[Data Science from Scratch] Ch2. A Crash Course in Python

[Data Science from Scratch (Joel Grus, O'Reilly)]를 읽고 이해한 바를 바탕으로 작성된 내용입니다.

The Not-So-Basics

Sorting

x = [4, 1, 2, 3]
y = sorted(x)        #is [1, 2, 3, 4], x is unchanged
x.sort()             #now x is [1, 2, 3, 4] (x is changed)
print(x, y)

[1, 2, 3, 4] [1, 2, 3, 4]

sorted(arg) function은 argument의 정렬된 값을 return하고 argument의 값 자체를 바꾸진 않는다.
sort() function은 return하는 값 없이, method가 적용되는 값 자체를 바꿔준다.

A function is a piece of code that is called by name. It can be passed data to operate on (i.e. the parameters) and can optionally return data (the return value). All data that is passed to a function is explicitly passed.

A method is a piece of code that is called by a name that is associated with an object. In most respects it is identical to a function except for two key differences:

A method is implicitly passed the object on which it was called.

A method is able to operate on data that is contained within the class (remembering that an object is an instance of a class - the class is the definition, the object is an instance of that data).

sort()와 sorted()는 default로 작은 값부터 큰 값순으로(오름차순으로) 정렬하지만, 내림차순으로 정렬하고 싶다면 `reverse=True` parameter를 지정해주면된다. 또, `key`로 지정해준 값에 따라 정렬할 수도 있다.

def word_count(str):
    counts = dict()
    words = str.split()

    for word in words:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1

    return counts

word_counts = word_count('the quick brown fox jumps over the lazy dog.')

#sort the list by absolute value from largest to smallest
x = sorted([-4, 1, -2, 3], key=abs, reverse=True)
print(x)

#sort the words and counts from highest count to lowest
wc = sorted(word_counts.items(),
           key = lambda count: count, reverse=True)
print(wc)

[-4, 3, -2, 1]
[('the', 2), ('quick', 1), ('over', 1), ('lazy', 1), ('jumps', 1), ('fox', 1), ('dog.', 1), ('brown', 1)]

List Comprehensions

어떤 리스트에서 특정한 원소만 뽑아서 또다른 리스트로 변환하거나, 원소들 각각에 어떤 함수를 사용하거나, 또는 둘다 하고 싶을 때가 있는데, 이걸 하기위한 Pythonic한 방법이 list comprehension이다!

even_numbers = [x for x in range(5) if x % 2 == 0] #transform a list into another list, by choosing only certain elements
squares = [x * x for x in range(5)] #transform elements
even_squares = [x * x for x in even_numbers] #choose only certain elements and transform elements

print(even_numbers, squares, even_squares, sep='\n')

[0, 2, 4]
[0, 1, 4, 9, 16]
[0, 4, 16]

dictionary, set에서도 list comprehension사용 가능!

square_dict = {x : x * x for x in range(5)}
square_set = {x * x for x in [1, -1]}

print(square_dict, square_set, sep='\n')

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}
{1}

list에 있는 값을 사용하지 않고 list의 length만 사용한다면 변수로 `_`를 사용하여 코드를 작성하는 것이 일반적이다.

zeroes = [0 for _ in even_numbers] #has the same length as even_numbers
print(zeroes)

[0, 0, 0]

multiple `for`문을 포함하여 list comprehension을 사용할 수도 있다.

pairs = [(x, y) for x in range(10) for y in range(10)] #100 pairs (0, 0), (0, 1) ... (9, 8), (9, 9)
print(pairs[:20])

[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 9), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8), (1, 9)]

두번째 for문은 첫번째 for문의 결과를 사용할 수 있다.

increasing_pairs = [(x, y)
                   for x in range(10)
                   for y in range(x+1, 10)]
print(pairs[:20])

[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 9), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8), (1, 9)]

Generators and Iterators

iteration
'iteration'은 '되풀이, 순환'이라는 뜻의 단어이다. Python에서도 Iterator는 '순환할 수 있는 것'을 말하는데, 예를들면 리스트를 만들면 for문을 이용해서 리스트의 원소를 하나하나 꺼내 읽어올 수 있다. 즉, 순환하며 item을 하나씩 꺼내어 읽을 수 있는 것을 Iterator라고 한다.
generator
Generator는 Iterator이다! 단, generator는 한번만! 사용할 수 있는 iterator이다. Generator는 메모리에 값을 저장하지 않고, 그때그때 값을 생성해서 사용할 뿐이다. (geterators generate the values on the fly)

Generators are iterators, a kind of iterable you can only iterate over once. Generators do not store all the values in memory, they generate the values on the fly A genertor is something that you can iterate over(for us, usually using for) but whose values are produced only as needed(lazily).

def lazy_range(n):
    '''a lazy version of range'''
    i = 0
    while i < n:
        yield i
        i += 1

for i in lazy_range(10):
    print(i, end =',')

0,1,2,3,4,5,6,7,8,9,

Python3에서는 range자체가 lazy하게 변경되었다.

def natural_numbers():
    '''returns 1, 2, 3, ...'''
    n = 1
    while True:
        yield n
        n + 1

generator를 생성하는 두번째 방법은 parentheses(소괄호)로 `for` comprehension을 사용하는 하는 것이다 :

lazy_evens_below_20 = (i for i in lazy_range(20) if i % 2 == 0)
lazy_evens_below_20

print([x for x in lazy_evens_below_20])
print([x for x in lazy_evens_below_20]) #generator cannot use more than once

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
[]

dictionary의 items() 메소드는 key-value쌍 list를 return한다.
iteritems() 메소드는 한번만 사용할 수 있는 key-value 쌍을 yield한다.

Randomness

data science를 배울 때, 난수가 필요한 경우가 종종 있다. 난수는 random module을 사용하여 생성한다.

import random
four_uniform_randoms = [random.random() for _ in range(4)]
four_uniform_randoms

[0.8052686857490019,
0.31968743424368706,
0.24001904386504735,
0.6330763389850422]

random 모듈을 실제로 pseudorandom numbers를 생성해준다. 즉, 만약 재사용할 수 있는 난수를 원한다면, random.seed를 지정할 수 있다.

Definition of pseudorandom
: being or involving entities (such as numbers) that are selected by a definite computational process but that satisfy one or more standard tests for statistical randomness.

random.seed(10) #set the seed to 10
print(random.random())
random.seed(10)
print(random.random())

0.5714025946899135
0.5714025946899135

random.randrange는 대응되는 range에서 랜덤하게 원소를 뽑는다.

random.randrange(10) #choose randomly from range(10)

random.randrange(3, 6) #choose randomly from range(3, 6)

random.shuffle은 list의 원소들을 섞어shuffle준다.

up_to_ten = list(range(10))
random.shuffle(up_to_ten) #results will probably be different each time
up_to_ten

[4, 5, 8, 1, 2, 6, 7, 3, 0, 9]

random.choice는 list에서 하나의 원소를 랜덤하게 뽑아준다.

my_best_friend = random.choice(['Alice', 'Bob', 'Charlie'])
my_best_friend

'Bob'

중복없이 숫자를 뽑고싶다면, random.sample
(random.choice는 중복을 허용)

lottery_numbers = range(60)
winning_numbers = random.sample(lottery_numbers, 6)
winning_numbers

[4, 15, 47, 23, 2, 26]

four_with_replacement = [random.choice(list(range(10))) for _ in range(4)]
four_with_replacement

[2, 9, 5, 6]

Regular Expression

regular expression은 re module을 사용하여 text를 검색하는 기능을 제공한다.

import re
print ([
    not re.match('a', 'cat'), #'cat' doesn't start with 'a'
    re.search('a', 'cat'), #'cat' has an 'a' in it
    not re.search('c', 'dog'), #'dog' doesn't have a 'c' in it
    3 == len(re.split('[ab]', 'carbs')), #split on a or b to ['c', 'r', 's']
    'R-D-' == re.sub('[0-9]', '-', 'R2D2') #replace digits with dashes
])

[True, <re.Match object; span=(1, 2), match='a'>, True, True, True]

Character Classes

. : any character except newline
\w \d \s: word, digit, whitespace
\W \D \_S : not word, digit, whitespace
[abc] : any of a, b, or c
[^abc] : not a, b, or c
[a-g] : character between a&g

Anchors

^abc$ : start / end of the string
\b \B : word, not-word boundary

Escaped cahracters

\ : escaped special characters
\t \n \r : tab, linefeed, carriage return

Group & Lookaround

(abc) : capture group
1 : backreference to group #1
(?:abc) : non-captureing group
(?=abc) : positive lookahead
(?!abc) : negative lookahead

Quantifiers & Alternation

a* a+ a? : 0 or more, 1 or more, 0 or 1
a{5} a{2,} : exactly five, two or more
a{1, 3} : between one & three
a+? a{2,}? : match as few as possible
ab|cd : match ab or cd

practice regular expression here! - regexr.com

RegExr: Learn, Build, & Test RegEx

RegExr is an online tool to learn, build, & test Regular Expressions (RegEx / RegExp).

regexr.com

Object-Oriendted Programming

Python allows you to define classes that encapsulate data and the functions that operate on them.
Python은 객체 지향 프로그래밍 언어! OOP✨

built-in Python Set class를 직접 구현해본다고 생각해보자!
item을 set에 add하는 것이 필요하고, remove하는 것도 필요하며, 특정 값을 contains하고 있는지 확인하는 것도 필요하다
→ add, remove, contains를 클래스의 member functions로 구현하면 우리는 Set object 뒤에 .으로 멤버 함수에 접근 가능하다

#by convention, we give classes PascalCase names
class Set:

    #there are the member functions
    #every one takes a first parameter 'self' (another convention)
    #that refers to the particular Set object being used

    def __init__(self, values=None):

        """This is the constructor.
        It gets called when you create a new Set.
        You would use it like
        s1 = Set() #empty set
        s2 = Set([1, 2, 2, 3]) #initialize with values"""

        self.dict = {} #each instance of Set has its own dict property

        if values is not None:
            for value in values:
                self.add(value)

    def __repr__(self):
        """This is the string representation of a Set object
        if you type it at the Python prompt or pass it to str()"""
        return "Set: " + set(self.dict.keys())

    #we'll represent membership by being a key in self.dict with value True
    def add (self, value):
        self.dict[value] = True

    #value is in the Set if it's a key in the dictionary
    def contains(self, value):
        return value in self.dict

    def remove(self, value):
        del self.dict[value]

s = Set([1, 2, 3])
s.add(4)
print(s.contains(4))
s.remove(3)
print(s.contains(3))

True
False

Functional Tools

When passing functions around, sometimes we'll want to partially apply(or curry) functions to create new functions. As a simple example, imagine that we have a function of two variables:

def exp(base, power):
    return base ** power

and we want to use it to create a function of one variable two_to_the whose input is a power and whose output is the result of exp(2, power).

def two_to_the(power):
    return exp(2, power) #type(two_to_the_):function

from functools import partial
two_to_the = partial(exp, 2) #is now a function of one variable → type(two_to_the_): functools.partial
print(two_to_the(3)) #2**3

You can also use partial to fill in later arguments if you specify their names:

square_of = partial(exp, power=2)
print(square_of(3))

파이썬의 내장된 pow() 함수를 만든다고 해보자. → power()
source

def power(base, exponent):
    return base ** exponent

이때 정해진 지수 2와 3을 갖는 전용 사각형과 큐브 함수를 원한다면 우리는 다음과 같이 다시 함수를 정의할 것이다 :

def square(base):
    return power(base, 2)

def cube(base):
    return power(base, 3)

그런데, power()함수의 변형을 10000개를 만드려고 한다면, 함수를 무려 10000개를 재정의해야한다.
이럴때 functools.partials를 사용한다면 :

from functools import partial

square = partial(power, exponent=2)
cube = partial(power, exponent=3)

def test_partials():
    assert square(2) == 4
    assert cube(2) == 8

test_partials()

이 부분함수에 대한 속성을 아래와 같이 기술할 수 있다.

def test_partials_docs():
    assert square.keywords == {'exponent':2}
    assert square.func == power

    assert cube.keywords == {'exponent':3}
    assert cube.func == power

test_partials_docs()

map, reduce and filter, which provide functional alternatives to list comprehensions:

map함수를 이용해서 list의 각 원소들에 double()함수를 적용

def double(x):
    return 2 * x

xs = [1, 2, 3, 4]
twice_xs = [double(x) for x in xs] #[2, 4, 6, 8]
twice_xs = map(double, xs) #[2, 4, 6, 8]
list(twice_xs)

[2, 4, 6, 8]

partial() 함수를 사용하여 list의 원소들 각각에 double()함수를 적용하여 반환하는 list_doubler라는 부분함수를 생성

when not using partial():

def list_doubler(lst):
    return map(double, lst)

when using partial():

list_doubler = partial(map, double)

def list_doubler(lst):
    return map(double, lst)

list(list_doubler([1, 2, 3, 4]))

[2, 4, 6, 8]

list_doubler = partial(map, double) #"function" that doubles a list #map의 파라미터로 double을 넘겨서 반환하는 list_doubler 함수
list(list_doubler([1, 2, 3, 4]))

[2, 4, 6, 8]

map 함수를 argument가 여러 개 있는 함수도 사용 가능!

def multiply(x, y) : return x * y
products = map(multiply, [1, 2], [4, 5]) #[1*4, 2*5] = [4, 10]
list(products)

[4, 10]

filter는 list-comprehension에서 if와 같은 역할을 하는 함수이다.
filter(condition, list)는 list에서 condition을 만족하는 원소만 리턴한다.

def is_even(x):
    """True if x is even, False if x is odd"""
    return x % 2 == 0

x_evens = [x for x in [1, 2, 3, 4] if is_even(x)] #[2, 4]
x_evens = filter(is_even, xs) #same as above
list(x_evens)

[2, 4]

list_evener = partial(filter, is_even) #filter()함수에 파라미터로 is_even을 넣은 함수를 list_evener라는 부분함수로 정의
x_evens = list_evener([1, 2, 3, 4])
list(x_evens)

[2, 4]

reduce는 각각 같은 인덱스의 원소들을 결합하여 결과를 누적한 뒤, 하나의 결과를 생성한다.

'reduce' combines the first two elements of a list, then that result with the third, that result with the fourth, and so on, producing a single result:

from functools import reduce
x_product = reduce(multiply, [1, 2, 3, 4]) # = 1 * 2 * 3 * 4
x_product

list_product = partial(reduce, multiply) #reduce()함수에 파라미터로 multiply를 넣은 함수를 list_product라는 부분함수로 정의
x_product = list_product([1, 2, 3, 4])
x_product

enumerate

list를 순환할 때 원소의 값과 원소의 인덱스 번호를 함께 얻고 싶을 때, enumerate()를 사용한다.

documents = ['a', 'b', 'c']
#not Pythonic
for i in range(len(documents)):
    document = documents[i]
    print((i, document), end=' ')

(0, 'a') (1, 'b') (2, 'c')

#also not Pythonic
i = 0
for document in documents:
    print((i, document), end=' ')
    i += 1

(0, 'a') (1, 'b') (2, 'c')

#Pythonic solution: `enumerate`
for i, document in enumerate(documents):
    print((i, document))

(0, 'a')
(1, 'b')
(2, 'c')

when just want only indexes:

#not Pythonic
for i in range(len(documents)): print(i)

0
1
2

#Pythonic
for i, _ in enumerate(documents): print(i)

0
1
2

zip and Argument Unpacking

zip은 여러개의 리스트에서 같은 위치(index)에 대응하는 원소들끼리 tuple로 묶은 후, 하나의 리스트로 반환

list1 = ['a', 'b', 'c']
list2 = [1, 2, 3]
list3 = ['one', 'two', 'three']
list(zip(list1, list2, list3))

[('a', 1, 'one'), ('b', 2, 'two'), ('c', 3, 'three')]

list의 길이가 다르다면, zip은 가장 짧은 리스트의 length에 맞춘다.

list1 = ['a', 'b']
list2 = [1, 2, 3]
list3 = ['one', 'two', 'three']
list(zip(list1, list2, list3))

[('a', 1, 'one'), ('b', 2, 'two')]

zip()에 _args를 넘겨 upzip을 수행할 수 있다.
asterisk(`_`)는 arguments를 unpacking해준다. unpacking은 list안의 원소들을 리스트 밖으로(?) 하나하나 꺼내는 것과 비슷한 개념

pairs = [('a', 1), ('b', 2), ('c', 3)]
letters, numbers = zip(*pairs)  #unpaking한 pairs를 zip
print(letters, numbers)
list(zip(('a', 1), ('b', 2), ('c', 3))) #same with letters, numbers

('a', 'b', 'c') (1, 2, 3)
[('a', 'b', 'c'), (1, 2, 3)]

pairs = [('a', 1), ('b', 2), ('c', 3)]
print(pairs)
print(*pairs)

[('a', 1), ('b', 2), ('c', 3)]
('a', 1) ('b', 2) ('c', 3)

You can use argument unpacking with any function:

def add(a, b): return a + b

add(1, 2)

add([1, 2])

TypeError Traceback (most recent call last)

in
----> 1 add([1, 2])

TypeError: add() missing 1 required positional argument: 'b'

add(*[1, 2])

args and kwargs

higher-order function(어떤 함수를 파라미터로 받고, 또다른 함수를 리턴하는 함수)를 생성하고 싶다고 가정해보자

def doubler(f):
    def g(x):
        return 2 * f(x)
    return g

def f1(x):
    return x + 1

g = doubler(f1)
print(g(3))
print(g(-1))

8
0

하지만 higher-order function을 이렇게 정의하면 파라미터를 2개 이상 넘길 수 없다.

def f2(x, y):
    return x + y
g = doubler(f2)
print(g(1, 2))

TypeError Traceback (most recent call last)

in
2 return x + y
3 g = doubler(f2)
----> 4 print(g(1, 2))

TypeError: g() takes 1 positional argument but 2 were given

argument unpacking과 kwargs를 사용햐여 임의로arbitrary argument를 가지는 함수를 만들어야 한다.
*args는 함수에서 몇개의 인자를 넘길지 알 수 없을 때, 지정되지 않은 개수의 복수의 파라미터를 넘기기위해 사용한다.
**kwargs는 {'keyward':'value'} 형태로 함수를 호출할 수 있게 한다. 즉, 이름(keyword)을 붙여서 넘겨주는 *args라고 생각하면된다.

def magic(*args, **kwargs):
    print('unnamed args:', args)
    print('keyword args:', kwargs)

magic(1, 2, 3, key='word', key2='word2')

unnamed args: (1, 2, 3)
keyword args: {'key': 'word', 'key2': 'word2'}

def other_way_magic(x, y, z):
    return x + y + z

x_y_list = [1, 2]
z_dict = {'z' : 3}
print(other_way_magic(*x_y_list, **z_dict))

def f2(x, y):
    return x + y


def doubler_correct(f):
    """works no matter what kind of inputs f expects"""
    def g(*args, **kwargs):
        """whatever arguments g is supplied, pass then through to f"""
        return 2 * f(*args, **kwargs)
    return g

g = doubler_correct(f2)
print(g(1, 2))

저작자표시 비영리 변경금지

'Data Science' 카테고리의 다른 글

[Data Science from Scratch] Ch4. Linear Algebra (0)	2021.12.08
[Data Science from Scratch] Ch3. Visualizing Data (0)	2021.12.08
python scraping 연습 (한빛미디어 도서리스트 scraper 만들기) (0)	2021.09.12
👩‍💻 링크타고 브랜드 타이틀 크롤링 (1)	2021.09.12
python 가상환경 사용하기 (conda) (0)	2021.09.11

hellomygreenworld

[Data Science from Scratch] Ch2. A Crash Course in Python

The Not-So-Basics

Sorting

List Comprehensions

Generators and Iterators

Randomness

Regular Expression

Object-Oriendted Programming

Functional Tools

enumerate

zip and Argument Unpacking

args and kwargs

'Data Science' 카테고리의 다른 글

댓글

티스토리툴바

[Data Science from Scratch] Ch2. A Crash Course in Python

The Not-So-Basics

Sorting

List Comprehensions

Generators and Iterators

Randomness

Regular Expression

Object-Oriendted Programming

Functional Tools

enumerate

zip and Argument Unpacking

args and kwargs

'Data Science' 카테고리의 다른 글

관련글

댓글

티스토리툴바