[Data Science from Scratch] Ch4. Linear Algebra

Linear algebra is the branch of mathematics that deals with vector spaces.

📌 Vectors

추상적으로, vector는 서로 더하고, scalar를 곱하고, 새로운 벡터를 생성할 수 있는 객체이다.
구체적으로는, vector는 어떤 유한한 차원의 공간에서의 점이다. 지금까지 보았던 data를 vector라고 생각하지 않겠지만, 그 data들은 numeric data를 대표할 수 있는 좋은 예가 된다.

예를 들어, 많은 사람들의 키, 몸무게, 나이 데이터를 가지고 있을 때, 그 데이터는 3차원의 vector로 다룰 수 있다. 혹은 학생들의 시험 성적 데이터를 4차원의 vector(exam1, exam2, exam3, exam4)로 다룰 수 있을 것이다.

vector를 표현할 수 있는 가장 간단한 방법은 숫자들의 list로 표현하는 것이다. 3개의 숫자로 구성되어 있는 list와 3차원의 벡터는 서로 대응될 수 있을 것이다.

Abstractly, vectors are objects that can be added together(to form new vectors) and that can be multiplied by scalars, also to form new vectors.

Concretely (for us), vectors are points in some finite-dimensional space. Although you might not think of your data as vectors, they are a good way to represent numeric data.

For example, if you have the heights, weights, and ages of a large number of peaple, you can treat your data as three-dimensional vectors(`height, weight, age`). If you're teaching a class with four exams, you can treat student grades as four-dimensional vectors(`exam1, exam2, exam3, exam4`)

The simplest from-scratch approach is to represent vectors as list of numbers. A list of three numbers corresponds to a vector in three-dimensional space, and vice versa:

height_weight_age = [70, 170, 40] # height(inches), weight(pounds), age(years)
grades = [95, 80, 75, 62] # exam1, exam2, exam3, exam4

하지만 이런 방식으로 접근했을 때의 문제점은 파이썬에서 list는 vector가 아니기 때문에 우리는 vector에 대한 수학적인 도구들을 우리 스스로 만들어 사용해야 한다.
예를 들면, 두 개의 벡터를 더하는 것을 생각해보자.

Vectors add componentwise : 두 개의 벡터의 길이가 같다면, 두 벡터의 합은 서로 같은 위치의 인덱스에 있는 값들끼리 더한 list이다. (If they're not the same length, then we're not allowed to add them.)
ex) [1, 2] + [2, 1] result in [1+2, 2+1] or [3, 3]

def vector_add(v, w):
    """adds corresponding elements"""
    return [v_i + w_i
            for v_i, w_i in zip(v, w)]

def vector_subtract(v, w):
    """subtracts corresponding elemnts"""
    return [v_i - w_i
           for v_i, w_i in zip(v, w)]

We'll also sometimes want to componentwise sum a list of vectors. That is, create a new vector whose first element is the sum of all the first elemnts, whose second element is the sum of all of second elemnts, and so on. The easiest way to do this is by adding one vector at a time:

def vector_sum(vectors):
    """sums all corresponding elements"""
    result = vectors[0] # start with first vector
    for vector in vectors[1:]: # then loop over the others
        result = vector_add(result, vector) # and add them to the result
    return result

we can rewrite this more briefly

from functools import partial, reduce

def vector_sum(vectors):
    return reduce(vector_add, vectors)

vector_sum = partial(reduce, vector_add) # this one is more clever and helpful

vector_sum([[1, 2, 3], [1, 2, 3], [1, 2, 3]])

[3, 6, 9]

def scalar_multiply(c, v):
    """c is a number(scalar), v is a vector"""
    return [c * v_i for v_i in v]

def vector_mean(vectors):
    """compute the vector whose ith element is the mean of the
    ith elements of the input vectors"""
    n = len(vectors)
    return scalar_multiply(1/n, vector_sum(vectors))

The dot product of two vectors is the sum of their componentwise products:

def dot(v, w):
    """v_i * w_i + ... + v_n * w_n"""
    return sum(v_i * w_i
              for v_i, w_i in zip(v, w))

dot([1, 2], [1, 2])

dot product는 v벡터가 w의 방향으로 얼마나 확장될 수 있는지를 측정한다. 예를들어, w = [1, 0]이면 dot(v, w)는 v벡터와 같다.
이것은 v를 w에 _project_했을 때 그 벡터의 길이와 같다.
The dot product measures how far the vector v extends in the w direction. For example, if w=[1, 0] then dot(v, w) is just the first component of v. Another way of saying this is that it's the length of the vector you'd get if you projucted v onto w

이를 이용해서 vector의 제곱합_sum of square_를 쉽게 구할 수 있다.

def sum_of_squares(v):
    """v_1 * v_1 + v_2 * v_2 + ... + v_n * v_n"""
    return dot(v, v)

sum_of_squares([3, 4])

import math
def magnitude(v): #magnitude of vector means length of vector
    return math.sqrt(sum_of_squares(v)) #math.sqrt is square root function

magnitude([3, 4])

5.0

let's get a distance of two vectors!

def squared_distance(v, w):
    """(v_1 - w_1) ** 2 + ... (v_n - w_n) ** 2"""
    return sum_of_squared(vector_subtract(v, w))

def distance(v, w):
    return math.sqrt(sqquared_disatnce(v, w))

def distance(v, w):
    return magnigude(vector_subtract(v, w))

📌 Matrix

matrix는 2차원으로 구성된 숫자들의 모음이다. 파이썬에는 매트릭스를 중첩된 list형태로 나타낸다. 예를 들어 A가 행렬이라면, A[i][j]는 i번째 row의 j번째 column의 값을 말한다. (보통, 행렬의 이름은 대문자를 사용한다.)

A = [[1, 2, 3],
    [4, 5, 6]] # A has 2 rows and 3 columns
B = [[1, 2],
    [3, 4],
    [5, 6]] # B has 3 rows and 2 columns

def shape(M):
    num_rows = len(M)
    num_cols = len(M[0]) if M else 0 # number of elements in first row
    return num_rows, num_cols

shape(A)

(2, 3)

matrix가 n개의 row와 k개의 column을 가지고 있다면, 이 행렬을 n✕k matrix라고 부른다.
여기서 n✕k matrix의 각각의 row들을 길이가 k인 각각의 벡터들이고 / 각각의 column들을 길이가 n인 각각의 벡터들로 생각할 수 있다.

def get_row(M, i):
    return M[i]

def get_column(M, j):
    return [M_i[j]
           for M_i in M]

get_row(A, 0)

[1, 2, 3]

get_column(A, 1)

[2, 5]

def make_matrix(num_rows, num_cols, entry_fn):
    """returns a num rows x num cols matrix
    whose (i,j)th entry is entry_fn(i, j)"""
    return [[entry_fn(i, j) # given i, create a list
            for j in range(num_cols)] # [entry_fn(i, 0), ...]
            for i in range(num_rows)] # create one list for each i

def is_diagonal(i, j):
    """1's on the 'diagonal', 0's everywhere else"""
    return 1 if i == j else 0

identity_matrix = make_matrix(5, 5, is_diagonal)
identity_matrix

[[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]]

matrix가 우리에게 중요한 이유!

matrix를 하나의 row로서 취급되는 각가의 벡터들로 이루어진 데이터셋을 표현할 때 사용할 수 있다.
n✕k matrix를 k차원 벡터와 n차원 벡터를 매핑하는 선형 함수를 표현하는데에 사용할 수 있다.

matrix들을 binary relationships를 표현하는데에 사용할 수 있다.
예를 들어, 아래에서 원래에는 friendships를 (i, j)쌍들로 binary 관계를 표현했다면, 행렬을 통해서는 관계가 있을 때 1, 없을 때 0의 값을 표시함으로써 user 한 명당 하나의 vector를 대응시킬 수 있다.

friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4), 
			(4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]

# user0 1 2 3 4 5 6 7 8 9
friendships = [[0, 1, 1, 0, 0, 0, 0, 0, 0, 0], # user 0
            [1, 0, 1, 1, 0, 0, 0, 0, 0, 0], # user 1 
            [1, 1, 0, 1, 0, 0, 0, 0, 0, 0], # user 2 
            [0, 1, 1, 0, 1, 0, 0, 0, 0, 0], # user 3 
            [0, 0, 0, 1, 0, 1, 0, 0, 0, 0], # user 4 
            [0, 0, 0, 0, 1, 0, 1, 1, 0, 0], # user 5 
            [0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 6 
            [0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 7 
            [0, 0, 0, 0, 0, 0, 1, 1, 0, 1], # user 8 
            [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]] # user 9

연결된 관계가 적다면, tuple로 저장한 frinedships보다 0을 더 많이 저장해야하므로 비효율적인 friendships가 될 수 있겠지만, 행렬 표현은 두 개의 node가 연결되었는지 확인하는데에 있어서 훨씬 빠르다. (모든 edge를 탐색하지 않고 하나의 matrix만 보면 되기 때문이다.)

print(friendships[0][2] == 1) # True, user0 and user2 are friends.
print(friendships[0][8] == 1) # False, user0 and user8 are not friends.

True
False

유사하게, 하나의 노드가 가지고 있는 connection을 찾을 때도 하나의 column(또는 row)만 확인하면 된다.

friends_of_user5 = [i
                   for i, is_friend in enumerate(friendships[5])
                   if is_friend]
friends_of_user5

[4, 6, 7]

저작자표시 비영리 변경금지 (새창열림)

'Data Science' 카테고리의 다른 글

OLTP와 OLAP \| data processing system, 데이터베이스(database) \| ADsP, ADP (0)	2022.02.26
데이터의 이해(데이터와 정보) \| 정성/정량적 데이터 \| 정형/반정형/비정형 데이터 \| 암묵지, 형식지 \| DIKW 피라미드 \| ADP, ADsP 1과목 (0)	2022.02.26
[Data Science from Scratch] Ch3. Visualizing Data (0)	2021.12.08
[Data Science from Scratch] Ch2. A Crash Course in Python (0)	2021.12.03
python scraping 연습 (한빛미디어 도서리스트 scraper 만들기) (0)	2021.09.12

hellomygreenworld

[Data Science from Scratch] Ch4. Linear Algebra

📌 Vectors

📌 Matrix

'Data Science' 카테고리의 다른 글

댓글

티스토리툴바

[Data Science from Scratch] Ch4. Linear Algebra

📌 Vectors

📌 Matrix

'Data Science' 카테고리의 다른 글

관련글

댓글

티스토리툴바