파이썬 프로그래밍

데이터 과학을 위한 프로그래밍을 배워봅니다.

저자

소속

이광춘

TCS

공개

2023년 01월 15일

1 정규화

파이썬 정규화 함수를 작성하고 이를 통해 정규분포 난수에서 나온 표본을 표준정규분포로 정규화하는 코드.

import numpy as np 

# 정규분포에서 나온 난수
mu, sigma = 100, 10 

random_samples = np.random.normal(mu, sigma, size = 10)

# 정규화 값을 저장
results = []

# 정규화 함수
def normalize(sample):
  return( (sample - mu) / sigma)

# 정규화
for sample in random_samples:
    results.append( normalize(sample) )

print(results)
#> [-0.37082741498345084, 0.8656631990371195, -0.18901873427866747, 0.82126547785832, -1.829051157694846, -0.533750732400371, -0.6976763280348308, -0.14691958971619384, 0.03463514277669048, -1.5771028398960425]

2 벡터화

벡터화(Vectorization)는 데이터 과학에서 크게 두가지 점에서 중요하다.

루프 반복문을 돌리는 대신 전체 배열과 행렬에 연산작업을 수행하여 속도를 높임. 특히 데이터가 큰 경우 큰 효과를 볼 수 있음. 대체로 Vectorization은 C 혹은 포트란으로 작성됨.
동일한 작업을 벡터화 코드로 작성하게 되면 가독성을 크게 높이고 코드가 간결해지는 장점이 있음.

for 루프 작성
벡터화 코드

import numpy as np

# 배열 두개를 생성
A_arr = np.array([1, 2, 3, 4, 5])
B_arr = np.array([6, 7, 8, 7, 8])

# 최종 결과 저장
result_arr = []

# 각 배열 항목별로 연산작업을 수행
for i in range(len(A_arr)):
    result_arr.append(A_arr[i] + B_arr[i])

print(result_arr)
#> [7, 9, 11, 11, 13]

import numpy as np

# 배열 두개를 생성
A_arr = np.array([1, 2, 3, 4, 5])
B_arr = np.array([6, 7, 8, 7, 8])

# 벡터화 연산
result_arr = A_arr + B_arr

print(result_arr)
#> [ 7  9 11 11 13]

3 제곱 방법

다양한 방식으로 데이터에 함수 (제곱)를 적용시켜 보자.

samples = [1, 2, 3, 4, 5]
squared_samples = []

for sample in samples:
    squared_samples.append(sample**2)
    
squared_samples
#> [1, 4, 9, 16, 25]

samples = [1, 2, 3, 4, 5]
squared_samples = [x**2 for x in samples]
squared_samples
#> [1, 4, 9, 16, 25]

samples = (1, 2, 3, 4, 5)
squared_samples = (x**2 for x in samples)
list(squared_samples)
#> [1, 4, 9, 16, 25]

samples = [1, 2, 3, 4, 5]
squared_samples = list(map(lambda x: x**2, samples))
squared_samples
#> [1, 4, 9, 16, 25]

samples_arr = np.array([1, 2, 3, 4, 5])
squared_samples = samples_arr ** 2
squared_samples
# print(squared_samples)
#> array([ 1,  4,  9, 16, 25])

import pandas as pd
samples = [1, 2, 3, 4, 5]

df = pd.DataFrame(samples, columns=["samples"])
df["squared_numbers"] = df["samples"].apply(lambda x: x**2)
print(df)
#>    samples  squared_numbers
#> 0        1                1
#> 1        2                4
#> 2        3                9
#> 3        4               16
#> 4        5               25

4 `gapminder` 데이터셋

4.1 사용자 정의 함수

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# 제곱함수
def square(x):
    return x**2

# A 칼럽에 사용자 정의 함수를 적용
df['A_squared'] = df['A'].apply(square)

print(df)
#>    A  B  A_squared
#> 0  1  4          1
#> 1  2  5          4
#> 2  3  6          9

4.2 `apply()` 함수

import pandas as pd
from gapminder import gapminder

gapminder['GDP'] = gapminder.apply(lambda row: row['pop'] * row['gdpPercap'] / 10**9, axis=1)

gapminder[['country', 'pop', 'gdpPercap', 'GDP']]
#>           country       pop   gdpPercap       GDP
#> 0     Afghanistan   8425333  779.445314  6.567086
#> 1     Afghanistan   9240934  820.853030  7.585449
#> 2     Afghanistan  10267083  853.100710  8.758856
#> 3     Afghanistan  11537966  836.197138  9.648014
#> 4     Afghanistan  13079460  739.981106  9.678553
#> ...           ...       ...         ...       ...
#> 1699     Zimbabwe   9216418  706.157306  6.508241
#> 1700     Zimbabwe  10704340  693.420786  7.422612
#> 1701     Zimbabwe  11404948  792.449960  9.037851
#> 1702     Zimbabwe  11926563  672.038623  8.015111
#> 1703     Zimbabwe  12311143  469.709298  5.782658
#> 
#> [1704 rows x 4 columns]

4.3 사용자 정의 함수

import pandas as pd
from gapminder import gapminder
from functools import partial

def calculate_gdp(row):
  return(row['pop'] * row['gdpPercap'] / 10**9)

gapminder['GDP'] = gapminder.apply(calculate_gdp, axis=1)

gapminder[['country', 'pop', 'gdpPercap', 'GDP']]
#>           country       pop   gdpPercap       GDP
#> 0     Afghanistan   8425333  779.445314  6.567086
#> 1     Afghanistan   9240934  820.853030  7.585449
#> 2     Afghanistan  10267083  853.100710  8.758856
#> 3     Afghanistan  11537966  836.197138  9.648014
#> 4     Afghanistan  13079460  739.981106  9.678553
#> ...           ...       ...         ...       ...
#> 1699     Zimbabwe   9216418  706.157306  6.508241
#> 1700     Zimbabwe  10704340  693.420786  7.422612
#> 1701     Zimbabwe  11404948  792.449960  9.037851
#> 1702     Zimbabwe  11926563  672.038623  8.015111
#> 1703     Zimbabwe  12311143  469.709298  5.782658
#> 
#> [1704 rows x 4 columns]

1 정규화

2 벡터화

3 제곱 방법

4 gapminder 데이터셋

4.1 사용자 정의 함수

4.2 apply() 함수

4.3 사용자 정의 함수

4 `gapminder` 데이터셋

4.2 `apply()` 함수