MLOps

WSL 리눅스 배포
저자
소속
이광춘

TCS

공개

2023년 02월 26일

1 WSL 리눅스

윈도우에서 리눅스를 사용하는 방식은 진화를 거듭했지만 최근에는 Microsoft Store에서 ubuntu 리눅스를 앱 사용하듯이 설치해서 사용하면 된다.

1.1 로컬 디렉토리

/mnt/ 디렉토리 아래 C:\, D:\ 드라이브를 찾을 수 있고 그 아래 작업 디렉토리와 파일로 이동하면 후속 작업을 이어서 수행할 수 있다.

statkclee@dl:/mnt/d/tcs/curriculum/mlops$ cd /mnt/
statkclee@dl:/mnt$ ls
c  d  wsl
statkclee@dl:/mnt$ cd d/tcs/curriculum/mlops/
statkclee@dl:/mnt/d/tcs/curriculum/mlops$ pwd
/mnt/d/tcs/curriculum/mlops

2 암수 기계학습 모형

파머 펭귄 데이터를 사용하여 Random Forest 기계학습 예측모형을 사용하여 성별 예측 기계학습 모형을 파이썬 sk-learn, numpy, pandas 패키지를 사용해서 구축해보자.

코드
library(reticulate)

reticulate::source_python("mlops/vanilla/penguins.py")

py$accuracy
#> [1] 0.8208955

2.1 파이썬 코드

패키지 상단에 설치하고 파머펭귄 패키지 데이터를 가져와서 결측값 제거하고 기계학습 관련 훈련/시험 데이터 나누고, Random Forest 예측모형으로 성별 예측 모형 개발하고 시험데이터로 성별 예측모형 정확도를 측정한다.


import numpy as np
import pandas as pd
from palmerpenguins import load_penguins
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

penguins = load_penguins()

df = penguins.dropna()

# X, y 구분
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y = df['sex']

# 훈련 시험데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# RF 분류모형
rfc = RandomForestClassifier()

# 기계학습
rfc.fit(X_train, y_train)

# 예측값 생성
y_pred = rfc.predict(X_test)

# 평가
accuracy = accuracy_score(y_test, y_pred)
print('정확도(Accuracy):', accuracy)

2.2 실제 배포환경

개발환경에서 성별예측모형을 개발했지만 이를 배포하게 되면 배포될 컴퓨터에는 관련 패키지가 설치되어 있지 않다. 그렇다고 수작업으로 requirements.txt 파일에 관련 패키지를 수작업으로 일일이 넣을 수는 없다.

2.3 pipreqs 패키지

pipreqs 패키지는 pip freeze 명령어와 비교하여 해당 프로젝트(디렉토리)의 파이썬 패키지만 requirements.txt 파일에 담아낸다.

statkclee@dl:/mnt/d/tcs/curriculum/mlops/vanilla$ pip install pipreqs
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pipreqs in /home/statkclee/.local/lib/python3.10/site-packages (0.4.11)
Requirement already satisfied: yarg in /home/statkclee/.local/lib/python3.10/site-packages (from pipreqs) (0.1.9)
Requirement already satisfied: docopt in /home/statkclee/.local/lib/python3.10/site-packages (from pipreqs) (0.6.2)
Requirement already satisfied: requests in /home/statkclee/.local/lib/python3.10/site-packages (from yarg->pipreqs) (2.28.2)
Requirement already satisfied: idna<4,>=2.5 in /home/statkclee/.local/lib/python3.10/site-packages (from requests->yarg->pipreqs) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/statkclee/.local/lib/python3.10/site-packages (from requests->yarg->pipreqs) (1.26.14)
Requirement already satisfied: certifi>=2017.4.17 in /home/statkclee/.local/lib/python3.10/site-packages (from requests->yarg->pipreqs) (2022.12.7)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/statkclee/.local/lib/python3.10/site-packages (from requests->yarg->pipreqs) (3.0.1)

2.4 패키지 설치

pip install -r requirements.txt 명령어로 암수성별 예측 기계학습모형 개발에 사용된 패키지를 일괄 설치할 수 있다.

statkclee@dl:/mnt/d/tcs/curriculum/mlops/vanilla$ pipreqs .
statkclee@dl:/mnt/d/tcs/curriculum/mlops/vanilla$ ls
penguins.py  requirements.txt

statkclee@dl:/mnt/d/tcs/curriculum/mlops/vanilla$ pip install -r requirements.txt
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: numpy==1.24.2 in /home/statkclee/.local/lib/python3.10/site-packages (from -r requirements.txt (line 1)) (1.24.2)
Collecting palmerpenguins==0.1.4
  Downloading palmerpenguins-0.1.4-py3-none-any.whl (17 kB)
Collecting pandas==1.5.3
  Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.1/12.1 MB 11.3 MB/s eta 0:00:00
Collecting scikit_learn==1.2.1
  Downloading scikit_learn-1.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.6/9.6 MB 11.4 MB/s eta 0:00:00
Collecting pytz>=2020.1
  Downloading pytz-2022.7.1-py2.py3-none-any.whl (499 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 499.4/499.4 kB 10.3 MB/s eta 0:00:00
Collecting python-dateutil>=2.8.1
  Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 247.7/247.7 kB 9.7 MB/s eta 0:00:00
Collecting scipy>=1.3.2
  Downloading scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 10.9 MB/s eta 0:00:00
Collecting joblib>=1.1.1
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 298.0/298.0 kB 9.8 MB/s eta 0:00:00
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.1->pandas==1.5.3->-r requirements.txt (line 3)) (1.16.0)
Installing collected packages: pytz, threadpoolctl, scipy, python-dateutil, joblib, scikit_learn, pandas, palmerpenguins
Successfully installed joblib-1.2.0 palmerpenguins-0.1.4 pandas-1.5.3 python-dateutil-2.8.2 pytz-2022.7.1 scikit_learn-1.2.1 scipy-1.10.1 threadpoolctl-3.1.0

2.5 실행

python3 penguins.py 명령어로 개발환경에서 제작한 암수성별 예측모형을 성공적으로 배포함을 확인할 수 있다.

statkclee@dl:/mnt/d/tcs/curriculum/mlops/vanilla$ python3 penguins.py
정확도(Accuracy): 0.835820895522388

3 기계학습 코드 모듈화

스크립트로 작성한 코드를 함수로 기능을 구분하여 작성한다. 동일한 기능을 제공하고 있으나 향후 코드 유지보수를 효율적으로 수행할 수 있게 한다. 암수성별 예측모형을 리팩토링하여 다시 제작하여 실행한다.

코드
library(reticulate)

reticulate::source_python("mlops/vanilla_module/penguins_functions.py")

py$accuracy
#> [1] 0.8208955

3.1 파이썬 코드

기능을 함수로 담아내어 모듈화시킨 코드는 다음과 같다.

import numpy as np
import pandas as pd
from palmerpenguins import load_penguins
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler


def ingest_data():
    penguins = load_penguins()
    df = penguins.dropna()
    return df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex']]


def preprocess_data(df):
    # Feature engineering
    df['bill_ratio'] = df['bill_length_mm'] / df['bill_depth_mm']
    df['body_mass_log'] = np.log(df['body_mass_g'])

    # X, y 구분
    X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'bill_ratio', 'body_mass_log']]
    y = df['sex']

    # 훈련/시험 데이터셋
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Feature Engineering
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    return X_train, X_test, y_train, y_test


def train_data(X_train, X_test, y_train, y_test):
    # 데이터 적합 / 기계학습
    rfc = RandomForestClassifier()
    rfc.fit(X_train, y_train)

    # 평가
    y_pred = rfc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print('Accuracy:', accuracy)

    # 평가결과 파일 저장
    with open('mlops/vanilla_module/results.txt', 'w') as f:
        f.write(f'함수 코딩 Accuracy: {accuracy}')

    return rfc


if __name__ == '__main__':
  
    df = ingest_data()
    X_train, X_test, y_train, y_test = preprocess_data(df)
    train_data(X_train, X_test, y_train, y_test)

4 파이프라인 기계학습

기계학습 파이프라인을 구축하여 암수성별 예측모형을 개발한다.

코드
library(reticulate)

reticulate::source_python("mlops/pipeline/penguins_pipeline.py")

py$accuracy
#> [1] 0.8358209

4.1 파이썬 코드

Feature Engineering과 예측모형을 Pipeline() 함수에 넣어 체계화한다.


import numpy as np
import pandas as pd
from palmerpenguins import load_penguins
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import accuracy_score

penguins = load_penguins()

df = penguins.dropna()

# X, y 구분
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y = df['sex']

# 훈련 시험데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 상수 feature 제거
selector = VarianceThreshold()
X_train = selector.fit_transform(X_train)
X_test = selector.transform(X_test)

# 파이프라인 구축
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2)),
    ('classifier', RandomForestClassifier())
])

# 기계학습과 예측
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

# 평가
accuracy = accuracy_score(y_test, y_pred)
print('파이프라인 방식 정확도(Accuracy):', accuracy)

4.2 실행

python3 penguins_pipeline.py 명령어로 개발환경에서 파이프라인 방식을 적용하여 제작한 암수성별 예측모형을 성공적으로 배포함을 확인할 수 있다.

statkclee@dl:/mnt/d/tcs/curriculum/mlops/pipeline$ python3 penguins_pipeline.py
파이프라인 방식 정확도(Accuracy): 0.8507462686567164

5 객체지향 기계학습

기계학습모형을 객체지향 클래스로 구축하여 암수성별 예측모형을 클래스로 개발하여 적용시킨다.

코드
library(reticulate)

reticulate::source_python("mlops/oop/penguins_oop.py")

py$accuracy
#> [1] 0.8358209

5.1 파이썬 코드

객체지향 PenguinSexClassifier 클래스를 생성하여 기계학습 예측모형을 제작할 수 있다.

import numpy as np
import pandas as pd
from palmerpenguins import load_penguins
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import accuracy_score

class PenguinSexClassifier:
    
    def __init__(self, test_size=0.2, random_state=42):
        self.test_size = test_size
        self.random_state = random_state
        self.pipeline = None
        
    def load_data(self):
      
        penguins = load_penguins()

        df = penguins.dropna()

        # X, y 구분
        self.X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
        self.y = df['sex']
    
    def preprocess_data(self):
        # 훈련 시험데이터 분리
        X_train, X_test, y_train, y_test = train_test_split(self.X, self.y, test_size=self.test_size, random_state=self.random_state)

        # 상수 Feature 제거
        selector = VarianceThreshold()
        X_train = selector.fit_transform(X_train)
        X_test = selector.transform(X_test)

        # 파이프라인 구축
        self.pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('poly', PolynomialFeatures(degree=2)),            
            ('classifier', RandomForestClassifier())
        ])

        # 모형 적합(훈련)
        self.pipeline.fit(X_train, y_train)
        
        # 모형 예측
        self.y_pred = self.pipeline.predict(X_test)
        
        # 예측 정확도 평가
        self.accuracy = accuracy_score(y_test, self.y_pred)

    def run(self):
        self.load_data()
        self.preprocess_data()

if __name__ == '__main__':
    clf = PenguinSexClassifier()
    clf.run()
    print('OOP 정확도(Accuracy):', clf.accuracy)
    

5.2 실행

python3 명령어로 개발환경에서 객체지향 방식으로 개발한 클래스 방식을 적용하여 제작한 암수성별 예측모형을 성공적으로 배포함을 확인할 수 있다.

statkclee@dl:/mnt/d/tcs/curriculum/mlops/oop$ python3 penguins_oop.py
OOP 정확도(Accuracy): 0.8656716417910447

6 Makefile 자동화

6.1 파이썬 파일

암수성별 예측모형 정확도를 텍스트 파일에 저장하는 로직을 추가한다. 이를 통해 requirements.txt 파일을 저장시킨다.


import numpy as np
import pandas as pd
from palmerpenguins import load_penguins
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import accuracy_score

penguins = load_penguins()

df = penguins.dropna()

# X, y 구분
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y = df['sex']

# 훈련 시험데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 상수 feature 제거
selector = VarianceThreshold()
X_train = selector.fit_transform(X_train)
X_test = selector.transform(X_test)

# 파이프라인 구축
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2)),
    ('classifier', RandomForestClassifier())
])

# 기계학습과 예측
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

# 평가
accuracy = accuracy_score(y_test, y_pred)
print('파이프라인 방식 정확도(Accuracy):', accuracy)

# 출력결과 내보내기
with open('results.txt', 'w') as f:
    f.write(f'\n파이프라인 방식 정확도(Accuracy): {accuracy}\n')

6.2 Makefile

자동화 대상을 다음과 같이 정리한다.

  1. make requirements : 파이썬 패키지 설치 목록을 requirements.txt 에 작성한다.
  2. make install : requirements.txt 파일에 저장된 패키지를 설치한다.
  3. make run : 암수성별 예측모형을 실행하고 예측정확도를 results.txt 파일에 기록한다.
.PHONY: all clean install run requirements help

PYTHON=python3
PIP=pip3

SRC=penguins_pipeline.py
REQ=requirements.txt

all: install run

install:
	$(PIP) install -r $(REQ)

run:
	$(PYTHON) $(SRC)

requirements:
	$(PIP) install pipreqs
	pipreqs --force .

clean:
	rm -rf __pycache__ *.pyc
	rm results.txt
	rm requirements.txt

help:
	@echo "Usage: make [target]"
	@echo ""
	@echo "Targets:"
	@echo "  help		  Show this help message"
	@echo "  all		   Install dependencies and run the script"
	@echo "  requirements  Generate requirements.txt using pipreqs"
	@echo "  install	   Install dependencies"
	@echo "  run		   Run the script"
	@echo "  clean		 Clean up the project directory"

	

6.3 작업 예시

코드
statkclee@dl:/mnt/d/tcs/curriculum/mlops/makefile$ make help
Usage: make [target]

Targets:
  help            Show this help message
  all              Install dependencies and run the script
  requirements  Generate requirements.txt using pipreqs
  install          Install dependencies
  run              Run the script
  clean          Clean up the project directory

statkclee@dl:/mnt/d/tcs/curriculum/mlops/makefile$ make requirements
pip3 install pipreqs
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pipreqs in /home/statkclee/.local/lib/python3.10/site-packages (0.4.11)
Requirement already satisfied: docopt in /home/statkclee/.local/lib/python3.10/site-packages (from pipreqs) (0.6.2)
Requirement already satisfied: yarg in /home/statkclee/.local/lib/python3.10/site-packages (from pipreqs) (0.1.9)
Requirement already satisfied: requests in /home/statkclee/.local/lib/python3.10/site-packages (from yarg->pipreqs) (2.28.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/statkclee/.local/lib/python3.10/site-packages (from requests->yarg->pipreqs) (1.26.14)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/statkclee/.local/lib/python3.10/site-packages (from requests->yarg->pipreqs) (3.0.1)
Requirement already satisfied: certifi>=2017.4.17 in /home/statkclee/.local/lib/python3.10/site-packages (from requests->yarg->pipreqs) (2022.12.7)
Requirement already satisfied: idna<4,>=2.5 in /home/statkclee/.local/lib/python3.10/site-packages (from requests->yarg->pipreqs) (3.4)
pipreqs --force .
INFO: Successfully saved requirements file in ./requirements.txt

statkclee@dl:/mnt/d/tcs/curriculum/mlops/makefile$ make install
pip3 install -r requirements.txt
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: numpy==1.24.2 in /home/statkclee/.local/lib/python3.10/site-packages (from -r requirements.txt (line 1)) (1.24.2)
Requirement already satisfied: palmerpenguins==0.1.4 in /home/statkclee/.local/lib/python3.10/site-packages (from -r requirements.txt (line 2)) (0.1.4)
Requirement already satisfied: pandas==1.5.3 in /home/statkclee/.local/lib/python3.10/site-packages (from -r requirements.txt (line 3)) (1.5.3)
Requirement already satisfied: scikit_learn==1.2.1 in /home/statkclee/.local/lib/python3.10/site-packages (from -r requirements.txt (line 4)) (1.2.1)
Requirement already satisfied: pytz>=2020.1 in /home/statkclee/.local/lib/python3.10/site-packages (from pandas==1.5.3->-r requirements.txt (line 3)) (2022.7.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/statkclee/.local/lib/python3.10/site-packages (from pandas==1.5.3->-r requirements.txt (line 3)) (2.8.2)
Requirement already satisfied: scipy>=1.3.2 in /home/statkclee/.local/lib/python3.10/site-packages (from scikit_learn==1.2.1->-r requirements.txt (line 4)) (1.10.1)
Requirement already satisfied: joblib>=1.1.1 in /home/statkclee/.local/lib/python3.10/site-packages (from scikit_learn==1.2.1->-r requirements.txt (line 4)) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/statkclee/.local/lib/python3.10/site-packages (from scikit_learn==1.2.1->-r requirements.txt (line 4)) (3.1.0)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.1->pandas==1.5.3->-r requirements.txt (line 3)) (1.16.0)

statkclee@dl:/mnt/d/tcs/curriculum/mlops/makefile$ make run
python3 penguins_pipeline.py
파이프라인 방식 정확도(Accuracy): 0.835820895522388

statkclee@dl:/mnt/d/tcs/curriculum/mlops/makefile$ ls
Makefile  penguins_pipeline.py  requirements.txt  results.txt