1 파이썬 환경 설정

reticulate 패키지로 콘다 파이썬 환경을 구축한다. 필요한 경우 패키지도 설치한다.

Riddhiman (Apr 19, 2022), ‘Getting started with Python using R and reticulate’

코드

# install.packages("reticulate")
library(reticulate)

# conda_list()
use_condaenv(condaenv = "r-reticulate")

# py_install(packages = c("pandas", "scikit-learn"))

2 데이터 가져오기

펭귄 데이터를 다운로드 받아 로컬 컴퓨터 data 폴더에 저장시킨다.

코드

library(tidyverse)

fs::dir_create("data")
download.file(url = "https://raw.githubusercontent.com/dataprofessor/data/master/penguins_cleaned.csv", destfile = "data/penguins_cleaned.csv")

penguin_df <- readr::read_csv("data/penguins_cleaned.csv")

penguin_df
#> # A tibble: 333 × 7
#>    species island    bill_length_mm bill_depth_mm flipper_length…¹ body_…² sex  
#>    <chr>   <chr>              <dbl>         <dbl>            <dbl>   <dbl> <chr>
#>  1 Adelie  Torgersen           39.1          18.7              181    3750 male 
#>  2 Adelie  Torgersen           39.5          17.4              186    3800 fema…
#>  3 Adelie  Torgersen           40.3          18                195    3250 fema…
#>  4 Adelie  Torgersen           36.7          19.3              193    3450 fema…
#>  5 Adelie  Torgersen           39.3          20.6              190    3650 male 
#>  6 Adelie  Torgersen           38.9          17.8              181    3625 fema…
#>  7 Adelie  Torgersen           39.2          19.6              195    4675 male 
#>  8 Adelie  Torgersen           41.1          17.6              182    3200 fema…
#>  9 Adelie  Torgersen           38.6          21.2              191    3800 male 
#> 10 Adelie  Torgersen           34.6          21.1              198    4400 male 
#> # … with 323 more rows, and abbreviated variable names ¹flipper_length_mm,
#> #   ²body_mass_g

3 파이썬 기계학습 모형

파이썬 sklearn 패키지로 펭귄 성별예측 모형을 구축하자.

파이썬 코드
R 환경 불러오기

# "code/penguin_sex_clf.py"

import pandas as pd
penguins = pd.read_csv('data/penguins_cleaned.csv')

penguins_df = penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex']]

# Ordinal feature encoding
# https://www.kaggle.com/pratik1120/penguin-dataset-eda-classification-and-clustering
df = penguins_df.copy()

target_mapper = {'male':0, 'female':1}
def target_encode(val):
    return target_mapper[val]

df['sex'] = df['sex'].apply(target_encode)

# Separating X and Y
X = df.drop('sex', axis=1)
Y = df['sex']

# Build random forest model
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, Y)

코드

source_python("code/penguin_sex_clf.py")

clf
#> RandomForestClassifier()

4 시각화

파이썬 기계학습 결과를 R로 가져와서 변수 중요도를 시각화한다.

코드

feat_tbl <- tibble(features = clf$feature_names_in_,
                   importance = clf$feature_importances_)

feat_tbl %>% 
  ggplot(aes(x = fct_reorder(features, importance), y = importance)) +
    geom_point(size = 3) +
    geom_segment( aes(x=features, xend=features, y=0, yend=importance)) +
    labs(y = "Feature 중요도", x = "Feature",
         title = "펭귄 암수 예측모형 Feature 중요도") +
    coord_flip() +
    theme_bw(base_family = "AppleGothic")

--- title: "chatGPT" subtitle: "파이썬 환경구축" author: - name: 이광춘 url: https://www.linkedin.com/in/kwangchunlee/ affiliation: 한국 R 사용자회 affiliation-url: https://github.com/bit2r title-block-banner: true #title-block-banner: "#562457" format: html: css: css/quarto.css theme: flatly code-fold: true toc: true toc-depth: 3 toc-title: 목차 number-sections: true highlight-style: github self-contained: false filters: - lightbox - interview-callout.lua lightbox: auto link-citations: yes knitr: opts_chunk: message: false warning: false collapse: true comment: "#>" R.options: knitr.graphics.auto_pdf: true editor_options: chunk_output_type: console --- ![](images/python_with_R.png) # 파이썬 환경 설정 `reticulate` 패키지로 콘다 파이썬 환경을 구축한다. 필요한 경우 패키지도 설치한다. [[Riddhiman (Apr 19, 2022), 'Getting started with Python using R and reticulate'](https://rtichoke.netlify.app/post/getting_started_with_reticulate/)]{.aside} ```{r} # install.packages("reticulate") library(reticulate) # conda_list() use_condaenv(condaenv = "r-reticulate") # py_install(packages = c("pandas", "scikit-learn")) ``` # 데이터 가져오기 [펭귄 데이터](https://raw.githubusercontent.com/dataprofessor/data/master/penguins_cleaned.csv)를 다운로드 받아 로컬 컴퓨터 `data` 폴더에 저장시킨다. ```{r} library(tidyverse) fs::dir_create("data") download.file(url = "https://raw.githubusercontent.com/dataprofessor/data/master/penguins_cleaned.csv", destfile = "data/penguins_cleaned.csv") penguin_df <- readr::read_csv("data/penguins_cleaned.csv") penguin_df ``` # 파이썬 기계학습 모형 파이썬 sklearn 패키지로 펭귄 성별예측 모형을 구축하자. :::{.panel-tabset} ## 파이썬 코드 ```python # "code/penguin_sex_clf.py" import pandas as pd penguins = pd.read_csv('data/penguins_cleaned.csv') penguins_df = penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex']] # Ordinal feature encoding # https://www.kaggle.com/pratik1120/penguin-dataset-eda-classification-and-clustering df = penguins_df.copy() target_mapper = {'male':0, 'female':1} def target_encode(val): return target_mapper[val] df['sex'] = df['sex'].apply(target_encode) # Separating X and Y X = df.drop('sex', axis=1) Y = df['sex'] # Build random forest model from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(n_estimators=100) clf.fit(X, Y) ``` ## R 환경 불러오기 ```{r} source_python("code/penguin_sex_clf.py") clf ``` ::: # 시각화 파이썬 기계학습 결과를 R로 가져와서 변수 중요도를 시각화한다. ```{r} feat_tbl <- tibble(features = clf$feature_names_in_, importance = clf$feature_importances_) feat_tbl %>% ggplot(aes(x = fct_reorder(features, importance), y = importance)) + geom_point(size = 3) + geom_segment( aes(x=features, xend=features, y=0, yend=importance)) + labs(y = "Feature 중요도", x = "Feature", title = "펭귄 암수 예측모형 Feature 중요도") + coord_flip() + theme_bw(base_family = "AppleGothic") ```