데이터 과학

데이터 사이언스를 챗GPT로 풀어보자.

저자
소속

1 autoEDA

1.1 Hmisc::describe()

Hmisc 패키지를 통해 과거 20년전 데이터 분석방법을 음미합니다.

코드
library(tidyverse)

penguins <- palmerpenguins::penguins %>%
  # 영어 변수명 한글 변환
  set_names(c("종명칭", "섬이름", "부리_길이", "부리_깊이", "물갈퀴_길이",
              "체중", "성별", "연도")) %>%
  # 결측값 제거
  # drop_na() %>%
  # 영어 값 한글 값으로 변환
  mutate(성별 = ifelse(성별 == "male", "수컷", "암컷"),
         섬이름 = case_when( str_detect(섬이름, "Biscoe") ~ "비스코",
                          str_detect(섬이름, "Dream") ~ "드림",
                          str_detect(섬이름, "Torgersen") ~ "토르거센"),
         종명칭 = case_when( str_detect(종명칭, "Adelie") ~ "아델리",
                          str_detect(종명칭, "Chinstrap") ~ "턱끈",
                          str_detect(종명칭, "Gentoo") ~ "젠투")
  ) %>%
  # 자료형 변환
  mutate(성별   = factor(성별, levels = c("수컷", "암컷")),
         섬이름 = factor(섬이름, levels = c("비스코", "드림", "토르거센")),
         종명칭 = factor(종명칭, levels = c("아델리", "턱끈", "젠투")),
         연도   = ordered(연도, levels = c(2007, 2008, 2009)))



Hmisc::describe(penguins)
penguins 

 8  Variables      344  Observations
--------------------------------------------------------------------------------
종명칭 
       n  missing distinct 
     344        0        3 
                                     
Value        아델리     턱끈     젠투
Frequency    152       68      124   
Proportion 0.442    0.198    0.360   
--------------------------------------------------------------------------------
섬이름 
       n  missing distinct 
     344        0        3 
                                     
Value        비스코     드림 토르거센
Frequency    168      124       52   
Proportion 0.488    0.360    0.151   
--------------------------------------------------------------------------------
부리_길이 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     342        2      164        1    43.92    6.274    35.70    36.60 
     .25      .50      .75      .90      .95 
   39.23    44.45    48.50    50.80    51.99 

lowest : 32.1 33.1 33.5 34.0 34.1, highest: 55.1 55.8 55.9 58.0 59.6
--------------------------------------------------------------------------------
부리_깊이 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     342        2       80        1    17.15    2.267     13.9     14.3 
     .25      .50      .75      .90      .95 
    15.6     17.3     18.7     19.5     20.0 

lowest : 13.1 13.2 13.3 13.4 13.5, highest: 20.7 20.8 21.1 21.2 21.5
--------------------------------------------------------------------------------
물갈퀴_길이 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     342        2       55    0.999    200.9    16.03    181.0    185.0 
     .25      .50      .75      .90      .95 
   190.0    197.0    213.0    220.9    225.0 

lowest : 172 174 176 178 179, highest: 226 228 229 230 231
--------------------------------------------------------------------------------
체중 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     342        2       94        1     4202    911.8     3150     3300 
     .25      .50      .75      .90      .95 
    3550     4050     4750     5400     5650 

lowest : 2700 2850 2900 2925 2975, highest: 5850 5950 6000 6050 6300
--------------------------------------------------------------------------------
성별 
       n  missing distinct 
     333       11        2 
                          
Value         수컷    암컷
Frequency    168     165  
Proportion 0.505   0.495  
--------------------------------------------------------------------------------
연도 
       n  missing distinct 
     344        0        3 
                            
Value       2007  2008  2009
Frequency    110   114   120
Proportion 0.320 0.331 0.349
--------------------------------------------------------------------------------

1.2 skimr

skimr 패키지를 사용하여 분석할 데이터와 친숙해진다.

코드
penguins %>% 
  skimr::skim()
Data summary
Name Piped data
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 4
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
종명칭 0 1.00 FALSE 3 아델리: 152, 젠투: 124, 턱끈: 68
섬이름 0 1.00 FALSE 3 비스코: 168, 드림: 124, 토르거: 52
성별 11 0.97 FALSE 2 수컷: 168, 암컷: 165
연도 0 1.00 TRUE 3 200: 120, 200: 114, 200: 110

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
부리_길이 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
부리_깊이 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
물갈퀴_길이 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
체중 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂

1.3 dataxray

dataxray 패키지를 사용해서 데이터에 대한 이해를 더욱 높일 수 있다.

코드
library(dataxray)

penguins %>% 
   make_xray() %>% 
   view_xray()

1.4 dlookr

dlookr 패키지를 사용하여 분석할 데이터와 친숙해진다.

코드
library(kableExtra)
penguins %>% 
  dlookr::describe() %>% 
  kable(caption = "요약통계량") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)  
요약통계량
described_variables n na mean sd se_mean IQR skewness kurtosis p00 p01 p05 p10 p20 p25 p30 p40 p50 p60 p70 p75 p80 p90 p95 p99 p100
부리_길이 342 2 43.92193 5.459584 0.2952205 9.275 0.0531181 -0.8760270 32.1 34.041 35.7 36.6 38.34 39.225 40.20 42.0 44.45 46.0 47.37 48.5 49.38 50.8 51.995 55.513 59.6
부리_깊이 342 2 17.15117 1.974793 0.1067846 3.100 -0.1434646 -0.9068661 13.1 13.441 13.9 14.3 15.00 15.600 15.93 16.8 17.30 17.9 18.50 18.7 18.90 19.5 20.000 21.100 21.5
물갈퀴_길이 342 2 200.91520 14.061714 0.7603704 23.000 0.3456818 -0.9842729 172.0 178.000 181.0 185.0 188.00 190.000 191.00 194.0 197.00 203.0 210.00 213.0 215.00 220.9 225.000 230.000 231.0
체중 342 2 4201.75439 801.954536 43.3647348 1200.000 0.4703293 -0.7192219 2700.0 2900.000 3150.0 3300.0 3475.00 3550.000 3650.00 3800.0 4050.00 4300.0 4650.00 4750.0 4950.00 5400.0 5650.000 5979.500 6300.0

1.5 DataExplorer

DataExplorer 패키지를 사용하여 분석할 데이터와 친숙해진다.

DataExplorer::create_report(penguins)
코드
DataExplorer::plot_str(penguins)
코드
DataExplorer::introduce(penguins)
# A tibble: 1 × 9
   rows columns discrete_columns continuous_columns all_missing_columns
  <int>   <int>            <int>              <int>               <int>
1   344       8                4                  4                   0
# ℹ 4 more variables: total_missing_values <int>, complete_rows <int>,
#   total_observations <int>, memory_usage <dbl>
코드
DataExplorer::plot_intro(penguins)

코드
DataExplorer::plot_missing(penguins)

코드
DataExplorer::plot_bar(penguins)

코드
DataExplorer::plot_histogram(penguins)

코드
penguins %>% select_if(is.numeric) %>% 
  # drop_na() %>% 
  DataExplorer::plot_correlation(cor_args = list("use" = "pairwise.complete.obs"))

코드
penguins_pca <- penguins %>% select_if(is.numeric) %>% 
  drop_na() %>% prcomp(scale = TRUE)

summary(penguins_pca)$importance %>% as.data.frame() %>% 
  kable(caption = "PCA 요약") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
PCA 요약
PC1 PC2 PC3 PC4
Standard deviation 1.659444 0.8789293 0.6043475 0.3293816
Proportion of Variance 0.688440 0.1931300 0.0913100 0.0271200
Cumulative Proportion 0.688440 0.8815700 0.9728800 1.0000000

2 챗GPT EDA

챗GPT GPT-4 모형을 사용해서 생산성을 크게 향상시킬 수 있다. (Raiyyan, 2023) 먼저 역할을 지정하고 데이터셋이 크지 않은 경우 이를 프롬프트에 넣어 GPT-4모형이 작업할 수 있는 환경을 조성한다.

I want you to act as a data scientist using R programming language and code for me. I have a dataset of penguins as follows. If you understand what I instructed. print “OK” and wait for another instruction.

species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex Adelie,Torgersen,39.1,18.7,181,3750,MALE Adelie,Torgersen,39.5,17.4,186,3800,FEMALE Adelie,Torgersen,40.3,18,195,3250,FEMALE Adelie,Torgersen,NA,NA,NA,NA,NA Adelie,Torgersen,36.7,19.3,193,3450,FEMALE Adelie,Torgersen,39.3,20.6,190,3650,MALE Adelie,Torgersen,38.9,17.8,181,3625,FEMALE Adelie,Torgersen,39.2,19.6,195,4675,MALE

GPT-4 답변: OK

본격적으로 EDA 작업을 챗GPT에 부여하고 작업을 지시한다.

Please write code for the exploratory data analysis (EDA) task and target variable is sex.

Here’s an example of an R script that performs exploratory data analysis (EDA) on a dataset, with the target variable being ‘sex’. The code uses the same dataset (‘penguins.csv’) as the previous answer.

# Load libraries
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(readr)

# Import dataset
data <- read_csv("penguins.csv")

# Data preprocessing and cleaning
data_clean <- data %>%
  drop_na() %>%
  mutate(sex = as.factor(sex))

# Examine the structure of the dataset
print(str(data_clean))

# Summary statistics
print(summary(data_clean))

# Target variable distribution
print(table(data_clean$sex))

# Visualize target variable distribution
ggplot(data_clean, aes(x = sex)) +
  geom_bar() +
  ggtitle("Sex Distribution")

# Pairwise scatterplot of continuous variables
pairs_plot <- data_clean %>%
  select_if(is.numeric) %>%
  ggpairs()

print(pairs_plot)

# Boxplots for continuous variables by sex
data_clean %>%
  gather("variable", "value", -sex) %>%
  ggplot(aes(x = sex, y = value)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free") +
  ggtitle("Boxplots of Continuous Variables by Sex")

# Visualize categorical variables by sex
data_clean %>%
  ggplot(aes(x = island, fill = sex)) +
  geom_bar(position = "dodge") +
  ggtitle("Island Distribution by Sex")

# Correlation matrix
cor_matrix <- cor(data_clean %>% select_if(is.numeric))
print(cor_matrix)

This script includes several EDA techniques such as examining the dataset’s structure, summarizing the dataset, visualizing the target variable distribution, creating pairwise scatterplots, boxplots, and bar charts, and calculating the correlation matrix for the continuous variables.

Make sure to replace the filename and adapt the code to your specific needs.

코드
# Load libraries
library(tidyverse)
library(palmerpenguins)
library(GGally)

# Import dataset
penguins <- palmerpenguins::penguins

# Data preprocessing and cleaning
penguins <- penguins %>%
  drop_na() %>%
  mutate(sex = as.factor(sex))

# Examine the structure of the dataset
print(str(penguins))
tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
 $ bill_depth_mm    : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
 $ flipper_length_mm: int [1:333] 181 186 195 193 190 181 195 182 191 198 ...
 $ body_mass_g      : int [1:333] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
 $ year             : int [1:333] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
NULL
코드
# Summary statistics
print(summary(penguins))
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :146   Biscoe   :163   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :123   1st Qu.:39.50   1st Qu.:15.60  
 Gentoo   :119   Torgersen: 47   Median :44.50   Median :17.30  
                                 Mean   :43.99   Mean   :17.16  
                                 3rd Qu.:48.60   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172       Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190       1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197       Median :4050                Median :2008  
 Mean   :201       Mean   :4207                Mean   :2008  
 3rd Qu.:213       3rd Qu.:4775                3rd Qu.:2009  
 Max.   :231       Max.   :6300                Max.   :2009  
코드
# Target variable distribution
print(table(penguins$sex))

female   male 
   165    168 
코드
# Visualize target variable distribution
ggplot(penguins, aes(x = sex)) +
  geom_bar() +
  ggtitle("Sex Distribution")

코드
# Pairwise scatterplot of continuous variables
pairs_plot <- penguins %>%
  select_if(is.numeric) %>%
  ggpairs()

print(pairs_plot)

코드
# Boxplots for continuous variables by sex
penguins %>%
  select(-species, -island) %>% 
  pivot_longer(cols = -sex, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = sex, y = value)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free") +
  ggtitle("Boxplots of Continuous Variables by Sex")

코드
# Visualize categorical variables by sex
penguins %>%
  ggplot(aes(x = island, fill = sex)) +
  geom_bar(position = "dodge") +
  ggtitle("Island Distribution by Sex")

코드
# Correlation matrix
cor_matrix <- cor(penguins %>% select_if(is.numeric))
print(cor_matrix)
                  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm         1.0000000    -0.2286256         0.6530956  0.58945111
bill_depth_mm         -0.2286256     1.0000000        -0.5777917 -0.47201566
flipper_length_mm      0.6530956    -0.5777917         1.0000000  0.87297890
body_mass_g            0.5894511    -0.4720157         0.8729789  1.00000000
year                   0.0326569    -0.0481816         0.1510679  0.02186213
                         year
bill_length_mm     0.03265690
bill_depth_mm     -0.04818160
flipper_length_mm  0.15106792
body_mass_g        0.02186213
year               1.00000000

참고문헌

Raiyyan, S. (2023). 11 Cool ChatGPT Prompts for Data Science. Medium.com. https://medium.com/ai-in-plain-english/11-cool-chatgpt-prompts-for-data-science-84ecf5e592bc