1 autoEDA

1.1 `Hmisc::describe()`

Hmisc 패키지를 통해 과거 20년전 데이터 분석방법을 음미합니다.

코드

library(tidyverse)

penguins <- palmerpenguins::penguins %>%
  # 영어 변수명 한글 변환
  set_names(c("종명칭", "섬이름", "부리_길이", "부리_깊이", "물갈퀴_길이",
              "체중", "성별", "연도")) %>%
  # 결측값 제거
  # drop_na() %>%
  # 영어 값 한글 값으로 변환
  mutate(성별 = ifelse(성별 == "male", "수컷", "암컷"),
         섬이름 = case_when( str_detect(섬이름, "Biscoe") ~ "비스코",
                          str_detect(섬이름, "Dream") ~ "드림",
                          str_detect(섬이름, "Torgersen") ~ "토르거센"),
         종명칭 = case_when( str_detect(종명칭, "Adelie") ~ "아델리",
                          str_detect(종명칭, "Chinstrap") ~ "턱끈",
                          str_detect(종명칭, "Gentoo") ~ "젠투")
  ) %>%
  # 자료형 변환
  mutate(성별   = factor(성별, levels = c("수컷", "암컷")),
         섬이름 = factor(섬이름, levels = c("비스코", "드림", "토르거센")),
         종명칭 = factor(종명칭, levels = c("아델리", "턱끈", "젠투")),
         연도   = ordered(연도, levels = c(2007, 2008, 2009)))



Hmisc::describe(penguins)

penguins 

 8  Variables      344  Observations
--------------------------------------------------------------------------------
종명칭 
       n  missing distinct 
     344        0        3 
                                     
Value        아델리     턱끈     젠투
Frequency    152       68      124   
Proportion 0.442    0.198    0.360   
--------------------------------------------------------------------------------
섬이름 
       n  missing distinct 
     344        0        3 
                                     
Value        비스코     드림 토르거센
Frequency    168      124       52   
Proportion 0.488    0.360    0.151   
--------------------------------------------------------------------------------
부리_길이 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     342        2      164        1    43.92    6.274    35.70    36.60 
     .25      .50      .75      .90      .95 
   39.23    44.45    48.50    50.80    51.99 

lowest : 32.1 33.1 33.5 34.0 34.1, highest: 55.1 55.8 55.9 58.0 59.6
--------------------------------------------------------------------------------
부리_깊이 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     342        2       80        1    17.15    2.267     13.9     14.3 
     .25      .50      .75      .90      .95 
    15.6     17.3     18.7     19.5     20.0 

lowest : 13.1 13.2 13.3 13.4 13.5, highest: 20.7 20.8 21.1 21.2 21.5
--------------------------------------------------------------------------------
물갈퀴_길이 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     342        2       55    0.999    200.9    16.03    181.0    185.0 
     .25      .50      .75      .90      .95 
   190.0    197.0    213.0    220.9    225.0 

lowest : 172 174 176 178 179, highest: 226 228 229 230 231
--------------------------------------------------------------------------------
체중 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     342        2       94        1     4202    911.8     3150     3300 
     .25      .50      .75      .90      .95 
    3550     4050     4750     5400     5650 

lowest : 2700 2850 2900 2925 2975, highest: 5850 5950 6000 6050 6300
--------------------------------------------------------------------------------
성별 
       n  missing distinct 
     333       11        2 
                          
Value         수컷    암컷
Frequency    168     165  
Proportion 0.505   0.495  
--------------------------------------------------------------------------------
연도 
       n  missing distinct 
     344        0        3 
                            
Value       2007  2008  2009
Frequency    110   114   120
Proportion 0.320 0.331 0.349
--------------------------------------------------------------------------------

1.2 `skimr`

skimr 패키지를 사용하여 분석할 데이터와 친숙해진다.

코드

penguins %>% 
  skimr::skim()

Data summary
Name	Piped data
Number of rows	344
Number of columns	8
_______________________
Column type frequency:
factor	4
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
종명칭	0	1.00	FALSE	3	아델리: 152, 젠투: 124, 턱끈: 68
섬이름	0	1.00	FALSE	3	비스코: 168, 드림: 124, 토르거: 52
성별	11	0.97	FALSE	2	수컷: 168, 암컷: 165
연도	0	1.00	TRUE	3	200: 120, 200: 114, 200: 110

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
부리_길이	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
부리_깊이	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	▅▅▇▇▂
물갈퀴_길이	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂
체중	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	▃▇▆▃▂

1.3 `dataxray`

dataxray 패키지를 사용해서 데이터에 대한 이해를 더욱 높일 수 있다.

코드

library(dataxray)

penguins %>% 
   make_xray() %>% 
   view_xray()

1.4 `dlookr`

dlookr 패키지를 사용하여 분석할 데이터와 친숙해진다.

코드

library(kableExtra)
penguins %>% 
  dlookr::describe() %>% 
  kable(caption = "요약통계량") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)

요약통계량
described_variables	n	na	mean	sd	se_mean	IQR	skewness	kurtosis	p00	p01	p05	p10	p20	p25	p30	p40	p50	p60	p70	p75	p80	p90	p95	p99	p100
부리_길이	342	2	43.92193	5.459584	0.2952205	9.275	0.0531181	-0.8760270	32.1	34.041	35.7	36.6	38.34	39.225	40.20	42.0	44.45	46.0	47.37	48.5	49.38	50.8	51.995	55.513	59.6
부리_깊이	342	2	17.15117	1.974793	0.1067846	3.100	-0.1434646	-0.9068661	13.1	13.441	13.9	14.3	15.00	15.600	15.93	16.8	17.30	17.9	18.50	18.7	18.90	19.5	20.000	21.100	21.5
물갈퀴_길이	342	2	200.91520	14.061714	0.7603704	23.000	0.3456818	-0.9842729	172.0	178.000	181.0	185.0	188.00	190.000	191.00	194.0	197.00	203.0	210.00	213.0	215.00	220.9	225.000	230.000	231.0
체중	342	2	4201.75439	801.954536	43.3647348	1200.000	0.4703293	-0.7192219	2700.0	2900.000	3150.0	3300.0	3475.00	3550.000	3650.00	3800.0	4050.00	4300.0	4650.00	4750.0	4950.00	5400.0	5650.000	5979.500	6300.0

1.5 `DataExplorer`

DataExplorer 패키지를 사용하여 분석할 데이터와 친숙해진다.

DataExplorer::create_report(penguins)

코드

DataExplorer::plot_str(penguins)

코드

DataExplorer::introduce(penguins)

# A tibble: 1 × 9
   rows columns discrete_columns continuous_columns all_missing_columns
  <int>   <int>            <int>              <int>               <int>
1   344       8                4                  4                   0
# ℹ 4 more variables: total_missing_values <int>, complete_rows <int>,
#   total_observations <int>, memory_usage <dbl>

코드

DataExplorer::plot_intro(penguins)

코드

DataExplorer::plot_missing(penguins)

코드

DataExplorer::plot_bar(penguins)

코드

DataExplorer::plot_histogram(penguins)

코드

penguins %>% select_if(is.numeric) %>% 
  # drop_na() %>% 
  DataExplorer::plot_correlation(cor_args = list("use" = "pairwise.complete.obs"))

코드

penguins_pca <- penguins %>% select_if(is.numeric) %>% 
  drop_na() %>% prcomp(scale = TRUE)

summary(penguins_pca)$importance %>% as.data.frame() %>% 
  kable(caption = "PCA 요약") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)

PCA 요약
	PC1	PC2	PC3	PC4
Standard deviation	1.659444	0.8789293	0.6043475	0.3293816
Proportion of Variance	0.688440	0.1931300	0.0913100	0.0271200
Cumulative Proportion	0.688440	0.8815700	0.9728800	1.0000000

2 챗GPT EDA

Model: GPT-4, autoEDA and autoML

챗GPT GPT-4 모형을 사용해서 생산성을 크게 향상시킬 수 있다. (Raiyyan, 2023) 먼저 역할을 지정하고 데이터셋이 크지 않은 경우 이를 프롬프트에 넣어 GPT-4모형이 작업할 수 있는 환경을 조성한다.

I want you to act as a data scientist using R programming language and code for me. I have a dataset of penguins as follows. If you understand what I instructed. print “OK” and wait for another instruction.

species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex Adelie,Torgersen,39.1,18.7,181,3750,MALE Adelie,Torgersen,39.5,17.4,186,3800,FEMALE Adelie,Torgersen,40.3,18,195,3250,FEMALE Adelie,Torgersen,NA,NA,NA,NA,NA Adelie,Torgersen,36.7,19.3,193,3450,FEMALE Adelie,Torgersen,39.3,20.6,190,3650,MALE Adelie,Torgersen,38.9,17.8,181,3625,FEMALE Adelie,Torgersen,39.2,19.6,195,4675,MALE
…

GPT-4 답변: OK

본격적으로 EDA 작업을 챗GPT에 부여하고 작업을 지시한다.

Please write code for the exploratory data analysis (EDA) task and target variable is sex.

Here’s an example of an R script that performs exploratory data analysis (EDA) on a dataset, with the target variable being ‘sex’. The code uses the same dataset (‘penguins.csv’) as the previous answer.

# Load libraries
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(readr)

# Import dataset
data <- read_csv("penguins.csv")

# Data preprocessing and cleaning
data_clean <- data %>%
  drop_na() %>%
  mutate(sex = as.factor(sex))

# Examine the structure of the dataset
print(str(data_clean))

# Summary statistics
print(summary(data_clean))

# Target variable distribution
print(table(data_clean$sex))

# Visualize target variable distribution
ggplot(data_clean, aes(x = sex)) +
  geom_bar() +
  ggtitle("Sex Distribution")

# Pairwise scatterplot of continuous variables
pairs_plot <- data_clean %>%
  select_if(is.numeric) %>%
  ggpairs()

print(pairs_plot)

# Boxplots for continuous variables by sex
data_clean %>%
  gather("variable", "value", -sex) %>%
  ggplot(aes(x = sex, y = value)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free") +
  ggtitle("Boxplots of Continuous Variables by Sex")

# Visualize categorical variables by sex
data_clean %>%
  ggplot(aes(x = island, fill = sex)) +
  geom_bar(position = "dodge") +
  ggtitle("Island Distribution by Sex")

# Correlation matrix
cor_matrix <- cor(data_clean %>% select_if(is.numeric))
print(cor_matrix)

This script includes several EDA techniques such as examining the dataset’s structure, summarizing the dataset, visualizing the target variable distribution, creating pairwise scatterplots, boxplots, and bar charts, and calculating the correlation matrix for the continuous variables.

Make sure to replace the filename and adapt the code to your specific needs.

코드

# Load libraries
library(tidyverse)
library(palmerpenguins)
library(GGally)

# Import dataset
penguins <- palmerpenguins::penguins

# Data preprocessing and cleaning
penguins <- penguins %>%
  drop_na() %>%
  mutate(sex = as.factor(sex))

# Examine the structure of the dataset
print(str(penguins))

tibble [333 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
 $ bill_depth_mm    : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
 $ flipper_length_mm: int [1:333] 181 186 195 193 190 181 195 182 191 198 ...
 $ body_mass_g      : int [1:333] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
 $ year             : int [1:333] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
NULL

코드

# Summary statistics
print(summary(penguins))

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :146   Biscoe   :163   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :123   1st Qu.:39.50   1st Qu.:15.60  
 Gentoo   :119   Torgersen: 47   Median :44.50   Median :17.30  
                                 Mean   :43.99   Mean   :17.16  
                                 3rd Qu.:48.60   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172       Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190       1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197       Median :4050                Median :2008  
 Mean   :201       Mean   :4207                Mean   :2008  
 3rd Qu.:213       3rd Qu.:4775                3rd Qu.:2009  
 Max.   :231       Max.   :6300                Max.   :2009

코드

# Target variable distribution
print(table(penguins$sex))


female   male 
   165    168

코드

# Visualize target variable distribution
ggplot(penguins, aes(x = sex)) +
  geom_bar() +
  ggtitle("Sex Distribution")

코드

# Pairwise scatterplot of continuous variables
pairs_plot <- penguins %>%
  select_if(is.numeric) %>%
  ggpairs()

print(pairs_plot)

코드

# Boxplots for continuous variables by sex
penguins %>%
  select(-species, -island) %>% 
  pivot_longer(cols = -sex, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = sex, y = value)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free") +
  ggtitle("Boxplots of Continuous Variables by Sex")

코드

# Visualize categorical variables by sex
penguins %>%
  ggplot(aes(x = island, fill = sex)) +
  geom_bar(position = "dodge") +
  ggtitle("Island Distribution by Sex")

코드

# Correlation matrix
cor_matrix <- cor(penguins %>% select_if(is.numeric))
print(cor_matrix)

                  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm         1.0000000    -0.2286256         0.6530956  0.58945111
bill_depth_mm         -0.2286256     1.0000000        -0.5777917 -0.47201566
flipper_length_mm      0.6530956    -0.5777917         1.0000000  0.87297890
body_mass_g            0.5894511    -0.4720157         0.8729789  1.00000000
year                   0.0326569    -0.0481816         0.1510679  0.02186213
                         year
bill_length_mm     0.03265690
bill_depth_mm     -0.04818160
flipper_length_mm  0.15106792
body_mass_g        0.02186213
year               1.00000000

참고문헌

Raiyyan, S. (2023). 11 Cool ChatGPT Prompts for Data Science. Medium.com. https://medium.com/ai-in-plain-english/11-cool-chatgpt-prompts-for-data-science-84ecf5e592bc

1 autoEDA

1.1 Hmisc::describe()

1.2 skimr

1.3 dataxray

1.4 dlookr

1.5 DataExplorer