기계학습

나무모형 (Tree-Based Model)
저자
소속
이광춘

TCS

공개

2023년 01월 16일

1 문제정의

여성 가슴에서 미세한 바늘 흡인(FNA)로 추출한 덩어리를 디지털화된 이미지로부터 추출한 정보를 바탕으로 유방암 여부를 예측한다.

범주 구분: 357 정상(benign), 212 유방암(malignant)

2 데이터셋

2.1 유방암 (분류)

여성 가슴에서 미세한 바늘 흡인(FNA)로 추출한 덩어리를 디지털화된 이미지로부터 역산하여 계산한 피쳐를 담고 있고, 각 변수는 존재하는 세포핵의 특성을 나타낸다.

  • 변수 설명
      1. ID number
      1. Diagnosis (M = malignant, B = benign)
  • Ten real-valued features are computed for each cell nucleus:
      1. radius (mean of distances from center to points on the perimeter)
      1. texture (standard deviation of gray-scale values)
      1. perimeter
      1. area
      1. smoothness (local variation in radius lengths)
      1. compactness (perimeter^2 / area - 1.0)
      1. concavity (severity of concave portions of the contour)
      1. concave points (number of concave portions of the contour)
      1. symmetry
      1. fractal dimension (“coastline approximation” - 1)
  • field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
Data summary
Name Piped data
Number of rows 568
Number of columns 33
_______________________
Column type frequency:
character 1
logical 1
numeric 31
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
diagnosis 0 1 1 1 0 2 0

Variable type: logical

skim_variable n_missing complete_rate mean count
…33 568 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 30425139.67 125124311.81 8670.00 869222.50 906157.00 8825022.25 911320502.00 ▇▁▁▁▁
radius_mean 0 1 14.14 3.52 6.98 11.71 13.38 15.80 28.11 ▂▇▃▁▁
texture_mean 0 1 19.28 4.30 9.71 16.17 18.84 21.78 39.28 ▃▇▃▁▁
perimeter_mean 0 1 92.05 24.25 43.79 75.20 86.29 104.15 188.50 ▃▇▃▁▁
area_mean 0 1 655.72 351.66 143.50 420.30 551.40 784.15 2501.00 ▇▃▂▁▁
smoothness_mean 0 1 0.10 0.01 0.06 0.09 0.10 0.11 0.16 ▂▇▅▁▁
compactness_mean 0 1 0.10 0.05 0.02 0.07 0.09 0.13 0.35 ▇▇▂▁▁
concavity_mean 0 1 0.09 0.08 0.00 0.03 0.06 0.13 0.43 ▇▃▂▁▁
concave points_mean 0 1 0.05 0.04 0.00 0.02 0.03 0.07 0.20 ▇▃▂▁▁
symmetry_mean 0 1 0.18 0.03 0.11 0.16 0.18 0.20 0.30 ▁▇▅▁▁
fractal_dimension_mean 0 1 0.06 0.01 0.05 0.06 0.06 0.07 0.10 ▆▇▂▁▁
radius_se 0 1 0.41 0.28 0.11 0.23 0.32 0.48 2.87 ▇▁▁▁▁
texture_se 0 1 1.22 0.55 0.36 0.83 1.11 1.47 4.88 ▇▅▁▁▁
perimeter_se 0 1 2.87 2.02 0.76 1.61 2.29 3.36 21.98 ▇▁▁▁▁
area_se 0 1 40.37 45.52 6.80 17.85 24.57 45.24 542.20 ▇▁▁▁▁
smoothness_se 0 1 0.01 0.00 0.00 0.01 0.01 0.01 0.03 ▇▃▁▁▁
compactness_se 0 1 0.03 0.02 0.00 0.01 0.02 0.03 0.14 ▇▃▁▁▁
concavity_se 0 1 0.03 0.03 0.00 0.02 0.03 0.04 0.40 ▇▁▁▁▁
concave points_se 0 1 0.01 0.01 0.00 0.01 0.01 0.01 0.05 ▇▇▁▁▁
symmetry_se 0 1 0.02 0.01 0.01 0.02 0.02 0.02 0.08 ▇▃▁▁▁
fractal_dimension_se 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.03 ▇▁▁▁▁
radius_worst 0 1 16.28 4.83 7.93 13.02 14.97 18.79 36.04 ▆▇▃▁▁
texture_worst 0 1 25.67 6.15 12.02 21.08 25.41 29.68 49.54 ▃▇▆▁▁
perimeter_worst 0 1 107.35 33.57 50.41 84.15 97.66 125.53 251.20 ▇▇▃▁▁
area_worst 0 1 881.66 569.28 185.20 515.68 686.55 1085.00 4254.00 ▇▂▁▁▁
smoothness_worst 0 1 0.13 0.02 0.07 0.12 0.13 0.15 0.22 ▂▇▇▂▁
compactness_worst 0 1 0.25 0.16 0.03 0.15 0.21 0.34 1.06 ▇▅▁▁▁
concavity_worst 0 1 0.27 0.21 0.00 0.12 0.23 0.38 1.25 ▇▅▂▁▁
concave points_worst 0 1 0.11 0.07 0.00 0.06 0.10 0.16 0.29 ▅▇▅▃▁
symmetry_worst 0 1 0.29 0.06 0.16 0.25 0.28 0.32 0.66 ▅▇▁▁▁
fractal_dimension_worst 0 1 0.08 0.02 0.06 0.07 0.08 0.09 0.21 ▇▃▁▁▁

2.2 연비 (예측)

데이터 출처: 캐글 자동차 연비
  • 변수 설명
    • mpg — Mileage/Miles Per Gallon
    • cylinders — the power unit of the car where gasoline is turned into power
    • displacement — engine displacement of the car
    • horsepower — rate of the engine performance
    • weight — the weight of a car
    • acceleration — the acceleration of a car
    • model — model of the car
    • origin — the origin of the car
    • car — the name of the car
Data summary
Name Piped data
Number of rows 398
Number of columns 9
_______________________
Column type frequency:
character 2
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
horsepower 0 1 1 3 0 94 0
car name 0 1 6 36 0 305 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 23.51 7.82 9 17.50 23.0 29.00 46.6 ▆▇▆▃▁
cylinders 0 1 5.45 1.70 3 4.00 4.0 8.00 8.0 ▇▁▃▁▃
displacement 0 1 193.43 104.27 68 104.25 148.5 262.00 455.0 ▇▂▂▃▁
weight 0 1 2970.42 846.84 1613 2223.75 2803.5 3608.00 5140.0 ▇▇▅▅▂
acceleration 0 1 15.57 2.76 8 13.83 15.5 17.17 24.8 ▁▆▇▃▁
model year 0 1 76.01 3.70 70 73.00 76.0 79.00 82.0 ▇▆▇▆▇
origin 0 1 1.57 0.80 1 1.00 1.0 2.00 3.0 ▇▁▂▁▂

3 예측모형 (Jupyter Notebook)

4 실습 데이터셋

4.1 recommend three tabular machine learning dataset for classification problem

Sure, here are three tabular machine learning datasets for classification problems:

  • Titanic: The Titanic dataset is a well-known dataset that contains information on the passengers of the Titanic. The task is to predict whether a passenger survived or not based on features such as their age, gender, and passenger class.
  • Credit Card Fraud Detection: This dataset contains credit card transaction data, and the task is to classify transactions as fraudulent or genuine. The dataset contains a large number of features, including the amount of the transaction, time of the transaction, and information about the card used.
  • Adult Census Income: This dataset contains information on the income of individuals based on features such as their age, education, and occupation. The task is to predict whether an individual has an income greater than or equal to $50,000 per year.

I hope these datasets are helpful! Let me know if you have any other questions.

4.2 where can i download credit card fraud deteciton dataset and adult census income dateset?

The Titanic dataset is also available on Kaggle at the following link: https://www.kaggle.com/c/titanic

The Credit Card Fraud Detection dataset is available on Kaggle at the following link: https://www.kaggle.com/mlg-ulb/creditcardfraud

The Adult Census Income dataset is available on the UCI Machine Learning Repository at the following link: https://archive.ics.uci.edu/ml/datasets/Adult

These sites are well-known repositories for machine learning datasets and are a good resource for finding datasets for a variety of tasks.

I hope this helps! Let me know if you have any other questions.