기계학습
나무모형 (Tree-Based Model)
1 문제정의
여성 가슴에서 미세한 바늘 흡인(FNA)로 추출한 덩어리를 디지털화된 이미지로부터 추출한 정보를 바탕으로 유방암 여부를 예측한다.
범주 구분: 357 정상(benign), 212 유방암(malignant)
2 데이터셋
2.1 유방암 (분류)
여성 가슴에서 미세한 바늘 흡인(FNA)로 추출한 덩어리를 디지털화된 이미지로부터 역산하여 계산한 피쳐를 담고 있고, 각 변수는 존재하는 세포핵의 특성을 나타낸다.
- 변수 설명
- ID number
- Diagnosis (M = malignant, B = benign)
- Ten real-valued features are computed for each cell nucleus:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension (“coastline approximation” - 1)
- field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
Name | Piped data |
Number of rows | 568 |
Number of columns | 33 |
_______________________ | |
Column type frequency: | |
character | 1 |
logical | 1 |
numeric | 31 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
diagnosis | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
…33 | 568 | 0 | NaN | : |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1 | 30425139.67 | 125124311.81 | 8670.00 | 869222.50 | 906157.00 | 8825022.25 | 911320502.00 | ▇▁▁▁▁ |
radius_mean | 0 | 1 | 14.14 | 3.52 | 6.98 | 11.71 | 13.38 | 15.80 | 28.11 | ▂▇▃▁▁ |
texture_mean | 0 | 1 | 19.28 | 4.30 | 9.71 | 16.17 | 18.84 | 21.78 | 39.28 | ▃▇▃▁▁ |
perimeter_mean | 0 | 1 | 92.05 | 24.25 | 43.79 | 75.20 | 86.29 | 104.15 | 188.50 | ▃▇▃▁▁ |
area_mean | 0 | 1 | 655.72 | 351.66 | 143.50 | 420.30 | 551.40 | 784.15 | 2501.00 | ▇▃▂▁▁ |
smoothness_mean | 0 | 1 | 0.10 | 0.01 | 0.06 | 0.09 | 0.10 | 0.11 | 0.16 | ▂▇▅▁▁ |
compactness_mean | 0 | 1 | 0.10 | 0.05 | 0.02 | 0.07 | 0.09 | 0.13 | 0.35 | ▇▇▂▁▁ |
concavity_mean | 0 | 1 | 0.09 | 0.08 | 0.00 | 0.03 | 0.06 | 0.13 | 0.43 | ▇▃▂▁▁ |
concave points_mean | 0 | 1 | 0.05 | 0.04 | 0.00 | 0.02 | 0.03 | 0.07 | 0.20 | ▇▃▂▁▁ |
symmetry_mean | 0 | 1 | 0.18 | 0.03 | 0.11 | 0.16 | 0.18 | 0.20 | 0.30 | ▁▇▅▁▁ |
fractal_dimension_mean | 0 | 1 | 0.06 | 0.01 | 0.05 | 0.06 | 0.06 | 0.07 | 0.10 | ▆▇▂▁▁ |
radius_se | 0 | 1 | 0.41 | 0.28 | 0.11 | 0.23 | 0.32 | 0.48 | 2.87 | ▇▁▁▁▁ |
texture_se | 0 | 1 | 1.22 | 0.55 | 0.36 | 0.83 | 1.11 | 1.47 | 4.88 | ▇▅▁▁▁ |
perimeter_se | 0 | 1 | 2.87 | 2.02 | 0.76 | 1.61 | 2.29 | 3.36 | 21.98 | ▇▁▁▁▁ |
area_se | 0 | 1 | 40.37 | 45.52 | 6.80 | 17.85 | 24.57 | 45.24 | 542.20 | ▇▁▁▁▁ |
smoothness_se | 0 | 1 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.03 | ▇▃▁▁▁ |
compactness_se | 0 | 1 | 0.03 | 0.02 | 0.00 | 0.01 | 0.02 | 0.03 | 0.14 | ▇▃▁▁▁ |
concavity_se | 0 | 1 | 0.03 | 0.03 | 0.00 | 0.02 | 0.03 | 0.04 | 0.40 | ▇▁▁▁▁ |
concave points_se | 0 | 1 | 0.01 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.05 | ▇▇▁▁▁ |
symmetry_se | 0 | 1 | 0.02 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.08 | ▇▃▁▁▁ |
fractal_dimension_se | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | ▇▁▁▁▁ |
radius_worst | 0 | 1 | 16.28 | 4.83 | 7.93 | 13.02 | 14.97 | 18.79 | 36.04 | ▆▇▃▁▁ |
texture_worst | 0 | 1 | 25.67 | 6.15 | 12.02 | 21.08 | 25.41 | 29.68 | 49.54 | ▃▇▆▁▁ |
perimeter_worst | 0 | 1 | 107.35 | 33.57 | 50.41 | 84.15 | 97.66 | 125.53 | 251.20 | ▇▇▃▁▁ |
area_worst | 0 | 1 | 881.66 | 569.28 | 185.20 | 515.68 | 686.55 | 1085.00 | 4254.00 | ▇▂▁▁▁ |
smoothness_worst | 0 | 1 | 0.13 | 0.02 | 0.07 | 0.12 | 0.13 | 0.15 | 0.22 | ▂▇▇▂▁ |
compactness_worst | 0 | 1 | 0.25 | 0.16 | 0.03 | 0.15 | 0.21 | 0.34 | 1.06 | ▇▅▁▁▁ |
concavity_worst | 0 | 1 | 0.27 | 0.21 | 0.00 | 0.12 | 0.23 | 0.38 | 1.25 | ▇▅▂▁▁ |
concave points_worst | 0 | 1 | 0.11 | 0.07 | 0.00 | 0.06 | 0.10 | 0.16 | 0.29 | ▅▇▅▃▁ |
symmetry_worst | 0 | 1 | 0.29 | 0.06 | 0.16 | 0.25 | 0.28 | 0.32 | 0.66 | ▅▇▁▁▁ |
fractal_dimension_worst | 0 | 1 | 0.08 | 0.02 | 0.06 | 0.07 | 0.08 | 0.09 | 0.21 | ▇▃▁▁▁ |
2.2 연비 (예측)
데이터 출처: 캐글 자동차 연비
- 변수 설명
- mpg — Mileage/Miles Per Gallon
- cylinders — the power unit of the car where gasoline is turned into power
- displacement — engine displacement of the car
- horsepower — rate of the engine performance
- weight — the weight of a car
- acceleration — the acceleration of a car
- model — model of the car
- origin — the origin of the car
- car — the name of the car
Name | Piped data |
Number of rows | 398 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
horsepower | 0 | 1 | 1 | 3 | 0 | 94 | 0 |
car name | 0 | 1 | 6 | 36 | 0 | 305 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mpg | 0 | 1 | 23.51 | 7.82 | 9 | 17.50 | 23.0 | 29.00 | 46.6 | ▆▇▆▃▁ |
cylinders | 0 | 1 | 5.45 | 1.70 | 3 | 4.00 | 4.0 | 8.00 | 8.0 | ▇▁▃▁▃ |
displacement | 0 | 1 | 193.43 | 104.27 | 68 | 104.25 | 148.5 | 262.00 | 455.0 | ▇▂▂▃▁ |
weight | 0 | 1 | 2970.42 | 846.84 | 1613 | 2223.75 | 2803.5 | 3608.00 | 5140.0 | ▇▇▅▅▂ |
acceleration | 0 | 1 | 15.57 | 2.76 | 8 | 13.83 | 15.5 | 17.17 | 24.8 | ▁▆▇▃▁ |
model year | 0 | 1 | 76.01 | 3.70 | 70 | 73.00 | 76.0 | 79.00 | 82.0 | ▇▆▇▆▇ |
origin | 0 | 1 | 1.57 | 0.80 | 1 | 1.00 | 1.0 | 2.00 | 3.0 | ▇▁▂▁▂ |
3 예측모형 (Jupyter Notebook)
- 분류모형
- 회귀모형