기계학습
1 문제정의
여성 가슴에서 미세한 바늘 흡인(FNA)로 추출한 덩어리를 디지털화된 이미지로부터 추출한 정보를 바탕으로 유방암 여부를 예측한다.
범주 구분: 357 정상(benign), 212 유방암(malignant)
2 데이터셋
2.1 유방암 (분류)
여성 가슴에서 미세한 바늘 흡인(FNA)로 추출한 덩어리를 디지털화된 이미지로부터 역산하여 계산한 피쳐를 담고 있고, 각 변수는 존재하는 세포핵의 특성을 나타낸다.
- 변수 설명
- ID number
- Diagnosis (M = malignant, B = benign)
- Ten real-valued features are computed for each cell nucleus:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension (“coastline approximation” - 1)
- field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
Name | Piped data |
Number of rows | 568 |
Number of columns | 33 |
_______________________ | |
Column type frequency: | |
character | 1 |
logical | 1 |
numeric | 31 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
diagnosis | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
…33 | 568 | 0 | NaN | : |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1 | 30425139.67 | 125124311.81 | 8670.00 | 869222.50 | 906157.00 | 8825022.25 | 911320502.00 | ▇▁▁▁▁ |
radius_mean | 0 | 1 | 14.14 | 3.52 | 6.98 | 11.71 | 13.38 | 15.80 | 28.11 | ▂▇▃▁▁ |
texture_mean | 0 | 1 | 19.28 | 4.30 | 9.71 | 16.17 | 18.84 | 21.78 | 39.28 | ▃▇▃▁▁ |
perimeter_mean | 0 | 1 | 92.05 | 24.25 | 43.79 | 75.20 | 86.29 | 104.15 | 188.50 | ▃▇▃▁▁ |
area_mean | 0 | 1 | 655.72 | 351.66 | 143.50 | 420.30 | 551.40 | 784.15 | 2501.00 | ▇▃▂▁▁ |
smoothness_mean | 0 | 1 | 0.10 | 0.01 | 0.06 | 0.09 | 0.10 | 0.11 | 0.16 | ▂▇▅▁▁ |
compactness_mean | 0 | 1 | 0.10 | 0.05 | 0.02 | 0.07 | 0.09 | 0.13 | 0.35 | ▇▇▂▁▁ |
concavity_mean | 0 | 1 | 0.09 | 0.08 | 0.00 | 0.03 | 0.06 | 0.13 | 0.43 | ▇▃▂▁▁ |
concave points_mean | 0 | 1 | 0.05 | 0.04 | 0.00 | 0.02 | 0.03 | 0.07 | 0.20 | ▇▃▂▁▁ |
symmetry_mean | 0 | 1 | 0.18 | 0.03 | 0.11 | 0.16 | 0.18 | 0.20 | 0.30 | ▁▇▅▁▁ |
fractal_dimension_mean | 0 | 1 | 0.06 | 0.01 | 0.05 | 0.06 | 0.06 | 0.07 | 0.10 | ▆▇▂▁▁ |
radius_se | 0 | 1 | 0.41 | 0.28 | 0.11 | 0.23 | 0.32 | 0.48 | 2.87 | ▇▁▁▁▁ |
texture_se | 0 | 1 | 1.22 | 0.55 | 0.36 | 0.83 | 1.11 | 1.47 | 4.88 | ▇▅▁▁▁ |
perimeter_se | 0 | 1 | 2.87 | 2.02 | 0.76 | 1.61 | 2.29 | 3.36 | 21.98 | ▇▁▁▁▁ |
area_se | 0 | 1 | 40.37 | 45.52 | 6.80 | 17.85 | 24.57 | 45.24 | 542.20 | ▇▁▁▁▁ |
smoothness_se | 0 | 1 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.03 | ▇▃▁▁▁ |
compactness_se | 0 | 1 | 0.03 | 0.02 | 0.00 | 0.01 | 0.02 | 0.03 | 0.14 | ▇▃▁▁▁ |
concavity_se | 0 | 1 | 0.03 | 0.03 | 0.00 | 0.02 | 0.03 | 0.04 | 0.40 | ▇▁▁▁▁ |
concave points_se | 0 | 1 | 0.01 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.05 | ▇▇▁▁▁ |
symmetry_se | 0 | 1 | 0.02 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.08 | ▇▃▁▁▁ |
fractal_dimension_se | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | ▇▁▁▁▁ |
radius_worst | 0 | 1 | 16.28 | 4.83 | 7.93 | 13.02 | 14.97 | 18.79 | 36.04 | ▆▇▃▁▁ |
texture_worst | 0 | 1 | 25.67 | 6.15 | 12.02 | 21.08 | 25.41 | 29.68 | 49.54 | ▃▇▆▁▁ |
perimeter_worst | 0 | 1 | 107.35 | 33.57 | 50.41 | 84.15 | 97.66 | 125.53 | 251.20 | ▇▇▃▁▁ |
area_worst | 0 | 1 | 881.66 | 569.28 | 185.20 | 515.68 | 686.55 | 1085.00 | 4254.00 | ▇▂▁▁▁ |
smoothness_worst | 0 | 1 | 0.13 | 0.02 | 0.07 | 0.12 | 0.13 | 0.15 | 0.22 | ▂▇▇▂▁ |
compactness_worst | 0 | 1 | 0.25 | 0.16 | 0.03 | 0.15 | 0.21 | 0.34 | 1.06 | ▇▅▁▁▁ |
concavity_worst | 0 | 1 | 0.27 | 0.21 | 0.00 | 0.12 | 0.23 | 0.38 | 1.25 | ▇▅▂▁▁ |
concave points_worst | 0 | 1 | 0.11 | 0.07 | 0.00 | 0.06 | 0.10 | 0.16 | 0.29 | ▅▇▅▃▁ |
symmetry_worst | 0 | 1 | 0.29 | 0.06 | 0.16 | 0.25 | 0.28 | 0.32 | 0.66 | ▅▇▁▁▁ |
fractal_dimension_worst | 0 | 1 | 0.08 | 0.02 | 0.06 | 0.07 | 0.08 | 0.09 | 0.21 | ▇▃▁▁▁ |
2.2 연비 (예측)
- 변수 설명
- mpg — Mileage/Miles Per Gallon
- cylinders — the power unit of the car where gasoline is turned into power
- displacement — engine displacement of the car
- horsepower — rate of the engine performance
- weight — the weight of a car
- acceleration — the acceleration of a car
- model — model of the car
- origin — the origin of the car
- car — the name of the car
Name | Piped data |
Number of rows | 398 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
horsepower | 0 | 1 | 1 | 3 | 0 | 94 | 0 |
car name | 0 | 1 | 6 | 36 | 0 | 305 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mpg | 0 | 1 | 23.51 | 7.82 | 9 | 17.50 | 23.0 | 29.00 | 46.6 | ▆▇▆▃▁ |
cylinders | 0 | 1 | 5.45 | 1.70 | 3 | 4.00 | 4.0 | 8.00 | 8.0 | ▇▁▃▁▃ |
displacement | 0 | 1 | 193.43 | 104.27 | 68 | 104.25 | 148.5 | 262.00 | 455.0 | ▇▂▂▃▁ |
weight | 0 | 1 | 2970.42 | 846.84 | 1613 | 2223.75 | 2803.5 | 3608.00 | 5140.0 | ▇▇▅▅▂ |
acceleration | 0 | 1 | 15.57 | 2.76 | 8 | 13.83 | 15.5 | 17.17 | 24.8 | ▁▆▇▃▁ |
model year | 0 | 1 | 76.01 | 3.70 | 70 | 73.00 | 76.0 | 79.00 | 82.0 | ▇▆▇▆▇ |
origin | 0 | 1 | 1.57 | 0.80 | 1 | 1.00 | 1.0 | 2.00 | 3.0 | ▇▁▂▁▂ |
3 예측모형 (Jupyter Notebook)
- 분류모형
- 회귀모형
4 실습 데이터셋
4.1 recommend three tabular machine learning dataset for classification problem
Sure, here are three tabular machine learning datasets for classification problems:
- Titanic: The Titanic dataset is a well-known dataset that contains information on the passengers of the Titanic. The task is to predict whether a passenger survived or not based on features such as their age, gender, and passenger class.
- Credit Card Fraud Detection: This dataset contains credit card transaction data, and the task is to classify transactions as fraudulent or genuine. The dataset contains a large number of features, including the amount of the transaction, time of the transaction, and information about the card used.
- Adult Census Income: This dataset contains information on the income of individuals based on features such as their age, education, and occupation. The task is to predict whether an individual has an income greater than or equal to $50,000 per year.
I hope these datasets are helpful! Let me know if you have any other questions.
4.2 where can i download credit card fraud deteciton dataset and adult census income dateset?
The Titanic dataset is also available on Kaggle at the following link: https://www.kaggle.com/c/titanic
The Credit Card Fraud Detection dataset is available on Kaggle at the following link: https://www.kaggle.com/mlg-ulb/creditcardfraud
The Adult Census Income dataset is available on the UCI Machine Learning Repository at the following link: https://archive.ics.uci.edu/ml/datasets/Adult
These sites are well-known repositories for machine learning datasets and are a good resource for finding datasets for a variety of tasks.
I hope this helps! Let me know if you have any other questions.