기계학습

나무모형 (Tree-Based Model)

저자

소속

이광춘

TCS

공개

2023년 01월 16일

1 문제정의

캐글: Breast Cancer Wisconsin (Diagnostic) Data Set

여성 가슴에서 미세한 바늘 흡인(FNA)로 추출한 덩어리를 디지털화된 이미지로부터 추출한 정보를 바탕으로 유방암 여부를 예측한다.

범주 구분: 357 정상(benign), 212 유방암(malignant)

2 데이터셋

2.1 유방암 (분류)

여성 가슴에서 미세한 바늘 흡인(FNA)로 추출한 덩어리를 디지털화된 이미지로부터 역산하여 계산한 피쳐를 담고 있고, 각 변수는 존재하는 세포핵의 특성을 나타낸다.

변수 설명
- 1. ID number
- 1. Diagnosis (M = malignant, B = benign)
Ten real-valued features are computed for each cell nucleus:
- 1. radius (mean of distances from center to points on the perimeter)
- 1. texture (standard deviation of gray-scale values)
- 1. perimeter
- 1. area
- 1. smoothness (local variation in radius lengths)
- 1. compactness (perimeter^2 / area - 1.0)
- 1. concavity (severity of concave portions of the contour)
- 1. concave points (number of concave portions of the contour)
- 1. symmetry
- 1. fractal dimension (“coastline approximation” - 1)
field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

Data summary
Name	Piped data
Number of rows	568
Number of columns	33
_______________________
Column type frequency:
character	1
logical	1
numeric	31
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
diagnosis	0	1	1	1	0	2	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
…33	568	0	NaN	:

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	1	30425139.67	125124311.81	8670.00	869222.50	906157.00	8825022.25	911320502.00	▇▁▁▁▁
radius_mean	1	14.14	3.52	6.98	11.71	13.38	15.80	28.11	▂▇▃▁▁
texture_mean	1	19.28	4.30	9.71	16.17	18.84	21.78	39.28	▃▇▃▁▁
perimeter_mean	1	92.05	24.25	43.79	75.20	86.29	104.15	188.50	▃▇▃▁▁
area_mean	1	655.72	351.66	143.50	420.30	551.40	784.15	2501.00	▇▃▂▁▁
smoothness_mean	1	0.10	0.01	0.06	0.09	0.10	0.11	0.16	▂▇▅▁▁
compactness_mean	1	0.10	0.05	0.02	0.07	0.09	0.13	0.35	▇▇▂▁▁
concavity_mean	1	0.09	0.08	0.00	0.03	0.06	0.13	0.43	▇▃▂▁▁
concave points_mean	1	0.05	0.04	0.00	0.02	0.03	0.07	0.20	▇▃▂▁▁
symmetry_mean	1	0.18	0.03	0.11	0.16	0.18	0.20	0.30	▁▇▅▁▁
fractal_dimension_mean	1	0.06	0.01	0.05	0.06	0.06	0.07	0.10	▆▇▂▁▁
radius_se	1	0.41	0.28	0.11	0.23	0.32	0.48	2.87	▇▁▁▁▁
texture_se	1	1.22	0.55	0.36	0.83	1.11	1.47	4.88	▇▅▁▁▁
perimeter_se	1	2.87	2.02	0.76	1.61	2.29	3.36	21.98	▇▁▁▁▁
area_se	1	40.37	45.52	6.80	17.85	24.57	45.24	542.20	▇▁▁▁▁
smoothness_se	1	0.01	0.00	0.00	0.01	0.01	0.01	0.03	▇▃▁▁▁
compactness_se	1	0.03	0.02	0.00	0.01	0.02	0.03	0.14	▇▃▁▁▁
concavity_se	1	0.03	0.03	0.00	0.02	0.03	0.04	0.40	▇▁▁▁▁
concave points_se	1	0.01	0.01	0.00	0.01	0.01	0.01	0.05	▇▇▁▁▁
symmetry_se	1	0.02	0.01	0.01	0.02	0.02	0.02	0.08	▇▃▁▁▁
fractal_dimension_se	1	0.00	0.00	0.00	0.00	0.00	0.00	0.03	▇▁▁▁▁
radius_worst	1	16.28	4.83	7.93	13.02	14.97	18.79	36.04	▆▇▃▁▁
texture_worst	1	25.67	6.15	12.02	21.08	25.41	29.68	49.54	▃▇▆▁▁
perimeter_worst	1	107.35	33.57	50.41	84.15	97.66	125.53	251.20	▇▇▃▁▁
area_worst	1	881.66	569.28	185.20	515.68	686.55	1085.00	4254.00	▇▂▁▁▁
smoothness_worst	1	0.13	0.02	0.07	0.12	0.13	0.15	0.22	▂▇▇▂▁
compactness_worst	1	0.25	0.16	0.03	0.15	0.21	0.34	1.06	▇▅▁▁▁
concavity_worst	1	0.27	0.21	0.00	0.12	0.23	0.38	1.25	▇▅▂▁▁
concave points_worst	1	0.11	0.07	0.00	0.06	0.10	0.16	0.29	▅▇▅▃▁
symmetry_worst	1	0.29	0.06	0.16	0.25	0.28	0.32	0.66	▅▇▁▁▁
fractal_dimension_worst	1	0.08	0.02	0.06	0.07	0.08	0.09	0.21	▇▃▁▁▁

2.2 연비 (예측)

데이터 출처: 캐글 자동차 연비

변수 설명
- mpg — Mileage/Miles Per Gallon
- cylinders — the power unit of the car where gasoline is turned into power
- displacement — engine displacement of the car
- horsepower — rate of the engine performance
- weight — the weight of a car
- acceleration — the acceleration of a car
- model — model of the car
- origin — the origin of the car
- car — the name of the car

Data summary
Name	Piped data
Number of rows	398
Number of columns	9
_______________________
Column type frequency:
character	2
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
horsepower	0	1	1	3	0	94	0
car name	0	1	6	36	0	305	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
mpg	1	23.51	7.82	9	17.50	23.0	29.00	46.6	▆▇▆▃▁
cylinders	1	5.45	1.70	3	4.00	4.0	8.00	8.0	▇▁▃▁▃
displacement	1	193.43	104.27	68	104.25	148.5	262.00	455.0	▇▂▂▃▁
weight	1	2970.42	846.84	1613	2223.75	2803.5	3608.00	5140.0	▇▇▅▅▂
acceleration	1	15.57	2.76	8	13.83	15.5	17.17	24.8	▁▆▇▃▁
model year	1	76.01	3.70	70	73.00	76.0	79.00	82.0	▇▆▇▆▇
origin	1	1.57	0.80	1	1.00	1.0	2.00	3.0	▇▁▂▁▂

3 예측모형 (Jupyter Notebook)

4 실습 데이터셋

데이터셋

4.2 where can i download credit card fraud deteciton dataset and adult census income dateset?

주의

The Titanic dataset is also available on Kaggle at the following link: https://www.kaggle.com/c/titanic

The Credit Card Fraud Detection dataset is available on Kaggle at the following link: https://www.kaggle.com/mlg-ulb/creditcardfraud

The Adult Census Income dataset is available on the UCI Machine Learning Repository at the following link: https://archive.ics.uci.edu/ml/datasets/Adult

These sites are well-known repositories for machine learning datasets and are a good resource for finding datasets for a variety of tasks.

I hope this helps! Let me know if you have any other questions.