Continue with Machine Learning - Try with Multiple Algorithms
In this post, what we are trying to do is finding a way to test several algorithm then choose the best one. The data is from https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/data Our machine learning model here is to predict whether the case diagnosis is benign or malignant (B, ...
In this post, what we are trying to do is finding a way to test several algorithm then choose the best one.
The data is from https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/data
Our machine learning model here is to predict whether the case diagnosis is benign or malignant (B, M). Let's look at the data:
import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import time df = pd.read_csv('data.csv') df.head() id diagnosis radius_mean texture_mean perimeter_mean area_mean 0 842302 M 17.99 10.38 122.80 1001.0 1 842517 M 20.57 17.77 132.90 1326.0 2 84300903 M 19.69 21.25 130.00 1203.0 3 84348301 M 11.42 20.38 77.58 386.1 4 84358402 M 20.29 14.34 135.10 1297.0 smoothness_mean compactness_mean concavity_mean concave points_mean 0 0.11840 0.27760 0.3001 0.14710 1 0.08474 0.07864 0.0869 0.07017 2 0.10960 0.15990 0.1974 0.12790 3 0.14250 0.28390 0.2414 0.10520 4 0.10030 0.13280 0.1980 0.10430 ... texture_worst perimeter_worst area_worst smoothness_worst 0 ... 17.33 184.60 2019.0 0.1622 1 ... 23.41 158.80 1956.0 0.1238 2 ... 25.53 152.50 1709.0 0.1444 3 ... 26.50 98.87 567.7 0.2098 4 ... 16.67 152.20 1575.0 0.1374 compactness_worst concavity_worst concave points_worst symmetry_worst 0 0.6656 0.7119 0.2654 0.4601 1 0.1866 0.2416 0.1860 0.2750 2 0.4245 0.4504 0.2430 0.3613 3 0.8663 0.6869 0.2575 0.6638 4 0.2050 0.4000 0.1625 0.2364 fractal_dimension_worst Unnamed: 32 0 0.11890 NaN 1 0.08902 NaN 2 0.08758 NaN 3 0.17300 NaN 4 0.07678 NaN [5 rows x 33 columns]
Data Description
There are 10 features measured in 3 ways: mean, standard error, worst. Those 10 features are: Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)
and diagnosis as malignant or benign (M,B)
Let's separate the data into features and class label (what we want to predict)
y = df['diagnosis'] x = df.drop(['id', 'diagnosis', 'Unnamed: 32'], axis=1) x.head() #Output radius_mean texture_mean perimeter_mean area_mean smoothness_mean 0 17.99 10.38 122.80 1001.0 0.11840 1 20.57 17.77 132.90 1326.0 0.08474 2 19.69 21.25 130.00 1203.0 0.10960 3 11.42 20.38 77.58 386.1 0.14250 4 20.29 14.34 135.10 1297.0 0.10030 compactness_mean concavity_mean concave points_mean symmetry_mean 0 0.27760 0.3001 0.14710 0.2419 1 0.07864 0.0869 0.07017 0.1812 2 0.15990 0.1974 0.12790 0.2069 3 0.28390 0.2414 0.10520 0.2597 4 0.13280 0.1980 0.10430 0.1809 fractal_dimension_mean ... radius_worst 0 0.07871 ... 25.38 1 0.05667 ... 24.99 2 0.05999 ... 23.57 3 0.09744 ... 14.91 4 0.05883 ... 22.54 texture_worst perimeter_worst area_worst smoothness_worst 0 17.33 184.60 2019.0 0.1622 1 23.41 158.80 1956.0 0.1238 2 25.53 152.50 1709.0 0.1444 3 26.50 98.87 567.7 0.2098 4 16.67 152.20 1575.0 0.1374 compactness_worst concavity_worst concave points_worst symmetry_worst 0 0.6656 0.7119 0.2654 0.4601 1 0.1866 0.2416 0.1860 0.2750 2 0.4245 0.4504 0.2430 0.3613 3 0.8663 0.6869 0.2575 0.6638 4 0.2050 0.4000 0.1625 0.2364 fractal_dimension_worst 0 0.11890 1 0.08902 2 0.08758 3 0.17300 4 0.07678 [5 rows x 30 columns]
Let check our data distribution by checking density plot on each feature:
x.plot(kind='density', subplots=True, layout=(6,6), sharex=False, legend=False, fontsize=1) plt.show()
All the features quite follow a general gaussian distribution. Let's check the number of case of benign and malignant
ax = sns.countplot(y, label="Count") b, m = y.value_counts() print("Number of Benign: ", b) print("Number of Malign: ", m)
Let's check data statistics:
x.describe() x.describe() radius_mean texture_mean perimeter_mean area_mean count 569.000000 569.000000 569.000000 569.000000 mean 14.127292 19.289649 91.969033 654.889104 std 3.524049 4.301036 24.298981 351.914129 min 6.981000 9.710000 43.790000 143.500000 25% 11.700000 16.170000 75.170000 420.300000 50% 13.370000 18.840000 86.240000 551.100000 75% 15.780000 21.800000 104.100000 782.700000 max 28.110000 39.280000 188.500000 2501.000000 smoothness_mean compactness_mean concavity_mean concave points_mean count 569.000000 569.000000 569.000000 569.000000 mean 0.096360 0.104341 0.088799 0.048919 std 0.014064 0.052813 0.079720 0.038803 min 0.052630 0.019380 0.000000 0.000000 25% 0.086370 0.064920 0.029560 0.020310 50% 0.095870 0.092630 0.061540 0.033500 75% 0.105300 0.130400 0.130700 0.074000 max 0.163400 0.345400 0.426800 0.201200 symmetry_mean fractal_dimension_mean ... count 569.000000 569.000000 ... mean 0.181162 0.062798 ... std 0.027414 0.007060 ... min 0.106000 0.049960 ... 25% 0.161900 0.057700 ... 50% 0.179200 0.061540 ... 75% 0.195700 0.066120 ... max 0.304000 0.097440 ... radius_worst texture_worst perimeter_worst area_worst count 569.000000 569.000000 569.000000 569.000000 mean 16.269190 25.677223 107.261213 880.583128 std 4.833242 6.146258 33.602542 569.356993 min 7.930000 12.020000 50.410000 185.200000 25% 13.010000 21.080000 84.110000 515.300000 50% 14.970000 25.410000 97.660000 686.500000 75% 18.790000 29.720000 125.400000 1084.000000 max 36.040000 49.540000 251.200000 4254.000000 smoothness_worst compactness_worst concavity_worst count 569.000000 569.000000 569.000000 mean 0.132369 0.254265 0.272188 std 0.022832 0.157336 0.208624 min 0.071170 0.027290 0.000000 25% 0.116600 0.147200 0.114500 50% 0.131300 0.211900 0.226700 75% 0.146000 0.339100 0.382900 max 0.222600 1.058000 1.252000 concave points_worst symmetry_worst fractal_dimension_worst count 569.000000 569.000000 569.000000 mean 0.114606 0.290076 0.083946 std 0.065732 0.061867 0.018061 min 0.000000 0.156500 0.055040 25% 0.064930 0.250400 0.071460 50% 0.099930 0.282200 0.080040 75% 0.161400 0.317900 0.092080 max 0.291000 0.663800 0.207500 [8 rows x 30 columns]
Let's check data features corrolletion:
f, ax = plt.subplots(figsize=(18,18)) sns.heatmap(x.corr(), annot=True, lineawidths=0.5, fmt='.1f', ax=ax)
Training model
There are several algorithm that are good for binary classification. We will test with 5 algorithm and check out which one is the best one: Classification and Regression Trees (CART), Linear Support Vector Machines (SVM), Gaussian Naive Bayes (NB) and k-Nearest Neighbors (KNN) and RandomForestClassifier(RF).
from sklearn.model_selection import KFold, cross_val_score from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC, LinearSVC models = [] models.append(('CART', DecisionTreeClassifier())) models.append(('SVM', SVC())) models.append(('NB', GaussianNB())) models.append(('KNN', KNeighborsClassifier())) models.append(('LinearSVC', LinearSVC())) num_folds = 10 results = [] names = [] kfold = KFold(n_splits=num_folds, random_state=123) for name, model in models: start = time.time() cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring='accuracy') end = time.time() results.append(cv_results) names.append(name) print("%s: %f (%f) (run time: %f)" % (name, cv_results.mean(), cv_results.std(), end-start)) #Output CART: 0.919551 (0.024681) (run time: 0.069104) SVM: 0.625769 (0.074918) (run time: 0.569782) NB: 0.921987 (0.034719) (run time: 0.039054) KNN: 0.901859 (0.044437) (run time: 0.046674) RF: 0.934679 (0.032022) (run time: 0.326259)
Let's make a graph of the performance
fig = plt.figure() fig.suptitle('Performance Comparision') ax= fig.add_subplot(111) plt.boxplot(results) ax.set_xticklabels(names) plt.show()
We find that the svm performance is not so good. This may be due to data not scaled yet. Let's scale before training check the performance again.
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler import warnings pipelines = [] pipelines.append(('ScaledCART', Pipeline([('Scaler', StandardScaler()), ('CART', DecisionTreeClassifier())]))) pipelines.append(('ScaledSVM', Pipeline([('Scaler', StandardScaler()), ('SVM', SVC())]))) pipelines.append(('ScaledNB', Pipeline([('Scaler', StandardScaler()), ('NB', GaussianNB())]))) pipelines.append(('ScaledKNN', Pipeline([('Scaler', StandardScaler()), ('KNN', KNeighborsClassifier())]))) pipelines.append(('ScaledRF', Pipeline([('Scaler', StandardScaler()), ('RF', RandomForestClassifier())]))) results = [] names = [] with warnings.catch_warnings(): warnings.simplefilter('ignore') kfold = KFold(n_splits=num_folds, random_state=123) for name, model in pipelines: start = time.time() cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring='accuracy') end = time.time() results.append(cv_results) names.append(name) print("%s: %f (%f) (%f)" % (name, cv_results.mean(), cv_results.std(), end-start)) #Output ScaledCART: 0.937179 (0.025657) (0.154313) ScaledSVM: 0.969744 (0.027240) (0.134548) ScaledNB: 0.937051 (0.039612) (0.058743) ScaledKNN: 0.952115 (0.043058) (0.091208) ScaledRF: 0.949744 (0.031627) (0.405433)
There are a lot of improvement. and SVM is the best. Here is the crux of this post. We will use GridSearchCV from model_selection to run each important params to tune for the best params.
from sklearn.model_selection import GridSearchCV scaler = StandardScaler().fit(x_train) scaledX = scaler.transform(x_train) c_values = [round(0.1 * (i+1), 1) for i in range(20)] kernel_values = ['linear', 'poly', 'rbf', 'sigmoid'] params_grid = dict(C=c_values, kernel=kernel_values) kfold = KFold(n_splits=num_folds, random_state=121) grid = GridSearchCV(estimator=SVC(), param_grid=params_grid, scoring='accuracy', cv=kfold) grid_result = grid.fit(scaledX, y_train) print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, std, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, std, param)) #Output Best: 0.972362 using {'C': 0.1, 'kernel': 'linear'} 0.972362 (0.026491) with: {'C': 0.1, 'kernel': 'linear'} 0.841709 (0.053980) with: {'C': 0.1, 'kernel': 'poly'} 0.932161 (0.039436) with: {'C': 0.1, 'kernel': 'rbf'} 0.939698 (0.020594) with: {'C': 0.1, 'kernel': 'sigmoid'} 0.964824 (0.036358) with: {'C': 0.2, 'kernel': 'linear'} 0.861809 (0.040516) with: {'C': 0.2, 'kernel': 'poly'} 0.947236 (0.030812) with: {'C': 0.2, 'kernel': 'rbf'} 0.944724 (0.022233) with: {'C': 0.2, 'kernel': 'sigmoid'} 0.962312 (0.034665) with: {'C': 0.3, 'kernel': 'linear'} 0.866834 (0.043296) with: {'C': 0.3