Machine Learning with Random Forest Algorithm
What is Random Forest? Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) ...
What is Random Forest?
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set.
Advantages using Random Forest
Here are some advantages of using random forests algorithm:
- Apply on both classification and the regression task
- Handle the missing values and maintains accuracy for missing data
- It will not overfit the model
- Handle large data set with higher dimensionality
Disadvantages using Random Forest
- Good at classification but not as good as for regression
- You have very little control on what the model does
Two major belief
- Most of the tree can provide correct prediction of class for most part of the data
- The tree are making mistake at different place That's why if we conduct voting for each of the observation and then decide about the class of the observation based on poll result, it is expected to be more close to the correct result
Examples
Let's get into an example using RandomForest Regressor We'll try to predict wine quality base on its attributes(features) You can get the code and data from this repo https://github.com/kchivorn/wine First of all we need to set up environment. I assume you have python3 installed and necessary packages
#Import numpy module import numpy as np #Import pandas import pandas as pd #Import sampling helper from sklearn.model_selection import train_test_split #Import preprocessing modules from sklearn import preprocessing #Import random forest model from sklearn.ensemble import RandomForestRegressor #Import cross-validation pipeline from sklearn.pipeline import make_pipeline from sklearn.model_selection import GridSearchCV #Import evaluation metrics from sklearn.metrics import mean_squared_error, r2_score #Import module for saving scikit-learn models from sklearn.externals import joblib
#Load red wine data data = pd.read_csv('winequality-red.csv', sep=';') #Output the first 5 rows of data print(data.head()) fixed acidity volatile acidity citric acid residual sugar chlorides 0 7.4 0.70 0.00 1.9 0.076 1 7.8 0.88 0.00 2.6 0.098 2 7.8 0.76 0.04 2.3 0.092 3 11.2 0.28 0.56 1.9 0.075 4 7.4 0.70 0.00 1.9 0.076 free sulfur dioxide total sulfur dioxide density pH sulphates 0 11.0 34.0 0.9978 3.51 0.56 1 25.0 67.0 0.9968 3.20 0.68 2 15.0 54.0 0.9970 3.26 0.65 3 17.0 60.0 0.9980 3.16 0.58 4 11.0 34.0 0.9978 3.51 0.56 alcohol quality 0 9.4 5 1 9.8 5 2 9.8 5 3 9.8 6 4 9.4 5 print(data.shape) (1599, 12) print(data.describe()) fixed acidity volatile acidity citric acid residual sugar count 1599.000000 1599.000000 1599.000000 1599.000000 mean 8.319637 0.527821 0.270976 2.538806 std 1.741096 0.179060 0.194801 1.409928 min 4.600000 0.120000 0.000000 0.900000 25% 7.100000 0.390000 0.090000 1.900000 50% 7.900000 0.520000 0.260000 2.200000 75% 9.200000 0.640000 0.420000 2.600000 max 15.900000 1.580000 1.000000 15.500000 chlorides free sulfur dioxide total sulfur dioxide density count 1599.000000 1599.000000 1599.000000 1599.000000 mean 0.087467 15.874922 46.467792 0.996747 std 0.047065 10.460157 32.895324 0.001887 min 0.012000 1.000000 6.000000 0.990070 25% 0.070000 7.000000 22.000000 0.995600 50% 0.079000 14.000000 38.000000 0.996750 75% 0.090000 21.000000 62.000000 0.997835 max 0.611000 72.000000 289.000000 1.003690 pH sulphates alcohol quality count 1599.000000 1599.000000 1599.000000 1599.000000 mean 3.311113 0.658149 10.422983 5.636023 std 0.154386 0.169507 1.065668 0.807569 min 2.740000 0.330000 8.400000 3.000000 25% 3.210000 0.550000 9.500000 5.000000 50% 3.310000 0.620000 10.200000 6.000000 75% 3.400000 0.730000 11.100000 6.000000 max 4.010000 2.000000 14.900000 8.000000
#Separate target from training features y = data.quality X = data.drop('quality', axis=1) #Split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y) # Fitting the Transformer API scaler = preprocessing.StandardScaler().fit(X_train) X_train_scaled = scaler.transform(X_train) #Applying transformer to training data print(X_train_scaled.mean(axis=0)) print(X_train_scaled.std(axis=0)) #Applying transformer to test data X_test_scaled = scaler.transform(X_test) print(X_test_scaled.mean(axis=0)) print(X_test_scaled.std(axis=0)) #Pipeline with preprocessing and model pipeline = make_pipeline(preprocessing.StandardScaler(), RandomForestRegressor(n_estimators=100)) #List tunable hyperparameters print(pipeline.get_params()) #Declare hyperparameters to tune hyperparameters = { 'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'], 'randomforestregressor__max_depth': [None, 5, 3, 1] } # Add sklearn cross-validation with pipeline clf = GridSearchCV(pipeline, hyperparameters, cv=10) # Fit and tune model clf.fit(X_train, y_train)
# print out the best params print(clf.best_params_) # {'randomforestregressor__max_depth': None, 'randomforestregressor__max_features': #Confirm model will be retrained print(clf.refit) # True #Predict a new set of data y_pred = clf.predict(X_test) print(y_pred) [ 6.35 5.77 4.86 5.36 6.31 5.47 5. 4.71 5.01 6.02 5.24 5.68 5.87 5.08 5.85 5.7 6.48 5.75 5.67 7. 5.55 5.61 5.06 6.08 6. 5.01 5.48 5.22 5.97 5.98 5.91 6.47 5.96 5.07 5. 5.95 5.03 6. 4.97 5.97 4.92 5.91 6.71 5.17 6.17 5.27 5.52 5.64 5.11 6.45 6.06 5.19 5.85 5.14 5.64 5.7 5.36 5.39 4.94 5.25 5.38 5.21 5.04 5.83 6.04 5.28 6.38 5.04 5.19 6.66 5.74 5.76 5.11 5.03 5.45 5.98 5.29 5.17 5.25 5.32 6.32 5.63 6.12 6.29 5.12 6.01 6.39 6.38 5.69 5.92 5.93 5.28 6.45 5.79 5.59 5.86 6.79 6.71 5.64 6.8 5.1 5.34 5.1 6.43 5.06 4.72 5.74 5.04 5.63 6.1 5.8 5.51 6.06 5.44 5.19 5.21 5.92 5.06 5.02 6.04 5.9 5.09 5.72 6.21 5.32 5.4 5.27 5.97 5.42 5.44 5.92 6.18 5.18 5.34 5.03 6.44 5. 5.11 6.72 5.36 5.18 5.08 5.75 6.04 5.38 5.31 5.11 6.47 5.84 5.03 5.58 5.11 4.9 5.01 5.2 5.94 5.45 5.71 5.76 5.27 5.46 5.3 5.26 5.87 5.01 5.93 5.13 5.4 5.49 5.02 5.9 5.08 5.71 5.08 5.57 5.5 5.02 5.36 5.55 5.1 5.99 5.58 4.99 5. 5.16 6.16 5.29 5.63 5.32 4.93 5.28 6.63 5.71 5.94 5.36 5.24 5.49 5.1 6.19 4.71 6.32 5.07 5.29 5.25 6.81 6.08 5.14 5.22 5.39 5.92 5.78 6.04 5.96 6.3 5.71 5.95 5.21 5.25 5.69 5.26 5.2 6.02 6.17 5.54 5.97 5.89 5.53 6.25 5.37 6.04 5.45 5.52 6.22 5.7 4.91 4.39 6.74 6.43 6.28 5.36 5.42 5.47 5.36 6.17 6. 5.14 5.12 5.36 5.15 6.32 5.22 5.04 5.18 5.15 5.91 6.34 5.74 5.38 5.43 6.46 5.49 6.02 5.33 5.19 5.74 5.88 5.8 5.54 5.41 5.05 5.71 5.41 6.55 6.15 5.69 4.96 5.95 6.41 5.99 5.45 5.74 5.33 5.34 5.94 6.87 5.32 6.36 5.92 5.35 5.52 5.64 5.14 5.12 6.3 5.81 5.92 5.95 5.91 5.31 5.66 5.52 6.09 5.67 6.84 6.91 5.88 6.27 5.04 5.31 5.95 5.37 5.35 5.97 6.58 6.48 5.27 5.56 5.69 6.13 5.46]
#Print prediction errors print(r2_score(y_test, y_pred)) 0.45044082571584243 print(mean_squared_error(y_test, y_pred)) 0.35461593750000003 # Save model to a .pkl file joblib.dump(clf, 'rf_regressor.pkl') # Load model from .pkl file clf2 = joblib.load('rf_regressor.pkl') clf2.predict(X_test)
Conclusion
Well, the rule of thumb is that your very first model probably won't be the best possible model. However, we recommend a combination of three strategies to decide if you're satisfied with your model performance. Start with the goal of the model. If the model is tied to a business problem, have you successfully solved the problem? Look in academic literature to get a sense of the current performance benchmarks for specific types of data. Try to find low-hanging fruit in terms of ways to improve your model.