This lab covers the random forest algorithm, as well as the more general approach of ensemble learning using sklearn
.
## Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
# turn off KNN future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
Examples throughout this lab will use the Wisconsin breast cancer data that were used in several of our previous labs:
## Read the data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wbc, test_size=0.2, random_state=7)
## Separate the target from the predictors and re-label the target
train_y = train['Label'].map({'M': 1, 'B': 0})
test_y = test['Label'].map({'M': 1, 'B': 0})
train_X = train.drop(['ID','Label'], axis = 1)
test_X = test.drop(['ID','Label'], axis = 1)
The fundamental idea of an ensemble learner is to combine the prediction rules from several different models/estimators to produce a composite model with superior performance on new data. This can work well if the different models used in the ensemble are each able to learn different things from the training data. Ensembles are most effective when each base model is prone to making different types of errors on the training data.
There are two main approaches to creating an ensemble learner:
We can build our own aggregation ensemble using the VotingClassifier()
function:
## Import each model
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
## Defining the individual models
model1 = Pipeline([('scaler', StandardScaler()),
('model', SVC(kernel = 'poly', probability=True))])
model2 = DecisionTreeClassifier(max_depth=5)
model3 = Pipeline([('scaler', StandardScaler()),
('model', KNeighborsClassifier())])
## Create the ensemble
my_ensemble = VotingClassifier(estimators=[('svm', model1),('tree', model2),('knn', model3)], voting='soft')
This example will combine the predictions of three different classification models, a support vector machine, a decision tree, and a $k$-nearest neighbors model using "soft" voting.
voting = 'soft'
will sum the predicted probabilities for each outcome class produced by each model in the ensemble with the final prediction being the class with the largest sum.voting = 'hard'
will use a simple majority vote where the final prediction is the most commonly predicted class. Unfortunately, if there's a tie the final prediction will be assigned in ascending order.Thus, we'll generally prefer soft
voting unless our scenario involves a sufficient number of models and outcome classes that make ties unlikely to occur.
Once it is set up, our voting classifier (ie: my_ensemble
) has the same capabilities as individual modeling functions in sklearn
. As an example we can find the ensemble's cross-validated F1-score and compare it to those of its individual base models:
## Imports
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
## Pipeline to compare models
model_pipe = Pipeline([('model', SVC())])
candidate_models = {'model': [my_ensemble, model1, model2, model3]}
## Cross-validated F1 scores
grid = GridSearchCV(model_pipe, candidate_models, cv=5, scoring = 'f1').fit(train_X, train_y)
pd.DataFrame(grid.cv_results_).sort_values('mean_test_score', ascending=False)[['param_model', 'mean_test_score']]
param_model | mean_test_score | |
---|---|---|
0 | VotingClassifier(estimators=[('svm',\n ... | 0.903979 |
3 | (StandardScaler(), KNeighborsClassifier()) | 0.898491 |
2 | DecisionTreeClassifier(max_depth=5) | 0.862679 |
1 | (StandardScaler(), SVC(kernel='poly', probabil... | 0.848937 |
This example demonstrates the primary reason to consider using an ensemble, which is that the composite predictions tend to exhibit better out-of-sample performance than any individual achieves by itself.
Another thing to know is that ensembles created using VotingClassifier()
are fully compatible with pipelines and grid search to optimize the hyperparameters of individual models:
## Some tuning parameters to search over
params = {'svm__model__kernel': ['poly','linear'],
'tree__max_depth': [4,5,6],
'knn__model__n_neighbors': [5,10,15],
'knn__model__weights': ['distance','uniform'],
'voting': ['soft','hard']}
## Perform the grid search
grid = GridSearchCV(my_ensemble, param_grid=params, cv=5, scoring = 'f1').fit(train_X, train_y)
print(grid.best_estimator_)
VotingClassifier(estimators=[('svm', Pipeline(steps=[('scaler', StandardScaler()), ('model', SVC(kernel='linear', probability=True))])), ('tree', DecisionTreeClassifier(max_depth=6)), ('knn', Pipeline(steps=[('scaler', StandardScaler()), ('model', KNeighborsClassifier(n_neighbors=15))]))])
Finally, you should be aware that it's not required for every model in an ensemble to contribute equally to the ensemble's prediction.
The weights
argument can be used to adjust the relative contribution of each model in an ensemble:
## Example that gives less weight to SVM and more to KNN
weighted_ensemble = VotingClassifier(estimators=[('svm', model1),
('tree', model2),
('knn', model3)],
voting='soft', weights=[0.8,1,1.2])
## Example comparing some different weighting schemes using cross-validation
candidate_models = {'weights': [[0.8,1,1.2], [1.2,1,0.8], [1,0.8,1.2]]}
grid = GridSearchCV(weighted_ensemble, candidate_models, cv=5, scoring = 'f1').fit(train_X, train_y)
grid.best_estimator_
VotingClassifier(estimators=[('svm', Pipeline(steps=[('scaler', StandardScaler()), ('model', SVC(kernel='poly', probability=True))])), ('tree', DecisionTreeClassifier(max_depth=5)), ('knn', Pipeline(steps=[('scaler', StandardScaler()), ('model', KNeighborsClassifier())]))], voting='soft', weights=[1, 0.8, 1.2])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
VotingClassifier(estimators=[('svm', Pipeline(steps=[('scaler', StandardScaler()), ('model', SVC(kernel='poly', probability=True))])), ('tree', DecisionTreeClassifier(max_depth=5)), ('knn', Pipeline(steps=[('scaler', StandardScaler()), ('model', KNeighborsClassifier())]))], voting='soft', weights=[1, 0.8, 1.2])
StandardScaler()
SVC(kernel='poly', probability=True)
DecisionTreeClassifier(max_depth=5)
StandardScaler()
KNeighborsClassifier()
You should note that when voting = 'soft'
these weights act as direct multipliers on the predicted class probabilities that are summed to make class predictions. When voting = 'hard'
, these weights can be used to break ties.
Question 1:
C = 1
or C = 0.1
GridSearchCV()
to find a weighting scheme for aggregating the base models in your best performing ensemble from Part A that yields better classification accuracy than weighting each model's predictions equally.Functions like VotingClassifier()
use relatively simple strategies to aggregate the predictions of the individual models within the ensemble. A more complex strategy would be to train another model on top of the ensemble's "base models" whose sole purpose is to optimally assign the predictions of the base models to predicted classes. This approach is known as stacked generalization.
The example below uses a decision tree model to aggregate the output of our ensemble from Part 1 of the lab:
## Set up the base estimators (model1 - model3 defined previously)
my_base_models = [('svm', model1), ('tree', model2), ('knn', model3)]
## Set up the final estimator
my_final_model = DecisionTreeClassifier(max_depth=2)
## Create the stack
from sklearn.ensemble import StackingClassifier
my_stack = StackingClassifier(estimators = my_base_models,
final_estimator = my_final_model,
stack_method ='predict_proba', cv=5)
## Fit and Evaluate (note that CV is done internally in StackingClassifier)
fitted_stack = my_stack.fit(train_X, train_y)
cv_stacked_preds = fitted_stack.predict(train_X)
from sklearn.metrics import f1_score
print(f1_score(cv_stacked_preds, train_y))
0.9320987654320988
Here we can see that the stacked generalization combined the predicted probabilities from these base models in a way that leads to a higher cross-validated F1 score than simple 'soft' voting.
Since the final model was a decision tree, we can see how it is using the predicted probabilities from the base models to generate final predictions:
from sklearn.tree import plot_tree
plt.figure(figsize=(8,4.5))
plot_tree(fitted_stack.final_estimator_, class_names=True)
plt.show()
Here the notation x[0]
references the predicted probabilities of the first base model in the stack. Thus, we can see that our final estimator is using the output of KNN in its initial step, then it is using the SVM output (in both child branches) to further partition the data.
Question 2: Recognizing that the final estimator in this example is a decision tree. Find the feature importance values of the stacked classifier used in the example given above. Briefly explain why the KNN base model exhibits the greatest importance.
Predicting a numeric outcome using ensembles or stacked generalizations is largely the same as the classification examples provided in the previous sections with only a few minor differences.
First, you should note that the relevant functions are VotingRegressor()
and StackingRegressor()
.
VotingRegressor()
are a simple average of the predictions made by each individual base model. The weights
argument can be used to adjust the relative contribution of each model's prediction.StackingRegressor()
you must use a regressor as the final estimator.Question 3: For this question you should use the Iowa City home sales data provided below. You will use the numeric features in the data to predict the outcome 'sale.amount'.
'neg_root_mean_squared_error'
scoring metric to optimize the following tuning parameters within these base models:C=1
or C=100
## Read the IC home sales data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
## Split to training and testing sets
from sklearn.model_selection import train_test_split
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)
## Split the outcome from the predictors
train_y_ic = train_ic['sale.amount']
train_X_ic= train_ic.select_dtypes("number").drop('sale.amount',axis=1)
The random forest algorithm is popular ensemble approach that aggregates the results of many decision trees that are trained independently on bootstrapped samples) of the training data. Usually, these trees are forced to use only a subset of the predictors available in the original data set. Together, bootstrapping and feature sampling address two major weaknesses of single decision trees:
Bootstrapping addresses the first weakness by removing the need for deep trees and rewarding splits that generalize well to variations of the training data. Feature sampling addresses the second weakness by allowing more variables to contribute to the ensemble than would normally contribute to a single decision tree.
Overall, the success of a random forest is tied to its ability to build a diverse set of decision trees using slightly different training data and sets of predictors. Achieving success involves selecting appropriate tuning parameters, the most important being:
max_depth
- the maximum depth of the individual trees in the ensemblemin_samples_split
- the minimum samples in a node for it to be eligible for splitting (for a single tree in the ensemble)max_features
- typically given as an int
representing the number of randomly selected predictors to be considered in an individual tree in the ensemble.You should also be aware of the n_estimators
tuning parameter, which governs the number of trees in the forest. If this parameter is set too low, the forest may not be sufficiently flexible to capture all of the meaningful patterns in the data. However, there is no benefit to choosing a very large number of trees as the performance of a random forest tends to stabilize with a certain number of trees and any more will only add computational burden.
Below is a quick demonstration of RandomForestClassifier()
. You should note that there is an analogous function RandomForestRegressor()
that is used for regression tasks.
from sklearn.ensemble import RandomForestClassifier
my_forest = RandomForestClassifier(max_depth=3, min_samples_split=10,
max_features=2, n_estimators=200,
random_state=0, oob_score=True)
fitted_forest = my_forest.fit(train_X, train_y)
print(fitted_forest.oob_score_) ## out-of-sample classification accuracy
0.9208791208791208
It's important to recognize that bootstrapping will naturally exclude some of the data-points from the sample used to train each tree. We can use these excluded data-points to calculate "out-of-bag" performance. Out-of-bag metrics are out-of-sample measures of performance, so we can use them in the same way that we use cross-validated measures of performance. Setting oob_score = True
will use classification accuracy as a scoring metric. You can also provide a callable function that has the arguments y_true
and y_pred
(such as f1_score()
, etc.)
Unfortunately, there's no easy way to rely upon oob_score_
to efficiently tune hyperparameters. So you'll likely need to rely on GridSearchCV()
despite its computational efficiencies with random forests.
Question 4:
'neg_root_mean_squared_error'
scoring metric to find a well-performing random forest model for the Iowa City home sales data (using all other features to predict sale.amount
). Report the best model and it's cross-validated score. Your search should explore the following tuning parameter values:max_depth
of either 2 or 3min_samples_split
of either 20 or 60max_features
of either 3, 5, or 7n_estimators
as 200GridSearchCV()
to compare the best performing random forest from Part A with the best performing approach from Question 3. Briefly comment upon which method performed better (in terms of its cross-validated RMSE).VotingRegressor()
that averages the predictions of the two machine learning methods you compared in Part B. Calculate the RMSE of this ensemble on the test set and compare it with your results from Part C.