Lab 9 - Random Forests and Ensembles¶

This lab covers the random forest algorithm, as well as the more general approach of ensemble learning using sklearn.

In [1]:
## Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

# turn off KNN future warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)  

Examples throughout this lab will use the Wisconsin breast cancer data that were used in several of our previous labs:

In [2]:
## Read the data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")

## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wbc, test_size=0.2, random_state=7)

## Separate the target from the predictors and re-label the target
train_y = train['Label'].map({'M': 1, 'B': 0})
test_y = test['Label'].map({'M': 1, 'B': 0})
train_X = train.drop(['ID','Label'], axis = 1)
test_X = test.drop(['ID','Label'], axis = 1)

Part 1 - Ensemble Classifiers¶

The fundamental idea of an ensemble learner is to combine the prediction rules from several different models/estimators to produce a composite model with superior performance on new data. This can work well if the different models used in the ensemble are each able to learn different things from the training data. Ensembles are most effective when each base model is prone to making different types of errors on the training data.

There are two main approaches to creating an ensemble learner:

  1. Aggregation (the focus of this lab) - Train several models independently and aggregate predictions using voting or simple/weighted averaging.
  2. Boosting (discussed next time) - Train several models sequentially such that each model addresses a weak point of the previous model.

We can build our own aggregation ensemble using the VotingClassifier() function:

In [3]:
## Import each model
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler

## Defining the individual models
model1 = Pipeline([('scaler', StandardScaler()),
                  ('model', SVC(kernel = 'poly', probability=True))])
model2 = DecisionTreeClassifier(max_depth=5)
model3 = Pipeline([('scaler', StandardScaler()),
                  ('model', KNeighborsClassifier())])
                  
## Create the ensemble
my_ensemble = VotingClassifier(estimators=[('svm', model1),('tree', model2),('knn', model3)], voting='soft')

This example will combine the predictions of three different classification models, a support vector machine, a decision tree, and a $k$-nearest neighbors model using "soft" voting.

  • The argument voting = 'soft' will sum the predicted probabilities for each outcome class produced by each model in the ensemble with the final prediction being the class with the largest sum.
  • The argument voting = 'hard' will use a simple majority vote where the final prediction is the most commonly predicted class. Unfortunately, if there's a tie the final prediction will be assigned in ascending order.

Thus, we'll generally prefer soft voting unless our scenario involves a sufficient number of models and outcome classes that make ties unlikely to occur.

Once it is set up, our voting classifier (ie: my_ensemble) has the same capabilities as individual modeling functions in sklearn. As an example we can find the ensemble's cross-validated F1-score and compare it to those of its individual base models:

In [4]:
## Imports
from sklearn.pipeline import Pipeline 
from sklearn.model_selection import GridSearchCV

## Pipeline to compare models 
model_pipe = Pipeline([('model', SVC())])
candidate_models = {'model': [my_ensemble, model1, model2, model3]}

## Cross-validated F1 scores
grid = GridSearchCV(model_pipe, candidate_models, cv=5, scoring = 'f1').fit(train_X, train_y)
pd.DataFrame(grid.cv_results_).sort_values('mean_test_score', ascending=False)[['param_model', 'mean_test_score']]
Out[4]:
param_model mean_test_score
0 VotingClassifier(estimators=[('svm',\n ... 0.903979
3 (StandardScaler(), KNeighborsClassifier()) 0.898491
2 DecisionTreeClassifier(max_depth=5) 0.862679
1 (StandardScaler(), SVC(kernel='poly', probabil... 0.848937

This example demonstrates the primary reason to consider using an ensemble, which is that the composite predictions tend to exhibit better out-of-sample performance than any individual achieves by itself.

Another thing to know is that ensembles created using VotingClassifier() are fully compatible with pipelines and grid search to optimize the hyperparameters of individual models:

In [5]:
## Some tuning parameters to search over
params = {'svm__model__kernel': ['poly','linear'], 
          'tree__max_depth': [4,5,6],
         'knn__model__n_neighbors': [5,10,15],
         'knn__model__weights': ['distance','uniform'],
         'voting': ['soft','hard']}

## Perform the grid search
grid = GridSearchCV(my_ensemble, param_grid=params, cv=5, scoring = 'f1').fit(train_X, train_y)
print(grid.best_estimator_)
VotingClassifier(estimators=[('svm',
                              Pipeline(steps=[('scaler', StandardScaler()),
                                              ('model',
                                               SVC(kernel='linear',
                                                   probability=True))])),
                             ('tree', DecisionTreeClassifier(max_depth=6)),
                             ('knn',
                              Pipeline(steps=[('scaler', StandardScaler()),
                                              ('model',
                                               KNeighborsClassifier(n_neighbors=15))]))])

Finally, you should be aware that it's not required for every model in an ensemble to contribute equally to the ensemble's prediction.

The weights argument can be used to adjust the relative contribution of each model in an ensemble:

In [6]:
## Example that gives less weight to SVM and more to KNN
weighted_ensemble = VotingClassifier(estimators=[('svm', model1),
                                                 ('tree', model2), 
                                                 ('knn', model3)], 
                                     voting='soft', weights=[0.8,1,1.2])

## Example comparing some different weighting schemes using cross-validation
candidate_models = {'weights': [[0.8,1,1.2], [1.2,1,0.8], [1,0.8,1.2]]}
grid = GridSearchCV(weighted_ensemble, candidate_models, cv=5, scoring = 'f1').fit(train_X, train_y)
grid.best_estimator_
Out[6]:
VotingClassifier(estimators=[('svm',
                              Pipeline(steps=[('scaler', StandardScaler()),
                                              ('model',
                                               SVC(kernel='poly',
                                                   probability=True))])),
                             ('tree', DecisionTreeClassifier(max_depth=5)),
                             ('knn',
                              Pipeline(steps=[('scaler', StandardScaler()),
                                              ('model',
                                               KNeighborsClassifier())]))],
                 voting='soft', weights=[1, 0.8, 1.2])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
VotingClassifier(estimators=[('svm',
                              Pipeline(steps=[('scaler', StandardScaler()),
                                              ('model',
                                               SVC(kernel='poly',
                                                   probability=True))])),
                             ('tree', DecisionTreeClassifier(max_depth=5)),
                             ('knn',
                              Pipeline(steps=[('scaler', StandardScaler()),
                                              ('model',
                                               KNeighborsClassifier())]))],
                 voting='soft', weights=[1, 0.8, 1.2])
StandardScaler()
SVC(kernel='poly', probability=True)
DecisionTreeClassifier(max_depth=5)
StandardScaler()
KNeighborsClassifier()

You should note that when voting = 'soft' these weights act as direct multipliers on the predicted class probabilities that are summed to make class predictions. When voting = 'hard', these weights can be used to break ties.

Question 1:

  • Part A: Create a "soft" voting ensemble involving five base models: a linear SVM, an 'rbf' SVM, a decision tree with a max depth of 4, a KNN classifier using $k=5$, and a KNN classifier using $k=15$. Include a standardization step as part of the SVM and KNN base models. Conduct a cross-validated grid search using the default scoring metric (classification accuracy) to optimize the following parameters:
    • The regularization or "slack" in each SVM base model as either C = 1 or C = 0.1
    • The weighting scheme in each KNN base model as either inverse-distance weighting or uniform weighting.
  • Part B: Use GridSearchCV() to find a weighting scheme for aggregating the base models in your best performing ensemble from Part A that yields better classification accuracy than weighting each model's predictions equally.

Part 2 - Stacked Generalizations¶

Functions like VotingClassifier() use relatively simple strategies to aggregate the predictions of the individual models within the ensemble. A more complex strategy would be to train another model on top of the ensemble's "base models" whose sole purpose is to optimally assign the predictions of the base models to predicted classes. This approach is known as stacked generalization.

The example below uses a decision tree model to aggregate the output of our ensemble from Part 1 of the lab:

In [7]:
## Set up the base estimators (model1 - model3 defined previously)
my_base_models = [('svm', model1), ('tree', model2), ('knn', model3)]

## Set up the final estimator
my_final_model = DecisionTreeClassifier(max_depth=2)

## Create the stack
from sklearn.ensemble import StackingClassifier
my_stack = StackingClassifier(estimators = my_base_models, 
                              final_estimator = my_final_model, 
                              stack_method ='predict_proba', cv=5)

## Fit and Evaluate (note that CV is done internally in StackingClassifier)
fitted_stack = my_stack.fit(train_X, train_y)
cv_stacked_preds = fitted_stack.predict(train_X)

from sklearn.metrics import f1_score
print(f1_score(cv_stacked_preds, train_y))
0.9320987654320988

Here we can see that the stacked generalization combined the predicted probabilities from these base models in a way that leads to a higher cross-validated F1 score than simple 'soft' voting.

Since the final model was a decision tree, we can see how it is using the predicted probabilities from the base models to generate final predictions:

In [8]:
from sklearn.tree import plot_tree
plt.figure(figsize=(8,4.5))
plot_tree(fitted_stack.final_estimator_, class_names=True)
plt.show()

Here the notation x[0] references the predicted probabilities of the first base model in the stack. Thus, we can see that our final estimator is using the output of KNN in its initial step, then it is using the SVM output (in both child branches) to further partition the data.

Question 2: Recognizing that the final estimator in this example is a decision tree. Find the feature importance values of the stacked classifier used in the example given above. Briefly explain why the KNN base model exhibits the greatest importance.

Part 3 - Ensembles for Regression Tasks¶

Predicting a numeric outcome using ensembles or stacked generalizations is largely the same as the classification examples provided in the previous sections with only a few minor differences.

First, you should note that the relevant functions are VotingRegressor() and StackingRegressor().

  • By default, the final predictions of VotingRegressor() are a simple average of the predictions made by each individual base model. The weights argument can be used to adjust the relative contribution of each model's prediction.
  • In StackingRegressor() you must use a regressor as the final estimator.

Question 3: For this question you should use the Iowa City home sales data provided below. You will use the numeric features in the data to predict the outcome 'sale.amount'.

  • Part A: Create an ensemble that uses three base models: a support vector regressor, a decision tree regressor, and a KNN regressor. Be sure to include standardization when needed. Then, perform a cross-validated grid search with the 'neg_root_mean_squared_error' scoring metric to optimize the following tuning parameters within these base models:
    • SVM kernel - either 'linear' of 'rbf'
    • SVM regularization - either C=1 or C=100
    • Decision tree depth - either 3, 4, or 5
    • KNN number of neighbors - either 10 or 25
  • Part B: Using the ensemble involving the best combination of tuning parameters in Part A, fit a stacked generalization that uses a decision tree with a maximum depth of 3 as the final estimator. Report the cross-validated RMSE of this approach.
  • Part C: Briefly comment upon whether your stacked generalization improved upon the predictive performance of your ensemble from Part A.
In [9]:
## Read the IC home sales data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

## Split to training and testing sets
from sklearn.model_selection import train_test_split
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)

## Split the outcome from the predictors
train_y_ic = train_ic['sale.amount']
train_X_ic= train_ic.select_dtypes("number").drop('sale.amount',axis=1)

Part 4 - Random Forests¶

The random forest algorithm is popular ensemble approach that aggregates the results of many decision trees that are trained independently on bootstrapped samples) of the training data. Usually, these trees are forced to use only a subset of the predictors available in the original data set. Together, bootstrapping and feature sampling address two major weaknesses of single decision trees:

  1. Deep decision trees are often necessary to adequately capture the predictive patterns in the training data; but deep trees exhibit high variability, meaning the splits are highly dependent on the specific set of training data and are prone to overfitting.
  2. The decision tree algorithm is greedy, so variables that are more predictive will dominate correlated variables that are less predictive but still contain some useful information about the outcome.

Bootstrapping addresses the first weakness by removing the need for deep trees and rewarding splits that generalize well to variations of the training data. Feature sampling addresses the second weakness by allowing more variables to contribute to the ensemble than would normally contribute to a single decision tree.

Overall, the success of a random forest is tied to its ability to build a diverse set of decision trees using slightly different training data and sets of predictors. Achieving success involves selecting appropriate tuning parameters, the most important being:

  • max_depth - the maximum depth of the individual trees in the ensemble
  • min_samples_split - the minimum samples in a node for it to be eligible for splitting (for a single tree in the ensemble)
  • max_features - typically given as an int representing the number of randomly selected predictors to be considered in an individual tree in the ensemble.

You should also be aware of the n_estimators tuning parameter, which governs the number of trees in the forest. If this parameter is set too low, the forest may not be sufficiently flexible to capture all of the meaningful patterns in the data. However, there is no benefit to choosing a very large number of trees as the performance of a random forest tends to stabilize with a certain number of trees and any more will only add computational burden.

Below is a quick demonstration of RandomForestClassifier(). You should note that there is an analogous function RandomForestRegressor() that is used for regression tasks.

In [10]:
from sklearn.ensemble import RandomForestClassifier
my_forest = RandomForestClassifier(max_depth=3, min_samples_split=10, 
                                   max_features=2, n_estimators=200, 
                                   random_state=0, oob_score=True)

fitted_forest = my_forest.fit(train_X, train_y)
print(fitted_forest.oob_score_) ## out-of-sample classification accuracy
0.9208791208791208

It's important to recognize that bootstrapping will naturally exclude some of the data-points from the sample used to train each tree. We can use these excluded data-points to calculate "out-of-bag" performance. Out-of-bag metrics are out-of-sample measures of performance, so we can use them in the same way that we use cross-validated measures of performance. Setting oob_score = True will use classification accuracy as a scoring metric. You can also provide a callable function that has the arguments y_true and y_pred (such as f1_score(), etc.)

Unfortunately, there's no easy way to rely upon oob_score_ to efficiently tune hyperparameters. So you'll likely need to rely on GridSearchCV() despite its computational efficiencies with random forests.

Question 4:

  • Part A: Use a cross-validated grid search using the 'neg_root_mean_squared_error' scoring metric to find a well-performing random forest model for the Iowa City home sales data (using all other features to predict sale.amount). Report the best model and it's cross-validated score. Your search should explore the following tuning parameter values:
    • max_depth of either 2 or 3
    • min_samples_split of either 20 or 60
    • max_features of either 3, 5, or 7
    • n_estimators as 200
  • Part B: Use GridSearchCV() to compare the best performing random forest from Part A with the best performing approach from Question 3. Briefly comment upon which method performed better (in terms of its cross-validated RMSE).
  • Part C: Calculate the RMSE of the best model from Part B on the test set.
  • Part D: Create an ensemble using VotingRegressor() that averages the predictions of the two machine learning methods you compared in Part B. Calculate the RMSE of this ensemble on the test set and compare it with your results from Part C.