This lab will cover implementations of ensemble learning methods in sklearn
, with random forests being a notable example.
As usual, we'll begin by loading several familiar libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
# Unfortunately, knn functions prompt "future warnings", so the commands below turn these off
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
Our classification examples will again use the SMS spam dataset. The code below creates several different features from the message text:
## Note that textfile containing these data uses a tab delimiter to separate the label and message
sms = pd.read_csv("https://remiller1450.github.io/data/sms_spam.txt", sep='\t', names=['Label','Message'])
## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(sms, test_size=0.2, random_state=8)
## Split outcome from predictors
train_y = (train['Label'] == 'spam').astype(int)
train_msg = train['Message']
## Feature engineering functions
def get_num(text):
return sum(map(str.isdigit, text))/len(text)
def cap_percent(text):
return sum(map(str.isupper, text))/len(text)
def alpha_percent(text):
return sum(map(str.isalnum, text))/len(text)
## Define "first_word" function
def first_word(text):
return text.split(sep=' ')[0].lower().replace('!','')
## Create data frame with these features
d = {'prop_num': train_msg.apply(get_num),
'prop_cap': train_msg.apply(cap_percent),
'prop_alp': train_msg.apply(alpha_percent),
'first': train_msg.apply(first_word)}
train_X = pd.DataFrame(d)
## Assemble final training X data
train_X_ohe = pd.get_dummies(train_X, columns=['first'])
train_X = train_X_ohe[['prop_num','prop_cap', 'prop_alp', 'first_urgent','first_free']]
The fundamental idea of an ensemble is to combine predictions from several different models/estimators in order to produce a composite model with superior performance and generalizability to new data.
Ensemble modeling methods generally fall into one of two categories:
Aggregation ensembles, such as random forests, use multiple independent models/estimators and aggregate predictions from each using strategies like simple or weighted averaging. Boosting ensembles use sequentially built models/estimators.
For classification tasks, we can construct our own ensemble for any set of base models using VotingClassifier
. The example below demonstrates a simple ensemble consisting of three different models: k-nearest neighbors, logistic regression, and a decision tree.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
## Defining the models
model1 = LogisticRegression(penalty='none')
model2 = DecisionTreeClassifier(max_depth=5)
model3 = Pipeline([('scaler', StandardScaler()),
('model', KNeighborsClassifier())])
## Creating the ensemble
my_ensemble = VotingClassifier(estimators=[('logr', model1),
('tree', model2),
('knn', model3)],
voting='soft')
In VotingClassifier
, the voting
argument controls how each model in the ensemble contributes to the final prediction:
voting = 'soft'
sums the predicted probability of each class across each model in the ensemble. The final prediction is the class with the largest sum.voting = 'hard'
uses each model's predicted class in a simple majority vote. For example, if 2 of 3 models predict a message is "spam", then the ensemble's prediction for that message is "spam". We can use our VotingClassifier
ensemble in many of the same ways we can use individual models. For example, we can evaluate its performance using cross-validation:
from sklearn.model_selection import cross_val_score
print(np.average(cross_val_score(my_ensemble, train_X, train_y, scoring='f1', cv=5)))
## Individual models
print([np.average(cross_val_score(model1, train_X, train_y, scoring='f1', cv=5)),
np.average(cross_val_score(model2, train_X, train_y, scoring='f1', cv=5)),
np.average(cross_val_score(model3, train_X, train_y, scoring='f1', cv=5))])
0.9128026222243044 [0.8453849485158784, 0.9053659207079692, 0.9084395064728428]
Here we can see the benefit ensembles. The composite predictions from the ensemble result in a better cross-validated F1-score than any base model achieves by itself.
Ensembles created using VotingClassifier
are compatible with pipelines and cross-validated grid search, providing us a useful set of tools for achieving strong predictive performance.
This process is demonstrated below. You should pay special attention to the syntax used to specify values for the n_neighbors
argument of the k-nearest neighbors component of the ensemble, which uses double underscores to reference arguments within a particular component.
params = {'logr__penalty': ['none','l2'],
'tree__max_depth': [4,5,6,7],
'knn__model__n_neighbors': [3,6,10,15],
'knn__model__weights': ['distance','uniform'],
'voting': ['soft','hard']}
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(my_ensemble, param_grid=params, cv=5, scoring = 'f1').fit(train_X, train_y)
print(grid.best_estimator_)
VotingClassifier(estimators=[('logr', LogisticRegression(penalty='none')), ('tree', DecisionTreeClassifier(max_depth=5)), ('knn', Pipeline(steps=[('scaler', StandardScaler()), ('model', KNeighborsClassifier(n_neighbors=3, weights='distance'))]))], voting='soft')
Another useful feature is the option to weight each individual model in an ensemble differently. In our previous example, the k-nearest neighbors classifier was the strongest model on its own, so we might think to give it slightly more weight in our ensemble:
weighted_ensemble = VotingClassifier(estimators=[('logr', model1),
('tree', model2),
('knn', model3)],
voting='soft',
weights=[0.8,1,1.2])
By setting weights = [0.8, 1, 1.2]
and using voting = 'soft'
, predicted class probabilities from the logistic regression model are multiplied by 0.8, so they contribute less the ensemble's scores. Similarly, the class probabilities from the k-nearest neighbors model are multiplied by 1.2, thereby giving them more influence.
We can re-run our cross-validates grid search using this new ensemble to see if it is beneficial:
params = {'logr__penalty': ['none','l2'],
'tree__max_depth': [4,5,6,7],
'knn__model__n_neighbors': [3,6,10,15],
'knn__model__weights': ['distance','uniform']}
from sklearn.model_selection import GridSearchCV
grid_weighted = GridSearchCV(weighted_ensemble, param_grid=params, cv=5, scoring = 'f1').fit(train_X, train_y)
print(grid.best_score_, grid_weighted.best_score_)
0.9167655692653746 0.916737416166477
We see a slight gain in performance with this weighting scheme. A few additional comments:
weights
as a tuning parameter and use methods like grid search to explore different combinations.weights = [0.5,0.5,2]
effectively leads to the ensemble's predictions extremely similar to the predictions obtained using just the KNN model, so you might argue that it's logical to simply use KNN model by itself rather than using this weighting scheme.The methods described in the previous section aggregate the predictions of multiple models using relatively simple strategies, such as simple majority voting (voting = 'hard'
), the sums of predicted probabilities (voting = 'soft'
), or the weighted sums of predicted probabilities (voting = 'soft'
and weights = ..
).
Another strategy is use a formally defined model to aggregate the predictions from the base models. For example, we can take the predicted probabilities from each base estimator in an ensemble and use them as inputs into a new model, such as a decision tree or logistic regression, which will translate them into final prediction scores and predicted classes.
This strategy is known as stacked generalization, and an example is demonstrated below:
## Create list of models in the ensemble
base_models = [('logreg', LogisticRegression(penalty='none')),
('tree', DecisionTreeClassifier(max_depth=5)),
('knn', KNeighborsClassifier(n_neighbors=20))]
## Final model used to aggregate predicted probs
my_final_estimator = DecisionTreeClassifier(max_depth=3)
## Create stack
from sklearn.ensemble import StackingClassifier
my_stack = StackingClassifier(estimators = base_models, final_estimator = my_final_estimator,
stack_method ='predict_proba', cv=5)
## Fit and Evaluate (cv is done internally in StackingClassifier)
fitted_stack = my_stack.fit(train_X, train_y)
cv_stacked_preds = fitted_stack.predict(train_X)
from sklearn.metrics import f1_score
print(f1_score(cv_stacked_preds, train_y))
0.9242957746478873
Comparing with our previous cross-validated F1-scores on the training data, this is an improvement and substantially higher than the benchmark set for KNN back in Lab 3.
Since the model we used to aggregate the base estimators was a decision tree, we can use methods from our previous lab to explore it:
from sklearn.tree import plot_tree
plt.figure(figsize=(8,5.5))
plot_tree(fitted_stack.final_estimator_, class_names=True)
plt.show()
Note that the splitting rules of this model are based upon the predicted probabilities from individual models in the ensemble. For example, the first rule X[1] <= 0.685
uses predicted probabilities from the decision tree (the model defined in position [1]
).
Because our final model was a decision tree, we could even look at the relative importance of each base model in the final tree:
fitted_stack.final_estimator_.feature_importances_
array([0.01446582, 0.95041708, 0.03511711])
However, this is information can be misleading due to the greedy nature of decision trees. It's not that the decision tree in our base models was far superior than the others, but rather it yielded the first split, which was highly discriminatory compared to the subsequent splits.
Ensembles and stacked generalizations can be applied to regression tasks with a few minor modifications.
VotingRegressor
is analogous to VotingClassifier
with the argument voting = 'soft'
. It aggregates predictions using simple or weighted averaging of the predictions produced by individual models within the ensemble.
Similarly, StackingRegressor
is analogous to StackingClassifier
.
An example is shown below:
## Read IC home sales data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
## Split to training and testing sets
from sklearn.model_selection import train_test_split
train_ic, test_ic = train_test_split(ic, test_size=0.2, random_state=7)
## Create outcome var
train_y_ic = train_ic['sale.amount']
## Create predictor matrix (numeric predictors only for simplicity, but we could use OHE if we wanted to)
train_X_ic= train_ic.select_dtypes("number").drop('sale.amount',axis=1)
## Create list of models in the ensemble
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
base_models = [('linreg', LinearRegression()),
('tree', DecisionTreeRegressor(max_depth=5)),
('knn', KNeighborsRegressor(n_neighbors=20))]
## Final model to aggregate base models in the ensemble
my_final_estimator = DecisionTreeRegressor(max_depth=3)
## Stacked regressor
from sklearn.ensemble import StackingRegressor
reg_stack = StackingRegressor(base_models, final_estimator = my_final_estimator, cv=5)
## Fit and Evaluate (cv is done internally)
fitted_stack = reg_stack.fit(train_X_ic, train_y_ic)
cv_stacked_preds = fitted_stack.predict(train_X_ic)
from sklearn.metrics import mean_squared_error
print(np.sqrt(mean_squared_error(cv_stacked_preds, train_y_ic)))
23905.17554946077
Random forests are an ensemble of decision trees built using bootstrapped samples of the training data using random selections of input features at each split. These strategies address two weaknesses of decision trees:
Using these strategies, random forests contain a diverse set of decision trees because each tree is fit to slightly different training data using different sets of predictors.
The most important tuning parameters for random forests are:
max_depth
- the maximum depth of each individual tree.min_samples_split
- the minimum number of data-points in a node for it to be eligible for splitting.max_features
- the number (if an int
is given) or the fraction of predictors that are randomly selected for consideration at each split.Additionally, you should be aware of the n_estimators
parameter, which determines the number of trees in the forest. If this value is too low, the forest may not perform optimally. However, there is little point in increasing it beyond a certain point where predictions become stable.
The code below fits a random forest classifier to the sms spam data we've working with:
from sklearn.ensemble import RandomForestClassifier
my_forest = RandomForestClassifier(max_depth=3, min_samples_split=10,
max_features=2, n_estimators=200,
random_state=0, oob_score=True)
fitted_forest = my_forest.fit(train_X, train_y)
print(fitted_forest.oob_score_) ## out-of-sample classification accuracy
0.9692618353152345
Because bootstrapping will exclude some data-points from each tree, we can use these "out of bag" data-points as new data for the corresponding trees. The result is a performance measure known as "out of bag accuracy", or the oob_score_
. To reiterate, this is an out-of-sample performance measure, so we can view it the same way we'd look at a cross-validated performance measure.
Unfortunately, classification accuracy is only oob_score_
that RandomForestClassifier
can calculate and return (at least in the current version of sklearn
). So we must fall back on cross-validation for any other metric (despite this being an inefficient use of computational resources).
max_depth
, min_samples_split
, and max_features
. You should explore at least 3 different values of max_depth
and min_samples_split
and at least 2 different values of max_features
.GridSearchCV
to compare the best random forest from Part A and the stacked generalization you created in Part B using the F1-score as your performance metric.