Lab 7 - Boosting and xgboost¶

This lab introduces the concept of boosting, the xgboost libary, and comparisons with other machine learning methods we've covered this semester. xgboost is currently viewed as a state of the art machine learning algorithm for "flat data" or "tabular data", which is generally meant to refer to data that aren't images, text, audio, video, etc. In fact, many recent Kaggle competitions involving flat data were won by teams using xgboost (or ensembles that include it) combined with careful feature engineering.

Unlike previous libraries we've used, xgboost is not included in Anaconda, so you'll need to install it yourself. You can do this in the "Environment" tab of the Anaconda Navigator by choosing "Not Installed" libraries, searching for "xgboost", then clicking to install xgboost and its dependencies (there shouldn't be any conflicts with the default Anaconda packages). You may also install xgboost using pip or conda see here for the installation guideline.

Once you've got the xgboost library installed, the following command should run without producing an error (warning messages are okay):

In [1]:
import xgboost as xgb
C:\Users\millerry\Anaconda3\lib\site-packages\xgboost\compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index

Next, we'll still need to load the other libraries we've been using throughout the semester:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math

## We'll ignore the warnings xgboost gives us
import warnings
warnings.simplefilter(action='ignore')

Examples in this lab will use the well-switching data found below. Each row in this dataset records information about a household in Bangladesh that had previously been using a well with unsafe levels of arsenic. After an intervention, some of these households opted to switch to a safe well, while others did not. The aim of our analysis is to accurately predict the likelihood of a household switching to a safe well.

In [3]:
### Read data
wells = pd.read_csv("https://remiller1450.github.io/data/Wells.csv")

## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wells, test_size=0.1, random_state=9)

### Separate the outcome and predictors
train_y = (train['switch'] == 'yes').astype(int)
train_X = train.drop(['switch', 'association'], axis="columns")

For the initial parts of the lab, we'll use only the numeric predictors (for simplicity).

Part 1 - AdaBoost¶

Boosting algorithms are a class of ensemble methods that aggregate the predictions of multiple sequentially trained models. The original boosting proposal, the AdaBoost algorithm, builds sequential models by weighting each data-point. The first base model assigns each observation a weight of $\tfrac{1}{n}$ (equivalent to unweighted model fitting), then successive models increase the weights assigned to the observations with the largest prediction errors, thereby forcing later models to focus their attention on certain hard to predict observations.

While AdaBoost is no longer considered to achieve state-of-the-art performance, it's worth knowing about for its simplicity and history.

We can fit an AdaBoost model using boosted decision trees using AdaBoostClassifier in sklearn:

In [4]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score
ada_model = AdaBoostClassifier(n_estimators=100)
cv_res1 = cross_val_score(ada_model, train_X, train_y, cv=5)
np.average(cv_res1)
Out[4]:
0.6269188061965117

When using AdaBoost, we should be aware of the learning_rate argument, which can be used to shrink the contribution of subsequent models in the ensemble. The default learning rate is 1, which allows each base model to contribute to the ensemble proportional to its error rate (see the AdaBoost algorithm in our lecture notes).

In [5]:
## Setup two different AdaBoost models:
ada_model1 = AdaBoostClassifier(n_estimators=100, learning_rate=1.2, algorithm = 'SAMME')
ada_model2 = AdaBoostClassifier(n_estimators=100, learning_rate=0.3, algorithm = 'SAMME')

## Plot weights of each tree in the ensemble:
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(np.linspace(0, 100, 100),ada_model1.fit(train_X, train_y,).estimator_weights_)
ax1.set_title('learning_rate = 1.2')
ax2.plot(np.linspace(0, 100, 100),ada_model2.fit(train_X, train_y,).estimator_weights_)
ax2.set_title('learning_rate = 0.3')
plt.show()

As shown above, a smaller learning rate will generally allow for more of the estimators in a boosted ensemble to make noticeable contributions to the final predictions. In this regard, it has been suggested that a smaller learning rate and more iterations may result in better performance, which seems to be true in our example (though the minor difference might not be that compelling):

In [6]:
cv_res1 = cross_val_score(ada_model1, train_X, train_y, cv=5)
print(np.average(cv_res1))

cv_res2 = cross_val_score(ada_model2, train_X, train_y, cv=5)
print(np.average(cv_res2))
0.6214156104430723
0.6291321362799263

By default, AdaBoostClassifier will use decision trees with a maximum depth of 1, but this can be changed.

In fact, we can use AdaBoost with any base learner whose implementation allows sample weighting. For example, we can create an ensemble of boosted logistic regression models:

In [7]:
from sklearn.linear_model import LogisticRegression
ada_model_lr = AdaBoostClassifier(LogisticRegression(), n_estimators=100)
cv_res2 = cross_val_score(ada_model_lr, train_X, train_y, cv=5)
np.average(cv_res2)
Out[7]:
0.6114810150579568

Or we can change the maximum depth of the decision trees that are fit at each iteration of the algorithm:

In [8]:
from sklearn.tree import DecisionTreeClassifier
ada_model_depth3 = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2), n_estimators=100)
cv_res3 = cross_val_score(ada_model_depth3, train_X, train_y, cv=5)
np.average(cv_res3)
Out[8]:
0.6008002924926876

Notice how increasing the maximum depth to 2 actually decreased the cross-validated accuracy. The result seen above is not unique or surprising, as boosting algorithms tend to work best with simple models like "stumps" of decisions trees with a maximum depth of 1 (ie: single splits of the data).

Question #1¶

  • Using classification accuracy as the scoring criteria, use a cross-validated grid search to tune the learning_rate and max_depth of an AdaBoost classifier that uses decision trees as its base models. Report the optimal values of each parameter, and the accuracy achieved by that model. You may search over any range that deem appropriate, but you should check at least three values for each tuning parameter.

Part 2 - Gradient Boosting¶

Gradient boosting was discovered as a generalization of AdaBoost. The fundamental idea is to sequentially fit the base models to the gradient (which is the residuals under squared error loss, or pseudo-residuals for other cost functions).

While AdaBoost was originally developed for binary classification, gradient boosting is easily applied to both classification and regression problems, and has been found to generally outperform AdaBoost. If exponential loss is used for a binary classification task, gradient boosting is equivalent to AdaBoost.

The functions and arguments used to fit a gradient boosted model in sklearn are very similar to those in the previous section. Below we fit a gradient boosted classifier and find its cross-validated accuracy:

In [9]:
from sklearn.ensemble import GradientBoostingClassifier
gb_mod = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
gb_mod_res = cross_val_score(gb_mod, train_X, train_y, cv=5)
np.average(gb_mod_res)
Out[9]:
0.6394370869894919

By default, GradientBoostingClassifier uses decision tree classifiers that subsample predictors. This can be controlled by the max_features argument. Further, the function allows for subsampling of observations, or using random samples of data at each iteration which can also reduce the potential for overfitting.

Question #2¶

  • Part A - Fit a gradient boosted model using 100 decision trees with a maximum depth of 1 and a learning rate of 0.1 to training data. Then, extract the importance of each predictor (note: feature importance is explained at the end of Lab 6, part 1, and you may also consult the documentation for details) and plot these values using a bar chart.
  • Part B - Introduce subsampling of predictors by using the argument max_features = 1. Then, find the importance of each predictor when the model is fit to the training data. How do these results differ from what you say in Part A? Why does changing max_features to 1 lead to this type of change? Briefly explain.
  • Part C - Use a cross-validated grid search to assess whether subsampling of predictors, subsampling of observations, subsampling of both, or subsampling of neither yields the best classifier. You may keep the number of base estimators fixed at 100 and the learning rate fixed at 0.1.

Part 3 - Introduction to xgboost¶

The xgboost library provides an implementation of gradient boosting using decision tree base learners with a few additional features that can lead to improved performance (in both model performance and computational efficiency). In terms of model performance, xgboost offers L1, L2, and elastic net regularization options. In terms of computational efficiency, xgboost was designed to run in parallel on GPUs with the capability to exploit special cases that we haven't considered, such as sparse data matrices.

When training an xgboost model, we should consider the following tuning parameters:

  • learning_rate - The shrinkage applied to the estimator weights at each boosting iteration (learning rate), range: $[0,1]$
  • gamma (or min_split_loss) - The minimum improvement in cost required for a node to be split within a tree. Larger values prevent overfitting, range: $[0,\infty]$
  • reg_alpha and reg_lambda - Regularization applied to the estimator weights. reg_alpha provides the L1 regularization, which encourages sparsity, while reg_lambda provides the L2 regularization, which encourages weights to be closer to zero. Specifying both imposes the elastic net penalty. These parameters accept a range from $[0, \infty]$, and the optimal value of each is highly specific to the application.
  • max_depth - The maximum depth of trees. Higher values increase the likelihood of overfitting but provide increased flexibility, range: $[0, \infty]$. Generally depths of 1 or 2 tend to perform best.
  • min_child_weight - The minimum sum of weight needed in a node for it to be further partitioned. If each data-point is weighted equally (which is often the case), this is equivalant to min_samples_split in other functions. Larger values prevent overfitting but reduce flexibility, range $[0, \infty]$
  • n_estimators - The numbers of trees in the ensemble (ie: number of boosting iterations). Larger values tend to work best when combined with smaller learning rates (and vice-versa).
  • colsample_bytree - The fraction of columns (predictors) to randomly sample for use in a tree.
  • colsample_bylevel - The fraction of columns (predictors) to randomly sample for use in a level (depth) of a tree.
  • colsample_bynode - The fraction of columns (predictors) to randomly sample for use in a single split (node) in a tree.

For advanced uses, additional tuning parameters are described here.

To begin, we'll note that xgboost models are compatible sklearn pipelines (and many other functions):

In [10]:
### Read data
wells = pd.read_csv("https://remiller1450.github.io/data/Wells.csv")

## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wells, test_size=0.1, random_state=9)

### Separate the outcome and predictors
train_y = (train['switch'] == 'yes').astype(int)
train_X = train.drop('switch', axis=1)

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import cross_val_score 
from sklearn.pipeline import Pipeline 
from sklearn.compose import ColumnTransformer, make_column_selector

## Seperate pipelines for numeric vs. cat vars
num_transformer = Pipeline([("scaler", MinMaxScaler())])
cat_transformer = Pipeline([("encoder", OneHotEncoder(sparse=False, handle_unknown='ignore'))])

## Column ids
cat_cols = ['association']
num_cols = ['arsenic', 'distance', 'education']

## Pre-processor
preprocessor = ColumnTransformer([
    ('num', num_transformer, num_cols),
    ('cat', cat_transformer, cat_cols)
])

## Final pipeline
pipe = Pipeline([
('preprocessor', preprocessor),
('model', xgb.XGBClassifier(eval_metric='error'))
])

## Accuracy with the default parms
np.average(cross_val_score(pipe, train_X, train_y, cv = 5))
Out[10]:
0.6004387390315242

We can apply familiar methods like cross-validated grid search using this pipeline.

For the purposes of this lab, we'll use this as opportunity to explore the roles of different tuning parameters:

In [11]:
from sklearn.model_selection import GridSearchCV

## Tree depths to try:
pars = {'model__max_depth': [1,2,3,5,8,10]}

## Grid search
grid_results = GridSearchCV(pipe, pars, cv = 5).fit(train_X, train_y)

## Plot results
mean_accs = grid_results.cv_results_['mean_test_score']
plt.plot(list(pars.values())[0], mean_accs)
plt.show()

Here we see that cross-validated accuracy tends to decline significantly when deeper trees are used. We've seen this already in our other boosted classifiers, but it's nice to see the same finding when using xgboost.

Next, we might wonder if we could offset the overfitting created by using deeper trees by imposing regularization?

In [12]:
## New parm grid
pars = {'model__reg_alpha': [0,1.5, 5],
        'model__max_depth': [1,2,3,5,8]}

## Grid search
grid_results = GridSearchCV(pipe, pars, cv = 5).fit(train_X, train_y)
In [13]:
## Plot results
mean_accs = grid_results.cv_results_['mean_test_score']

## 
plt.figure(figsize=(20,3))
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
depths = [1,2,3,5,8] 
ax1.plot(depths, mean_accs[range(0, len(mean_accs),3)])
ax1.set_title('Alpha = 0')
ax2.plot(depths, mean_accs[range(1, len(mean_accs),3)])
ax2.set_title('Alpha = 1.5')
ax3.plot(depths, mean_accs[range(2, len(mean_accs),3)])
ax3.set_title('Alpha = 5')
plt.show()
<Figure size 2000x300 with 0 Axes>

It doesn't seem like regularization is an effective strategy for overcoming the overfitting that occurs when deep trees are used. Overall, these results suggest we might be best sticking to shallow trees (depths of 1 or 2) combined with modest levels of regularization.

At this point we've seen a couple examples of model tuning. It's important to recognize that xgboost provides a large number of tuning parameter options, and you can improve performance by optimizing them. However, in most applications you can see an even greater improvement in performance by investing more time and effort into feature engineering, and worrying less about getting the perfect combination of tuning parameters.

Question #3 (the xgboost challenge)¶

  • Lab 5 (part 1) introduced the Ames Housing dataset. The goal of this challenge is accurately predict sale prices on the test set created below. You are welcome to use any strategies that do not use data which could not be known prior to a home selling (such as sale date) to predict its sale price.
  • You should use the code given below as a starting point. It will remove some variables with large amounts of missing data, and some variables that are not useful predictors.
  • You should try to find a model that achieves a benchmark RMSE of at least $16,600 (cross-validated on the training data) to expect full credit on this question. This benchmark was achieved using modest amounts of hyperparameter tuning. You are welcome to use feature engineering to help you beat the benchmark.
  • Additionally, you are expected to report the RMSE of your final model on the test data.
In [14]:
## Read the data
ames = pd.read_csv("https://remiller1450.github.io/data/AmesHousing.csv")

## Train-test split
from sklearn.model_selection import train_test_split
train_ames, test_ames = train_test_split(ames, test_size=0.2, random_state=9)

## Separate y and X
train_ames_y = train_ames['SalePrice']
train_ames_X = train_ames.drop('SalePrice', axis=1)

## We'll remove any variables with more than 1000 missing values since they have too much missing data to reliably impute.
train_ames_X = train_ames_X[train_ames_X.columns[train_ames_X.isna().sum() < 1000]]

## We'll also remove variables that shouldn't be used for prediction
train_ames_X = train_ames_X.drop(['Order','PID','MS.SubClass', 'Yr.Sold','Mo.Sold'],axis=1)