xgboost
¶This lab introduces the concept of boosting, the xgboost
libary, and comparisons with other machine learning methods we've covered this semester. xgboost
is currently viewed as a state of the art machine learning algorithm for "flat data" or "tabular data", which is generally meant to refer to data that aren't images, text, audio, video, etc. In fact, many recent Kaggle competitions involving flat data were won by teams using xgboost
(or ensembles that include it) combined with careful feature engineering.
Unlike previous libraries we've used, xgboost
is not included in Anaconda, so you'll need to install it yourself. You can do this in the "Environment" tab of the Anaconda Navigator by choosing "Not Installed" libraries, searching for "xgboost", then clicking to install xgboost
and its dependencies (there shouldn't be any conflicts with the default Anaconda packages). You may also install xgboost
using pip or conda see here for the installation guideline.
Once you've got the xgboost
library installed, the following command should run without producing an error (warning messages are okay):
import xgboost as xgb
C:\Users\millerry\Anaconda3\lib\site-packages\xgboost\compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import MultiIndex, Int64Index
Next, we'll still need to load the other libraries we've been using throughout the semester:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import math
## We'll ignore the warnings xgboost gives us
import warnings
warnings.simplefilter(action='ignore')
Examples in this lab will use the well-switching data found below. Each row in this dataset records information about a household in Bangladesh that had previously been using a well with unsafe levels of arsenic. After an intervention, some of these households opted to switch to a safe well, while others did not. The aim of our analysis is to accurately predict the likelihood of a household switching to a safe well.
### Read data
wells = pd.read_csv("https://remiller1450.github.io/data/Wells.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wells, test_size=0.1, random_state=9)
### Separate the outcome and predictors
train_y = (train['switch'] == 'yes').astype(int)
train_X = train.drop(['switch', 'association'], axis="columns")
For the initial parts of the lab, we'll use only the numeric predictors (for simplicity).
Boosting algorithms are a class of ensemble methods that aggregate the predictions of multiple sequentially trained models. The original boosting proposal, the AdaBoost algorithm, builds sequential models by weighting each data-point. The first base model assigns each observation a weight of $\tfrac{1}{n}$ (equivalent to unweighted model fitting), then successive models increase the weights assigned to the observations with the largest prediction errors, thereby forcing later models to focus their attention on certain hard to predict observations.
While AdaBoost is no longer considered to achieve state-of-the-art performance, it's worth knowing about for its simplicity and history.
We can fit an AdaBoost model using boosted decision trees using AdaBoostClassifier
in sklearn
:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score
ada_model = AdaBoostClassifier(n_estimators=100)
cv_res1 = cross_val_score(ada_model, train_X, train_y, cv=5)
np.average(cv_res1)
0.6269188061965117
When using AdaBoost, we should be aware of the learning_rate
argument, which can be used to shrink the contribution of subsequent models in the ensemble. The default learning rate is 1, which allows each base model to contribute to the ensemble proportional to its error rate (see the AdaBoost algorithm in our lecture notes).
## Setup two different AdaBoost models:
ada_model1 = AdaBoostClassifier(n_estimators=100, learning_rate=1.2, algorithm = 'SAMME')
ada_model2 = AdaBoostClassifier(n_estimators=100, learning_rate=0.3, algorithm = 'SAMME')
## Plot weights of each tree in the ensemble:
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(np.linspace(0, 100, 100),ada_model1.fit(train_X, train_y,).estimator_weights_)
ax1.set_title('learning_rate = 1.2')
ax2.plot(np.linspace(0, 100, 100),ada_model2.fit(train_X, train_y,).estimator_weights_)
ax2.set_title('learning_rate = 0.3')
plt.show()
As shown above, a smaller learning rate will generally allow for more of the estimators in a boosted ensemble to make noticeable contributions to the final predictions. In this regard, it has been suggested that a smaller learning rate and more iterations may result in better performance, which seems to be true in our example (though the minor difference might not be that compelling):
cv_res1 = cross_val_score(ada_model1, train_X, train_y, cv=5)
print(np.average(cv_res1))
cv_res2 = cross_val_score(ada_model2, train_X, train_y, cv=5)
print(np.average(cv_res2))
0.6214156104430723 0.6291321362799263
By default, AdaBoostClassifier
will use decision trees with a maximum depth of 1, but this can be changed.
In fact, we can use AdaBoost with any base learner whose implementation allows sample weighting. For example, we can create an ensemble of boosted logistic regression models:
from sklearn.linear_model import LogisticRegression
ada_model_lr = AdaBoostClassifier(LogisticRegression(), n_estimators=100)
cv_res2 = cross_val_score(ada_model_lr, train_X, train_y, cv=5)
np.average(cv_res2)
0.6114810150579568
Or we can change the maximum depth of the decision trees that are fit at each iteration of the algorithm:
from sklearn.tree import DecisionTreeClassifier
ada_model_depth3 = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2), n_estimators=100)
cv_res3 = cross_val_score(ada_model_depth3, train_X, train_y, cv=5)
np.average(cv_res3)
0.6008002924926876
Notice how increasing the maximum depth to 2 actually decreased the cross-validated accuracy. The result seen above is not unique or surprising, as boosting algorithms tend to work best with simple models like "stumps" of decisions trees with a maximum depth of 1 (ie: single splits of the data).
learning_rate
and max_depth
of an AdaBoost classifier that uses decision trees as its base models. Report the optimal values of each parameter, and the accuracy achieved by that model. You may search over any range that deem appropriate, but you should check at least three values for each tuning parameter.Gradient boosting was discovered as a generalization of AdaBoost. The fundamental idea is to sequentially fit the base models to the gradient (which is the residuals under squared error loss, or pseudo-residuals for other cost functions).
While AdaBoost was originally developed for binary classification, gradient boosting is easily applied to both classification and regression problems, and has been found to generally outperform AdaBoost. If exponential loss is used for a binary classification task, gradient boosting is equivalent to AdaBoost.
The functions and arguments used to fit a gradient boosted model in sklearn
are very similar to those in the previous section. Below we fit a gradient boosted classifier and find its cross-validated accuracy:
from sklearn.ensemble import GradientBoostingClassifier
gb_mod = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
gb_mod_res = cross_val_score(gb_mod, train_X, train_y, cv=5)
np.average(gb_mod_res)
0.6394370869894919
By default, GradientBoostingClassifier
uses decision tree classifiers that subsample predictors. This can be controlled by the max_features
argument. Further, the function allows for subsampling of observations, or using random samples of data at each iteration which can also reduce the potential for overfitting.
max_features = 1
. Then, find the importance of each predictor when the model is fit to the training data. How do these results differ from what you say in Part A? Why does changing max_features
to 1 lead to this type of change? Briefly explain.xgboost
¶The xgboost
library provides an implementation of gradient boosting using decision tree base learners with a few additional features that can lead to improved performance (in both model performance and computational efficiency). In terms of model performance, xgboost
offers L1, L2, and elastic net regularization options. In terms of computational efficiency, xgboost
was designed to run in parallel on GPUs with the capability to exploit special cases that we haven't considered, such as sparse data matrices.
When training an xgboost
model, we should consider the following tuning parameters:
learning_rate
- The shrinkage applied to the estimator weights at each boosting iteration (learning rate), range: $[0,1]$gamma
(or min_split_loss
) - The minimum improvement in cost required for a node to be split within a tree. Larger values prevent overfitting, range: $[0,\infty]$reg_alpha
and reg_lambda
- Regularization applied to the estimator weights. reg_alpha
provides the L1 regularization, which encourages sparsity, while reg_lambda
provides the L2 regularization, which encourages weights to be closer to zero. Specifying both imposes the elastic net penalty. These parameters accept a range from $[0, \infty]$, and the optimal value of each is highly specific to the application.max_depth
- The maximum depth of trees. Higher values increase the likelihood of overfitting but provide increased flexibility, range: $[0, \infty]$. Generally depths of 1 or 2 tend to perform best.min_child_weight
- The minimum sum of weight needed in a node for it to be further partitioned. If each data-point is weighted equally (which is often the case), this is equivalant to min_samples_split
in other functions. Larger values prevent overfitting but reduce flexibility, range $[0, \infty]$n_estimators
- The numbers of trees in the ensemble (ie: number of boosting iterations). Larger values tend to work best when combined with smaller learning rates (and vice-versa).colsample_bytree
- The fraction of columns (predictors) to randomly sample for use in a tree.colsample_bylevel
- The fraction of columns (predictors) to randomly sample for use in a level (depth) of a tree.colsample_bynode
- The fraction of columns (predictors) to randomly sample for use in a single split (node) in a tree.For advanced uses, additional tuning parameters are described here.
To begin, we'll note that xgboost
models are compatible sklearn
pipelines (and many other functions):
### Read data
wells = pd.read_csv("https://remiller1450.github.io/data/Wells.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wells, test_size=0.1, random_state=9)
### Separate the outcome and predictors
train_y = (train['switch'] == 'yes').astype(int)
train_X = train.drop('switch', axis=1)
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
## Seperate pipelines for numeric vs. cat vars
num_transformer = Pipeline([("scaler", MinMaxScaler())])
cat_transformer = Pipeline([("encoder", OneHotEncoder(sparse=False, handle_unknown='ignore'))])
## Column ids
cat_cols = ['association']
num_cols = ['arsenic', 'distance', 'education']
## Pre-processor
preprocessor = ColumnTransformer([
('num', num_transformer, num_cols),
('cat', cat_transformer, cat_cols)
])
## Final pipeline
pipe = Pipeline([
('preprocessor', preprocessor),
('model', xgb.XGBClassifier(eval_metric='error'))
])
## Accuracy with the default parms
np.average(cross_val_score(pipe, train_X, train_y, cv = 5))
0.6004387390315242
We can apply familiar methods like cross-validated grid search using this pipeline.
For the purposes of this lab, we'll use this as opportunity to explore the roles of different tuning parameters:
from sklearn.model_selection import GridSearchCV
## Tree depths to try:
pars = {'model__max_depth': [1,2,3,5,8,10]}
## Grid search
grid_results = GridSearchCV(pipe, pars, cv = 5).fit(train_X, train_y)
## Plot results
mean_accs = grid_results.cv_results_['mean_test_score']
plt.plot(list(pars.values())[0], mean_accs)
plt.show()
Here we see that cross-validated accuracy tends to decline significantly when deeper trees are used. We've seen this already in our other boosted classifiers, but it's nice to see the same finding when using xgboost
.
Next, we might wonder if we could offset the overfitting created by using deeper trees by imposing regularization?
## New parm grid
pars = {'model__reg_alpha': [0,1.5, 5],
'model__max_depth': [1,2,3,5,8]}
## Grid search
grid_results = GridSearchCV(pipe, pars, cv = 5).fit(train_X, train_y)
## Plot results
mean_accs = grid_results.cv_results_['mean_test_score']
##
plt.figure(figsize=(20,3))
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
depths = [1,2,3,5,8]
ax1.plot(depths, mean_accs[range(0, len(mean_accs),3)])
ax1.set_title('Alpha = 0')
ax2.plot(depths, mean_accs[range(1, len(mean_accs),3)])
ax2.set_title('Alpha = 1.5')
ax3.plot(depths, mean_accs[range(2, len(mean_accs),3)])
ax3.set_title('Alpha = 5')
plt.show()
<Figure size 2000x300 with 0 Axes>
It doesn't seem like regularization is an effective strategy for overcoming the overfitting that occurs when deep trees are used. Overall, these results suggest we might be best sticking to shallow trees (depths of 1 or 2) combined with modest levels of regularization.
At this point we've seen a couple examples of model tuning. It's important to recognize that xgboost
provides a large number of tuning parameter options, and you can improve performance by optimizing them. However, in most applications you can see an even greater improvement in performance by investing more time and effort into feature engineering, and worrying less about getting the perfect combination of tuning parameters.
xgboost
challenge)¶## Read the data
ames = pd.read_csv("https://remiller1450.github.io/data/AmesHousing.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train_ames, test_ames = train_test_split(ames, test_size=0.2, random_state=9)
## Separate y and X
train_ames_y = train_ames['SalePrice']
train_ames_X = train_ames.drop('SalePrice', axis=1)
## We'll remove any variables with more than 1000 missing values since they have too much missing data to reliably impute.
train_ames_X = train_ames_X[train_ames_X.columns[train_ames_X.isna().sum() < 1000]]
## We'll also remove variables that shouldn't be used for prediction
train_ames_X = train_ames_X.drop(['Order','PID','MS.SubClass', 'Yr.Sold','Mo.Sold'],axis=1)