xgboost
¶This lab introduces gradient boosting methods with a focus on the xgboost
library.
Currently, xgboost
is viewed as a state-of-the-art machine learning algorithm for flat/tabular data. While we will not cover them in this lab because of their heavy overlap with xgboost
, it is worth knowing about the following gradient boosting libraries:
CatBoost
- A library developed by the Russian company Yandex that more effectively handles categorical predictors than other gradient boosting implementations - here is a link to the technical paperlightGBM
- A gradient boosting library developed by Microsoft that focuses on speed and efficiency - here is a link to the technical paperUnlike the libraries we've been using so far, xgboost
is not a default inclusion in Anaconda, so you'll need to install it yourself. You can do this using the "Environment" tab of the Anaconda Navigator by choosing "Not Installed" libraries, searching for "xgboost", then clicking to install xgboost
and its dependencies (there shouldn't be any conflicts with the default Anaconda packages). You may also install xgboost using pip or conda, click here for the official installation guide.
If you have xgboost
installed, you should be able to load the library without an error:
import warnings # We will turn of the future warnings that xgboost gives us
warnings.simplefilter(action='ignore', category=FutureWarning)
import xgboost as xgb
We'll also need to load our standard libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
The examples in this lab will use data from a public health study conducted in Arahazar upazila, Bangladesh. In this study, researchers tested water from community wells for arsenic and encouraged households that were using unsafe wells to switch to a nearby well that was tested to be safe. Several years after their initial testing, the researchers revisited each household they had encouraged to switch to a safe well and recorded which of these households had done so, an outcome of switch = 'yes'
.
### Read data
wells = pd.read_csv("https://remiller1450.github.io/data/Wells.csv")
## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wells, test_size=0.1, random_state=9)
### Separate the outcome and predictors (dropping the "association" predictor)
train_y = (train['switch'] == 'yes').astype(int)
train_X = train.drop(['switch', 'association'], axis="columns")
sklearn
¶The AdaBoost algorithm is regarded as the first successful boosting method. Modern gradient boosting approaches are generalizations of AdaBoost, making it worthwhile to have some familiarity with AdaBoost for its historical importance and easy to understand framework.
AdaBoost sequentially builds an ensemble classifier using decision trees. The first tree in the ensemble gives each data-point a weight of $\tfrac{1}{n}$, as is normal. However, the next tree assigns each data-point a weight that is determined by its prediction error, thereby allowing data-points that are misclassified to be given more attention.
An essential parameter in the AdaBoost algorithm is the learning rate, which controls how much weight is given to a base model at each boosting iteration. The default value is 1, which allows each base model to contribute proportional to its error:
from sklearn.ensemble import AdaBoostClassifier
## Setup two different AdaBoost models:
ada_model1 = AdaBoostClassifier(n_estimators=100, learning_rate=1, algorithm = 'SAMME')
ada_model2 = AdaBoostClassifier(n_estimators=100, learning_rate=0.2, algorithm = 'SAMME')
## Plot weights of each tree in the ensemble:
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(np.linspace(0, 100, 100),ada_model1.fit(train_X, train_y,).estimator_weights_)
ax1.set_title('learning_rate = 1')
ax2.plot(np.linspace(0, 100, 100),ada_model2.fit(train_X, train_y,).estimator_weights_)
ax2.set_title('learning_rate = 0.2')
plt.show()
One implication of setting a lower learning rate is that more of the base learners will make noticeable contributions to the ensemble (right plot), whereas choosing a higher learning tends to allow the base learners that are earlier parts of the ensemble to be primarily responsible for the ensemble's predictions.
There is an inherent tradeoff between the learning rate and the number of boosting iterations that produce optimal performance. Smaller learning rates should be paired with more boosting iterations so that the ensemble can continue to improve. It's generally believed that a smaller learning rate and more iterations.
For a fixed number of boosting iterations, we can see the implications of using different learning rates:
from sklearn.model_selection import cross_val_score
cv_res = []
rates_to_try = [0.01,0.1,0.2,0.5,1,2]
for rate in rates_to_try:
ada_model_temp = AdaBoostClassifier(n_estimators=100, learning_rate=rate, algorithm = 'SAMME')
cv_res.append(np.average(cross_val_score(ada_model_temp, train_X, train_y, cv=5, scoring='accuracy')))
## Plot of CV accuracy
plt.plot(rates_to_try, cv_res)
plt.show()
Learning rates that are too small lead to biased models that never fully learn all of the patterns present in the data, while learning rates that are too large lead to overfit models.
Now let's see what happens if we fix the learning rate at 0.5 (what seemed to work best in the exploration above) and vary the number of boosting iterations:
cv_res = []
n_iter_to_try = [10,100,200,300,600]
for it in n_iter_to_try:
ada_model_temp = AdaBoostClassifier(n_estimators=it, learning_rate=0.5, algorithm = 'SAMME')
cv_res.append(np.average(cross_val_score(ada_model_temp, train_X, train_y, cv=5, scoring='accuracy')))
## Plot of CV accuracy
plt.plot(n_iter_to_try, cv_res)
plt.show()
We can see that there's a clear relationship between these two tuning parameters. However, it's difficult to assess what is a large or a small learning rate (or a large or small number of boosting iterations) from one application to another.
Question 1:
GridSearchCV()
to evaluate the cross-validated F1 scores of each combination of the following:AdaBoostClassifier()
will use decision trees with a maximum depth of 1 as base models, but you can change this using the estimator
argument (ie: estimator = DecisionTreeClassifier(max_depth=2)
uses decision trees with a max depth of 2 as base models). Add a second line to your line chart in Part B that shows the performance of decision trees with a max depth of 2 for the same tuning parameter values you used in Part B.xgboost
¶The explorations in Part 1 of the lab illustrate the tradeoff between the two most important hyperparaters in boosting ensembles, the learning rate and the number of boosting iterations. The motivation behind the xgboost
library was to provide an efficient implementation of gradient boosting with numerous hyperparameters that could be customized. Additionally, xgboost
was designed to be able to run a GPU using parallelization and accommodate sparse data matrices (a data structure we will not discuss). These features have made xgboost
widely used, though it is fundamentally just an implementation of gradient boosting.
When training an xgboost
model, you should carefully consider the following hyperparameters (listed roughly in their order of importance):
The following are parameters influence to the ensemble building process and use of the ensemble as a whole:
learning_rate
- The shrinkage applied to estimates at each boosting iteration that controls much each additional boosting iteration contributes to the ensemble. The learning rate can be set to any value in $(0,1]$.n_estimators
- The number of boosting iterations (ie: trees in the ensemble). More boosting iterations tends to work best when combined with a smaller learning rate (and vice-versa).reg_alpha
and reg_lambda
- Amount of regularization applied to the contributions of base estimators to the ensemble. reg_alpha
provides the L1 regularization, which encourages sparsity (some estimators making contributions of exactly zero), while reg_lambda
provides the L2 regularization, which encourages weights to be closer to zero (but not exactly zero). Increasing these tuning parameters introduces bias but reduces the variance of the ensemble. These parameters rate be set to any values in the range $(0,\infty]$.The following parameters influence the individual base models within the ensemble:
max_depth
- The maximum depth of trees. Higher values increase the likelihood of overfitting but provide increased flexibility. Generally maximum depths of 1 or 2 tend to work best in boosting ensembles.colsample_bytree
, colsample_bylevel
, and colsample_bynode
- The fraction of columns (variables) randomly sampled for use at a particular stage of the base model. For example, colsample_bytree = 0.5
will randomly select 50% of the available predictors and use them for building an entire tree, while colsample_bylevel = 0.5
will select a random 50% of predictors at each depth level of each individual tree and colsample_bynode = 0.5
will select a random 50% of predictors to be considered at each individual split within each individual tree.min_child_weight
- The minimum sum of weight needed in a node for it to be further partitioned. If each data-point is weighted equally (which is often the case), this is equivalent to min_samples_split
in decision tree functions. Larger values can help prevent overfitting but reduce flexibility.gamma
(alias: min_split_loss
) - The minimum improvement in cost required for a node to be split within a tree. Larger values prevent overfitting.For advanced uses, there are a few additional tuning parameters that you can read about here.
Fortunately for us, xgboost
is compatible with sklearn
pipelines and grid searches:
from sklearn.pipeline import Pipeline
## Example pipeline
pipe = Pipeline([
('model', xgb.XGBClassifier(eval_metric='error', use_label_encoder=False))
])
## Try two different learning rates
parms = {'model__learning_rate': [0.01, 0.1, 0.3]}
## Grid Search
from sklearn.model_selection import GridSearchCV
grid_res = GridSearchCV(pipe, parms, cv=5).fit(train_X, train_y)
print(grid_res.best_estimator_)
print(grid_res.best_score_)
Pipeline(steps=[('model', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, eval_metric='error', gamma=0, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.01, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=12, num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', use_label_encoder=False, validate_parameters=1, verbosity=None))]) 0.6383266980825478
Finally, while we will not use it in this lab, you should recognize that the XGBRegressor()
function should be used when predicting a numeric outcome.
Question 2: For this question you should use the well-switching data used throughout this lab.
xgboost
classification model.GridSearchCV()
to explore the cross-validated classification accuracy that results from base models having maximum depths of 1, 2, 3, 4, 5, and 6. Display the relationship between maximum depth and cross-validated accuracy using a line chart.learning_rate
parameter between $(0,1]$ and draw values of the n_estimators
parameter between 100 and 500. reg_alpha
and reg_lambda
tuning parameters. Hint: You can extract the best tuning parameters from a GridSearchCV()
object via the best_params_
attribute.GridSearchCV()
to identify a reasonable SVM classifier to use as a competitor to the xgboost
model you identified in Part E. Report the tuning parameters and cross-validated accuracy score of your SVM.GridSearchCV()
to compare your SVM from Part F and your best xgboost
model from Part E. Which model achieves a higher cross-validated accuracy score?