Lab 10 - Boosting Ensembles and `xgboost`¶

This lab introduces gradient boosting methods with a focus on the xgboost library.

Currently, xgboost is viewed as a state-of-the-art machine learning algorithm for flat/tabular data. While we will not cover them in this lab because of their heavy overlap with xgboost, it is worth knowing about the following gradient boosting libraries:

CatBoost - A library developed by the Russian company Yandex that more effectively handles categorical predictors than other gradient boosting implementations - here is a link to the technical paper
lightGBM - A gradient boosting library developed by Microsoft that focuses on speed and efficiency - here is a link to the technical paper

Unlike the libraries we've been using so far, xgboost is not a default inclusion in Anaconda, so you'll need to install it yourself. You can do this using the "Environment" tab of the Anaconda Navigator by choosing "Not Installed" libraries, searching for "xgboost", then clicking to install xgboost and its dependencies (there shouldn't be any conflicts with the default Anaconda packages). You may also install xgboost using pip or conda, click here for the official installation guide.

If you have xgboost installed, you should be able to load the library without an error:

In [1]:

import warnings   # We will turn of the future warnings that xgboost gives us
warnings.simplefilter(action='ignore', category=FutureWarning)
import xgboost as xgb

We'll also need to load our standard libraries:

In [2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

The examples in this lab will use data from a public health study conducted in Arahazar upazila, Bangladesh. In this study, researchers tested water from community wells for arsenic and encouraged households that were using unsafe wells to switch to a nearby well that was tested to be safe. Several years after their initial testing, the researchers revisited each household they had encouraged to switch to a safe well and recorded which of these households had done so, an outcome of switch = 'yes'.

In [3]:

### Read data
wells = pd.read_csv("https://remiller1450.github.io/data/Wells.csv")

## Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wells, test_size=0.1, random_state=9)

### Separate the outcome and predictors (dropping the "association" predictor)
train_y = (train['switch'] == 'yes').astype(int)
train_X = train.drop(['switch', 'association'], axis="columns")

Part 1 - Boosting in `sklearn`¶

The AdaBoost algorithm is regarded as the first successful boosting method. Modern gradient boosting approaches are generalizations of AdaBoost, making it worthwhile to have some familiarity with AdaBoost for its historical importance and easy to understand framework.

AdaBoost sequentially builds an ensemble classifier using decision trees. The first tree in the ensemble gives each data-point a weight of $\tfrac{1}{n}$, as is normal. However, the next tree assigns each data-point a weight that is determined by its prediction error, thereby allowing data-points that are misclassified to be given more attention.

An essential parameter in the AdaBoost algorithm is the learning rate, which controls how much weight is given to a base model at each boosting iteration. The default value is 1, which allows each base model to contribute proportional to its error:

In [4]:

from sklearn.ensemble import AdaBoostClassifier

## Setup two different AdaBoost models:
ada_model1 = AdaBoostClassifier(n_estimators=100, learning_rate=1, algorithm = 'SAMME')
ada_model2 = AdaBoostClassifier(n_estimators=100, learning_rate=0.2, algorithm = 'SAMME')

## Plot weights of each tree in the ensemble:
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(np.linspace(0, 100, 100),ada_model1.fit(train_X, train_y,).estimator_weights_)
ax1.set_title('learning_rate = 1')
ax2.plot(np.linspace(0, 100, 100),ada_model2.fit(train_X, train_y,).estimator_weights_)
ax2.set_title('learning_rate = 0.2')
plt.show()

One implication of setting a lower learning rate is that more of the base learners will make noticeable contributions to the ensemble (right plot), whereas choosing a higher learning tends to allow the base learners that are earlier parts of the ensemble to be primarily responsible for the ensemble's predictions.

There is an inherent tradeoff between the learning rate and the number of boosting iterations that produce optimal performance. Smaller learning rates should be paired with more boosting iterations so that the ensemble can continue to improve. It's generally believed that a smaller learning rate and more iterations.

For a fixed number of boosting iterations, we can see the implications of using different learning rates:

In [5]:

from sklearn.model_selection import cross_val_score
cv_res = []
rates_to_try = [0.01,0.1,0.2,0.5,1,2]
for rate in rates_to_try:
    ada_model_temp = AdaBoostClassifier(n_estimators=100, learning_rate=rate, algorithm = 'SAMME')
    cv_res.append(np.average(cross_val_score(ada_model_temp, train_X, train_y, cv=5, scoring='accuracy')))

## Plot of CV accuracy
plt.plot(rates_to_try, cv_res)
plt.show()

Learning rates that are too small lead to biased models that never fully learn all of the patterns present in the data, while learning rates that are too large lead to overfit models.

Now let's see what happens if we fix the learning rate at 0.5 (what seemed to work best in the exploration above) and vary the number of boosting iterations:

In [6]:

cv_res = []
n_iter_to_try = [10,100,200,300,600]
for it in n_iter_to_try:
    ada_model_temp = AdaBoostClassifier(n_estimators=it, learning_rate=0.5, algorithm = 'SAMME')
    cv_res.append(np.average(cross_val_score(ada_model_temp, train_X, train_y, cv=5, scoring='accuracy')))

## Plot of CV accuracy
plt.plot(n_iter_to_try, cv_res)
plt.show()

We can see that there's a clear relationship between these two tuning parameters. However, it's difficult to assess what is a large or a small learning rate (or a large or small number of boosting iterations) from one application to another.

Question 1:

Part A: Use GridSearchCV() to evaluate the cross-validated F1 scores of each combination of the following:
- Learning rate: 0.1, 0.3, or 0.6
- Number of boosting iterations: 10, 100, 200, or 400
Part B: Using the learning rate of the best performing model identified in Part A, create a line chart showing the cross-validated F1 scores at various numbers of boosting iterations that surround number iterations identified in your search from Part A. You should choose values so that your graph displays a clear peak/top.
Part C: By default AdaBoostClassifier() will use decision trees with a maximum depth of 1 as base models, but you can change this using the estimator argument (ie: estimator = DecisionTreeClassifier(max_depth=2) uses decision trees with a max depth of 2 as base models). Add a second line to your line chart in Part B that shows the performance of decision trees with a max depth of 2 for the same tuning parameter values you used in Part B.
Part D: The base estimators in AdaBoost can be any model that accommodates sample weighting. Change the base model to linear SVM classifiers and compare the cross-validated accuracy with the best performing model you found in Part A.

Part 2 - `xgboost`¶

The explorations in Part 1 of the lab illustrate the tradeoff between the two most important hyperparaters in boosting ensembles, the learning rate and the number of boosting iterations. The motivation behind the xgboost library was to provide an efficient implementation of gradient boosting with numerous hyperparameters that could be customized. Additionally, xgboost was designed to be able to run a GPU using parallelization and accommodate sparse data matrices (a data structure we will not discuss). These features have made xgboost widely used, though it is fundamentally just an implementation of gradient boosting.

When training an xgboost model, you should carefully consider the following hyperparameters (listed roughly in their order of importance):

The following are parameters influence to the ensemble building process and use of the ensemble as a whole:

learning_rate - The shrinkage applied to estimates at each boosting iteration that controls much each additional boosting iteration contributes to the ensemble. The learning rate can be set to any value in $(0,1]$.
n_estimators - The number of boosting iterations (ie: trees in the ensemble). More boosting iterations tends to work best when combined with a smaller learning rate (and vice-versa).
reg_alpha and reg_lambda - Amount of regularization applied to the contributions of base estimators to the ensemble. reg_alpha provides the L1 regularization, which encourages sparsity (some estimators making contributions of exactly zero), while reg_lambda provides the L2 regularization, which encourages weights to be closer to zero (but not exactly zero). Increasing these tuning parameters introduces bias but reduces the variance of the ensemble. These parameters rate be set to any values in the range $(0,\infty]$.

The following parameters influence the individual base models within the ensemble:

max_depth - The maximum depth of trees. Higher values increase the likelihood of overfitting but provide increased flexibility. Generally maximum depths of 1 or 2 tend to work best in boosting ensembles.
colsample_bytree, colsample_bylevel, and colsample_bynode - The fraction of columns (variables) randomly sampled for use at a particular stage of the base model. For example, colsample_bytree = 0.5 will randomly select 50% of the available predictors and use them for building an entire tree, while colsample_bylevel = 0.5 will select a random 50% of predictors at each depth level of each individual tree and colsample_bynode = 0.5 will select a random 50% of predictors to be considered at each individual split within each individual tree.
min_child_weight - The minimum sum of weight needed in a node for it to be further partitioned. If each data-point is weighted equally (which is often the case), this is equivalent to min_samples_split in decision tree functions. Larger values can help prevent overfitting but reduce flexibility.
gamma (alias: min_split_loss) - The minimum improvement in cost required for a node to be split within a tree. Larger values prevent overfitting.

For advanced uses, there are a few additional tuning parameters that you can read about here.

Fortunately for us, xgboost is compatible with sklearn pipelines and grid searches:

In [8]:

from sklearn.pipeline import Pipeline 

## Example pipeline
pipe = Pipeline([
('model', xgb.XGBClassifier(eval_metric='error', use_label_encoder=False))
])

## Try two different learning rates
parms = {'model__learning_rate': [0.01, 0.1, 0.3]}

## Grid Search
from sklearn.model_selection import GridSearchCV
grid_res = GridSearchCV(pipe, parms, cv=5).fit(train_X, train_y)
print(grid_res.best_estimator_)
print(grid_res.best_score_)

Pipeline(steps=[('model',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, enable_categorical=False,
                               eval_metric='error', gamma=0, gpu_id=-1,
                               importance_type=None, interaction_constraints='',
                               learning_rate=0.01, max_delta_step=0,
                               max_depth=6, min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=100,
                               n_jobs=12, num_parallel_tree=1, predictor='auto',
                               random_state=0, reg_alpha=0, reg_lambda=1,
                               scale_pos_weight=1, subsample=1,
                               tree_method='exact', use_label_encoder=False,
                               validate_parameters=1, verbosity=None))])
0.6383266980825478

Finally, while we will not use it in this lab, you should recognize that the XGBRegressor() function should be used when predicting a numeric outcome.

Question 2: For this question you should use the well-switching data used throughout this lab.

Part A: Create a data processing pipeline includes a min-max scaling step followed by an xgboost classification model.
Part B: Using your pipeline from Part A with a learning rate of 0.01 and the default values for every other tuning parameter, use GridSearchCV() to explore the cross-validated classification accuracy that results from base models having maximum depths of 1, 2, 3, 4, 5, and 6. Display the relationship between maximum depth and cross-validated accuracy using a line chart.
Part C: Using the best maximum depth from Part B, perform 10 iterations of randomized search using a uniform distribution to draw values of the learning_rate parameter between $(0,1]$ and draw values of the n_estimators parameter between 100 and 500.
Part D: Using the best set of tuning parameters from Part C, conduct a cross-validated grid search exploring the regularization values of 0, 2, and 30 for the reg_alpha and reg_lambda tuning parameters. Hint: You can extract the best tuning parameters from a GridSearchCV() object via the best_params_ attribute.
Part E: Using the best set of tuning parameters from Part D, conduct a cross-validated grid search exploring column sampling. You should try out a few reasonable values for at least one type of column sampling in your search.
Part F: Use GridSearchCV() to identify a reasonable SVM classifier to use as a competitor to the xgboost model you identified in Part E. Report the tuning parameters and cross-validated accuracy score of your SVM.
Part G: Use GridSearchCV() to compare your SVM from Part F and your best xgboost model from Part E. Which model achieves a higher cross-validated accuracy score?
Part H: Evaluate each model you compared in Part G on the test set. Report each model's classification accuracy and F1 score.

Lab 10 - Boosting Ensembles and xgboost¶

Part 1 - Boosting in sklearn¶

Part 2 - xgboost¶

Lab 10 - Boosting Ensembles and `xgboost`¶

Part 1 - Boosting in `sklearn`¶

Part 2 - `xgboost`¶