Lab 9 - Introduction to xgboost¶

Gradient boosting ensembles are arguably the most powerful models for "tabular" or "flat" data, and are frequently used by the winning submissions in [Kaggle competitions].(https://www.kaggle.com/competitions)

This lab covers the xgboost (extreme gradient boosting) implementation of gradient boosting, a library used to efficiently build scalable gradient boosting ensembles. The library is not included in the Anaconda distribution, so you will need to add it to your environment on your own. An easy way to do this is by running the command pip install xgboost, but you may also locate it in "uninstalled" libraries in the environment panel of Anaconda Navigator and install it that way.

If you've successfully installed xgboost you should be able to run the following code without any warnings or error messages:

In [1]:
import warnings   # Turns off the future warnings from xgboost
warnings.simplefilter(action='ignore', category=FutureWarning)
import xgboost as xgb

In addition to xgboost, we'll continue using several standard libraries:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

The examples in this lab will use data from a public health study conducted in Arahazar Upazila, Bangladesh. In this study, researchers tested the water in community wells for arsenic and encouraged the households using unsafe wells to switch where they got their water from to the nearest safe well. Several years after their initial testing, the researchers revisited each household they had previously contacted and recorded which households had actually switched from an unsafe well to a safe one, which is recorded as switch = 'yes'.

In [3]:
# Well switching data
wells = pd.read_csv("https://remiller1450.github.io/data/Wells.csv")

# Train-test split
from sklearn.model_selection import train_test_split
train, test = train_test_split(wells, test_size=0.1, random_state=9)

# Separate the outcome and predictors (and drop the "association" predictor)
train_y = (train['switch'] == 'yes').astype(int)
train_X = train.drop(['switch', 'association'], axis="columns")

Part 1 - Hyperparameter Tuning¶

A strength of xgboost models is the relatively large number of hyperparameters that can be tuned, which enable the approach to find a sweet spot in the bias-variance trade off in a wide range of applications. Thus, when training an xgboost model, you should carefully consider the following hyperparameters (listed roughly in their order of importance):

The following are parameters influence to the ensemble building process and use of the ensemble as a whole:

  • learning_rate - The shrinkage applied to estimates at each boosting iteration that controls how much each additional boosting iteration contributes to the ensemble. The learning rate can be set to any value between 0 and 1.
  • n_estimators - The number of boosting iterations (ie: trees in the ensemble). More boosting iterations tends to work best when combined with a smaller learning rate (and vice-versa).
  • reg_alpha and reg_lambda - Amount of regularization applied to the contributions of base estimators to the ensemble. reg_alpha provides the L1 regularization, which encourages sparsity (some estimators making contributions of exactly zero), while reg_lambda provides the L2 regularization, which encourages weights to be closer to zero (but not exactly zero). Increasing these tuning parameters introduces bias but reduces the variance of the ensemble. These parameters can be set to any positive value.

The following parameters influence the individual base models within the ensemble:

  • max_depth - The maximum depth of trees. Higher values increase the likelihood of overfitting but provide increased flexibility. Generally maximum depths of 1 or 2 tend to work best in boosting ensembles.
  • colsample_bytree, colsample_bylevel, and colsample_bynode - The fraction of columns (variables) randomly sampled for use at a particular stage of the base model. For example, colsample_bytree = 0.5 will randomly select 50% of the available predictors and use them for building an entire tree, while colsample_bylevel = 0.5 will select a random 50% of predictors at each depth level of each individual tree and colsample_bynode = 0.5 will select a random 50% of predictors to be considered at each individual split within each individual tree.
  • min_child_weight - The minimum sum of weight needed in a node for it to be further partitioned. If each data-point is weighted equally (which is often the case), this is equivalent to min_samples_split in decision tree functions. Larger values can help prevent overfitting but reduce flexibility.
  • gamma (alias: min_split_loss) - The minimum improvement in cost required for a node to be split within a tree. Larger values prevent overfitting.

For advanced uses, there are some additional tuning parameters that you can read about here.

Fortunately, xgboost is compatible with sklearn pipelines and cross-validated grid searches:

In [4]:
## Simple pipeline
from sklearn.pipeline import Pipeline 
pipe = Pipeline([
('model', xgb.XGBClassifier(eval_metric='error'))
])

## A few different learning rates
parms = {'model__learning_rate': [0.005, 0.01, 0.03]}

## Grid Search and results
from sklearn.model_selection import GridSearchCV
grid_res = GridSearchCV(pipe, parms, scoring = 'roc_auc', cv=5).fit(train_X, train_y)
print(grid_res.best_estimator_)
print(grid_res.best_score_)
Pipeline(steps=[('model',
                 XGBClassifier(base_score=None, booster=None, callbacks=None,
                               colsample_bylevel=None, colsample_bynode=None,
                               colsample_bytree=None, device=None,
                               early_stopping_rounds=None,
                               enable_categorical=False, eval_metric='error',
                               feature_types=None, feature_weights=None,
                               gamma=None, grow_policy=None,
                               importance_type=None,
                               interaction_constraints=None, learning_rate=0.03,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=None, max_leaves=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, multi_strategy=None,
                               n_estimators=None, n_jobs=None,
                               num_parallel_tree=None, ...))])
0.6475470088364157

As we will see in the next section it can be cumbersome to rely upon GridSearchCV() to tune the learning rate and number of boosting iterations, so this approach is most useful to get some preliminary ideas about the optimal tree depth and feature subsampling for a given application before moving on to a more sophisticated hyperparameter tuning scheme.

Question #1:

  • Part A: Using the well switching data, a learning rate of 0.03, and 100 boosting iterations, find the cross-validated classification accuracy that results from base models having maximum depths of 1, 2, 3, 4, 5, and 6. Display the relationship between maximum depth and cross-validated accuracy using a line chart.
  • Part B: Using the best performing combination from Part A, explore the values of 1/3, 2/3, and 1 for the parameter colsample_bylevel using a cross-validated grid search. Briefly comment on why it only makes sense to consider these values for this application.
  • Part C: Use GridSearchCV() to identify a reasonable SVM classifier to use as a competitor to the xgboost model you identified in Part B. Report the hyperparameters (kernel function, etc.) and cross-validated accuracy score of your SVM.
  • Part D: Compare the performance of the best models from Parts B and C on the test set, reporting their classification accuracy, ROC-AUC, and F1 score.

Part 2 - Early Stopping and Validation Monitoring¶

In the previous section we saw that xgboost models can overfit the training data if too many boosting iterations are used; however, it can be difficult to know in advance how many boosting iterations are necessary in a given application, and it can be computationally burdensome to rely upon GridSearchCV() to tune this hyperparameter. Fortunately, xgboost allows you to monitor the performance of your model on an external validation set and stop training early if performance on the validation set plateaus or the model begins to overfit the training data.

To make use of early stopping, we must specify the number of boosting iterations with no improvement on the validation set that must be observed in order for the model to stop training early (the argument early_stopping_rounds), and we also must provide a performance evaluation metric that will be monitored on the validation set (the argument eval_metric). We will use the 'logloss' evaluation metric here because it tends to be more sensitive to small changes in predicted probabilities than either classification accuracy or AUC.

In our example we will use the test set as our validation data, but in practice you might want to use separate validation and test sets.

In [5]:
# Set up validation set using the test data
test_y = (test['switch'] == 'yes').astype(int)
test_X = test.drop(['switch', 'association'], axis="columns")

# Initialize and train the model with early stopping
model = xgb.XGBClassifier(n_estimators=1000, learning_rate=0.01, early_stopping_rounds=15, eval_metric='logloss').fit(train_X, train_y, eval_set=[(train_X, train_y), (test_X, test_y)], verbose = False)

In this example we provided both the training and validation data to the eval_set argument, which will track the performance of the model on these two sets by boosting iteration.

We can then extract the results and plot the training and validation performance:

In [6]:
## Extract results
results = model.evals_result()

## Visualize performance
plt.figure(figsize=(10,7))
plt.plot(results["validation_0"]["logloss"], label="Training loss")
plt.plot(results["validation_1"]["logloss"], label="Validation loss")
plt.axvline(model.best_iteration, color="gray", label="Final Model")
plt.xlabel("Number of boosting iterations")
plt.ylabel("Loss")
plt.legend()
plt.show()

To use the best model identified by our early stopping rule, or any model corresponding to an even earlier part of the training process, we can use the iteration_range argument in the predict() method:

In [7]:
## Predictions on test set
test_preds = model.predict(test_X, iteration_range=(0, model.best_iteration+1))

## Final evaluation of test set
from sklearn.metrics import accuracy_score
accuracy_score(test_y, test_preds)
Out[7]:
0.6225165562913907

In this example we used the base learners from boosting iterations 1 to best_iteration to produce our predictions.

More generally, we could use any collection of base learners from the final model to produce predictions, though there aren't many clear instances where this would be useful:

In [8]:
## Another set of predictions on test set
test_preds = model.predict(test_X, iteration_range=(50, 150))
accuracy_score(test_y, test_preds)
Out[8]:
0.6026490066225165

Question #2¶

  • Part A: Modify the example in this section to use 'auc' as the evaluation metric and plot the performance of the model by iteration using this metric. Why does the pattern/trend in the performance plot look very different when using this metric than it does with using 'logloss'? You should continue using a learning rate of 0.01 and early_stopping_rounds=15.
  • Part B: Try changing the learning rate to 0.001 and 0.1 and comment on how the learning rate impacts the ability of an xgboost model with early stopping to find a well-fitting model.
  • Part C: Use the test set to compare the performance of the early stopping model that was found in Part A using 'auc' as the evaluation metric and the one from this section's examples that used 'logloss' as the evaluation. Report the classification accuracy, AUC, and F1 score of these models and briefly comment upon which approach (choice of evaluation metric) seemed to do a better job finding a well-fitting model.

Part 3 - Feature Importance¶

Another strength of xgboost models (and also models like random forest, which rely upon decision trees) is that the influence of individual features on the model's predictions can be assessed using gain-based importance, or the total reduction in Gini impurity (or reduction in squared error for numeric outcomes) achieved by the splits involving a certain variable.

The example below shows how to extract gain-based importance scores from a fitted xgboost model:

In [9]:
# Get feature importance scores
model = xgb.XGBClassifier(n_estimators=1000, learning_rate=0.01, early_stopping_rounds=15, eval_metric='logloss').fit(train_X, train_y, eval_set=[(train_X, train_y), (test_X, test_y)], verbose = False)
importance_scores = model.get_booster().get_score(importance_type='gain')

# Print the scores
print(importance_scores)
{'arsenic': 3.682039976119995, 'distance': 2.698807716369629, 'education': 2.5666866302490234}

In this example we can see that the variable arsenic was most influential in the early stopping model from the previous section.

Question #3:

  • Part A: Fit a new xgboost model that randomly subsamples one of the three available features to be used exclusively on each split (node). Are the feature importance scores of this model more similar or less similar to one another in this model in comparison with the model in the example? Briefly explain why you think this occurs.
  • Part B: Change the feature importance type to 'weight', which simply contains the number of splits that were made using a given feature, and repeat the comparison of a model that uses random subsampling of 1/3 of the available features at each split and another model that considers all available features at all splits. Based upon your findings, briefly describe the implications of feature subsampling on weight-based importance.

Part 4 - Application¶

In this section you'll revisit the Wisconsin Breast Cancer data set. Recall that the goal in this application was to accurately classify tissue samples from tumors as either benign or malignant.

In [10]:
## Read the data
wbc = pd.read_csv("https://remiller1450.github.io/data/wisc_bc.csv")

The purpose of this section is to provide you with an opportunity to apply the concepts from each of the previous sections in order to another data set.

Question #4:

  • Part A: Split these data into three segments, a training set consisting of 60% of the data, a validation set containing 20% of the data, and a test set containing 20% of the data.
  • Part B: Use GridSearchCV() to make a preliminary determination of an appropriate tree depth, feature subsampling strategy, and learning rate for this application.
  • Part C: Use early stopping and the validation set to optimize the learning rate and number of boosting iterations for the two or three most promising hyperparameter configurations you found in Part B using the 'logloss' evaluation metric. Use the F1 score on the validation data of the best model from each early stopping exploration to determine a final model.
  • Part D: Create a bar plot showing the gain-based importance of each feature in the best model you identified in Part C. Normalize the values so that the sum of the feature importance scores is 1.
  • Part E: Report the final performance (accuracy, ROC-AUC, and F1 score) of the best model on the test data.