Introduction to Supervised Learning (Part 4

This lab is optional and may be completed for a small amount of extra credit. I encourage you to consider using these methods on your final project.

library(rpart)
library(dplyr)
library(ggplot2)
#install.packages("randomForest")
library(randomForest)

\(~\)

Random Forests

Decision tree models tend to be high variance, meaning they are prone to over-fitting the data they are trained on.

Random forests are an ensemble modeling method that ameliorates the high variance of single decision trees by exploiting the idea that averaging a set of independent elements yields an outcome with lower variability than any of the individual elements in the set.

The basic idea that an average is less variable than an individual entity should be familiar from intro stats class. For example, the sample mean, \(\bar{x}\), has substantially less variability (\(\sigma/\sqrt{n}\)) than the variability of the individual data-points in a population (\(\sigma\)).

To translate this idea into a workable modeling method, random forests use bagging (short for bootstrap aggregation).

Bagging creates \(B\) different bootstrapped training data sets that are each slightly different. The model of interest (decision tree) is trained separately on each each of these data sets, producing a collection of \(B\) different estimated “models”, sometimes called “base learners”. When it comes time to generate predictions, the new data is given to each of these models and the resulting set of predictions is then aggregated (usually by averaging or voting) to produce a single prediction.

You should note that bagging is a general method that can be applied to any base learner. We’ll focus on it’s use in the random forest algorithm, which combines bagging, decision trees, and one additional strategy. Below is an outline of the random forest algorithm:

Construct \(B\) bootstrap samples by sampling cases from the original data set with replacement (thus producing \(B\) unique data sets that are similar to the original, but not identical to each other)
Fit a decision tree to each bootstrapped sample, but for each split randomly choose a subset of \(m\) variables to be considered for that split (this results in \(B\) unique trees)
For prediction/classification, each of the \(B\) trees in the forest contributes a single prediction or “vote”, with the majority (or average) forming the random forest’s final prediction, \(\hat{y}_i\)

Essentially, a random forest is just a collection of decision trees that each “vote” on a final prediction:

Remark: Bootstrapping alone isn’t enough to decorrelate the \(B\) classification and regression trees. By mandating each split only considers a random set of \(m\) predictors, the strongest explanatory variables are prevented from dominating every tree (at the expense of other variables that might provide some unique predictive value). Decreasing \(m\) produces a set of trees that are closer to being independent.

The code below demonstrates how to apply the random forest algorithm in R:

rf <- randomForest(Species ~ ., data = iris)

Because the random forest algorithm uses bagging, it naturally can provide an out-of-sample error rate by considering data-points that didn’t end up in bootstrap samples. These are called out-of-bag (OOB) samples, and evaluating the model using them provides an out-of-bag error rate

This means cross-validation is not necessary to evaluate the performance of a random forest model; however, properly evaluating a single CART tree does require cross-validation.

Notice the substantial improvement in classification accuracy achieved by the random forest algorithm relative to a single tree:

## Random Forest confusion matrix (naturally out-of-sample)
rf$confusion

##            setosa versicolor virginica class.error
## setosa         50          0         0        0.00
## versicolor      0         47         3        0.06
## virginica       0          4        46        0.08

## CART confusion matrix
### CV Setup
set.seed(123)
fold_id <- sample(rep(1:5, length.out = nrow(iris)), size = nrow(iris))
preds <- numeric(nrow(iris))

## Loop across CV folds
for(k in 1:5){
  
  ## Subset the data
  train <- iris[fold_id != k, ]
  test <- iris[fold_id == k, ]
  
  ## Fit CART model
  tr <- rpart(Species ~ ., data = train)
  
  ## Store Predictions
  preds[fold_id == k] <- predict(tr, newdata = test, type = "class")
}

## Confusion Matrix   
table(iris$Species, preds)

##             preds
##               1  2  3
##   setosa     50  0  0
##   versicolor  0 46  4
##   virginica   0  7 43

Question #1: In your own words, briefly explain why the CART algorithm yields less accurate out-of-sample predictions than the random forest algorithm when applied to Fisher’s iris data set.

\(~\)

Tuning the Random Forest Algorithm

The random forest algorithm depends upon the following tuning parameters, each of which can influence the model’s accuracy. The two most important are:

mtry - the number of variables that are randomly chosen to be candidates at each split
The stopping criteria for individual trees, this can be: nodesize, which sets the minimum size of terminal nodes (setting this larger leads to smaller trees), or maxnodes, which sets the maximum number of terminal nodes an individual tree can have.

You should also be mindful of the ntree parameter, which determines the number of bootstrapped samples to be used in the forest. If too few trees are used, predictions might not be as accurate as they could be. However, the benefits of using more trees will plateau at a certain point.

You can type ?randomForest to read more about these option (such as their default values, etc.) in the R documentation for the randomForest() function. The code below provides an example of how to specify them:

rf <- randomForest(Species ~ ., data = iris, ntree = 100, mtry = 2, maxnodes = 4)

This forest will contain 100 trees, with each tree having a maximum of 4 terminal nodes. Additionally, two variables are randomly selected to be considered for each split.

\(~\)

Variable Importance

A downside of the random forest algorithm (as well as many other algorithmic modeling approaches) is the difficulty in quantifying the roles played by individual variables in making predictions. However, the importance of individual variables in a random forest can still be expressed using a measure known as variable importance.

Variable importance is general term, but most R functions will (by default) take it to refer to the average improvement in tree purity (measured by the Gini Index) that is attributable to splits involving each variable. These improvements are then re-scaled to sum to 100. The code below demonstrates the calculation in R:

rf <- randomForest(Species ~ ., data = iris)
rf$importance

##              MeanDecreaseGini
## Sepal.Length        10.673018
## Sepal.Width          2.508506
## Petal.Length        43.115187
## Petal.Width         42.949449

This metric suggests petal length and petal width are most useful in accurately predicting the class of an iris flower, while sepal width seems to have little importance.

Unfortunately, variable importance isn’t as easily interpreted as coefficients in a regression (or logistic regression) model, so if the goal of an application is to estimate the effect of a particular variable, regression (or one of its variants) might be preferable to a random forest, especially if the performance is comparable.

Question #2: Fit two different random forest models to the full iris data set, both using \(B = 300\) bootstrapped samples and a maximum of 4 nodes. Allow the first model to consider only 1 randomly selected variable for each split, and the second model to allow 3 randomly selected variables to be considered for each split. How do the variable importance measures for each model compare? Briefly explain why the importance measures look so different across the two models.

\(~\)

Boosting

Bagging methods, such as the random forest algorithm, build an ensemble are base learners (single decision trees) that are trained separately. Alternatively, boosting methods build an ensemble of base learners that are trained sequentially.

The original proposal of boosting was an algorithm known as adaptive boosting, or AdaBoost.

AdaBoost begins by fitting a decision tree model to the entire data set.
Next, each data-point is given a weight according to how accurately it was predicted. Observations that were poorly predicted (sources of large errors) are given higher weights, and observations that were accurately predicted are given lower weights.
The next decision tree in the ensemble is trained using the weighted data, thereby predisposing it to pick up on patterns that might explain some of the larger errors made by the original tree.
Weights are then re-computed and the next tree in the ensemble is found using the new set of weights.
The algorithm continues until a specified number of trees are part of the ensemble, and the final prediction is a weighted combination of the predictions of individual members (trees) in the ensemble.

The original AdaBoost algorithm has since been generalized and improved upon, with modern approaches training each subsequent model on the residuals of the previous model rather than a re-weighted version of the original data set. This approach is known as gradient boosting, a name referencing the fact that a model’s residuals are related to the gradient of the model’s objective function (a concept from calculus/optimization).

The software package and modeling approach xgboost is an efficient implementation of gradient boosted decision trees:

#install.packages("xgboost")
library(xgboost)

It’s worthwhile noting that xgboost was not originally designed for R, and training an xgb model requires you to create a special data storage object:

## Create an xgb Data Matrix - notice the data must be an R matrix, and the "label" is the response variable
map = c("setosa" = 0, "versicolor" = 1, "virginica" = 2)
iris_xgb = xgb.DMatrix(data = as.matrix(select(iris, -Species)), label = map[iris$Species])

## Fit the xgb model
my_xgb <- xgboost(data = iris_xgb, params = list(objective = "multi:softmax", num_class = 3), nrounds = 100, verbose = 0)

Notice how the categorical outcome must be encoded by integers that start at zero. This is achieved in the above example using element mapping (an idea we haven’t discussed, but which is relatively straightforward).

Additionally, when using xgboost we must specify an objective function that is appropriate for the modeling task. We won’t go into the details in this lab, but you should use:

objective = "multi:softmax" when working with a categorical outcome variable with at least 3 categories
objective = "binary:logistic" when working with a binary outcome variable
objective = "reg:squarederror" when working with a numeric outcome variable

The xgboost method involves a large number of tuning parameters. A few of the most important are summarized below:

eta controls the “learning rate” or much to scale back the weighted contribution of trees appearing later in the ensemble (to prevent overfitting).
gamma the minimum improvement in purity for a split to occur within an individual tree involved in the ensemble.
max_depth the maximum depth of each tree in the ensemble.
colsample_bytree the proportion of variables used to construct a single tree. Similar to the random forest algorithm, forced sampling of variables can decorrelate trees in the ensemble.
alpha and lambda are regularization terms that regularize (penalize downward) the weighted contributions of trees in the ensemble, with alpha providing L1 regularization (encouraging sparsity where some trees to contribute nothing to the ensemble) and lambda providing L2 regularization.

The xgboost packages includes its own implementation of cross-validation:

cv_res = xgb.cv(data = iris_xgb, params = list(eta = 0.1, max_depth = 2),
        nfold = 4, nrounds = 100, verbose = 0)

The cross-validated error is stored in the “evaluation_log” of the object created by xgb.cv. The visualization below shows the training data error (in-sample RMSE or logistic loss) and the testing data error (out-of-sample RMSE or logistic loss) of the current model at each of the 100 boosting iterations performed by the model:

## Items to graph
boosting_iter = cv_res$evaluation_log$iter
testing_rmse = cv_res$evaluation_log$test_rmse_mean
training_rmse = cv_res$evaluation_log$train_rmse_mean

ggplot() + geom_line(aes(x = boosting_iter,
                          y = training_rmse), color = "blue") +
          geom_line(aes(x = boosting_iter,
                          y = testing_rmse), color = "red")

Here you should notice that the training error never reaches a minimum, it continues to decrease as the model becomes more complex (via more boosting iterations, which amounts to more trees added to ensemble). However, the testing error shows a clear minimum and tends to increase as more trees are added (suggesting these trees lead to over-fitting).

Question #3: Something w/ CV, compare oob RF vs. CV xgb for a binary outcome. Maybe GSW data.

The code below loads game data from the Golden State Warrior’s historic 2015-16 season where the team broke the NBA record for most wins in a single season. The goal of this question is to compare models that aim to predict the observed score differential using other statistics.

gsw <- read.csv("https://remiller1450.github.io/data/GSWarriors.csv") %>% 
  mutate(score_diff = Points - OppPoints) %>%
  select(-Game, -Date, - Opp, -Win, -Points, -OppPoints)

Part A: Fit a random forest model that predicts “score_diff” using all other variables in gsw as predictors. Then, using the OOB mean-squared errors, reported the estimated out-of-sample RMSE of the model by taking the square-root of these MSE values and averaging those results.

Part B: Fit a xgboost model and estimate the out-of-sample RMSE using 4-fold cross-validation. Identify a combination of the tuning parameters: rounds, eta, and max_depth that produces a model with better performance than the random forest from part A.

Part C: Construct a visual display of how the xgboost model performance compares to the OOB performance of the random forest that is similar to the example given below:

Note: This example graph suggests trying even more boosting iterations since the cross-validated RMSE appears to still be decreasing after 100 iterations.

\(~\)

Variable Importance

Because xgboost provides an ensemble that is built upon decision trees, measures of gain-based importance can be used to assess the relative contributions of individual variables to the model:

xgb.importance(model = my_xgb)

##         Feature       Gain     Cover Frequency
## 1: Petal.Length 0.67029255 0.5152926 0.2884211
## 2:  Petal.Width 0.31180273 0.2904833 0.3052632
## 3:  Sepal.Width 0.01082526 0.0871759 0.1915789
## 4: Sepal.Length 0.00707946 0.1070482 0.2147368

Here, the “Gain” column shows the proportion of purity increases that are attributable to splits involving each feature (row). The “Cover” column indicates the proportion of predictions that are influenced by this variable, And the “Frequency” column describes the relative frequency of this feature across all trees in the ensemble.

Question #4: Compare the gain based importance of the random forest and xgboost models used in Question #3. Identify one area similarity and one area of difference in the most important variables in each approach.

\(~\)

Next Steps

This lab is intended to introduce the foundation topics of ensemble models, specifically bagging (random forests) and boosting (xgboost). When applying these models I encourage you to seek out additional information to better understand their implementations, tuning parameters, and limitations. Below are a few resources:

The official xgboost R documentation: https://xgboost.readthedocs.io/en/stable/R-package/xgboostPresentation.html
Leo Breiman’s webpage on random forests for classification: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Google’s machine learning introduction to gradient boosted trees: https://developers.google.com/machine-learning/decision-forests/intro-to-gbdt
Introduction to Statistical Learning in R: https://www.statlearning.com/

Introduction to Supervised Learning (Part 4 - Boosting and Bagging)

Random Forests

Tuning the Random Forest Algorithm

Variable Importance

Boosting

Variable Importance

Next Steps