This lab is optional and may be completed for a small amount of extra credit. I encourage you to consider using these methods on your final project.
library(rpart)
library(dplyr)
library(ggplot2)
#install.packages("randomForest")
library(randomForest)
\(~\)
Decision tree models tend to be high variance, meaning they are prone to over-fitting the data they are trained on.
Random forests are an ensemble modeling method that ameliorates the high variance of single decision trees by exploiting the idea that averaging a set of independent elements yields an outcome with lower variability than any of the individual elements in the set.
The basic idea that an average is less variable than an individual entity should be familiar from intro stats class. For example, the sample mean, \(\bar{x}\), has substantially less variability (\(\sigma/\sqrt{n}\)) than the variability of the individual data-points in a population (\(\sigma\)).
To translate this idea into a workable modeling method, random forests use bagging (short for bootstrap aggregation).
Bagging creates \(B\) different bootstrapped training data sets that are each slightly different. The model of interest (decision tree) is trained separately on each each of these data sets, producing a collection of \(B\) different estimated “models”, sometimes called “base learners”. When it comes time to generate predictions, the new data is given to each of these models and the resulting set of predictions is then aggregated (usually by averaging or voting) to produce a single prediction.
You should note that bagging is a general method that can be applied to any base learner. We’ll focus on it’s use in the random forest algorithm, which combines bagging, decision trees, and one additional strategy. Below is an outline of the random forest algorithm:
Essentially, a random forest is just a collection of decision trees that each “vote” on a final prediction:
Remark: Bootstrapping alone isn’t enough to decorrelate the \(B\) classification and regression trees. By mandating each split only considers a random set of \(m\) predictors, the strongest explanatory variables are prevented from dominating every tree (at the expense of other variables that might provide some unique predictive value). Decreasing \(m\) produces a set of trees that are closer to being independent.
The code below demonstrates how to apply the random forest algorithm
in R
:
rf <- randomForest(Species ~ ., data = iris)
Because the random forest algorithm uses bagging, it naturally can provide an out-of-sample error rate by considering data-points that didn’t end up in bootstrap samples. These are called out-of-bag (OOB) samples, and evaluating the model using them provides an out-of-bag error rate
This means cross-validation is not necessary to evaluate the performance of a random forest model; however, properly evaluating a single CART tree does require cross-validation.
Notice the substantial improvement in classification accuracy achieved by the random forest algorithm relative to a single tree:
## Random Forest confusion matrix (naturally out-of-sample)
rf$confusion
## setosa versicolor virginica class.error
## setosa 50 0 0 0.00
## versicolor 0 47 3 0.06
## virginica 0 4 46 0.08
## CART confusion matrix
### CV Setup
set.seed(123)
fold_id <- sample(rep(1:5, length.out = nrow(iris)), size = nrow(iris))
preds <- numeric(nrow(iris))
## Loop across CV folds
for(k in 1:5){
## Subset the data
train <- iris[fold_id != k, ]
test <- iris[fold_id == k, ]
## Fit CART model
tr <- rpart(Species ~ ., data = train)
## Store Predictions
preds[fold_id == k] <- predict(tr, newdata = test, type = "class")
}
## Confusion Matrix
table(iris$Species, preds)
## preds
## 1 2 3
## setosa 50 0 0
## versicolor 0 46 4
## virginica 0 7 43
Question #1: In your own words, briefly explain why the CART algorithm yields less accurate out-of-sample predictions than the random forest algorithm when applied to Fisher’s iris data set.
\(~\)
The random forest algorithm depends upon the following tuning parameters, each of which can influence the model’s accuracy. The two most important are:
mtry
- the number of variables that are randomly chosen
to be candidates at each splitnodesize
, which sets the minimum size of terminal nodes
(setting this larger leads to smaller trees), or maxnodes
,
which sets the maximum number of terminal nodes an individual tree can
have.You should also be mindful of the ntree
parameter, which
determines the number of bootstrapped samples to be used in the forest.
If too few trees are used, predictions might not be as accurate as they
could be. However, the benefits of using more trees will plateau at a
certain point.
You can type ?randomForest
to read more about these
option (such as their default values, etc.) in the R documentation for
the randomForest()
function. The code below provides an
example of how to specify them:
rf <- randomForest(Species ~ ., data = iris, ntree = 100, mtry = 2, maxnodes = 4)
This forest will contain 100 trees, with each tree having a maximum of 4 terminal nodes. Additionally, two variables are randomly selected to be considered for each split.
\(~\)
A downside of the random forest algorithm (as well as many other algorithmic modeling approaches) is the difficulty in quantifying the roles played by individual variables in making predictions. However, the importance of individual variables in a random forest can still be expressed using a measure known as variable importance.
Variable importance is general term, but most R
functions will (by default) take it to refer to the average improvement
in tree purity (measured by the Gini Index) that is attributable to
splits involving each variable. These improvements are then re-scaled to
sum to 100. The code below demonstrates the calculation in R:
rf <- randomForest(Species ~ ., data = iris)
rf$importance
## MeanDecreaseGini
## Sepal.Length 10.673018
## Sepal.Width 2.508506
## Petal.Length 43.115187
## Petal.Width 42.949449
This metric suggests petal length and petal width are most useful in accurately predicting the class of an iris flower, while sepal width seems to have little importance.
Unfortunately, variable importance isn’t as easily interpreted as coefficients in a regression (or logistic regression) model, so if the goal of an application is to estimate the effect of a particular variable, regression (or one of its variants) might be preferable to a random forest, especially if the performance is comparable.
Question #2: Fit two different random forest models
to the full iris
data set, both using \(B = 300\) bootstrapped samples and a
maximum of 4 nodes. Allow the first model to consider only 1 randomly
selected variable for each split, and the second model to allow 3
randomly selected variables to be considered for each split. How do the
variable importance measures for each model compare? Briefly explain why
the importance measures look so different across the two models.
\(~\)
Bagging methods, such as the random forest algorithm, build an ensemble are base learners (single decision trees) that are trained separately. Alternatively, boosting methods build an ensemble of base learners that are trained sequentially.
The original proposal of boosting was an algorithm known as adaptive boosting, or AdaBoost.
The original AdaBoost algorithm has since been generalized and improved upon, with modern approaches training each subsequent model on the residuals of the previous model rather than a re-weighted version of the original data set. This approach is known as gradient boosting, a name referencing the fact that a model’s residuals are related to the gradient of the model’s objective function (a concept from calculus/optimization).
The software package and modeling approach xgboost
is an
efficient implementation of gradient boosted decision trees:
#install.packages("xgboost")
library(xgboost)
It’s worthwhile noting that xgboost
was not originally
designed for R
, and training an xgb
model
requires you to create a special data storage object:
## Create an xgb Data Matrix - notice the data must be an R matrix, and the "label" is the response variable
map = c("setosa" = 0, "versicolor" = 1, "virginica" = 2)
iris_xgb = xgb.DMatrix(data = as.matrix(select(iris, -Species)), label = map[iris$Species])
## Fit the xgb model
my_xgb <- xgboost(data = iris_xgb, params = list(objective = "multi:softmax", num_class = 3), nrounds = 100, verbose = 0)
Notice how the categorical outcome must be encoded by integers that start at zero. This is achieved in the above example using element mapping (an idea we haven’t discussed, but which is relatively straightforward).
Additionally, when using xgboost
we must specify an
objective function that is appropriate for the modeling task. We won’t
go into the details in this lab, but you should use:
objective = "multi:softmax"
when working with a
categorical outcome variable with at least 3 categoriesobjective = "binary:logistic"
when working with a
binary outcome variableobjective = "reg:squarederror"
when working with a
numeric outcome variableThe xgboost
method involves a large number of tuning
parameters. A few of the most important are summarized below:
eta
controls the “learning rate” or much to scale back
the weighted contribution of trees appearing later in the ensemble (to
prevent overfitting).gamma
the minimum improvement in purity for a split to
occur within an individual tree involved in the ensemble.max_depth
the maximum depth of each tree in the
ensemble.colsample_bytree
the proportion of variables used to
construct a single tree. Similar to the random forest algorithm, forced
sampling of variables can decorrelate trees in the ensemble.alpha
and lambda
are regularization terms
that regularize (penalize downward) the weighted contributions of trees
in the ensemble, with alpha
providing L1 regularization
(encouraging sparsity where some trees to contribute nothing to the
ensemble) and lambda
providing L2 regularization.The xgboost
packages includes its own implementation of
cross-validation:
cv_res = xgb.cv(data = iris_xgb, params = list(eta = 0.1, max_depth = 2),
nfold = 4, nrounds = 100, verbose = 0)
The cross-validated error is stored in the “evaluation_log” of the
object created by xgb.cv
. The visualization below shows the
training data error (in-sample RMSE or logistic loss) and the testing
data error (out-of-sample RMSE or logistic loss) of the current model at
each of the 100 boosting iterations performed by the model:
## Items to graph
boosting_iter = cv_res$evaluation_log$iter
testing_rmse = cv_res$evaluation_log$test_rmse_mean
training_rmse = cv_res$evaluation_log$train_rmse_mean
ggplot() + geom_line(aes(x = boosting_iter,
y = training_rmse), color = "blue") +
geom_line(aes(x = boosting_iter,
y = testing_rmse), color = "red")
Here you should notice that the training error never reaches a minimum, it continues to decrease as the model becomes more complex (via more boosting iterations, which amounts to more trees added to ensemble). However, the testing error shows a clear minimum and tends to increase as more trees are added (suggesting these trees lead to over-fitting).
Question #3: Something w/ CV, compare oob RF vs. CV xgb for a binary outcome. Maybe GSW data.
The code below loads game data from the Golden State Warrior’s historic 2015-16 season where the team broke the NBA record for most wins in a single season. The goal of this question is to compare models that aim to predict the observed score differential using other statistics.
gsw <- read.csv("https://remiller1450.github.io/data/GSWarriors.csv") %>%
mutate(score_diff = Points - OppPoints) %>%
select(-Game, -Date, - Opp, -Win, -Points, -OppPoints)
Part A: Fit a random forest model that predicts
“score_diff” using all other variables in gsw
as
predictors. Then, using the OOB mean-squared errors, reported the
estimated out-of-sample RMSE of the model by taking the
square-root of these MSE values and averaging those results.
Part B: Fit a xgboost
model and
estimate the out-of-sample RMSE using 4-fold cross-validation. Identify
a combination of the tuning parameters: rounds, eta, and max_depth that
produces a model with better performance than the random forest from
part A.
Part C: Construct a visual display of how the
xgboost
model performance compares to the OOB performance
of the random forest that is similar to the example given below:
Note: This example graph suggests trying even more boosting iterations since the cross-validated RMSE appears to still be decreasing after 100 iterations.
\(~\)
Because xgboost
provides an ensemble that is built upon
decision trees, measures of gain-based importance can be used to assess
the relative contributions of individual variables to the model:
xgb.importance(model = my_xgb)
## Feature Gain Cover Frequency
## 1: Petal.Length 0.67029255 0.5152926 0.2884211
## 2: Petal.Width 0.31180273 0.2904833 0.3052632
## 3: Sepal.Width 0.01082526 0.0871759 0.1915789
## 4: Sepal.Length 0.00707946 0.1070482 0.2147368
Here, the “Gain” column shows the proportion of purity increases that are attributable to splits involving each feature (row). The “Cover” column indicates the proportion of predictions that are influenced by this variable, And the “Frequency” column describes the relative frequency of this feature across all trees in the ensemble.
Question #4: Compare the gain based importance of
the random forest and xgboost
models used in Question #3.
Identify one area similarity and one area of difference in the most
important variables in each approach.
\(~\)
This lab is intended to introduce the foundation topics of ensemble models, specifically bagging (random forests) and boosting (xgboost). When applying these models I encourage you to seek out additional information to better understand their implementations, tuning parameters, and limitations. Below are a few resources:
xgboost
R documentation: https://xgboost.readthedocs.io/en/stable/R-package/xgboostPresentation.html