Introduction to Tree-based Models

This lab continues our introduction supervised learning by introducing a new type of modeling approach for categorical outcomes.

Directions (Please read before starting)

Please work together with your assigned partner. Make sure you both fully understand each concept before you move on.
Please record your answers and any related code for all embedded lab questions. I encourage you to try out the embedded examples, but you shouldn’t turn them in.
Please ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Preamble

Packages and Datasets

This lab will use the following packages:

# load the following packages
# install.packages("rpart")
# install.packages("rpart.plot")
# install.packages("randomForest")
library(ggplot2)
library(dplyr)
library(plotROC)
library(rpart)
library(rpart.plot)
library(randomForest)

\(~\)

Classification Trees

Classification and Regression Trees (CART) are non-parametric models that are well-suited for applications where involving many interactions between features.

CART models are “trained” by recursively partitioning the \(p\)-dimensional space (defined by the explanatory variables) until an acceptable level of homogeneity or “purity” is achieved within each partition:

Starting with a “parent” node, search for a splitting rule that maximizes the homogeneity or purity of the “child” nodes
Next, considering each node that hasn’t yet been split, find another splitting rule that maximizes purity
Repeat until a stopping criteria has been reached

For classification, a commonly used measure of purity is the Gini Index: \(G = \sum_{j=1}^{k}p_j(1-p_j)\), or \(p_1(1-p_1) + p_2(1-p_2)\) for binary classification.

Once a tree contains multiple terminal nodes, the Gini Index is calculated separately for each and aggregated using a weighted average (weighted by the proportion of cases in a node)
If a tree perfectly separates the outcome classes the resulting Gini Index is 0

Shown below is an example of the CART algorithm applied to the famous “iris” data set, which contains 4 different measurements describing flowers from 3 different iris species.

data("iris")
mytree <- rpart(Species ~ Sepal.Length + Sepal.Width, data = iris)
rpart.plot(mytree, extra=104) ## "extra = 104" formats the output

The first line reports node’s most common (or average) outcome (either “setosa”, “versicolor”, or “virginica” here) - this is the model’s predicted outcome for any data (new or old) that ends up in that node
The second line reports the distribution of outcomes in the node - this helps you evaluate the purity of a node
The third line shows the percentage of the data in the node - this helps you understand which branches of the tree are most influential

As previously mentioned, the main behind the CART algorithm is that nodes are split to achieve better “purity”. In this tree, the first node is split into two “branches” using the rule “Sepal.Length < 5.5”; thus, all data-points that satisfy this condition are sent to “branch” on the left, and all data-points that don’t satisfy this condition are sent to the branch on the right.

We can see how this splitting rule partitions the data into visually:

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species, pch = Species)) + geom_point() + geom_vline(xintercept = 5.5, lty = 2, lwd = 1.5)

All data-points to the left of the dashed line sent to the left branch, and all data-points to right are sent to the right branch. Based upon the other the information in the tree, we know that the right branch contains 35% of the data-points and is 87% setosa flowers. The left branch contains 65% of the data is pretty evenly split between virginica and versicolor flowers.

Next, we can see that both nodes in the tree’s second level are each split again. The node in the left branch is split using decision rule “Sepal.Width \(\geq\) 2.8”, while the node in the right branch is split using to the decision rule “Sepal.Length < 6.2”; these splits are shown visually in the plot below:

As you can see in the plot, the CART algorithm is recursively partitioning the two-dimensional space (defined by Sepal.Width and Sepal.Length) in order to optimally separate the three categories of the outcome variable “Species”.

At this point we’ve described four nodes, two on the left branch and two on the right branch. The final tree includes one additional split (the third of these nodes is split using “Sepal.Width \(\geq\) 3.1”), a visual representation of this final model is shown below:

Lab

Practice

Question #1:Our previous introduced the well switching data set. Our goal in modeling these data is to predict whether a household will switch from a contaminated well to the nearest safe well.

library(carData)
data("Wells")

Part A: Use the rpart() function to fit the CART model defined by the formula switch ~ arsenic + distance + education + association. Then use the rpart.plot() function with the argument extra=104 to visualize this model.

Part B: Describe the first splitting rule in the tree created in Part A.

Part C: Calculate the Gini Index before and after the first split.

Part D: What is the combination of explanatory variables that puts an observation into the tree’s most populated node? What proportion of observations in this node switched wells?

Pruning

The CART algorithm will consider splitting its nodes until a stopping criterion is reached. We will not go in-depth into possible stopping rules, but they generally depend upon one or more the following factors:

Purity of a node
Data-points in a node
Depth of the tree (how many splits occurred)

Had we desired a tree of a particular depth, we could “prune” our tree by tuning what’s known as the complexity parameter cp, an internal parameter used to determine when the recursive partitioning algorithm should stop:

mytree <- rpart(Species ~ Sepal.Length + Sepal.Width, data = iris)
newtree <- prune(mytree, cp = .1)
rpart.plot(newtree, type = 1, extra=104)

The complexity parameter describes the minimum improvement (proportionally speaking) for a split to be considered worthwhile. Setting cp = 0.1 specifies that an improvement (in information gain) of at least 10% is required for a new split.

Because the CART algorithm is greedy, we can always fit a larger tree using a small complexity parameter and then retroactively go back and increase cp using the prune() function if we desire a smaller or simpler tree.

Question #2: Prune the tree you created in Question #1 to only include splits that lead to at least a 5% information gain. Do you think this model (the pruned tree) is more or less likely to be overfit to the sample data? Briefly explain.

\(~\)

Random Forests

CART models tend to be high variance, meaning they are prone to over-fitting. Random forests are an ensemble modeling method based upon the fact that averaging a set of independent elements yields an outcome with lower variability than any of the individual elements in the set.

This concept should seem familiar from your intro stats class. The sample mean, \(\bar{x}\), has substantially less variability (\(\sigma/\sqrt{n}\)) than the individual data-points (\(\sigma\)).

To translate this idea into a workable model, random forests use bagging (short for bootstrap aggregation).

Bagging creates \(B\) different training data sets that are nearly independent of each other (with \(B\) typically in the hundreds). The model of interest (in this case CART) is trained separately on set, resulting in \(B\) different estimated “models”. These models are then aggregated (usually averaged) to produce a single, low-variance estimate.

Bagging is a general method; however, we’ll focus on it’s use in the context of random forests:

Construct \(B\) bootstrap samples by sampling cases from the original data set with replacement (this results in \(B\) unique data sets that are similar to the original)
Fit a classification and regression tree to each sample, but randomly choose a subset of \(m\) variables that can be used in the construction of that tree (this results in \(B\) unique trees that are fit to similar data sets using different sets of predictors)
For prediction/classification, each of the \(B\) trees in the forest contributes a single prediction or “vote”, with the majority (or average) forming the random forest’s final prediction, \(\hat{y}_i\)

Essentially, a random forest is just a collection of decision trees that each “vote” on a final prediction:

Remark: Bootstrapping alone isn’t enough to decorrelate the \(B\) classification and regression trees. By mandating each split only considers a random set of \(m\) predictors, the strongest explanatory variables are prevented from dominating every tree (at the expense of other variables that might provide some predictive value). This results in trees that are closer to independent.

The code below demonstrates how to apply the random forest algorithm in R:

#install.packages("randomForest")
library(randomForest)
rf <- randomForest(Species ~ ., data = iris)

Because the random forest algorithm uses bagging, it naturally provides out-of-sample classifications, which means that methods like cross-validation are not necessary. However, properly evaluating a single CART model does require cross-validation.

Notice the substantial improvement in classification accuracy achieved by the random forest algorithm relative to a single tree:

## Random Forest confusion matrix (naturally out-of-sample)
rf$confusion

##            setosa versicolor virginica class.error
## setosa         50          0         0        0.00
## versicolor      0         47         3        0.06
## virginica       0          4        46        0.08

## CART confusion matrix
### CV Setup
set.seed(123)
fold_id <- sample(rep(1:5, length.out = nrow(iris)), size = nrow(iris))
preds <- numeric(nrow(iris))

## Loop across CV folds
for(k in 1:5){
  
  ## Subset the data
  train <- iris[fold_id != k, ]
  test <- iris[fold_id == k, ]
  
  ## Fit CART model
  tr <- rpart(Species ~ ., data = train)
  
  ## Store Predictions
  preds[fold_id == k] <- predict(tr, newdata = test, type = "class")
}

## Confusion Matrix   
table(iris$Species, preds)

##             preds
##               1  2  3
##   setosa     50  0  0
##   versicolor  0 46  4
##   virginica   0  7 43

Question #3: In your own words, briefly explain why the CART algorithm yields less accurate out-of-sample predictions than the random forest algorithm when applied to Fisher’s iris dataset.

\(~\)

Tuning the Random Forest Algorithm

The random forest algorithm depends upon the following tuning parameters, each of which can influence the model’s accuracy.

ntree - the number of bagged samples, \(B\), onto which trees will be grown
mtry - the number of variables that are randomly chosen to be candidates at each split
Some sort of stopping criteria for individual trees, this can be: nodesize, which sets the minimum size of terminal nodes (setting this larger leads to smaller trees), or maxnodes, which sets the maximum number of terminal nodes an individual tree can have.

You can type ?randomForest to read more about these option (such as their default values, etc.) in the R documentation for the randomForest() function. The code below demonstrates how to specify them:

rf <- randomForest(Species ~ ., data = iris, ntree = 100, mtry = 2, maxnodes = 4)

\(~\)

Variable Importance

A downside of both the CART and random forest algorithms (as well as many other algorithmic modeling approaches) is the difficulty in quantifying the roles played by individual variables in making predictions. However, the importance of individual variables in a random forest can still be expressed using a measure known as variable importance.

Variable importance is general term, but most R functions will (by default) take it to refer to the average improvement in tree purity (measured by the Gini Index) that is attributable to splits involving each variable. The code below demonstrates the calculation in R:

rf <- randomForest(Species ~ ., data = iris)
rf$importance

##              MeanDecreaseGini
## Sepal.Length        10.673018
## Sepal.Width          2.508506
## Petal.Length        43.115187
## Petal.Width         42.949449

This metric suggests petal length and petal width are most useful in accurately predicting the class of an iris flower, while sepal width seems to have little importance.

Unfortunately, variable importance isn’t as easily interpreted as coefficients in a regression (or logistic regression) model, so if the goal of an application is to estimate the effect of a particular variable, regression (or one of its variants) is preferable to methods like CART and random forests.

Question #4: Fit two different random forest models to the full iris data set, both using \(B = 300\) bootstrapped samples and a maximum of 4 nodes. Allow the first model to consider only 1 randomly selected variable for each split, and the second model to allow 3 randomly selected variables to be considered for each split. How do the variable importance measures for each model compare? Briefly explain why the importance measures look so different across the two models.

\(~\)

Practice

Question #5: When studying PCA, we were exposed to wine data set from the UC-Irvine machine learning data repository. These data describe wine samples from three different growers in the same region of Italy. The goal of this analysis will be to correctly identify which wines came from which growers.

The code below reads these data:

wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")
colnames(wine) <- c("GrowerID" ,"Alcohol", "MalicAcid", "Ash", "Alkalinity", "Mg", "TotalPhenols", "Flavanoids", "NonFlavanoidPhenols", "Proanthocyanins", "ColorIntensity", "Hue", "OD280-OD315Ratio", "Proline")
wine$GrowerID <- factor(wine$GrowerID)
colnames(wine) <- make.names(colnames(wine))

Part A: Fit a classification tree to the wine data to predict Grower ID using all other available variables as predictors. Create visualization of this tree and describe its first split.

Part B: Estimate the out-of-sample classification accuracy of the CART model from Part A using 5-fold cross validation.

Part C: Fit a random forest model using default values for tuning parameters. Compare the out-of-sample classification accuracy of this model to the value you found in Part B. Which model seems to perform better?

Part D: By trial and error (or grid search) find a combination of tuning parameters that improves the classification accuracy of the random forest model described in Part C.

Part E: Identify the two most important explanatory variables from the random forest model in Part D, then create a visualization displaying these two variables and the outcome (Grower ID).