This lab continues our introduction supervised learning by introducing a new type of modeling approach for categorical outcomes.
Directions (Please read before starting)
\(~\)
This lab will use the following packages:
# load the following packages
# install.packages("rpart")
# install.packages("rpart.plot")
# install.packages("randomForest")
library(ggplot2)
library(dplyr)
library(plotROC)
library(rpart)
library(rpart.plot)
library(randomForest)
\(~\)
Classification and Regression Trees (CART) are non-parametric models that are well-suited for applications where involving many interactions between features.
CART models are “trained” by recursively partitioning the \(p\)-dimensional space (defined by the explanatory variables) until an acceptable level of homogeneity or “purity” is achieved within each partition:
For classification, a commonly used measure of purity is the Gini Index: \(G = \sum_{j=1}^{k}p_j(1-p_j)\), or \(p_1(1-p_1) + p_2(1-p_2)\) for binary classification.
Shown below is an example of the CART algorithm applied to the famous “iris” data set, which contains 4 different measurements describing flowers from 3 different iris species.
data("iris")
mytree <- rpart(Species ~ Sepal.Length + Sepal.Width, data = iris)
rpart.plot(mytree, extra=104) ## "extra = 104" formats the output
As previously mentioned, the main behind the CART algorithm is that nodes are split to achieve better “purity”. In this tree, the first node is split into two “branches” using the rule “Sepal.Length < 5.5”; thus, all data-points that satisfy this condition are sent to “branch” on the left, and all data-points that don’t satisfy this condition are sent to the branch on the right.
We can see how this splitting rule partitions the data into visually:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species, pch = Species)) + geom_point() + geom_vline(xintercept = 5.5, lty = 2, lwd = 1.5)
All data-points to the left of the dashed line sent to the left branch, and all data-points to right are sent to the right branch. Based upon the other the information in the tree, we know that the right branch contains 35% of the data-points and is 87% setosa flowers. The left branch contains 65% of the data is pretty evenly split between virginica and versicolor flowers.
Next, we can see that both nodes in the tree’s second level are each split again. The node in the left branch is split using decision rule “Sepal.Width \(\geq\) 2.8”, while the node in the right branch is split using to the decision rule “Sepal.Length < 6.2”; these splits are shown visually in the plot below:
As you can see in the plot, the CART algorithm is recursively
partitioning the two-dimensional space (defined by
Sepal.Width
and Sepal.Length
) in order to
optimally separate the three categories of the outcome variable
“Species”.
At this point we’ve described four nodes, two on the left branch and two on the right branch. The final tree includes one additional split (the third of these nodes is split using “Sepal.Width \(\geq\) 3.1”), a visual representation of this final model is shown below:
Question #1:Our previous introduced the well switching data set. Our goal in modeling these data is to predict whether a household will switch from a contaminated well to the nearest safe well.
library(carData)
data("Wells")
Part A: Use the rpart()
function to fit
the CART model defined by the formula
switch ~ arsenic + distance + education + association
. Then
use the rpart.plot()
function with the argument
extra=104
to visualize this model.
Part B: Describe the first splitting rule in the tree created in Part A.
Part C: Calculate the Gini Index before and after the first split.
Part D: What is the combination of explanatory variables that puts an observation into the tree’s most populated node? What proportion of observations in this node switched wells?
The CART algorithm will consider splitting its nodes until a stopping criterion is reached. We will not go in-depth into possible stopping rules, but they generally depend upon one or more the following factors:
Had we desired a tree of a particular depth, we could “prune” our
tree by tuning what’s known as the complexity parameter cp
,
an internal parameter used to determine when the recursive partitioning
algorithm should stop:
mytree <- rpart(Species ~ Sepal.Length + Sepal.Width, data = iris)
newtree <- prune(mytree, cp = .1)
rpart.plot(newtree, type = 1, extra=104)
The complexity parameter describes the minimum improvement
(proportionally speaking) for a split to be considered worthwhile.
Setting cp = 0.1
specifies that an improvement (in
information gain) of at least 10% is required for a new split.
Because the CART algorithm is greedy, we can always fit a larger tree
using a small complexity parameter and then retroactively go back and
increase cp
using the prune()
function if we
desire a smaller or simpler tree.
Question #2: Prune the tree you created in Question #1 to only include splits that lead to at least a 5% information gain. Do you think this model (the pruned tree) is more or less likely to be overfit to the sample data? Briefly explain.
\(~\)
CART models tend to be high variance, meaning they are prone to over-fitting. Random forests are an ensemble modeling method based upon the fact that averaging a set of independent elements yields an outcome with lower variability than any of the individual elements in the set.
This concept should seem familiar from your intro stats class. The sample mean, \(\bar{x}\), has substantially less variability (\(\sigma/\sqrt{n}\)) than the individual data-points (\(\sigma\)).
To translate this idea into a workable model, random forests use bagging (short for bootstrap aggregation).
Bagging creates \(B\) different training data sets that are nearly independent of each other (with \(B\) typically in the hundreds). The model of interest (in this case CART) is trained separately on set, resulting in \(B\) different estimated “models”. These models are then aggregated (usually averaged) to produce a single, low-variance estimate.
Bagging is a general method; however, we’ll focus on it’s use in the context of random forests:
Essentially, a random forest is just a collection of decision trees that each “vote” on a final prediction:
Remark: Bootstrapping alone isn’t enough to decorrelate the \(B\) classification and regression trees. By mandating each split only considers a random set of \(m\) predictors, the strongest explanatory variables are prevented from dominating every tree (at the expense of other variables that might provide some predictive value). This results in trees that are closer to independent.
The code below demonstrates how to apply the random forest algorithm
in R
:
#install.packages("randomForest")
library(randomForest)
rf <- randomForest(Species ~ ., data = iris)
Because the random forest algorithm uses bagging, it naturally provides out-of-sample classifications, which means that methods like cross-validation are not necessary. However, properly evaluating a single CART model does require cross-validation.
Notice the substantial improvement in classification accuracy achieved by the random forest algorithm relative to a single tree:
## Random Forest confusion matrix (naturally out-of-sample)
rf$confusion
## setosa versicolor virginica class.error
## setosa 50 0 0 0.00
## versicolor 0 47 3 0.06
## virginica 0 4 46 0.08
## CART confusion matrix
### CV Setup
set.seed(123)
fold_id <- sample(rep(1:5, length.out = nrow(iris)), size = nrow(iris))
preds <- numeric(nrow(iris))
## Loop across CV folds
for(k in 1:5){
## Subset the data
train <- iris[fold_id != k, ]
test <- iris[fold_id == k, ]
## Fit CART model
tr <- rpart(Species ~ ., data = train)
## Store Predictions
preds[fold_id == k] <- predict(tr, newdata = test, type = "class")
}
## Confusion Matrix
table(iris$Species, preds)
## preds
## 1 2 3
## setosa 50 0 0
## versicolor 0 46 4
## virginica 0 7 43
Question #3: In your own words, briefly explain why the CART algorithm yields less accurate out-of-sample predictions than the random forest algorithm when applied to Fisher’s iris dataset.
\(~\)
The random forest algorithm depends upon the following tuning parameters, each of which can influence the model’s accuracy.
ntree
- the number of bagged samples, \(B\), onto which trees will be grownmtry
- the number of variables that are randomly chosen
to be candidates at each splitnodesize
, which sets the minimum size of terminal nodes
(setting this larger leads to smaller trees), or maxnodes
,
which sets the maximum number of terminal nodes an individual tree can
have.You can type ?randomForest
to read more about these
option (such as their default values, etc.) in the R documentation for
the randomForest()
function. The code below demonstrates
how to specify them:
rf <- randomForest(Species ~ ., data = iris, ntree = 100, mtry = 2, maxnodes = 4)
\(~\)
A downside of both the CART and random forest algorithms (as well as many other algorithmic modeling approaches) is the difficulty in quantifying the roles played by individual variables in making predictions. However, the importance of individual variables in a random forest can still be expressed using a measure known as variable importance.
Variable importance is general term, but most R
functions will (by default) take it to refer to the average improvement
in tree purity (measured by the Gini Index) that is attributable to
splits involving each variable. The code below demonstrates the
calculation in R:
rf <- randomForest(Species ~ ., data = iris)
rf$importance
## MeanDecreaseGini
## Sepal.Length 10.673018
## Sepal.Width 2.508506
## Petal.Length 43.115187
## Petal.Width 42.949449
This metric suggests petal length and petal width are most useful in accurately predicting the class of an iris flower, while sepal width seems to have little importance.
Unfortunately, variable importance isn’t as easily interpreted as coefficients in a regression (or logistic regression) model, so if the goal of an application is to estimate the effect of a particular variable, regression (or one of its variants) is preferable to methods like CART and random forests.
Question #4: Fit two different random forest models
to the full iris
data set, both using \(B = 300\) bootstrapped samples and a
maximum of 4 nodes. Allow the first model to consider only 1 randomly
selected variable for each split, and the second model to allow 3
randomly selected variables to be considered for each split. How do the
variable importance measures for each model compare? Briefly explain why
the importance measures look so different across the two models.
\(~\)
Question #5: When studying PCA, we were exposed to wine data set from the UC-Irvine machine learning data repository. These data describe wine samples from three different growers in the same region of Italy. The goal of this analysis will be to correctly identify which wines came from which growers.
The code below reads these data:
wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")
colnames(wine) <- c("GrowerID" ,"Alcohol", "MalicAcid", "Ash", "Alkalinity", "Mg", "TotalPhenols", "Flavanoids", "NonFlavanoidPhenols", "Proanthocyanins", "ColorIntensity", "Hue", "OD280-OD315Ratio", "Proline")
wine$GrowerID <- factor(wine$GrowerID)
colnames(wine) <- make.names(colnames(wine))
Part A: Fit a classification tree to the
wine
data to predict Grower ID using all other available
variables as predictors. Create visualization of this tree and describe
its first split.
Part B: Estimate the out-of-sample classification accuracy of the CART model from Part A using 5-fold cross validation.
Part C: Fit a random forest model using default values for tuning parameters. Compare the out-of-sample classification accuracy of this model to the value you found in Part B. Which model seems to perform better?
Part D: By trial and error (or grid search) find a combination of tuning parameters that improves the classification accuracy of the random forest model described in Part C.
Part E: Identify the two most important explanatory variables from the random forest model in Part D, then create a visualization displaying these two variables and the outcome (Grower ID).