Directions:

As with all our labs, you are expected to work through the examples and questions in this lab collaboratively with your partner(s). This means you will work together and discuss each section. You each will receive the same score for your work, so is your responsibility to make sure that both of you are on the same page. Any groups found to be using a “divide and conquer” strategy to rush through the lab’s questions will be penalized.

You should record your answers to the lab’s questions in an R Markdown file. When submitting the lab, you should only turn in the compiled .html file created by R Markdown.

You are strongly encouraged to open a separate, blank R script to run and experiment with the example code that is given throughout the lab. Please do not turn-in this code.

\(~\)

Introduction

Throughout the majority of the semester we’ve focused extensively on two varieties of generalized linear models (linear regression and logistic regression). To recap, these models enjoy the following strengths:

  1. The effects of individual predictors on the outcome are easily understood
  2. Statistical inference, such as hypothesis testing or interval estimation, is straightforward
  3. Methods and procedures for selecting, comparing, and summarizing these models are well-established and extensively studied

For these reasons, GLMs are very popular; however, their rigid parametric structure can put them at a disadvantage in certain circumstances:

  • Complex, non-linear relationships between predictors and the outcome
  • High degrees of interaction between predictors
  • Nominal outcome variables with several categories

In these situations, non-parametric or algorithmic modeling approaches have the potential to better capture the underlying trends in the data.

In this lab, we’ll briefly cover a couple of these approaches, namely classification and regression trees (CART), random forests. Additionally, you should recall we’ve already discussed k-nearest neighbors, which is also appropriately for these settings. Thus, by the end of this lab you should have three different non-parametric models to consider as alternatives to regression.

Throughout the lab, we’ll use Fisher’s Iris dataset to illustrate the aforementioned methods:

data("iris")
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Recognize that this dataset involves a nominal categorical outcome, “Species”, that has three categories.

\(~\)

Classification and Regression Trees

Classification and regression trees (CART) are “trained” by recursively partitioning the \(p\)-dimensional space (defined by the explanatory variables) until an acceptable level of homogeneity or “purity” is achieved within each partition.

The recursive partitioning can be viewed as a set of rules that begins with the entire dataset, and divides it up into distinct compartments that are then used to generate predictions.

The method is best illustrated by looking at an actual tree. The code below uses the rpart() function in the rpart package to fit a classification tree on the iris data using the variables “Sepal.Length” and “Sepal.Width” as predictors.

library(rpart)
library(rpart.plot)

mytree <- rpart(Species ~ Sepal.Length + Sepal.Width, data = iris)
rpart.plot(mytree, extra=104) ## "extra = 104" adds some info to the tree, see ?rpart.plot to see the details

Let’s begin with the information in each “node” of tree.

  • The first line in the box reports node’s most common (or average) outcome (either “setosa”, “versicolor”, or “virginica” here) - this can be viewed as the model’s predicted outcome
  • The second line reports the distribution of outcomes in that node - this helps you evaluate the purity of a node
  • The third line indicates the percentage of the dataset that belongs to that node - this helps you understand which branches of the tree are most influential

As previously mentioned, the main behind the CART algorithm is that nodes are split to achieve better “purity”. In this tree, the first node is split into two “branches” using the rule “Sepal.Length < 5.5”; thus, all data-points that satisfy this condition are sent to “branch” on the left, and all data-points that don’t satisfy this condition are sent to the branch on the right.

We can see how this splitting rule partitions the data into visually:

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species, pch = Species)) + geom_point() + geom_vline(xintercept = 5.5, lty = 2, lwd = 1.5)

All data-points to the left of the dashed line sent to the left branch, and all data-points to right are sent to the right branch. Based upon the other the information in the tree, we know that the right branch contains 35% of the data-points and is 87% setosa flowers. The left branch contains 65% of the data is pretty evenly split between virginica and versicolor flowers.

Next, we can see that both nodes in the tree’s second level are each split again. The node in the left branch is split using decision rule “Sepal.Width \(\geq\) 2.8”, while the node in the right branch is split using to the decision rule “Sepal.Length < 6.2”; these splits are shown visually in the plot below:

As you can see in the plot, the CART algorithm is recursively partitioning the two-dimensional space (defined by Sepal.Width and Sepal.Length) in order to optimally separate the three categories of the outcome variable “Species”.

At this point we’ve described four nodes, two on the left branch and two on the right branch. You’ll recognize that the final tree includes one additional split (the third of these nodes is split using “Sepal.Width \(\geq\) 3.1”), a visual representation of this final model is shown below:

Question #1: Fit a classification and regression tree to the iris dataset that uses all four predictor variables. Use the prp function the rpart.plot package to plot the tree (include the argument extra = 104). After plotting your tree, briefly describe the 2 splits it makes, and determine what percentage of the dataset will be misclassified if each data-point were predicted to be the dominant species in the node it is contained in.

Remark: Recognize that regression trees can also be applied to numeric outcome variables. In this situation, the drop in root-mean squared error (RMSE) is used to determine splits.

\(~\)

Pruning

The CART algorithm will consider splitting any (or all) of these nodes until a stopping criterion is reached. We will not go in-depth into possible stopping rules, but they generally depend upon one or more the following factors:

  • Purity of a node
  • Data-points in a node
  • Depth of the tree (how many splits occurred)

Had we desired a tree of a particular depth, we could “prune” our tree by tuning what’s known as the complexity parameter cp, an internal parameter used to determine when the recursive partitioning algorithm should stop:

mytree <- rpart(Species ~ Sepal.Length + Sepal.Width, data = iris)
newtree <- prune(mytree, cp = .1)
rpart.plot(newtree, type = 1, extra=104)

The complexity parameter describes the minimum improvement (proportionally speaking) for a split to be considered worthwhile. Setting cp = 0.1 specifies that an improvement (in information gain) of at least 10% is required for a new split.

Because the CART algorithm is greedy, we can always fit a larger tree using a small complexity parameter and then retroactively go back and increase cp using the prune() function if we desire a smaller or simpler tree.

Question #2: Prune the CART model displayed above to only include splits that improve the information gain by at least 20%. Based on the information in the tree’s nodes, do you think this model provides a good fit to the data? Additionally, you think it’s less likely to be overfit to the sample? Briefly explain.

\(~\)

Random Forests

A major issue with tree-based models is that they tend to be high variance (leading to a high propensity towards over-fitting). Random forests are a non-parametric, tree-based modeling algorithm that is built upon the idea that averaging a set of independent elements yields an outcome with lower variability than any of the individual elements in the set.

This general concept should seem familiar. Thinking back to your introductory statistics course, you should remember that the sample mean, \(\bar{x}\), of a dataset has substantially less variability (\(\sigma/\sqrt{n}\)) than the individual data-points themselves (\(\sigma\)).

To turn this idea into a workable model, we need to introduce a procedure called bagging (short for bootstrap aggregation).

The goal of bagging is to produce \(B\) separate training datasets that are independent of each other (typically \(B\) is in the hundreds). The model of interest (in this case classification and regression trees) is trained separately on each of these datasets, resulting in \(B\) different estimated “models”. These are then averaged to produce a single, low-variance estimate.

Bagging is a general approach, but its most well-known application is in the random forest algorithm:

  1. Construct \(B\) bootstrap samples by sampling cases from the original dataset with replacement (this results in \(B\) unique datasets that are similar to the original)
  2. Fit a classification and regression tree to each sample, but randomly choose a subset of \(m\) variables that can be used in the construction of that tree (this results in \(B\) unique trees that are fit to similar datasets using different sets of predictors)
  3. For a given data-point, each of the \(B\) trees in the forest contributes a prediction or “vote”, with the majority (or average) of these votes forming the random forest’s final prediction, \(\hat{y}_i\)

Essentially, a random forest is just a collection of decision trees that each “vote” on a final prediction:

Remark: Bootstrapping alone isn’t enough to decorrelate the \(B\) classification and regression trees. By mandating each split only considers a random set of \(m\) predictors, the strongest explanatory variables are prevented from dominating every tree (at the expense of other variables that might still provide moderate predictive value), thereby resulting in trees that are roughly independent.

The code below demonstrates how to apply the random forest algorithm in R:

#install.packages("randomForest")
library(randomForest)
rf <- randomForest(Species ~ ., data = iris)

Next, the code below shows the confusion matrix of the random forest model. Because the random forest algorithm involves bagging, it naturally yields out-of-sample classifications. Thus, to compare it against something like CART, we should use cross-validation to get out-of-sample measures.

Notice the substantial improvement in classification accuracy achieved by the random forest algorithm:

## Random Forest confusion matrix (naturally out-of-sample)
rf$confusion
##            setosa versicolor virginica class.error
## setosa         50          0         0        0.00
## versicolor      0         47         3        0.06
## virginica       0          4        46        0.08
## CART confusion matrix
### CV Setup
set.seed(123)
fold_id <- sample(rep(1:5, length.out = nrow(iris)), size = nrow(iris))
preds <- numeric(nrow(iris))

## Loop across CV folds
for(k in 1:5){
  
  ## Subset the data
  train <- iris[fold_id != k, ]
  test <- iris[fold_id == k, ]
  
  ## Fit CART model
  tr <- rpart(Species ~ ., data = train)
  
  ## Store Predictions
  preds[fold_id == k] <- predict(tr, newdata = test, type = "class")
}

## Confusion Matrix   
table(iris$Species, preds)
##             preds
##               1  2  3
##   setosa     50  0  0
##   versicolor  0 46  4
##   virginica   0  7 43

Question #3: In your own words, briefly explain why the CART algorithm yields less accurate out-of-sample predictions than the random forest algorithm when applied to Fisher’s iris dataset.

\(~\)

Tuning the Random Forest Algorithm

The random forest algorithm requires the following tuning parameters be specified in order to run:

  1. ntree - the number of bagged samples, \(B\), onto which trees will be grown
  2. mtry - the number of variables that are randomly chosen to be candidates at each split
  3. Some sort of stopping criteria for individual trees, this can be: nodesize, which sets the minimum size of terminal nodes (setting this larger leads to smaller trees), or maxnodes, which sets the maximum number of terminal nodes an individual tree can have.

You can type ?randomForest to read more about these option (such as their default values, etc.) in the R documentation for the randomForest() function. The code below demonstrates how to specify them:

rf <- randomForest(Species ~ ., data = iris, ntree = 100, mtry = 2, maxnodes = 4)

\(~\)

Variable Importance

A downside of both the CART and random forest algorithms (as well as many other algorithmic modeling approaches) is an inability to clearly quantify the roles played by individual variables in making predictions. However, the importance of individual variables in a random forest can still be expressed using a measure known as variable importance.

Variable importance is general term, but most R functions will (by default) take it to refer to the average improvement in tree purity (measured by the Gini Index) that is attributable to splits involving each variable. The code below demonstrates the calculation in R:

rf <- randomForest(Species ~ ., data = iris)
rf$importance
##              MeanDecreaseGini
## Sepal.Length        10.673018
## Sepal.Width          2.508506
## Petal.Length        43.115187
## Petal.Width         42.949449

This suggests petal length and petal width are most useful in accurately predicting the class of a flower, while sepal width seems to have little importance.

Unfortunately, variable importance isn’t as explicit about the precise role of each variable in the model (at least compared to logistic regression). So if the goal of a modeling application is to estimate the effect of a particular variable, logistic regression (or its variants intended for nominal outcomes) are likely preferable to methods like CART and random forests.

Question #4: Fit two different random forest models, both using \(B = 300\) bootstrapped samples and a maximum of 4 nodes, but have the first allow only 1 randomly selected variable to be considered for each split and the second to allow 3 randomly variables to be considered for each split. How do the variable importance measures for each model compare? Briefly explain why the importance measures look so different across the two models.

\(~\)

Practice

For this application you will use the Ames Housing dataset:

library(dplyr)
ah <- read.csv("https://remiller1450.github.io/data/AmesHousing.csv", stringsAsFactors = TRUE)
ah2 <- select(ah, -Order, -PID, -MS.SubClass, -MS.Zoning, -Neighborhood) ## Remove identifiers

The outcome variable will be “SalePrice”, which you’ll notice is numeric. Thus, one motivation for this application is to illustrate how the CART and random forest algorithms can apply to a numeric outcome.

Question #5: After removing identifier variables “Order”, “PID”, “MS.SubClass”, “MS.Zoning”, and “Neighborhood”, fit a CART model that predicts “SalePrice” using all of the remaining predictors. Set the parameter cp = 0.02 when fitting your model. Then, use the rpart.plot with option extra = 100 to plot the tree.

Question #6: Describe the “rule” that defines the data-points in the left-most node of the tree. Additionally, what is the model’s predicted sale price for these homes?

Question #7: This question has been removed due to difficulties with missing data in the dataset.

## Example function to check missing percentage by column
perc_miss <- function(X){
  mean(is.na(X))
}

apply(ah2, 2, perc_miss)

Question #8: A challenge in most modeling applications is missing data. For this question, first verify (using the function complete.cases()) that the dataset ah2 contains no data-points that aren’t missing a value in at least one variable. Next, use the argument na.action = "na.roughfix" to fit a random forest model that uses all of the available predictors. Provide a summary measure describing the out-of-sample accuracy of this model.