Directions:
As with all our labs, you are expected to work through the examples and questions in this lab collaboratively with your partner(s). This means you will work together and discuss each section. You each will receive the same score for your work, so is your responsibility to make sure that both of you are on the same page. Any groups found to be using a “divide and conquer” strategy to rush through the lab’s questions will be penalized.
You should record your answers to the lab’s questions in an R Markdown file. When submitting the lab, you should only turn in the compiled .html file created by R Markdown.
You are strongly encouraged to open a separate, blank R script to run and experiment with the example code that is given throughout the lab. Please do not turn-in this code.
\(~\)
Throughout the majority of the semester we’ve focused extensively on two varieties of generalized linear models (linear regression and logistic regression). To recap, these models enjoy the following strengths:
For these reasons, GLMs are very popular; however, their rigid parametric structure can put them at a disadvantage in certain circumstances:
In these situations, non-parametric or algorithmic modeling approaches have the potential to better capture the underlying trends in the data.
In this lab, we’ll briefly cover a couple of these approaches, namely classification and regression trees (CART), random forests. Additionally, you should recall we’ve already discussed k-nearest neighbors, which is also appropriately for these settings. Thus, by the end of this lab you should have three different non-parametric models to consider as alternatives to regression.
Throughout the lab, we’ll use Fisher’s Iris dataset to illustrate the aforementioned methods:
data("iris")
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Recognize that this dataset involves a nominal categorical outcome, “Species”, that has three categories.
\(~\)
Classification and regression trees (CART) are “trained” by recursively partitioning the \(p\)-dimensional space (defined by the explanatory variables) until an acceptable level of homogeneity or “purity” is achieved within each partition.
The recursive partitioning can be viewed as a set of rules that begins with the entire dataset, and divides it up into distinct compartments that are then used to generate predictions.
The method is best illustrated by looking at an actual tree. The code below uses the rpart()
function in the rpart
package to fit a classification tree on the iris data using the variables “Sepal.Length” and “Sepal.Width” as predictors.
library(rpart)
library(rpart.plot)
mytree <- rpart(Species ~ Sepal.Length + Sepal.Width, data = iris)
rpart.plot(mytree, extra=104) ## "extra = 104" adds some info to the tree, see ?rpart.plot to see the details
Let’s begin with the information in each “node” of tree.
As previously mentioned, the main behind the CART algorithm is that nodes are split to achieve better “purity”. In this tree, the first node is split into two “branches” using the rule “Sepal.Length < 5.5”; thus, all data-points that satisfy this condition are sent to “branch” on the left, and all data-points that don’t satisfy this condition are sent to the branch on the right.
We can see how this splitting rule partitions the data into visually:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species, pch = Species)) + geom_point() + geom_vline(xintercept = 5.5, lty = 2, lwd = 1.5)
All data-points to the left of the dashed line sent to the left branch, and all data-points to right are sent to the right branch. Based upon the other the information in the tree, we know that the right branch contains 35% of the data-points and is 87% setosa flowers. The left branch contains 65% of the data is pretty evenly split between virginica and versicolor flowers.
Next, we can see that both nodes in the tree’s second level are each split again. The node in the left branch is split using decision rule “Sepal.Width \(\geq\) 2.8”, while the node in the right branch is split using to the decision rule “Sepal.Length < 6.2”; these splits are shown visually in the plot below:
As you can see in the plot, the CART algorithm is recursively partitioning the two-dimensional space (defined by Sepal.Width
and Sepal.Length
) in order to optimally separate the three categories of the outcome variable “Species”.
At this point we’ve described four nodes, two on the left branch and two on the right branch. You’ll recognize that the final tree includes one additional split (the third of these nodes is split using “Sepal.Width \(\geq\) 3.1”), a visual representation of this final model is shown below:
Question #1: Fit a classification and regression tree to the iris dataset that uses all four predictor variables. Use the prp
function the rpart.plot
package to plot the tree (include the argument extra = 104
). After plotting your tree, briefly describe the 2 splits it makes, and determine what percentage of the dataset will be misclassified if each data-point were predicted to be the dominant species in the node it is contained in.
Remark: Recognize that regression trees can also be applied to numeric outcome variables. In this situation, the drop in root-mean squared error (RMSE) is used to determine splits.
\(~\)
The CART algorithm will consider splitting any (or all) of these nodes until a stopping criterion is reached. We will not go in-depth into possible stopping rules, but they generally depend upon one or more the following factors:
Had we desired a tree of a particular depth, we could “prune” our tree by tuning what’s known as the complexity parameter cp
, an internal parameter used to determine when the recursive partitioning algorithm should stop:
mytree <- rpart(Species ~ Sepal.Length + Sepal.Width, data = iris)
newtree <- prune(mytree, cp = .1)
rpart.plot(newtree, type = 1, extra=104)
The complexity parameter describes the minimum improvement (proportionally speaking) for a split to be considered worthwhile. Setting cp = 0.1
specifies that an improvement (in information gain) of at least 10% is required for a new split.
Because the CART algorithm is greedy, we can always fit a larger tree using a small complexity parameter and then retroactively go back and increase cp
using the prune()
function if we desire a smaller or simpler tree.
Question #2: Prune the CART model displayed above to only include splits that improve the information gain by at least 20%. Based on the information in the tree’s nodes, do you think this model provides a good fit to the data? Additionally, you think it’s less likely to be overfit to the sample? Briefly explain.
\(~\)
A major issue with tree-based models is that they tend to be high variance (leading to a high propensity towards over-fitting). Random forests are a non-parametric, tree-based modeling algorithm that is built upon the idea that averaging a set of independent elements yields an outcome with lower variability than any of the individual elements in the set.
This general concept should seem familiar. Thinking back to your introductory statistics course, you should remember that the sample mean, \(\bar{x}\), of a dataset has substantially less variability (\(\sigma/\sqrt{n}\)) than the individual data-points themselves (\(\sigma\)).
To turn this idea into a workable model, we need to introduce a procedure called bagging (short for bootstrap aggregation).
The goal of bagging is to produce \(B\) separate training datasets that are independent of each other (typically \(B\) is in the hundreds). The model of interest (in this case classification and regression trees) is trained separately on each of these datasets, resulting in \(B\) different estimated “models”. These are then averaged to produce a single, low-variance estimate.
Bagging is a general approach, but its most well-known application is in the random forest algorithm:
Essentially, a random forest is just a collection of decision trees that each “vote” on a final prediction:
Remark: Bootstrapping alone isn’t enough to decorrelate the \(B\) classification and regression trees. By mandating each split only considers a random set of \(m\) predictors, the strongest explanatory variables are prevented from dominating every tree (at the expense of other variables that might still provide moderate predictive value), thereby resulting in trees that are roughly independent.
The code below demonstrates how to apply the random forest algorithm in R
:
#install.packages("randomForest")
library(randomForest)
rf <- randomForest(Species ~ ., data = iris)
Next, the code below shows the confusion matrix of the random forest model. Because the random forest algorithm involves bagging, it naturally yields out-of-sample classifications. Thus, to compare it against something like CART, we should use cross-validation to get out-of-sample measures.
Notice the substantial improvement in classification accuracy achieved by the random forest algorithm:
## Random Forest confusion matrix (naturally out-of-sample)
rf$confusion
## setosa versicolor virginica class.error
## setosa 50 0 0 0.00
## versicolor 0 47 3 0.06
## virginica 0 4 46 0.08
## CART confusion matrix
### CV Setup
set.seed(123)
fold_id <- sample(rep(1:5, length.out = nrow(iris)), size = nrow(iris))
preds <- numeric(nrow(iris))
## Loop across CV folds
for(k in 1:5){
## Subset the data
train <- iris[fold_id != k, ]
test <- iris[fold_id == k, ]
## Fit CART model
tr <- rpart(Species ~ ., data = train)
## Store Predictions
preds[fold_id == k] <- predict(tr, newdata = test, type = "class")
}
## Confusion Matrix
table(iris$Species, preds)
## preds
## 1 2 3
## setosa 50 0 0
## versicolor 0 46 4
## virginica 0 7 43
Question #3: In your own words, briefly explain why the CART algorithm yields less accurate out-of-sample predictions than the random forest algorithm when applied to Fisher’s iris dataset.
\(~\)
The random forest algorithm requires the following tuning parameters be specified in order to run:
ntree
- the number of bagged samples, \(B\), onto which trees will be grownmtry
- the number of variables that are randomly chosen to be candidates at each splitnodesize
, which sets the minimum size of terminal nodes (setting this larger leads to smaller trees), or maxnodes
, which sets the maximum number of terminal nodes an individual tree can have.You can type ?randomForest
to read more about these option (such as their default values, etc.) in the R documentation for the randomForest()
function. The code below demonstrates how to specify them:
rf <- randomForest(Species ~ ., data = iris, ntree = 100, mtry = 2, maxnodes = 4)
\(~\)
A downside of both the CART and random forest algorithms (as well as many other algorithmic modeling approaches) is an inability to clearly quantify the roles played by individual variables in making predictions. However, the importance of individual variables in a random forest can still be expressed using a measure known as variable importance.
Variable importance is general term, but most R
functions will (by default) take it to refer to the average improvement in tree purity (measured by the Gini Index) that is attributable to splits involving each variable. The code below demonstrates the calculation in R:
rf <- randomForest(Species ~ ., data = iris)
rf$importance
## MeanDecreaseGini
## Sepal.Length 10.673018
## Sepal.Width 2.508506
## Petal.Length 43.115187
## Petal.Width 42.949449
This suggests petal length and petal width are most useful in accurately predicting the class of a flower, while sepal width seems to have little importance.
Unfortunately, variable importance isn’t as explicit about the precise role of each variable in the model (at least compared to logistic regression). So if the goal of a modeling application is to estimate the effect of a particular variable, logistic regression (or its variants intended for nominal outcomes) are likely preferable to methods like CART and random forests.
Question #4: Fit two different random forest models, both using \(B = 300\) bootstrapped samples and a maximum of 4 nodes, but have the first allow only 1 randomly selected variable to be considered for each split and the second to allow 3 randomly variables to be considered for each split. How do the variable importance measures for each model compare? Briefly explain why the importance measures look so different across the two models.
\(~\)
For this application you will use the Ames Housing dataset:
library(dplyr)
ah <- read.csv("https://remiller1450.github.io/data/AmesHousing.csv", stringsAsFactors = TRUE)
ah2 <- select(ah, -Order, -PID, -MS.SubClass, -MS.Zoning, -Neighborhood) ## Remove identifiers
The outcome variable will be “SalePrice”, which you’ll notice is numeric. Thus, one motivation for this application is to illustrate how the CART and random forest algorithms can apply to a numeric outcome.
Question #5: After removing identifier variables “Order”, “PID”, “MS.SubClass”, “MS.Zoning”, and “Neighborhood”, fit a CART model that predicts “SalePrice” using all of the remaining predictors. Set the parameter cp = 0.02
when fitting your model. Then, use the rpart.plot
with option extra = 100
to plot the tree.
Question #6: Describe the “rule” that defines the data-points in the left-most node of the tree. Additionally, what is the model’s predicted sale price for these homes?
Question #7: This question has been removed due to difficulties with missing data in the dataset.
## Example function to check missing percentage by column
perc_miss <- function(X){
mean(is.na(X))
}
apply(ah2, 2, perc_miss)
Question #8: A challenge in most modeling applications is missing data. For this question, first verify (using the function complete.cases()
) that the dataset ah2
contains no data-points that aren’t missing a value in at least one variable. Next, use the argument na.action = "na.roughfix"
to fit a random forest model that uses all of the available predictors. Provide a summary measure describing the out-of-sample accuracy of this model.