Directions (Please read before starting)
\(~\)
This lab will use a data set from the alr4
library, and
it will use the plotROC
package to help evaluate the
performance of classification models.
# load the following packages
# install.packages("alr4")
# install.packages("plotROC")
library(FNN)
library(ggplot2)
library(dplyr)
library(alr4)
library(plotROC)
library(rpart)
library(rpart.plot)
\(~\)
The Donner
Party was a group of 45 pioneers whose migration to California was
delayed by a series of mishaps that ultimately led to the group being
stranded in the Sierra Nevada mountains. The data below (from the
alr4
package) record the survival status of all adult
members of the Donner Party, with survival status mapped onto a binary
numeric scale (1 = “died”, 0 = “survived”).
data("Donner")
Donner <- filter(Donner, age >= 18)
Donner$survived_numeric <- ifelse(Donner$y == "died", 1, 0)
By encoding the outcome using 0s and 1s, we are able to fit linear regression model that predicts survival status using an individual’s age and sex. Since \(y = 1\) indicates death, the predicted y-value from this model can be taken to reflect the probability of death:
mod_lin <- lm(survived_numeric ~ age + sex, data = Donner)
coef(mod_lin)
## (Intercept) age sexMale
## 0.03206308 0.01069589 0.30679199
On the surface, this model seems fine. Each 1 year increase in age increases the expected probability of death by about 1%, and male members of the Donner Party have a roughly 30% higher probability of death than an equal-aged females.
However, here’s what this model looks like when visualized:
A more sensible approach to modeling a binary outcome is logistic regression: \[log\big(\tfrac{\pi}{1-\pi}\big) = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p \]
Note: The “odds” of an outcome describe how often that outcome occurs relative to how often it does not occur. So, odds of “3” would imply the the outcome “died” is expected 3 times for every 1 instance of the outcome “survived” (equivalent to \(\pi = 0.75\)).
The glm()
function with the argument
family = "binomial"
is used to fit logistic regression
models:
mod_log <- glm(survived_numeric ~ age + sex, data = Donner, family = "binomial")
When interpreting this model’s estimated coefficients, \(\{\hat{\beta}_0, \hat{\beta}_1, \ldots \hat{\beta}_p \}\), we must recognize that each coefficient represents an additive linear contribution on the log-odds scale.
m <- glm(survived_numeric ~ age + sex, data = Donner, family = "binomial")
m$coefficients
## (Intercept) age sexMale
## -2.40022081 0.05788031 1.39309808
So, in the Donner Party example, each 1 year increase in age increases the log-odds of death by 0.0579.
Unfortunately the “log-odds of death” are not easily interpreted, but luckily: \[log\big(\tfrac{\pi}{1-\pi}\big) = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p \\ \implies \tfrac{\pi}{1-\pi} = exp(\beta_0 + \beta_1X_1 + \ldots + \beta_pX_p) \\ \implies \tfrac{\pi}{1-\pi} = exp(\beta_0)*exp(\beta_1X_1)* \ldots *exp(\beta_pX_p)\]
Based on these expressions:
In the Donner Party example, the coefficient for “age” is estimated as \(\hat{\beta}_1 = 0.058\). After taking the exponent, \(e^{0.058} = 1.06\), this suggests the odds of dying change by a factor of 1.06 for each 1 year increase in age (after adjusting for the variable “sex”).
Equivalently, we might interpret this effect as an 6% increase in the odds of death for every 1 year increase in age (again, after adjusting for “sex”).
\(~\)
Recall that decision trees use the “anova” splitting method when the outcome is numeric. This method seeks splits that maximize the improvement in the model’s squared error.
For categorical outcomes, the default splitting methods utilizes the Gini index: \(G = \sum_{j=1}^{k}p_j(1-p_j)\) in general, or \(p_1(1-p_1) + p_2(1-p_2)\) for binary classification, where \(p_j\) is the proportion of outcomes belonging to category \(j\).
As an example, the overall Gini index of the Donner party data can be found using the distribution of survival:
prop.table(table(Donner$y))
##
## died survived
## 0.5813953 0.4186047
From this we can calculate that \(G_\text{init} = 0.58*(1-0.58) + 0.42*(1-0.42) = 0.487\)
The decision tree algorithm (fit using default tuning parameters)
suggests we can better predict survival by making separate predictions
according to the rule sex = Male
:
tree <- rpart(y ~ age + sex, data = Donner)
rpart.plot(tree, extra=104)
The Gini index after this split is calculated as \(G_{sp1}=0.63*(0.7*(1-0.7)) + 0.37*(0.38*(1-0.38)) = 0.219\), meaning this split produces a Gini gain of 0.487-0.219 = 0.268.
\(~\)
In the early 2000s, a research team comprised of US and Bangladesh scientists measured the arsenic levels of wells in Arahazar upazila, Bangladesh. The team identified safe and unsafe wells (arsenic concentrations \(\geq 0.5 \mu g/L\)), and they distributed materials encouraging residents who used unsafe wells to switch to safe ones.
A few years later, researchers followed up to see which of these residents had actually switched to safer wells. The results of this follow-up study are documented in the “Wells” dataset within the “carData” package:
library(carData)
data("Wells")
Wells$switch_numeric <- ifelse(Wells$switch == "yes", 1, 0)
The data set includes the following variables:
Question #1:
Part A: Use logistic regression to model the likelihood of a household switching to a safe well based upon the explanatory variables “arsenic”, “distance”, and “education”. Interpret the coefficient of “distance” after exponentiation.
Part B: Use the rpart()
function to fit
a CART model defined by the formula
switch ~ arsenic + distance + education + association
. Then
use the rpart.plot()
function with the argument
extra=104
to visualize this model.
Part C: Describe the first splitting rule in the tree created in Part A.
Part D: Calculate the Gini Index before and after the first split. Hint: After the first split you should calculate two quantities (one for each node) and combine them after weighting by the proportion of observations in each node.
Part E: What are the rules that will lead to an observation ending up in the tree’s most populated terminal node? What proportion of observations in this node switched wells?
\(~\)
For numeric outcomes, we used \(RMSE\) to measure a model’s error. However, this calculation is not possible for categorical outcomes.
The simplest option is classification accuracy, or the
proportion of correct classifications the model makes:
\[\text{Accuracy} = \sum_{i=1}^{n}(y_i =
\hat{y}_i)/n\]
Note that we could also represent this information via the classification error rate, or \(1 - \text{Accuracy} = \sum_{i=1}^{n}(y_i \ne \hat{y}_i)/n\).
For logistic regression, calculating the classification accuracy (or error rate) requires mapping the fitted model’s predicted probabilities (ie: the \(\hat{\pi}_i\)’s) to predicted categories (ie: \(\hat{y}_i\)’s) using a decision rule. A simple rule to map all predicted probabilities above some threshold, \(t\), to the “positive” class: \[\hat{y}_i = 1 \text{ if } \hat{\pi}_i \geq t; \\ \hat{y}_i=0 \text{ if } \hat{\pi}_i<t\]
For a binary outcome, \(t = 0.5\) is a reasonable choice since it maps predicted probabilities to the “most likely” outcome category.
The glm()
function returns predicted probabilities in
the vector fitted.values
. The code below uses
ifelse()
to map these probabilities into predicted
categories for the Donner Party example:
## Fit logistic model
fit <- glm(survived_numeric ~ age + sex, data = Donner, family = "binomial")
## translate predicted probabilities to died/survived
pred_cat <- ifelse(fit$fitted.values > .5, "died", "survived")
Other types of models, such as KNN or decision trees, have different ways by which predicted probabilities must be obtained. The code below demonstrates this for decision tree models:
## Fit decision tree
tree <- rpart(y ~ age + sex, data = Donner)
## get predictions
pred_probs_tree = predict(tree, newdata = Donner, type = "prob")
## translate predicted probabilities to died/survived
pred_cat_tree <- ifelse(pred_probs_tree[,1] > .5, "died", "survived") # Note that the first column is Pr(death)
And this code demonstrates the process for KNN models:
## First separate the predictors and make into a numeric matrix
Donner_X = model.matrix(~ -1 + age + sex, data = select(Donner, age, sex))
## Fit the model
kkn_mod = knn(train = Donner_X, test = Donner_X, cl = Donner$y, k = 4, prob = TRUE)
## translate prob to died or surv
pred_cat_knn<- ifelse(attr(kkn_mod, "prob") > .5, "died", "survived")
This code is relatively complex due to the way knn()
was
programmed, as it requires predictors be given as a matrix of
exclusively numeric data. The model.matrix()
function
facilitates this, using -1
in the formula to avoid creating
an intercept term. I encourage you to print Donner_X
to see
that all this code is doing is recoding sex
to be a binary
numeric variable. Additionally, predicted probabilities are returned as
an attribute of the fitted model, so they must be accessed using
attr()
.
Once we have predicted categories, we can then calculate the model’s in-sample classification accuracy by finding the proportion of these predictions which match the actual observed outcome:
## Sum the number of matches and divide by the sample size
sum(Donner$y == pred_cat)/length(Donner$y)
## [1] 0.7674419
## Results for decision trees and knn
sum(Donner$y == pred_cat_tree)/length(Donner$y)
## [1] 0.6744186
sum(Donner$y == pred_cat_knn)/length(Donner$y)
## [1] 0.627907
We can see that logistic regression achieves the highest in-sample accuracy.
Question #2: Using the well switching data introduced in Question #1, calculate and report the in-sample classification accuracy of three different models: logistic regression, decision tree, and KNN, that are used to predict the switch to a safe well. Each model should use the predictors: arsenic, distance, education, and association. You may use the default tuning parameters for the decision tree model, and \(k=15\) for KNN.
\(~\)
Accuracy is often reported alongside a confusion matrix, or two-way table summarizing the model’s predicted categories vs. the data’s true categories. The rows of a confusion matrix represent the observed (true) categories, while the columns represent predicted categories:
## Confusion Matrix (note: predictions = columns)
table(Donner$y, pred_cat)
## pred_cat
## died survived
## died 23 2
## survived 8 10
Using the confusion matrix, we see that this model makes a total of 10 classification errors, 8 of which were people who actually survived when our model predicted should have died (based upon their age and sex using a probability threshold of \(t = 0.5\)), while 2 were people that our predicted should have survived but they actually died.
Question #3: For the Donner Party data, using the
same logistic regression model,
survived_numeric ~ age + sex
, that was used above, change
the threshold used to map predicted probabilities to survival outcomes
to \(t = 0.6\). How does this change
impact the in-sample classification accuracy? How does it influence the
distribution of classification errors in the confusion matrix?
\(~\)
Within the confusion matrix, you should notice that two types of classification errors are possible in any binary classification application. Sometimes certain types of errors are more harmful than others. For example, it is less problematic for a model to classify a healthy person as having cancer than it is for a model to classify a person with cancer as being healthy. The former might be inconvenient, but the later could be deadly.
These two types of classification errors are called false positives, which describes when a “negative” observation (outcome encoded by “0”) is classified as “positive” (outcome encoded by “1”), and false negatives, which describes when a “positive” observation is classified as “negative”.
Because we decide which categories to code as “1” and “0”, the notion of a false positive or false negative is determined by the data analyst.
Two important metrics derived from these quantities are:
Sensitivity, or the true positive rate: \[\text{Sensitivity} = \tfrac{\text{#True Positives}}{\text{#True Positives + #False Negatives}}\]
Specificity, or the true negative rate: \[\text{Specificity} = \tfrac{\text{#True Negatives}}{\text{#True Negatives + #False Positives}}\] Note: A model that classifies everything as “positive” will have a sensitivity of 1 and a specificity of 0. Conversely, a model that classifies everything as “negative” will have a sensitivity of 0 and a specificity of 1.
For any non-trivial model, there is a trade-off between sensitivity and specificity that depends upon the decision threshold, \(t\) (which maps predicted probabilities to predicted categories). One way to visually represent this trade-off is a Receiver Operating Characteristics (ROC) curve.
The code below shows the ROC curve for the Donner Party logistic
regression model. It is created using ggplot
with help from
the plotROC
package. The aesthetic “d” signifies the true
“disease” status, while “m” signifies the “marker”, or the predicted
probability of disease. By default, geom_roc
will take the
first level of the binary categorical outcome to be the “positive”
class.
library(plotROC)
fit <- glm(survived_numeric ~ age + sex, data = Donner, family = "binomial")
df <- data.frame(y = Donner$survived_numeric, pi = fit$fitted.values)
ggplot(df, aes(d = y, m = pi)) + geom_roc()
The y-axis of the ROC plot depicts the true positive rate, or sensitivity, while the x-axis shows the false positive rate, which is the same as 1-specificity. The curve depicts the sensitivity and 1-specificity corresponding to different decision thresholds (values of \(t\)), which are marked at various points.
For example, the threshold \(t = 0.4\) leads to around 95% sensitivity, but at the expense of the false positive rate being around 60%.
We can also supply a color argument to produce stratified ROC curves that depict the model’s performance for males and females separately:
df <- data.frame(y = Donner$survived_numeric, pi = fit$fitted.values, sex = Donner$sex)
ggplot(df, aes(d = y, m = pi, color = sex)) + geom_roc()
Exploring the stratified ROC curves is useful in this application because we can clearly see that our model does a much better job predicting outcomes for female members of the Donner Party.
The Area Under the Curve (AUC) of an ROC curve is a popular single number summary of a model’s efficacy. An AUC of 1 indicates perfect classification, while an AUC of 0.5 is akin to random guessing.
The AUC of an ROC curve can be found using the calc_auc
function in the plotROC
package.
plot1 <- ggplot(df, aes(d = y, m = pi)) + geom_roc()
plot2 <- ggplot(df, aes(d = y, m = pi, color = sex)) + geom_roc()
calc_auc(plot1)
## PANEL group AUC
## 1 1 -1 0.6933333
calc_auc(plot2)
## PANEL group sex AUC
## 1 1 1 Female 0.9250000
## 2 1 2 Male 0.3914474
In our example, we see decent overall prediction of survival status, but this is largely due to the accuracy of predictions for females (the model actually does worse than guessing for males).
Question #4:
Part A: Create a single ggplot
graphic
displaying the ROC curves for each model you created in Question #2. You
should display each model’s curve using a different color.
Hint: You could choose to combine the predicted and observed
values for each model into a single data frame and add a column to
indicate the model, or you could use local specification of the
d
and m
aesthetics within 3 different layers
of geom_roc()
.
Part B: Using the plot from Part A and without performing any calculations, which model do you believe has the highest AUC value? Briefly explain.
\(~\)
So far we’ve been calculating in-sample performance measures (classification accuracy, AUC, etc.), which are generally too optimistic and tend to reward over fitting. It’s much better to use cross-validation to estimate out-of-sample performance.
### Setup
set.seed(111)
n <- nrow(Donner)
fold_id <- sample(rep(1:4, length.out = n), size = n)
preds <- numeric(n)
## Loop across CV folds
for(k in 1:4){
## Subset the data
train <- Donner[fold_id != k, ]
test <- Donner[fold_id == k, ]
## Fit models on the data
m <- glm(survived_numeric ~ age + sex, data = train, family = "binomial")
## Store predictions
preds[fold_id == k] <- predict(m, newdata = test, type = "response")
}
## Out of sample accuracy
pred_class <- ifelse(preds >= .5, 1, 0)
out_acc <- sum(pred_class == Donner$survived_numeric)/n
out_acc
## [1] 0.6744186
## Out of sample AUC
df <- data.frame(y = Donner$survived_numeric, pi = preds)
pl <- ggplot(df, aes(d = y, m = pi)) + geom_roc()
calc_auc(pl)
## PANEL group AUC
## 1 1 -1 0.5922222
Earlier we saw that this model achieves 76.7% in-sample accuracy, but cross-validation indicates the out-of-sample accuracy is substantially lower. The results for AUC are similar.
In general, you should use cross-validation to help you select and evaluate the performance of models. Then you should fit/estimate that model on the entire data set and use those results to develop interpretations (such as odds ratios) or construct visualizations (such as ROC curves).
Question #5: Use 5-fold cross-validation to estimate the out-of-sample classification accuracy and out-of-sample AUC of three models you’ve used in Questions #3 and #4.