Introduction to Supervised Learning (Part 3

Directions (Please read before starting)

Please work together with your assigned partner. Make sure you both fully understand each concept before you move on.
Please record your answers and any related code for all embedded lab questions. I encourage you to try out the embedded examples, but you shouldn’t turn them in.
Please ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Preamble

Packages and Datasets

This lab will use a data set from the alr4 library, and it will use the plotROC package to help evaluate the performance of classification models.

# load the following packages
# install.packages("alr4")
# install.packages("plotROC")
library(FNN)
library(ggplot2)
library(dplyr)
library(alr4)
library(plotROC)
library(rpart)
library(rpart.plot)

\(~\)

Motivation

The Donner Party was a group of 45 pioneers whose migration to California was delayed by a series of mishaps that ultimately led to the group being stranded in the Sierra Nevada mountains. The data below (from the alr4 package) record the survival status of all adult members of the Donner Party, with survival status mapped onto a binary numeric scale (1 = “died”, 0 = “survived”).

data("Donner")
Donner <- filter(Donner, age >= 18)
Donner$survived_numeric <- ifelse(Donner$y == "died", 1, 0)

By encoding the outcome using 0s and 1s, we are able to fit linear regression model that predicts survival status using an individual’s age and sex. Since \(y = 1\) indicates death, the predicted y-value from this model can be taken to reflect the probability of death:

mod_lin <- lm(survived_numeric ~ age + sex, data = Donner)
coef(mod_lin)

## (Intercept)         age     sexMale 
##  0.03206308  0.01069589  0.30679199

On the surface, this model seems fine. Each 1 year increase in age increases the expected probability of death by about 1%, and male members of the Donner Party have a roughly 30% higher probability of death than an equal-aged females.

However, here’s what this model looks like when visualized:

Logistic Regression

A more sensible approach to modeling a binary outcome is logistic regression: \[log\big(\tfrac{\pi}{1-\pi}\big) = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p \]

\(\pi\) denotes the probability of the outcome category encoded as “1” (ie: \(P(y_i = 1) = \pi\), and \(P(y_i = 0) = 1-\pi\))
The log-odds of \(\pi\) are modeled by a linear function of the observed data
\(\hat{\pi}_i\) is used to express the estimated probability that the \(i^{th}\) data-point belongs to the category encoded as “1”

Note: The “odds” of an outcome describe how often that outcome occurs relative to how often it does not occur. So, odds of “3” would imply the the outcome “died” is expected 3 times for every 1 instance of the outcome “survived” (equivalent to \(\pi = 0.75\)).

The glm() function with the argument family = "binomial" is used to fit logistic regression models:

mod_log <- glm(survived_numeric ~ age + sex, data = Donner, family = "binomial")

When interpreting this model’s estimated coefficients, \(\{\hat{\beta}_0, \hat{\beta}_1, \ldots \hat{\beta}_p \}\), we must recognize that each coefficient represents an additive linear contribution on the log-odds scale.

m <- glm(survived_numeric ~ age + sex, data = Donner, family = "binomial")
m$coefficients

## (Intercept)         age     sexMale 
## -2.40022081  0.05788031  1.39309808

So, in the Donner Party example, each 1 year increase in age increases the log-odds of death by 0.0579.

Unfortunately the “log-odds of death” are not easily interpreted, but luckily: \[log\big(\tfrac{\pi}{1-\pi}\big) = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p \\ \implies \tfrac{\pi}{1-\pi} = exp(\beta_0 + \beta_1X_1 + \ldots + \beta_pX_p) \\ \implies \tfrac{\pi}{1-\pi} = exp(\beta_0)*exp(\beta_1X_1)* \ldots *exp(\beta_pX_p)\]

Based on these expressions:

The exponent of the intercept, \(e^{\beta_0}\), is the baseline odds of category “1”.
The exponent of any coefficient, \(e^{\beta_j}\), is a multiplier of the baseline odds for each one unit increase in variable \(j\).

In the Donner Party example, the coefficient for “age” is estimated as \(\hat{\beta}_1 = 0.058\). After taking the exponent, \(e^{0.058} = 1.06\), this suggests the odds of dying change by a factor of 1.06 for each 1 year increase in age (after adjusting for the variable “sex”).

Equivalently, we might interpret this effect as an 6% increase in the odds of death for every 1 year increase in age (again, after adjusting for “sex”).

\(~\)

Decision Trees

Recall that decision trees use the “anova” splitting method when the outcome is numeric. This method seeks splits that maximize the improvement in the model’s squared error.

For categorical outcomes, the default splitting methods utilizes the Gini index: \(G = \sum_{j=1}^{k}p_j(1-p_j)\) in general, or \(p_1(1-p_1) + p_2(1-p_2)\) for binary classification, where \(p_j\) is the proportion of outcomes belonging to category \(j\).

As an example, the overall Gini index of the Donner party data can be found using the distribution of survival:

prop.table(table(Donner$y))

## 
##      died  survived 
## 0.5813953 0.4186047

From this we can calculate that \(G_\text{init} = 0.58*(1-0.58) + 0.42*(1-0.42) = 0.487\)

The decision tree algorithm (fit using default tuning parameters) suggests we can better predict survival by making separate predictions according to the rule sex = Male:

tree <- rpart(y ~ age + sex, data = Donner)
rpart.plot(tree, extra=104)

The Gini index after this split is calculated as \(G_{sp1}=0.63*(0.7*(1-0.7)) + 0.37*(0.38*(1-0.38)) = 0.219\), meaning this split produces a Gini gain of 0.487-0.219 = 0.268.

What would this Gini gain calculation look like if a single split perfectly separated the survivors and non-survivors?

\(~\)

Lab

Model Practice

In the early 2000s, a research team comprised of US and Bangladesh scientists measured the arsenic levels of wells in Arahazar upazila, Bangladesh. The team identified safe and unsafe wells (arsenic concentrations \(\geq 0.5 \mu g/L\)), and they distributed materials encouraging residents who used unsafe wells to switch to safe ones.

A few years later, researchers followed up to see which of these residents had actually switched to safer wells. The results of this follow-up study are documented in the “Wells” dataset within the “carData” package:

library(carData)
data("Wells")
Wells$switch_numeric <- ifelse(Wells$switch == "yes", 1, 0)

The data set includes the following variables:

switch - whether or not a household switched to a safe well
arsenic - the arsenic level of the household’s original well (note that all are at least 0.5, the threshold for a well being considered unsafe)
distance - how many meters the household would need to travel to get to the nearest safe well
education - how many years of formal education the head of the household has
association - a binary variable describing whether or not any members of the household participated in any community organizations

Question #1:

Part A: Use logistic regression to model the likelihood of a household switching to a safe well based upon the explanatory variables “arsenic”, “distance”, and “education”. Interpret the coefficient of “distance” after exponentiation.

Part B: Use the rpart() function to fit a CART model defined by the formula switch ~ arsenic + distance + education + association. Then use the rpart.plot() function with the argument extra=104 to visualize this model.

Part C: Describe the first splitting rule in the tree created in Part A.

Part D: Calculate the Gini Index before and after the first split. Hint: After the first split you should calculate two quantities (one for each node) and combine them after weighting by the proportion of observations in each node.

Part E: What are the rules that will lead to an observation ending up in the tree’s most populated terminal node? What proportion of observations in this node switched wells?

\(~\)

Classification Accuracy

For numeric outcomes, we used \(RMSE\) to measure a model’s error. However, this calculation is not possible for categorical outcomes.

The simplest option is classification accuracy, or the proportion of correct classifications the model makes:
\[\text{Accuracy} = \sum_{i=1}^{n}(y_i = \hat{y}_i)/n\]

Note that we could also represent this information via the classification error rate, or \(1 - \text{Accuracy} = \sum_{i=1}^{n}(y_i \ne \hat{y}_i)/n\).

For logistic regression, calculating the classification accuracy (or error rate) requires mapping the fitted model’s predicted probabilities (ie: the \(\hat{\pi}_i\)’s) to predicted categories (ie: \(\hat{y}_i\)’s) using a decision rule. A simple rule to map all predicted probabilities above some threshold, \(t\), to the “positive” class: \[\hat{y}_i = 1 \text{ if } \hat{\pi}_i \geq t; \\ \hat{y}_i=0 \text{ if } \hat{\pi}_i<t\]

For a binary outcome, \(t = 0.5\) is a reasonable choice since it maps predicted probabilities to the “most likely” outcome category.

The glm() function returns predicted probabilities in the vector fitted.values. The code below uses ifelse() to map these probabilities into predicted categories for the Donner Party example:

## Fit logistic model
fit <- glm(survived_numeric ~ age + sex, data = Donner, family = "binomial")

## translate predicted probabilities to died/survived
pred_cat <- ifelse(fit$fitted.values > .5, "died", "survived")

Other types of models, such as KNN or decision trees, have different ways by which predicted probabilities must be obtained. The code below demonstrates this for decision tree models:

## Fit decision tree
tree <- rpart(y ~ age + sex, data = Donner)

## get predictions
pred_probs_tree = predict(tree, newdata = Donner, type = "prob")

## translate predicted probabilities to died/survived
pred_cat_tree <- ifelse(pred_probs_tree[,1] > .5, "died", "survived") # Note that the first column is Pr(death)

And this code demonstrates the process for KNN models:

## First separate the predictors and make into a numeric matrix
Donner_X = model.matrix(~ -1 + age + sex, data = select(Donner, age, sex))

## Fit the model
kkn_mod = knn(train = Donner_X, test = Donner_X, cl = Donner$y, k = 4, prob = TRUE)

## translate prob to died or surv
pred_cat_knn<- ifelse(attr(kkn_mod, "prob") > .5, "died", "survived")

This code is relatively complex due to the way knn() was programmed, as it requires predictors be given as a matrix of exclusively numeric data. The model.matrix() function facilitates this, using -1 in the formula to avoid creating an intercept term. I encourage you to print Donner_X to see that all this code is doing is recoding sex to be a binary numeric variable. Additionally, predicted probabilities are returned as an attribute of the fitted model, so they must be accessed using attr().

Once we have predicted categories, we can then calculate the model’s in-sample classification accuracy by finding the proportion of these predictions which match the actual observed outcome:

## Sum the number of matches and divide by the sample size
sum(Donner$y == pred_cat)/length(Donner$y)

## [1] 0.7674419

## Results for decision trees and knn
sum(Donner$y == pred_cat_tree)/length(Donner$y)

## [1] 0.6744186

sum(Donner$y == pred_cat_knn)/length(Donner$y)

## [1] 0.627907

We can see that logistic regression achieves the highest in-sample accuracy.

Question #2: Using the well switching data introduced in Question #1, calculate and report the in-sample classification accuracy of three different models: logistic regression, decision tree, and KNN, that are used to predict the switch to a safe well. Each model should use the predictors: arsenic, distance, education, and association. You may use the default tuning parameters for the decision tree model, and \(k=15\) for KNN.

\(~\)

Confusion Matrices

Accuracy is often reported alongside a confusion matrix, or two-way table summarizing the model’s predicted categories vs. the data’s true categories. The rows of a confusion matrix represent the observed (true) categories, while the columns represent predicted categories:

## Confusion Matrix (note: predictions = columns)
table(Donner$y, pred_cat)

##           pred_cat
##            died survived
##   died       23        2
##   survived    8       10

Using the confusion matrix, we see that this model makes a total of 10 classification errors, 8 of which were people who actually survived when our model predicted should have died (based upon their age and sex using a probability threshold of \(t = 0.5\)), while 2 were people that our predicted should have survived but they actually died.

Question #3: For the Donner Party data, using the same logistic regression model, survived_numeric ~ age + sex, that was used above, change the threshold used to map predicted probabilities to survival outcomes to \(t = 0.6\). How does this change impact the in-sample classification accuracy? How does it influence the distribution of classification errors in the confusion matrix?

\(~\)

ROC Analysis

Within the confusion matrix, you should notice that two types of classification errors are possible in any binary classification application. Sometimes certain types of errors are more harmful than others. For example, it is less problematic for a model to classify a healthy person as having cancer than it is for a model to classify a person with cancer as being healthy. The former might be inconvenient, but the later could be deadly.

These two types of classification errors are called false positives, which describes when a “negative” observation (outcome encoded by “0”) is classified as “positive” (outcome encoded by “1”), and false negatives, which describes when a “positive” observation is classified as “negative”.

Because we decide which categories to code as “1” and “0”, the notion of a false positive or false negative is determined by the data analyst.

Two important metrics derived from these quantities are:

Sensitivity, or the true positive rate: \[\text{Sensitivity} = \tfrac{\text{#True Positives}}{\text{#True Positives + #False Negatives}}\]
Specificity, or the true negative rate: \[\text{Specificity} = \tfrac{\text{#True Negatives}}{\text{#True Negatives + #False Positives}}\] Note: A model that classifies everything as “positive” will have a sensitivity of 1 and a specificity of 0. Conversely, a model that classifies everything as “negative” will have a sensitivity of 0 and a specificity of 1.

For any non-trivial model, there is a trade-off between sensitivity and specificity that depends upon the decision threshold, \(t\) (which maps predicted probabilities to predicted categories). One way to visually represent this trade-off is a Receiver Operating Characteristics (ROC) curve.

The code below shows the ROC curve for the Donner Party logistic regression model. It is created using ggplot with help from the plotROC package. The aesthetic “d” signifies the true “disease” status, while “m” signifies the “marker”, or the predicted probability of disease. By default, geom_roc will take the first level of the binary categorical outcome to be the “positive” class.

library(plotROC)
fit <- glm(survived_numeric ~ age + sex, data = Donner, family = "binomial")

df <- data.frame(y = Donner$survived_numeric, pi = fit$fitted.values)
ggplot(df, aes(d = y, m = pi)) + geom_roc()

The y-axis of the ROC plot depicts the true positive rate, or sensitivity, while the x-axis shows the false positive rate, which is the same as 1-specificity. The curve depicts the sensitivity and 1-specificity corresponding to different decision thresholds (values of \(t\)), which are marked at various points.

For example, the threshold \(t = 0.4\) leads to around 95% sensitivity, but at the expense of the false positive rate being around 60%.

We can also supply a color argument to produce stratified ROC curves that depict the model’s performance for males and females separately:

df <- data.frame(y = Donner$survived_numeric, pi = fit$fitted.values, sex = Donner$sex)
ggplot(df, aes(d = y, m = pi, color = sex)) + geom_roc()

Exploring the stratified ROC curves is useful in this application because we can clearly see that our model does a much better job predicting outcomes for female members of the Donner Party.

The Area Under the Curve (AUC) of an ROC curve is a popular single number summary of a model’s efficacy. An AUC of 1 indicates perfect classification, while an AUC of 0.5 is akin to random guessing.

The AUC of an ROC curve can be found using the calc_auc function in the plotROC package.

plot1 <- ggplot(df, aes(d = y, m = pi)) + geom_roc()
plot2 <- ggplot(df, aes(d = y, m = pi, color = sex)) + geom_roc()

calc_auc(plot1)

##   PANEL group       AUC
## 1     1    -1 0.6933333

calc_auc(plot2)

##   PANEL group    sex       AUC
## 1     1     1 Female 0.9250000
## 2     1     2   Male 0.3914474

In our example, we see decent overall prediction of survival status, but this is largely due to the accuracy of predictions for females (the model actually does worse than guessing for males).

Question #4:

Part A: Create a single ggplot graphic displaying the ROC curves for each model you created in Question #2. You should display each model’s curve using a different color. Hint: You could choose to combine the predicted and observed values for each model into a single data frame and add a column to indicate the model, or you could use local specification of the d and m aesthetics within 3 different layers of geom_roc().

Part B: Using the plot from Part A and without performing any calculations, which model do you believe has the highest AUC value? Briefly explain.

\(~\)

Cross-validation

So far we’ve been calculating in-sample performance measures (classification accuracy, AUC, etc.), which are generally too optimistic and tend to reward over fitting. It’s much better to use cross-validation to estimate out-of-sample performance.

### Setup
set.seed(111)
n <- nrow(Donner)
fold_id <- sample(rep(1:4, length.out = n), size = n)
preds <- numeric(n)

## Loop across CV folds
for(k in 1:4){
  
  ## Subset the data
  train <- Donner[fold_id != k, ]
  test <- Donner[fold_id == k, ]
  
  ## Fit models on the data
  m <- glm(survived_numeric ~ age + sex, data = train, family = "binomial")  
  
  ## Store predictions
  preds[fold_id == k] <- predict(m, newdata = test, type = "response")
}

## Out of sample accuracy
pred_class <- ifelse(preds >= .5, 1, 0)
out_acc <- sum(pred_class == Donner$survived_numeric)/n
out_acc

## [1] 0.6744186

## Out of sample AUC
df <- data.frame(y = Donner$survived_numeric, pi = preds)
pl <- ggplot(df, aes(d = y, m = pi)) + geom_roc()
calc_auc(pl)

##   PANEL group       AUC
## 1     1    -1 0.5922222

Earlier we saw that this model achieves 76.7% in-sample accuracy, but cross-validation indicates the out-of-sample accuracy is substantially lower. The results for AUC are similar.

In general, you should use cross-validation to help you select and evaluate the performance of models. Then you should fit/estimate that model on the entire data set and use those results to develop interpretations (such as odds ratios) or construct visualizations (such as ROC curves).

Question #5: Use 5-fold cross-validation to estimate the out-of-sample classification accuracy and out-of-sample AUC of three models you’ve used in Questions #3 and #4.

Introduction to Supervised Learning (Part 3 - Categorical Outcomes)

Preamble

Packages and Datasets

Motivation

Logistic Regression

Decision Trees

Lab

Model Practice

Classification Accuracy

Confusion Matrices

ROC Analysis

Cross-validation