MATH-257 - Lab #9 - Logistic Regression - Model Selection and Evaluation

Directions:

As with all our labs, you are expected to work through the examples and questions in this lab collaboratively with your partner(s). This means you will work together and discuss each section. You each will receive the same score for your work, so is your responsibility to make sure that both of you are on the same page. Any groups found to be using a “divide and conquer” strategy to rush through the lab’s questions will be penalized.

You should record your answers to the lab’s questions in an R Markdown file. When submitting the lab, you should only turn in the compiled .html file created by R Markdown.

You are strongly encouraged to open a separate, blank R script to run and experiment with the example code that is given throughout the lab. Please do not turn-in this code.

\(~\)

Introduction

This lab will continue our study of logistic regression, an approach to modeling binary outcomes using the following statistical model: \[log\big(\tfrac{\pi}{1-\pi}\big) = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p \]

\(\pi\) is the probably of a “success”, or the category of the dummy variable that’s encoded as “1”
The log-odds of \(\pi\) are modeled using the observed binary outcomes, \(\{y_1, \ldots, y_i\}\), and a set of predictors, \(X_1, \ldots, X_p\)

Last week’s lab focused on conceptually understanding the impact of variables in logistic regression models, including methods of statistical inference. This week we’ll focus on evaluating and comparing different logistic regression models.

\(~\)

Model Summaries

When considering a logistic regression model, it is essential to have a way of telling how well it fits the data. This also has model selection implications. In a scenario involving multiple candidate models, perhaps all of our candidate models are a poor fit to the data; or perhaps all of them have a relatively good fit.

\(~\)

\(R^2\) Alternatives

In the linear regression setting, the coefficient of determination (often called \(R^2\)), expressed the proportion of variability in the outcome variable could be explained by a particular model. \(R^2\) is easily interpretted, and it does not depend upon the scale/units of the outcome variable, which make it an appealing choice for summarizing the efficacy of a model.

The mathematical calculation of \(R^2\) (in the linear regression setting) isn’t transferable to logistic regression; however, there are a few proposed methods that closely resemble certain aspects of linear regression’s \(R^2\):

Find the correlation between the observed binary outcomes (ie: the \(y_i\)’s) and the model’s predicted probabilities (ie: the \(\pi_i\)’s) and square it
Efron’s pseudo-\(R^2\) - \(1 - \frac{\sum_i(y_i - \hat{\pi}_i)^2}{\sum_i(y_i - \bar{y}_i)^2}\)
McFadden’s pseudo-\(R^2\) - \(1 - \frac{\text{LogLikelihood}_{\text{full}}}{\text{LogLikelihood}_{\text{int-only}}}\)

For logistic regression, you can expect these three approaches to produce slightly different results. This link provides a nice summary of these (and other) proposed pseudo-\(R^2\) measures.

The code below demonstrates these pseudo-\(R^2\) methods on the Donner Party dataset:

## Load the Donner Party data, then filter to include only adults
library(alr4)
data("Donner")
Donner <- dplyr::filter(Donner, age >= 18)
Donner$y_numeric <- ifelse(Donner$y == "survived", 1, 0)

## Models
m <- glm(y_numeric ~ age, data=Donner, family = "binomial")
m_in <- glm(y_numeric ~ 1, data=Donner, family = "binomial")

## Option #1
cor(m$fitted.values, Donner$y_numeric)^2

## [1] 0.08461331

## Option #2 (Efron)
1 - sum((Donner$y_numeric - m$fitted.values)^2)/sum((Donner$y_numeric - mean(Donner$y_numeric))^2)

## [1] 0.08454015

## Option #2 (McFadden)
1 - as.numeric(logLik(m)/logLik(m_in))

## [1] 0.07193049

Generally speaking, accurately predicting a binary outcome is considerably harder than predicting a numeric outcome, so you should hold pseudo-\(R^2\) measures to the same standards that you’d apply to \(R^2\) in linear regression. That said, the observed pseudo-\(R^2\) values less than 0.10 indicate this model doesn’t explain much of the variability in survival outcomes.

\(~\)

Question #1: Using the Donner Party dataset (loaded above) compare the pseudo-\(R^2\) of the logistic regression models: y ~ age + sex and y ~ age; which model appears to explain more variability in the outcome? Are you fully convinced that it’s the better model? Briefly explain.

\(~\)

Classification Accuracy

An even simpler way to quantify model performance for a binary outcome is classification accuracy, or the proportion of correct classifications the model makes:
\[\text{Accuracy} = \Sigma_{i=1}^{n}(y_i = \hat{y}_i)/n\]

Calculating accuracy requires the data analyst to map the model’s predicted probabilities (ie: the \(\hat{\pi}_i\)’s) into predicted categories (ie: \(\hat{y}_i\)’s) using a decision rule: \[\hat{y}_i = 1 \text{ if } \hat{\pi}_i \geq t; \\ \hat{y}_i=0 \text{ if } \hat{\pi}_i<t\]

In many circumstances, a threshold of \(t = 0.5\) is a reasonable choice since it maps predicted probabilities to the “most likely” category.

For a logistic regression model fit using the glm function, predicted probabilities are returned as a list element named fitted.values. The code below shows how to use the ifelse function to map these probabilities into predicted categories:

fit <- glm(y ~ age + sex, data = Donner, family = "binomial")
predictedCategory <- ifelse(fit$fitted.values > .5, "survived", "died")

We can then calculate accuracy by finding the proportion of these predictions which match the actual observed outcome:

sum(Donner$y == predictedCategory)/length(Donner$y)

## [1] 0.7674419

We see that this logistic regression model achieves 78% accuracy, although we should note that this is in-sample accuracy (out-of-sample accuracy will come up later in this lab)

It’s common to report accuracy alongside a confusion matrix, which is a two-way table displaying the model’s predicted categories and the data’s actual observed categories. The typical convention is to put the observed categories in the rows and the predicted categories in the columns.

## Confusion Matrix (predictions = columns)
table(Donner$y, predictedCategory)

##           predictedCategory
##            died survived
##   died       23        2
##   survived    8       10

Based upon this confusion matrix, we can see that there were a total of 10 misclassifications and 8 of them were people who actually surived that our model predicted should have died (based upon their age and sex, as well as a probability threshold of \(t = 0.5\)).

Question #2: Using the same logistic regression model, y ~ age + sex, change the threshold that maps predicted probabilites to survival outcomes to \(t = 0.6\). How does this change impact the in-sample classification accuracy? How does it influence the distribution of misclassifications in the confusion matrix?

\(~\)

Cohen’s Kappa

Accuracy is a popular performance measure because it’s easy to understand and use to make decisions, but relying solely on accuracy can be problematic when one category is much more common than the other. For example, suppose you make a model to predict whether freshmen Xavier students will return for a second year. It might sound fairly impressive if this model achieves 85% classification accuracy.

However, the reality is that 85% of all Xavier freshmen return, which means that an uninformative model which simply predicts “return” for everyone can trivially achieve 85% accuracy. In other words, accuracy isn’t always a good reflection of how useful a model is. Instead, we’d like to know by what degree are a model’s predictions more accurate than what could be expected by random chance.

Cohen’s kappa, denoted as \(\kappa\), is a statistic originally designed to measure inter-rater reliability in pyschology, but it is also frequently used to measure the agreement between a model’s predictions and the actual data. \[\kappa = \tfrac{p_o - p_e}{1 - p_e}\]

Here, \(p_o\) is the observed accuracy from the model, and \(p_e\) is the expected accuracy under random chance.

In our XU freshmen example, \(p_o = 0.85\), as the model achieved 85% accuracy. Also, \(p_e = 0.85\) because an uninformative model could achieve 85% accuracy. Thus, \(\kappa = 0\) in this example.

In contrast, suppose we had a perfect model which could predict whether a student returns with 100% accuracy. Then, \(p_o = 1.00\) and \(p_e = 0.85\), leading to \(\kappa = 1\). Thus, kappa values nearer to 1 indicate better classification performance, while values nearer zero indicate classification indicate worse performance.

Note, it is possible for kappa to be negative if a model makes predictions that are worse than random chance.

Question #3: Calculate Cohen’s kappa for the logistic regression model y ~ age + sex. How would you interpret this measure?

\(~\)

ROC Analysis

In some applications certain types of misclassifications can be more harmful than others. For example, it is much less problematic for a model to classify a healthy person as having cancer than it is a model to classify a person with cancer as being healthy. The former might lead to some inconveniences, while the later will certainly result in grave health consequences.

These two different types of misclassification give rise to terms false positive, which describes when a “negative” observation is classified as “positive”, and false negative, which describes when a “positive” observation is classified as “negative”.

Like how we previously re-coded binary categories as “1” and “0”, it is up to the analyst to decide which categories to designate as “positive” and “negative”.

Sensitivity is defined as the true positive rate: \[\text{Sensitivity} = \tfrac{\text{#True Positives}}{\text{#True Positives + #False Negatives}}\]

Put differently, sensitivity is the ability of a model to correctly classify an observation as “positive”, as the denominator represents the total number of “positive” observations, and the numerator represents the number correctly classified.

Specificity is defined as the true negative rate: \[\text{Specificity} = \tfrac{\text{#True Negatives}}{\text{#True Negatives + #False Positives}}\]

So, specificity is the ability of a model to correctly classify an observation as “negative”, as the denominator represents the total number of “negative” observations, and the numerator represents the number correctly classified.

Note, a model that classifies everything as “positive” will have a sensitivity of 1 and a specificity of 0. Conversely, a model that classifies everything as “negative” will have a sensitivity of 0 and a specificity of 1.

For all non-trivial models there is a trade-off between sensitivity and specificity that depends upon the classification threshold, \(t\), which is used to map predicted probabilities to categories. One way to visually represent this trade-off is a Receiver Operating Characteristics (ROC) curve.

The code below shows the ROC curve for the Donner Party logistic regression model created using ggplot via the plotROC package. The aesthetic “d” signifies the true “disease” status, while “m” signifies the “marker”, or the predicted probability of disease. By default, geom_roc will take the first level of the binary categorical outcome to be the “positive” class.

library(plotROC)
fit <- glm(y_numeric ~ age + sex, data = Donner, family = "binomial")

df <- data.frame(y_numeric = Donner$y_numeric, pi = fit$fitted.values)
ggplot(df, aes(d = y_numeric, m = pi)) + geom_roc()

The y-axis of the ROC plot depicts the true positive rate, or sensitivity, while the x-axis shows the false positive rate, which is the same as 1-specificity. The curve depicts the sensitivity and 1-specificity that result from different decision thresholds (values of \(t\) as previously defined), which are marked at various points.

For example, the threshold \(t = 0.6\) results in approximately 50% sensitivity and 90% specificity, indicating the model is good at predicting deaths but not as good at predicting survival. This same trend could be observed in confusion matrix we analyzed earlier.

We could supply a color argument to produce separate ROC curves for males and females:

df <- data.frame(y_numeric = Donner$y_numeric, pi = fit$fitted.values, sex = Donner$sex)
ggplot(df, aes(d = y_numeric, m = pi, color = sex)) + geom_roc()

Here we see that it is much more difficult to predict survival for males than for females in the Donner Party.

An ROC curve can provide an overall picture of a model’s classification ability, but it is also nice to have single number that summarizes the efficacy of a model. The Area Under the Curve (AUC) of the ROC curve is one such commonly used summary measure. An AUC of 1 indicates perfect classification, while an AUC of 0.5 is akin to random guessing.

The AUC of an ROC curve can be found using the calc_auc function in the plotROC package.

plot1 <- ggplot(df, aes(d = y_numeric, m = pi)) + geom_roc()
plot2 <- ggplot(df, aes(d = y_numeric, m = pi, color = sex)) + geom_roc()

calc_auc(plot1)

##   PANEL group       AUC
## 1     1    -1 0.6933333

calc_auc(plot2)

##   PANEL group       AUC
## 1     1     1 0.9250000
## 2     1     2 0.3914474

calc_auc accepts a ggplot object containing an roc geom, and returns the area under the curve(s) for that plot.

Question #4: Plot the ROC curve for the logistic regression model y_numeric ~ age + sex + status. Based upon this curve, what are the sensitivity and specificity (approximately) when using the threshold \(t = 0.5\) to make classifications?

Question #5: Compare the AUC of the model y_numeric ~ age + sex + status with the AUC of the model y_numeric ~ age + sex. Based upon this comparison, which model you feel is more accurate? Are you convinced that the difference is significant?

\(~\)

Model Selection

While the methods we’ve described are useful in understanding the fit of a model, they tend to systematically favor larger more complex models. In order to choose a model that balances parsimony with accuracy, there are two commonly used approaches:

Model selection critera that simultaneously penalize complex and reward goodness of fit
Cross-validation, which provides an estimate of out-of-sample accuracy or Cohen’s kappa that is not biased towards the sample data

\(~\)

AIC and BIC

All generalized linear models can be compared using the AIC and BIC model selection criteria. Thus, regardless of whether you’re using logistic regression, linear regression, or any other GLM, AIC and BIC are calculated the same way (with the only difference being the probability distribution that is basis for the log-likelihood).

As a reminder:

\(AIC = -\text{Log-Likelihood} + 2k\)
\(BIC = -\text{Log-Likelihood} + log(n)*k\)

Recall that the major difference between these two criteria is that BIC tends to favor more parsimonous models (so long as \(n \geq 8\)).

Further, these model selection criteria can be the basis of stepwise selection algorithms:

library(MASS)

## Backwards stepwise selection (initial model contains all predictors)
init_fit <- glm(y ~ age + sex + status, data = Donner, family = "binomial")
stepAIC(init_fit, 
        scope = list(upper = y ~ age + sex + status, 
                     lower = y ~ 1), 
        direction = "both")

## Start:  AIC=51.6
## y ~ age + sex + status
## 
##          Df Deviance    AIC
## - sex     1   41.674 49.674
## <none>        41.604 51.604
## - status  2   50.134 56.134
## - age     1   48.229 56.229
## 
## Step:  AIC=49.67
## y ~ age + status
## 
##          Df Deviance    AIC
## <none>        41.674 49.674
## + sex     1   41.604 51.604
## - age     1   48.922 54.922
## - status  2   54.261 58.261

## 
## Call:  glm(formula = y ~ age + status, family = "binomial", data = Donner)
## 
## Coefficients:
##  (Intercept)           age   statusHired  statusSingle  
##      3.22364      -0.08161      -2.00809     -18.28240  
## 
## Degrees of Freedom: 42 Total (i.e. Null);  39 Residual
## Null Deviance:       58.47 
## Residual Deviance: 41.67     AIC: 49.67

Question #6: Using the spam dataset, which was introduced in the previous lab, apply forward stepwise selection using the BIC selection criterion to choose a that predicts the binary outcome variable “type”. Report the final model. Please type out or copy-paste the final model formula and comment out your stepAIC() code when compiling your lab write-up.

Hints:

You can use the argument k = ... to make stepAIC() use BIC
Use the code provided below to construct the model formula containing all available predictors
Expect the algorithm to take several minutes to complete. I suggest you run it and then move on to another section of the while waiting for the final model. Additionally, you can use the argument trace = FALSE to stop stepAIC() from printing output at every step.

Question #7: As illustrated in the previous question, the computational burden of model selection algorithms in the logistic regression setting is non-trivial. Had you used a purely forward selection algorithm (ie: direction = "forward") rather than a stepwise forward selection algorithm (ie: direction = "both") would you expect the run time to remain approximately the same? Or would it decrease? Briefly explain. (Note: you don’t need to write any code for your answer to this question)

library(kernlab)
data("spam")
table(spam$type)

## 
## nonspam    spam 
##    2788    1813

rhs <- paste(names(spam)[-58], collapse = " + ")
max_form <- paste0("type ~ ", rhs)

\(~\)

Cross-validation

A final tool that can be used in both the selection and summarization of logistic regression models is cross-validation. Recall that cross-validation allows us to estimate the out-of-sample performance of a model. Evaluating a model based upon its out-of-sample performance, rather than its in-sample performance, protects against overfitting the sample data.

In the logistic regression setting, it’s common to report the cross-validated classification accuracy, Cohen’s kappa, and/or AUC. Notice that all of these are based upon predicted probabilities, so the cross-validation loop is actually fairly simple and directly mirrors code we’ve seen previously in Lab #4.

### Make outcome numeric for ease of use
spam$type <- ifelse(spam$type == "spam", 1, 0)

### Setup
set.seed(123)
fold_id <- sample(rep(1:5, length.out = nrow(spam)), size = nrow(spam))
preds <- numeric(nrow(spam))

## Loop across CV folds
for(k in 1:5){
  
  ## Subset the data
  train <- spam[fold_id != k, ]
  test <- spam[fold_id == k, ]
  
  ## Fit models on the data
  m <- glm(type ~ ., data = train, family = "binomial")  
  
  ## Store predictions
  preds[fold_id == k] <- predict(m, newdata = test, type = "response")
}

## Out of sample accuracy
pred_class <- ifelse(preds >= .5, 1, 0)
out_acc <- sum(pred_class == spam$type)/nrow(spam)
out_acc

## [1] 0.9234949

## Out of sample kappa
prop.table(table(spam$type)) ## Guessing "nonspam" for everything is 60.59% accurate

## 
##         0         1 
## 0.6059552 0.3940448

(out_acc - 0.6059552)/(1 - 0.6059552)

## [1] 0.8058467

## Out of sample AUC
df <- data.frame(y_numeric = spam$type, pi = preds)
pl <- ggplot(df, aes(d = y_numeric, m = pi)) + geom_roc()
calc_auc(pl)

##   PANEL group       AUC
## 1     1    -1 0.9712265

\(~\)

Practice

In the early 2000s, a research team comprised of US and Bangladesh scientists measured the arsenic levels of wells in Arahazar upazila, Bangladesh. The team identified safe and unsafe wells (arsenic concentrations \(\geq 0.5 \mu g/L\)), and they distributed materials encouraging residents who used unsafe wells to switch to safe ones.

A few years later, researchers followed up to see which of these residents had actually switched to safer wells. The results of this follow-up study are documented in the “Wells” dataset within the “carData” package:

library(carData)
data("Wells")
summary(Wells)

##  switch        arsenic         distance         education      association
##  no :1283   Min.   :0.510   Min.   :  0.387   Min.   : 0.000   no :1743   
##  yes:1737   1st Qu.:0.820   1st Qu.: 21.117   1st Qu.: 0.000   yes:1277   
##             Median :1.300   Median : 36.761   Median : 5.000              
##             Mean   :1.657   Mean   : 48.332   Mean   : 4.828              
##             3rd Qu.:2.200   3rd Qu.: 64.041   3rd Qu.: 8.000              
##             Max.   :9.650   Max.   :339.531   Max.   :17.000

The dataset contains the following variables:

switch - whether or not a household switched to a safe well
arsenic - the arsenic level of the household’s original well (note that all are at least 0.5, the threshold for a well being considered unsafe)
distance - how many meters the household would need to travel to get to the nearest safe well
education - how many years of formal education the head of the household has
association - a binary variable describing whether or not any members of the household participated in any community organizations

Question #8: Use a backwards stepwise selection algorithm (with AIC as the model selection criteria) to find a logistic regression model that predicts “switch” using some or all of the remaining variables in the dataset. Report your final model.

Question #9: Interpret the estimated effect of each variable in the model on the odds that a household switches wells. Based upon the \(Z\)-tests provided by the summary function, which variable appears to be the strongest predictor (after adjusting for the others)?

Question #10: Report the in-sample pseudo-\(R^2\) (your choice as to which one), classification accuracy, confusion matrix, and Cohen’s kappa of your model.

Question #11: Create an ROC curve displaying the sensitivity and specificity of your model. Based upon your own subjective beliefs regarding the importance of sensitivity and specificy in this application, use the ROC to select a decision threshold \(t\). Briefly explain the rationale for your choice. Finally, calculate and report the model’s AUC.

Question #12: Use cross-validation to estimate the out-of-sample classification accuracy of this model. Based upon a comparison of the in-sample and out-of-sample accuracy, do you believe your model is overfit?