Directions:
As with all our labs, you are expected to work through the examples and questions in this lab collaboratively with your partner(s). This means you will work together and discuss each section. You each will receive the same score for your work, so is your responsibility to make sure that both of you are on the same page. Any groups found to be using a “divide and conquer” strategy to rush through the lab’s questions will be penalized.
You should record your answers to the lab’s questions in an R Markdown file. When submitting the lab, you should only turn in the compiled .html file created by R Markdown.
You are strongly encouraged to open a separate, blank R script to run and experiment with the example code that is given throughout the lab. Please do not turn-in this code.
\(~\)
This lab will continue our study of logistic regression, an approach to modeling binary outcomes using the following statistical model: \[log\big(\tfrac{\pi}{1-\pi}\big) = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p \]
Last week’s lab focused on conceptually understanding the impact of variables in logistic regression models, including methods of statistical inference. This week we’ll focus on evaluating and comparing different logistic regression models.
\(~\)
When considering a logistic regression model, it is essential to have a way of telling how well it fits the data. This also has model selection implications. In a scenario involving multiple candidate models, perhaps all of our candidate models are a poor fit to the data; or perhaps all of them have a relatively good fit.
\(~\)
In the linear regression setting, the coefficient of determination (often called \(R^2\)), expressed the proportion of variability in the outcome variable could be explained by a particular model. \(R^2\) is easily interpretted, and it does not depend upon the scale/units of the outcome variable, which make it an appealing choice for summarizing the efficacy of a model.
The mathematical calculation of \(R^2\) (in the linear regression setting) isn’t transferable to logistic regression; however, there are a few proposed methods that closely resemble certain aspects of linear regression’s \(R^2\):
For logistic regression, you can expect these three approaches to produce slightly different results. This link provides a nice summary of these (and other) proposed pseudo-\(R^2\) measures.
The code below demonstrates these pseudo-\(R^2\) methods on the Donner Party dataset:
## Load the Donner Party data, then filter to include only adults
library(alr4)
data("Donner")
Donner <- dplyr::filter(Donner, age >= 18)
Donner$y_numeric <- ifelse(Donner$y == "survived", 1, 0)
## Models
m <- glm(y_numeric ~ age, data=Donner, family = "binomial")
m_in <- glm(y_numeric ~ 1, data=Donner, family = "binomial")
## Option #1
cor(m$fitted.values, Donner$y_numeric)^2
## [1] 0.08461331
## Option #2 (Efron)
1 - sum((Donner$y_numeric - m$fitted.values)^2)/sum((Donner$y_numeric - mean(Donner$y_numeric))^2)
## [1] 0.08454015
## Option #2 (McFadden)
1 - as.numeric(logLik(m)/logLik(m_in))
## [1] 0.07193049
Generally speaking, accurately predicting a binary outcome is considerably harder than predicting a numeric outcome, so you should hold pseudo-\(R^2\) measures to the same standards that you’d apply to \(R^2\) in linear regression. That said, the observed pseudo-\(R^2\) values less than 0.10 indicate this model doesn’t explain much of the variability in survival outcomes.
\(~\)
Question #1: Using the Donner Party dataset (loaded above) compare the pseudo-\(R^2\) of the logistic regression models: y ~ age + sex
and y ~ age
; which model appears to explain more variability in the outcome? Are you fully convinced that it’s the better model? Briefly explain.
\(~\)
An even simpler way to quantify model performance for a binary outcome is classification accuracy, or the proportion of correct classifications the model makes:
\[\text{Accuracy} = \Sigma_{i=1}^{n}(y_i = \hat{y}_i)/n\]
Calculating accuracy requires the data analyst to map the model’s predicted probabilities (ie: the \(\hat{\pi}_i\)’s) into predicted categories (ie: \(\hat{y}_i\)’s) using a decision rule: \[\hat{y}_i = 1 \text{ if } \hat{\pi}_i \geq t; \\ \hat{y}_i=0 \text{ if } \hat{\pi}_i<t\]
In many circumstances, a threshold of \(t = 0.5\) is a reasonable choice since it maps predicted probabilities to the “most likely” category.
For a logistic regression model fit using the glm
function, predicted probabilities are returned as a list element named fitted.values
. The code below shows how to use the ifelse
function to map these probabilities into predicted categories:
fit <- glm(y ~ age + sex, data = Donner, family = "binomial")
predictedCategory <- ifelse(fit$fitted.values > .5, "survived", "died")
We can then calculate accuracy by finding the proportion of these predictions which match the actual observed outcome:
sum(Donner$y == predictedCategory)/length(Donner$y)
## [1] 0.7674419
We see that this logistic regression model achieves 78% accuracy, although we should note that this is in-sample accuracy (out-of-sample accuracy will come up later in this lab)
It’s common to report accuracy alongside a confusion matrix, which is a two-way table displaying the model’s predicted categories and the data’s actual observed categories. The typical convention is to put the observed categories in the rows and the predicted categories in the columns.
## Confusion Matrix (predictions = columns)
table(Donner$y, predictedCategory)
## predictedCategory
## died survived
## died 23 2
## survived 8 10
Based upon this confusion matrix, we can see that there were a total of 10 misclassifications and 8 of them were people who actually surived that our model predicted should have died (based upon their age and sex, as well as a probability threshold of \(t = 0.5\)).
Question #2: Using the same logistic regression model, y ~ age + sex
, change the threshold that maps predicted probabilites to survival outcomes to \(t = 0.6\). How does this change impact the in-sample classification accuracy? How does it influence the distribution of misclassifications in the confusion matrix?
\(~\)
Accuracy is a popular performance measure because it’s easy to understand and use to make decisions, but relying solely on accuracy can be problematic when one category is much more common than the other. For example, suppose you make a model to predict whether freshmen Xavier students will return for a second year. It might sound fairly impressive if this model achieves 85% classification accuracy.
However, the reality is that 85% of all Xavier freshmen return, which means that an uninformative model which simply predicts “return” for everyone can trivially achieve 85% accuracy. In other words, accuracy isn’t always a good reflection of how useful a model is. Instead, we’d like to know by what degree are a model’s predictions more accurate than what could be expected by random chance.
Cohen’s kappa, denoted as \(\kappa\), is a statistic originally designed to measure inter-rater reliability in pyschology, but it is also frequently used to measure the agreement between a model’s predictions and the actual data. \[\kappa = \tfrac{p_o - p_e}{1 - p_e}\]
Here, \(p_o\) is the observed accuracy from the model, and \(p_e\) is the expected accuracy under random chance.
In our XU freshmen example, \(p_o = 0.85\), as the model achieved 85% accuracy. Also, \(p_e = 0.85\) because an uninformative model could achieve 85% accuracy. Thus, \(\kappa = 0\) in this example.
In contrast, suppose we had a perfect model which could predict whether a student returns with 100% accuracy. Then, \(p_o = 1.00\) and \(p_e = 0.85\), leading to \(\kappa = 1\). Thus, kappa values nearer to 1 indicate better classification performance, while values nearer zero indicate classification indicate worse performance.
Note, it is possible for kappa to be negative if a model makes predictions that are worse than random chance.
Question #3: Calculate Cohen’s kappa for the logistic regression model y ~ age + sex
. How would you interpret this measure?
\(~\)
In some applications certain types of misclassifications can be more harmful than others. For example, it is much less problematic for a model to classify a healthy person as having cancer than it is a model to classify a person with cancer as being healthy. The former might lead to some inconveniences, while the later will certainly result in grave health consequences.
These two different types of misclassification give rise to terms false positive, which describes when a “negative” observation is classified as “positive”, and false negative, which describes when a “positive” observation is classified as “negative”.
Like how we previously re-coded binary categories as “1” and “0”, it is up to the analyst to decide which categories to designate as “positive” and “negative”.
Sensitivity is defined as the true positive rate: \[\text{Sensitivity} = \tfrac{\text{#True Positives}}{\text{#True Positives + #False Negatives}}\]
Put differently, sensitivity is the ability of a model to correctly classify an observation as “positive”, as the denominator represents the total number of “positive” observations, and the numerator represents the number correctly classified.
Specificity is defined as the true negative rate: \[\text{Specificity} = \tfrac{\text{#True Negatives}}{\text{#True Negatives + #False Positives}}\]
So, specificity is the ability of a model to correctly classify an observation as “negative”, as the denominator represents the total number of “negative” observations, and the numerator represents the number correctly classified.
Note, a model that classifies everything as “positive” will have a sensitivity of 1 and a specificity of 0. Conversely, a model that classifies everything as “negative” will have a sensitivity of 0 and a specificity of 1.
For all non-trivial models there is a trade-off between sensitivity and specificity that depends upon the classification threshold, \(t\), which is used to map predicted probabilities to categories. One way to visually represent this trade-off is a Receiver Operating Characteristics (ROC) curve.
The code below shows the ROC curve for the Donner Party logistic regression model created using ggplot
via the plotROC
package. The aesthetic “d” signifies the true “disease” status, while “m” signifies the “marker”, or the predicted probability of disease. By default, geom_roc
will take the first level of the binary categorical outcome to be the “positive” class.
library(plotROC)
fit <- glm(y_numeric ~ age + sex, data = Donner, family = "binomial")
df <- data.frame(y_numeric = Donner$y_numeric, pi = fit$fitted.values)
ggplot(df, aes(d = y_numeric, m = pi)) + geom_roc()
The y-axis of the ROC plot depicts the true positive rate, or sensitivity, while the x-axis shows the false positive rate, which is the same as 1-specificity. The curve depicts the sensitivity and 1-specificity that result from different decision thresholds (values of \(t\) as previously defined), which are marked at various points.
For example, the threshold \(t = 0.6\) results in approximately 50% sensitivity and 90% specificity, indicating the model is good at predicting deaths but not as good at predicting survival. This same trend could be observed in confusion matrix we analyzed earlier.
We could supply a color argument to produce separate ROC curves for males and females:
df <- data.frame(y_numeric = Donner$y_numeric, pi = fit$fitted.values, sex = Donner$sex)
ggplot(df, aes(d = y_numeric, m = pi, color = sex)) + geom_roc()
Here we see that it is much more difficult to predict survival for males than for females in the Donner Party.
An ROC curve can provide an overall picture of a model’s classification ability, but it is also nice to have single number that summarizes the efficacy of a model. The Area Under the Curve (AUC) of the ROC curve is one such commonly used summary measure. An AUC of 1 indicates perfect classification, while an AUC of 0.5 is akin to random guessing.
The AUC of an ROC curve can be found using the calc_auc
function in the plotROC
package.
plot1 <- ggplot(df, aes(d = y_numeric, m = pi)) + geom_roc()
plot2 <- ggplot(df, aes(d = y_numeric, m = pi, color = sex)) + geom_roc()
calc_auc(plot1)
## PANEL group AUC
## 1 1 -1 0.6933333
calc_auc(plot2)
## PANEL group AUC
## 1 1 1 0.9250000
## 2 1 2 0.3914474
calc_auc
accepts a ggplot object containing an roc geom, and returns the area under the curve(s) for that plot.
Question #4: Plot the ROC curve for the logistic regression model y_numeric ~ age + sex + status
. Based upon this curve, what are the sensitivity and specificity (approximately) when using the threshold \(t = 0.5\) to make classifications?
Question #5: Compare the AUC of the model y_numeric ~ age + sex + status
with the AUC of the model y_numeric ~ age + sex
. Based upon this comparison, which model you feel is more accurate? Are you convinced that the difference is significant?
\(~\)
While the methods we’ve described are useful in understanding the fit of a model, they tend to systematically favor larger more complex models. In order to choose a model that balances parsimony with accuracy, there are two commonly used approaches:
\(~\)
All generalized linear models can be compared using the AIC and BIC model selection criteria. Thus, regardless of whether you’re using logistic regression, linear regression, or any other GLM, AIC and BIC are calculated the same way (with the only difference being the probability distribution that is basis for the log-likelihood).
As a reminder:
Recall that the major difference between these two criteria is that BIC tends to favor more parsimonous models (so long as \(n \geq 8\)).
Further, these model selection criteria can be the basis of stepwise selection algorithms:
library(MASS)
## Backwards stepwise selection (initial model contains all predictors)
init_fit <- glm(y ~ age + sex + status, data = Donner, family = "binomial")
stepAIC(init_fit,
scope = list(upper = y ~ age + sex + status,
lower = y ~ 1),
direction = "both")
## Start: AIC=51.6
## y ~ age + sex + status
##
## Df Deviance AIC
## - sex 1 41.674 49.674
## <none> 41.604 51.604
## - status 2 50.134 56.134
## - age 1 48.229 56.229
##
## Step: AIC=49.67
## y ~ age + status
##
## Df Deviance AIC
## <none> 41.674 49.674
## + sex 1 41.604 51.604
## - age 1 48.922 54.922
## - status 2 54.261 58.261
##
## Call: glm(formula = y ~ age + status, family = "binomial", data = Donner)
##
## Coefficients:
## (Intercept) age statusHired statusSingle
## 3.22364 -0.08161 -2.00809 -18.28240
##
## Degrees of Freedom: 42 Total (i.e. Null); 39 Residual
## Null Deviance: 58.47
## Residual Deviance: 41.67 AIC: 49.67
Question #6: Using the spam dataset, which was introduced in the previous lab, apply forward stepwise selection using the BIC selection criterion to choose a that predicts the binary outcome variable “type”. Report the final model. Please type out or copy-paste the final model formula and comment out your stepAIC()
code when compiling your lab write-up.
Hints:
k = ...
to make stepAIC()
use BICtrace = FALSE
to stop stepAIC()
from printing output at every step.Question #7: As illustrated in the previous question, the computational burden of model selection algorithms in the logistic regression setting is non-trivial. Had you used a purely forward selection algorithm (ie: direction = "forward"
) rather than a stepwise forward selection algorithm (ie: direction = "both"
) would you expect the run time to remain approximately the same? Or would it decrease? Briefly explain. (Note: you don’t need to write any code for your answer to this question)
library(kernlab)
data("spam")
table(spam$type)
##
## nonspam spam
## 2788 1813
rhs <- paste(names(spam)[-58], collapse = " + ")
max_form <- paste0("type ~ ", rhs)
\(~\)
A final tool that can be used in both the selection and summarization of logistic regression models is cross-validation. Recall that cross-validation allows us to estimate the out-of-sample performance of a model. Evaluating a model based upon its out-of-sample performance, rather than its in-sample performance, protects against overfitting the sample data.
In the logistic regression setting, it’s common to report the cross-validated classification accuracy, Cohen’s kappa, and/or AUC. Notice that all of these are based upon predicted probabilities, so the cross-validation loop is actually fairly simple and directly mirrors code we’ve seen previously in Lab #4.
### Make outcome numeric for ease of use
spam$type <- ifelse(spam$type == "spam", 1, 0)
### Setup
set.seed(123)
fold_id <- sample(rep(1:5, length.out = nrow(spam)), size = nrow(spam))
preds <- numeric(nrow(spam))
## Loop across CV folds
for(k in 1:5){
## Subset the data
train <- spam[fold_id != k, ]
test <- spam[fold_id == k, ]
## Fit models on the data
m <- glm(type ~ ., data = train, family = "binomial")
## Store predictions
preds[fold_id == k] <- predict(m, newdata = test, type = "response")
}
## Out of sample accuracy
pred_class <- ifelse(preds >= .5, 1, 0)
out_acc <- sum(pred_class == spam$type)/nrow(spam)
out_acc
## [1] 0.9234949
## Out of sample kappa
prop.table(table(spam$type)) ## Guessing "nonspam" for everything is 60.59% accurate
##
## 0 1
## 0.6059552 0.3940448
(out_acc - 0.6059552)/(1 - 0.6059552)
## [1] 0.8058467
## Out of sample AUC
df <- data.frame(y_numeric = spam$type, pi = preds)
pl <- ggplot(df, aes(d = y_numeric, m = pi)) + geom_roc()
calc_auc(pl)
## PANEL group AUC
## 1 1 -1 0.9712265
\(~\)
In the early 2000s, a research team comprised of US and Bangladesh scientists measured the arsenic levels of wells in Arahazar upazila, Bangladesh. The team identified safe and unsafe wells (arsenic concentrations \(\geq 0.5 \mu g/L\)), and they distributed materials encouraging residents who used unsafe wells to switch to safe ones.
A few years later, researchers followed up to see which of these residents had actually switched to safer wells. The results of this follow-up study are documented in the “Wells” dataset within the “carData” package:
library(carData)
data("Wells")
summary(Wells)
## switch arsenic distance education association
## no :1283 Min. :0.510 Min. : 0.387 Min. : 0.000 no :1743
## yes:1737 1st Qu.:0.820 1st Qu.: 21.117 1st Qu.: 0.000 yes:1277
## Median :1.300 Median : 36.761 Median : 5.000
## Mean :1.657 Mean : 48.332 Mean : 4.828
## 3rd Qu.:2.200 3rd Qu.: 64.041 3rd Qu.: 8.000
## Max. :9.650 Max. :339.531 Max. :17.000
The dataset contains the following variables:
Question #8: Use a backwards stepwise selection algorithm (with AIC as the model selection criteria) to find a logistic regression model that predicts “switch” using some or all of the remaining variables in the dataset. Report your final model.
Question #9: Interpret the estimated effect of each variable in the model on the odds that a household switches wells. Based upon the \(Z\)-tests provided by the summary function, which variable appears to be the strongest predictor (after adjusting for the others)?
Question #10: Report the in-sample pseudo-\(R^2\) (your choice as to which one), classification accuracy, confusion matrix, and Cohen’s kappa of your model.
Question #11: Create an ROC curve displaying the sensitivity and specificity of your model. Based upon your own subjective beliefs regarding the importance of sensitivity and specificy in this application, use the ROC to select a decision threshold \(t\). Briefly explain the rationale for your choice. Finally, calculate and report the model’s AUC.
Question #12: Use cross-validation to estimate the out-of-sample classification accuracy of this model. Based upon a comparison of the in-sample and out-of-sample accuracy, do you believe your model is overfit?