Directions:

As with all our labs, you are expected to work through the examples and questions in this lab collaboratively with your partner(s). This means you will work together and discuss each section. You each will receive the same score for your work, so is your responsibility to make sure that both of you are on the same page. Any groups found to be using a “divide and conquer” strategy to rush through the lab’s questions will be penalized.

You should record your answers to the lab’s questions in an R Markdown file. When submitting the lab, you should only turn in the compiled .html file created by R Markdown.

You are strongly encouraged to open a separate, blank R script to run and experiment with the example code that is given throughout the lab. Please do not turn-in this code.

\(~\)

Introduction

The focus of this lab is logistic regression, an approach to modeling binary outcomes using the model: \[log\big(\tfrac{\pi}{1-\pi}\big) = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p \]

  • \(\pi\) is the probably of a “success”, or the category a dummy variable would encode as “1”
  • Logistic regression models the log-odds of \(\pi\) using the observed outcomes, \(\{y_1, \ldots, y_i\}\), and a set of predictors
  • In this model, will use \(\hat{\pi}_i\) to express the estimated probability that \(y_i =\) “Category 1”

In words, logistic regression models the natural-logarithm of the odds of category “1” as a linear combination of explanatory variables \(\{X_1, \ldots, X_p\}\).

Before getting into the details, we’ll first see an illustration of how logistic regression differs from linear regression.

\(~\)

Logistic vs. Linear Regression for Binary Outcomes

To illustrate the differences between logistic and linear regression, we’ll analyze the survival of each adult member of the Donner Party, a group of 45 pioneers whose migration to California was delayed by a series of mishaps which resulted in the group being stranded in the Sierra Nevada mountains.

library(alr4)
data("Donner")
Donner$survived_numeric <- ifelse(Donner$y == "survived", 1, 0)
Donner <- dplyr::filter(Donner, age >= 18)

Visual inspection seems to show the variables “Age” and “Sex” are both predictive of the binary outcome “Survival” A naive analysis might re-code “Survival” into a numeric dummy variable, then use it as the outcome in the linear regression model: Survival ~ Age + Sex.

On the surface this model seems sensible, as the slope coefficients can be interpreted as linear increases in the estimated probability of the outcome coded as “1” (ie: \(\hat{\pi}\)). However, when plotted we can quickly see there are some problems with the naive approach…

Question #1: Fit the linear regression model, survival_numeric ~ age + sex, that is depicted earlier in this section. Using this model, provide a one-sentence interpretation of the estimated slope coefficient for the variable “age”. Then, briefly describe the most noticeable problem with the predictions generated by this model.

\(~\)

The analogous logistic regression model, which uses the formula y ~ age + sex, is depicted below:

Question #2: Inspect the fitted logistic regression model displayed above. Based upon the graph of this model, what would you anticipate happens to the model’s estimated probability of survival as the predictor “age” \(\rightarrow \infty\)? How does differ from the linear regression model depicted earlier?

\(~\)

Fitting a Logistic Regression Model in R

Logistic regression (and linear regression) is an example of a generalized linear model (GLM), a broad class of statistical models where each predictor makes a linear contribution towards the prediction of an outcome. Without getting too far into the details of GLMs, they consist of the following components:

  • Systematic component - a linear combination of predictor variables (ie: \(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots\))
  • Random component - a probability distribution for \(Y\), the outcome variable
  • Link function - a function that links \(E(Y)\) to the model’s systematic component

Linear regression was a GLM that used the Normal distribution as its random component, and the identity function (ie: \(g(A) = A\)) as its link function. Thus, the model specified \(E(Y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots\), and \(Y\) was presumed to follow a Normal distribution.

In logistic regression, the outcome variable, \(Y\), is a binary categorical variable, thus the binomial distribution is used as the model’s random component. Then, the logit function (ie: g(A) = log(A/1-A)) is used as the model’s link function.

To fit a logistic regression model in R you can use the glm() function with the argument `family = “binomial”.

m <- glm(y ~ age + sex, data = Donner, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = y ~ age + sex, family = "binomial", data = Donner)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5996  -0.9972  -0.4511   0.8703   2.0389  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  2.40022    1.21648   1.973   0.0485 *
## age         -0.05788    0.03322  -1.742   0.0814 .
## sexMale     -1.39310    0.70579  -1.974   0.0484 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 58.466  on 42  degrees of freedom
## Residual deviance: 50.134  on 40  degrees of freedom
## AIC: 56.134
## 
## Number of Fisher Scoring iterations: 4

\(~\)

Interpretating Logistic Regression Coefficients

When interpreting the estimated coefficients, \(\{\hat{\beta}_0, \hat{\beta}_1, \ldots \hat{\beta}_p \}\), in a fitted logistic regression model we must recognize that each coefficient represents an additive linear contribution on the log-odds scale. Further, if the model includes multiple predictors, these are adjusted effects.

m <- glm(y ~ age + sex, data = Donner, family = "binomial")
m$coefficients
## (Intercept)         age     sexMale 
##  2.40022081 -0.05788031 -1.39309808

So, in the Donner Party example, each 1 year increase in age decreases the log-odds of survival by 0.0579.

Unfortunately this effect isn’t very easily understood, as most people have no intuition at all regarding the log-odds of an outcome, but luckily: \[log\big(\tfrac{\pi}{1-\pi}\big) = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p \\ \implies \tfrac{\pi}{1-\pi} = exp(\beta_0 + \beta_1X_1 + \ldots + \beta_pX_p) \\ \implies \tfrac{\pi}{1-\pi} = exp(\beta_0)*exp(\beta_1X_1)* \ldots *exp(\beta_pX_p)\]

Based upon the expressions shown above, we can interpret:

  • The exponent of the intercept (\(e^{\beta_0}\)) as the baseline odds of category “1”.
  • The exponent of any coefficient (ie: \(e^{\beta_1}\)) as a multiplier of the baseline odds for each one unit increase in that explanatory variable.

For the Donner Party example, the coefficient estimate of age, \(\hat{\beta}_1 = -0.0579\), can be exponentiated: \(e^{-0.0579} = 0.944\), which suggests that the odds of survival changes by a factor of 0.944 for each 1 year increase in age (after adjusting for the variable “sex”). Equivalently, we might interpret this effect as an 8% decline in the odds of survival for every 1 year increase in age (again, after adjusting for “sex”).

Question #3: Fit a logistic regression model that uses “age” but does not use “sex” as a predictor. What is unadjusted effect of “age” on survival status on the log-odds scale? How would you interpret the unadjusted effect of on the odds of survival?

Question #4: What do we learn about these data if the unadjusted effect of “age” is similar to the effect of “age” in a model that adjusts for “sex”?

\(~\)

Statistical Inference and Logistic Regression

Recall that the random component of the logistic regression model is based upon the binomial distribution. This makes logistic regression a statistical model, meaning we can use the model to make statistical inferences about a broader population of interest. Similar to linear regression, there are three major types of inference we might be interested in:

  1. Hypothesis Testing
  2. Confidence Interval Estimation
  3. Prediction

The following subsections cover how to perform each of type of statistical inference using a fitted logistic regression model.

Hypothesis Tests on \(\beta\) Coefficients

Recall that with linear regression we used the \(t\)-distribution and \(t\)-tests to evaluate the statistical uncertainty in our model coefficient estimates. The rationale was that the model’s random errors were assumed to follow a Normal distribution, but we needed to estimate the error variance, \(\sigma^2\), which created a small amount of extra uncertainty that prompted us to use the \(t\)-distribution rather than the Normal distribution.

For logistic regression, our model is based upon the binomial distribution, which does not have a separate parameter to describe it’s variance (recall the variance is a function of the probability of observing a “success”). Thus, we can use the Normal distribution and \(z\)-test for statistical inference (much like we do in hypothesis tests of a single proportion or a difference in proportions).

Practically speaking, the steps involved in testing individual \(\beta\) coefficients are identical to what we did with linear regression. The summary() function will provide us all of the information (such as standard errors, etc.) that is necessary:

m <- glm(survived_numeric ~ age + sex, data = Donner, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = survived_numeric ~ age + sex, family = "binomial", 
##     data = Donner)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5996  -0.9972  -0.4511   0.8703   2.0389  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  2.40022    1.21648   1.973   0.0485 *
## age         -0.05788    0.03322  -1.742   0.0814 .
## sexMale     -1.39310    0.70579  -1.974   0.0484 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 58.466  on 42  degrees of freedom
## Residual deviance: 50.134  on 40  degrees of freedom
## AIC: 56.134
## 
## Number of Fisher Scoring iterations: 4

As a refresher, summary() will test the null hypothesis \(H_0: \beta = 0\); so if you wanted to hypothesize another value, you’d need to extract the coefficient estimate and its standard error to setup the test by yourself.

Question #5: Fit the logistic regression model y ~ sex and use this model to test the hypothesis that adult men and women were equally likely to survive in the Donner Party disaster.

\(~\)

Comparing Nested Models

In the linear regression setting, we used ANOVA and the \(F\)-test to compare nested models (recall this also gave us an overall test of model fit relative to an intercept-only model). The analogous approach for comparing two nested logistic regression models is the likelihood ratio test.

Without going too far into the mathematical details, the likelihood ratio test does exactly what its name implies - that is, it compares the ratio of the likelihoods of the two models. A model’s likelihood is just a numeric measurement of how closely a model fits the sample data (recall that likelihood is a key component in model selection criteria like AIC and BIC).

If the larger model has significantly better fit, we’d expect the ratio of its likelihood to that of the smaller sub-model to exceed 1.0 by an amount that is larger than what could reasonably be attributed to random chance. We will not cover the mathematical derivation, but this can be statistically evaluated using a Chi-square test with degrees of freedom equal to the difference in parameters between the two models.

The lrtest() function in the “lmtest” library is used to conduct a likelihood ratio test. Similar to the anova() function, you just need to provide the two fitted models you’re comparing:

m1 <- glm(y ~ age + sex, data = Donner, family = "binomial")
m2 <- glm(y ~ age, data = Donner, family = "binomial")

library(lmtest)
lrtest(m2, m1)
## Likelihood ratio test
## 
## Model 1: y ~ age
## Model 2: y ~ age + sex
##   #Df  LogLik Df  Chisq Pr(>Chisq)  
## 1   2 -27.130                       
## 2   3 -25.067  1 4.1263    0.04222 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question #6: Use a likelihood ratio test to determine whether the logistic regression model y ~ age + status has significantly better fit than the model y ~ age. Additionally, briefly comment on why using the summary() function’s default hypothesis tests is not sufficient for assessing the importance of the variable “status”.

\(~\)

Confidence and Prediction Intervals

Recall that in the linear regression setting we had two different types of intervals to consider:

  1. Confidence intervals - these represented a range of statistically plausible values for \(E(y)\) for a given set of predictors
  2. Prediction intervals - these represented a range of possible individual \(y\)-values for a given set of predictors

Unfortunately for logistic regression prediction intervals don’t make sense, since the outcome variable, \(Y\), can only take on the values of “0” and “1”.

That leaves us with only confidence intervals as a tool for expressing uncertainty in the model’s predicted outcomes. The code below demonstrates how to come up with these intervals and add them to a graph:

m <- glm(y ~ age + sex, data = Donner, family = "binomial")

x1_grid <- seq(17, 80, by = 0.1)
x2_grid <- levels(Donner$sex)
grid <- expand.grid(age = x1_grid, sex = x2_grid)

preds <- predict(m, newdata = grid,  se.fit = TRUE)
#preds

inverse_logit = function(x){
  exp(x)/(1+exp(x))
}
ey_pred <- inverse_logit(preds$fit)
ey_lower <- inverse_logit(preds$fit - 1.96*preds$se.fit)
ey_upper <- inverse_logit(preds$fit + 1.96*preds$se.fit)


plot(Donner$age, Donner$survived_numeric, ylim = c(-0.15, 1.15), xlim = c(17,80), xlab = "Age", ylab = "Survival")
lines(x = grid$age[grid$sex == "Male"], y = ey_pred[grid$sex == "Male"], col = "red", lwd =2 )
lines(x = grid$age[grid$sex == "Male"], y = ey_lower[grid$sex == "Male"], col = "red", lty = 2)
lines(x = grid$age[grid$sex == "Male"], y = ey_upper[grid$sex == "Male"], col = "red", lty = 2)

lines(x = grid$age[grid$sex == "Female"], y = ey_pred[grid$sex == "Female"], col = "blue", lwd = 2)
lines(x = grid$age[grid$sex == "Female"], y = ey_lower[grid$sex == "Female"], col = "blue", lty = 2)
lines(x = grid$age[grid$sex == "Female"], y = ey_upper[grid$sex == "Female"], col = "blue", lty = 2)
legend("topright", legend = c("Female", "Male"), col = c("blue","red"), lwd = 2, lty = 1, bty = "n")

The trickiest part is that the predict() function won’t return the confidence bounds itself, so we need to calculate them ourselves by adding and subtracting 1.96 standard errors to the estimated value (for a given set of predictors). Further complicating things is the fact that predict() returns predicted values on the logit scale, so we needed to define an inverse transformation to return things to the probability scale.

Question #7: Modify the code above to use the model y ~ age + status. Your plot should include separate confidence bands for each of the three levels of status. Organize your plot such that family members are colored in blue, hired help is colored in red, and singles are colored in green.

\(~\)

Practice

The following applications are aimed at providing additional practice fitting logistic regression models and interpretting their coefficients.

\(~\)

Application #1

The code below reads the Xavier men’s basketball 20-21 team game results and creates a dummy variable “win” that takes the numeric value of 1 if XU outscored their opponent and 0 otherwise. In this application we’ll use logistic regression to model wins/losses.

xub <- read.csv("https://remiller1450.github.io/data/xubball2021.csv")
xub$win <- ifelse(xub$Margin > 0, 1, 0)

Question #10: Fit the logistic regression model win ~ X3P, which predicts whether a game was won or lost based upon the number of made 3-point shots. How does the number of made 3-point shots impact the odds of Xavier winning? Is the effect of “3XP” statistically meaningful?

Question #11: Fit the logistic regression model win ~ X3P + X3P., which adjusts for the percentage of three point attempts that are made. How does the effect of “3XP” in this model differ from the effect you described in the previous question? Briefly explain why the adjusted effect is so different from the unadjusted effect.

Question #12: Use a likelihood ratio test to compare the two models described in Question #10 and #11. Does the test provide statistically compelling evidence that the larger model provides a better fit?

\(~\)

Application #2

Classifying emails as “spam” is a classical data-science application that involves modeling a binary outcome. The following questions will use the spam dataset contained in the kernlab package, which contains a sample of 4601 emails that were manually labeled as either “spam” or “nonspam” (recorded as the variable “type”), along with various predictors that describe the normalized frequencies of certain words/symbols within each email message. We will focus on the following predictors:

  • “capitalAve” - The average number of capitalized letters (per sentence) contained in the email.
  • “charDollar” - The normalized frequency of the dollar sign character
  • “charExclamation” - The normalized frequency of the exclamation point character
library(kernlab)
data("spam")
table(spam$type)
## 
## nonspam    spam 
##    2788    1813

Question #13: Use boxplots to show the distribution of each of the three variables listed above in “spam” and “nonspam” emails. Based upon a visual inspection of these plots, which variable appears to be the strongest predictor of an email being “spam”? (Hint: Your choie of predictor is somewhat subjective, but try “zooming in” on your boxplots using the “xlim” or “ylim” arguments to help you make your selection)

Question #14: Fit a logistic regression model that uses the variable you chose in Question #13 to predict “spam”. Then, use this model to intpret the effect of this variable on the odds of an email being “spam”.

Question #15: Create a graph that displays the expected probability, along with 90% confidence intervals, of an email being spam in response to the predictor you identified in Question #13. (Hint: this graph should resemble the ones from the “Confidence and Prediction Intervals” section)