Lab #18 - Logistic Regression

\(~\)

Lab

This lab covers logistic regression, or statistical models of the form: \[\log\big(\tfrac{Pr(y=1)}{1-Pr(y=1)}\big) = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p\]

Logistic regression is used when modeling a binary outcome, \(y\), using one or more explanatory variables. In this lab’s examples we’ll look at the NBA shots data set from our previous lecture.

## Libraries and data
library(ggplot2)
library(dplyr)

## Load Data, remove low quality points, and add a couple new variables
shots = read.csv("https://remiller1450.github.io/data/shot_logs.csv") %>%
  filter(TOUCH_TIME > 0) %>%
  mutate(OUTCOME = ifelse(SHOT_RESULT == "made", 1, 0),
         CAT_DRIBBLES = cut(DRIBBLES, breaks = c(-Inf, 0, 1, Inf), 
                            labels = c("zero", "one", "two or more")))

The variables of interest in this data set are:

OUTCOME - the primary outcome in models: 1 if a shot is made, 0 otherwise
SHOT_DIST - distance from the basket when the shot was attempted
CLOSE_DEF_DIST - distance from the closest defender when the shot was attempted
TOUCH_TIME - how long the player had control of the ball prior to the shot
CAT_DRIBBLES - whether the player made “zero”, “one”, or “two or more” dribbles before attempting the shot
PERIOD - the period of the game during which the shot was attempted (1 = first quarter, 2 = second quarter, …, 5 = first overtime, …)

\(~\)

Fitting Models using `glm()`

Logistic regression models are fit using glm() function with the argument family = "binomial". The other syntax used to fit the model mirrors what we’ve done for linear regression models in lm(). That is, we still provide a formula representation of the variables involved in our model. The code below fits a logistic regression model predicting OUTCOME using SHOT_DIST:

mod1 = glm(OUTCOME ~ SHOT_DIST, data = shots, family = "binomial")

You should recognize that this is the fitted version of the theoretical model:

\[\log\bigg(\frac{Pr(\text{Outcome = 1})}{1-Pr(\text{Outcome = 1})}\bigg) = \beta_0 + \beta_1 \cdot \text{Shot Distance}\] We can access the estimated coefficients, \(\hat{\beta}_0\) and \(\hat{\beta_1}\), using the coef() function:

coef(mod1)

## (Intercept)   SHOT_DIST 
##  0.41475994 -0.04478712

Interpreting each:

The estimated intercept, \(\hat{\beta}_0 = 0.4\), tells us the expected log-odds of a shot being made when it is attempted from a distance of zero feet is 0.4
- We should always apply the inverse transformation when interpreting, so \(exp(0.4) = 1.49\) suggests the expected odds of a being made from zero feet are 1.49
- We could use an implied probability calculator to determine this is an implied probability of 67.11%
The estimated coefficient, \(\hat{\beta}_1 = -0.044\), tells us the expected log-odds of a shot being made decrease by 0.044 for every 1 foot increase in distance from the basket
- Again, we should always apply the inverse transformation when interpreting, so \(exp(-0.044) = 0.957\) suggests the expected odds of a shot being made change by a multiplicative factor of 0.957, a 4.3% decrease, for each additional foot from the basket

Question #1: In this question you’ll fit models involving various combinations of the predictors TOUCH_TIME and CAT_DRIBBLES and interpret the coefficients in these models.

Part A: Fit a logistic regression model using only TOUCH_TIME to predict OUTCOME. Provide an interpretation for both estimated coefficients in this model. Remember to apply the inverse transformation before forming your interpretation.
Part B: Now fit a logistic regression model using only CAT_DRIBBLES to predict OUTCOME. Provide an interpretation for all of the estimated coefficients in this model. Hint: this is a categorical predictor, so you should review our lecture slides and recognize that your interpretations should involve odds ratios.
Part C: Now fit a logistic regression model using both TOUCH_TIME and CAT_DRIBBLES to predict OUTCOME. Why do the estimated coefficients of TOUCH_TIME and CAT_DRIBBLES two or more both get closer to zero in this model? Hint: If these predictors were independent, adjusting for one would not impact the coefficient of the other. But are they independent?

\(~\)

Comparing Models with Likelihood Ratio Tests

The family = "binomial" argument we’ve been using describes a model’s “likelihood” function. When evaluated, the likelihood provides a measurement of how close the model fits the sample data. Similar to the sum of squared residuals in linear regression, the likelihood is a single number summarizing the fit of the model that doesn’t mean anything in absolute sense, but is meaningful in a relative sense.

The lrtest() function in the lmtest library is used to perform likelihood ratio tests on nested logistic regression models. The function is extremely similar to anova(), which we used to compare nested linear regression models:

## Fit both models
mod1 = glm(OUTCOME ~ SHOT_DIST, data = shots, family = "binomial")
mod2 = glm(OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST, data = shots, family = "binomial")

## Likelihood ratio test
library(lmtest)
lrtest(mod1, mod2)

## Likelihood ratio test
## 
## Model 1: OUTCOME ~ SHOT_DIST
## Model 2: OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST
##   #Df LogLik Df  Chisq Pr(>Chisq)    
## 1   2 -83514                         
## 2   3 -82565  1 1898.2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis here is that the two models have the same likelihood (a likelihood ratio of 1), and the very small \(p\)-value in this example suggests strong evidence that the larger model fits the data significantly better than the smaller model.

Question #2:

Part A: Can the likelihood ratio test help you determine whether the model OUTCOME ~ TOUCH_TIME is preferable over the model OUTCOME ~ CAT_DRIBBLES? Briefly explain.
Part B: Use a likelihood ratio test to compare the model OUTCOME ~ TOUCH_TIME + CAT_DRIBBLES with the model OUTCOME ~ TOUCH_TIME. Provide the \(p\)-value of the test and a brief conclusion indicating which model is preferable.

\(~\)

Inference on Coefficients

Inference on individual coefficients in a logistic regression model can be conducted using the same functions we previously saw for linear regression:

summary() will test the null hypothesis \(H_0:\beta_j=0\) for every coefficient in the model
confint() will calculate confidence interval estimates for each of the model’s coefficients

Below are quick examples:

## Model from before
mod2 = glm(OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST, data = shots, family = "binomial")

## Tests on coefficients using summary()
summary(mod2)

## 
## Call:
## glm(formula = OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST, family = "binomial", 
##     data = shots)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0268  -1.0882  -0.8232   1.1524   2.1135  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.2053367  0.0116966   17.55   <2e-16 ***
## SHOT_DIST      -0.0643797  0.0008261  -77.93   <2e-16 ***
## CLOSE_DEF_DIST  0.1150303  0.0027351   42.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 171730  on 124710  degrees of freedom
## Residual deviance: 165130  on 124708  degrees of freedom
## AIC: 165136
## 
## Number of Fisher Scoring iterations: 4

## Confidence intervals using conf.int()
confint(mod2)

##                      2.5 %      97.5 %
## (Intercept)     0.18241356  0.22826366
## SHOT_DIST      -0.06600082 -0.06276239
## CLOSE_DEF_DIST  0.10967881  0.12040027

A few additional details:

The coefficients in logistic regression are estimated using a method known as maximum likelihood estimation. Likelihood theory provides a Normal approximation for these coefficients, which becomes more accurate as \(n\) approaches infinity.
In contrast, the coefficients in linear regression follow a \(t\)-distribution for finite samples when the model’s Normality assumption is met.

Question #3: The code provided below converts the PERIOD variable into categories, storing the result as a new variable CAT_PERIOD.

## Add a categorical (factor) version of the period variable
shots$CAT_PERIOD = factor(shots$PERIOD)

Part A: Use a likelihood ratio test to compare the models: OUTCOME ~ SHOT_DIST + CAT_PERIOD and OUTCOME ~ SHOT_DIST. Report the \(p\)-value and briefly explain whether you think CAT_PERIOD is a worthwhile predictor after accounting for SHOT_DIST.
Part B: Consider the model OUTCOME ~ SHOT_DIST + CAT_PERIOD. After adjusting for shot distance, which periods tend to have significantly worse shooting performance than the first quarter (period 1)?
Part C: Now consider the model OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST + TOUCH_TIME + CAT_PERIOD. After adjusting for shot distance, defender distance, and touch time, which periods tend to have significantly worse shooting performance than the first quarter (period 1)?
Part D: Create a data visualization that relates the variables TOUCH_TIME and CAT_PERIOD. Based upon what you see, why might the difference between periods 2, 4, and 5 (first overtime) diminish after adjusting for TOUCH_TIME.
Part E: Report a 95% confidence interval estimate for the adjusted odds ratio comparing the odds of making a shot in the fourth quarter (period 4) relative to the first quarter (period 1) that adjusts for differences in shot distance, defender distance, and touch time. Hint: you should use confint() and exponentiation.

\(~\)

Application

In the early 2000s, a team of US and Bangladesh scientists measured the arsenic levels of wells in Arahazar upazila, Bangladesh. The team identified safe and unsafe wells (arsenic concentrations ≥0.5μg/L), and they distributed materials encouraging residents who used unsafe wells to switch to safe ones.

A year later the researchers followed up to see which residents had switched to a safe well. The data below record the results of this study:

wells = read.csv("https://remiller1450.github.io/data/Wells.csv") %>%
  mutate(switch_binary = ifelse(switch == "yes", 1, 0))

The variable switch_binary records whether a family switched to a safe well, while arsenic records the arsenic concentration of the family’s original well, distance is the distance from their home to the nearest safe well, education is the number of years of formal education of the head of household, and association indicates whether the household is active in community associations.

Question #4: This question is intended to provide a step-by-step walk through of a basic logistic regression analysis. You should keep this in mind if you’re considering logistic regression for the project.

Part A: There are three quantitative explanatory variables in this data set. For each of these variables, create a set of box plots comparing their distribution among those who switched wells and those who didn’t. Based upon what you see in these box plots, come up with your own ranking of the strength of marginal association between each of these variables and switch. That is, determine which variable seems most strongly related to switch, which seems second most strongly related, etc.
Part B: Use likelihood ratio tests to sequentially add additional explanatory variables one at a time and find a well-fitting logistic regression model. You should start by comparing an intercept-only model and the a model containing only the strongest predictor you identified in Part A. You should then consider models with additional predictors added in the order you determined in Part A until there is no longer a significant improvement in fit. You may simply provide your code and output for these tests with a single sentence indicating which model was chosen to be your final model.
Part C: Use a likelihood ratio test to determine whether association should be added to your final model from Part B. Report the \(p\)-value of this test and a brief conclusion.
Part D: Interpret the effect of each variable in your final model. Be sure to exponentiate each coefficient to facilitate a meaningful interpretation.

Lab #18 - Logistic Regression

Lab

Fitting Models using glm()

Comparing Models with Likelihood Ratio Tests

Inference on Coefficients

Application

Fitting Models using `glm()`