Lab #16 - Introduction to Logistic Regression

\(~\)

This lab covers logistic regression, or statistical models of the form: \[\log\big(\tfrac{Pr(y=1)}{1-Pr(y=1)}\big) = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p\]

Logistic regression is used when modeling a binary outcome, \(y\), using one or more explanatory variables. In this lab’s examples we’ll look at the NBA shots data set from our previous lecture.

## Libraries and data
library(ggplot2)
library(dplyr)

## Load Data, remove low quality points, and add a couple new variables
shots = read.csv("https://remiller1450.github.io/data/shot_logs.csv") %>%
  filter(TOUCH_TIME > 0) %>%
  mutate(OUTCOME = ifelse(SHOT_RESULT == "made", 1, 0),
         CAT_DRIBBLES = cut(DRIBBLES, breaks = c(-Inf, 0, 1, Inf), 
                            labels = c("zero", "one", "two or more")))

The variables of interest in this data set are:

OUTCOME - the primary outcome in models: 1 if a shot is made, 0 otherwise
SHOT_DIST - distance from the basket when the shot was attempted
CLOSE_DEF_DIST - distance from the closest defender when the shot was attempted
TOUCH_TIME - how long the player had control of the ball prior to the shot
CAT_DRIBBLES - whether the player made “zero”, “one”, or “two or more” dribbles before attempting the shot
PERIOD - the period of the game during which the shot was attempted (1 = first quarter, 2 = second quarter, …, 5 = first overtime, …)

\(~\)

Fitting Logistic Regression Models using `glm()`

Logistic regression models are fit using glm() function with the argument family = "binomial". The other syntax used to fit the model mirrors what we’ve done for linear regression models in lm(). That is, we still provide a formula representation of the variables involved in our model. The code below fits a logistic regression model predicting OUTCOME using SHOT_DIST:

mod1 = glm(OUTCOME ~ SHOT_DIST, data = shots, family = "binomial")

You should recognize that this is the fitted version of the theoretical model:

\[\log\bigg(\frac{Pr(\text{Outcome = 1})}{1-Pr(\text{Outcome = 1})}\bigg) = \beta_0 + \beta_1 \cdot \text{Shot Distance}\] We can access the estimated coefficients, \(\hat{\beta}_0\) and \(\hat{\beta_1}\), using the coef() function:

coef(mod1)

## (Intercept)   SHOT_DIST 
##  0.41475994 -0.04478712

Interpreting each:

The estimated intercept, \(\hat{\beta}_0 = 0.4\), tells us the expected log-odds of a shot being made when it is attempted from a distance of zero feet is 0.4
- We should always apply the inverse transformation when interpreting, so \(exp(0.4) = 1.49\) suggests the expected odds of a being made from zero feet are 1.49
- If we wanted a predicted probability, we could plug in our predicted log-odds (0.414 in the intercept-only model) into the inverse function \(1/(1+exp(-x))\), which suggests a 59.87% probability of a shot taken from zero feet from the basket being made.
The estimated coefficient, \(\hat{\beta}_1 = -0.044\), tells us the expected log-odds of a shot being made decrease by 0.044 for every 1 foot increase in distance from the basket
- Again, we should always apply the inverse transformation when interpreting, so \(exp(-0.044) = 0.957\) suggests the expected odds of a shot being made change by a multiplicative factor of 0.957, a 4.3% decrease, for each additional foot from the basket

Question #1: In this question you’ll fit models involving various combinations of the predictors TOUCH_TIME and CAT_DRIBBLES and interpret the coefficients in these models.

Part A: Fit a logistic regression model using only TOUCH_TIME to predict OUTCOME. Provide an interpretation for both estimated coefficients in this model. Remember to apply the inverse transformation before forming your interpretation.
Part B: Now fit a logistic regression model using only CAT_DRIBBLES to predict OUTCOME. Provide an interpretation for all of the estimated coefficients in this model. Hint: this is a categorical predictor, so you should review our lecture slides and recognize that your interpretations should involve odds ratios.
Part C: Now fit a logistic regression model using both TOUCH_TIME and CAT_DRIBBLES to predict OUTCOME. Why do the estimated coefficients of TOUCH_TIME and CAT_DRIBBLES two or more both get closer to zero in this model? Hint: If these predictors were independent, adjusting for one would not impact the coefficient of the other. But are they independent?

\(~\)

Inference on Coefficients

Inference on individual coefficients in a logistic regression model can be conducted using the same functions we previously saw for linear regression:

summary() will test the null hypothesis \(H_0:\beta_j=0\) for every coefficient in the model
confint() will calculate confidence interval estimates for each of the model’s coefficients

Below are quick examples:

## Model from before
mod2 = glm(OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST, data = shots, family = "binomial")

## Tests on coefficients using summary()
summary(mod2)

## 
## Call:
## glm(formula = OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST, family = "binomial", 
##     data = shots)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.2053367  0.0116966   17.55   <2e-16 ***
## SHOT_DIST      -0.0643797  0.0008261  -77.93   <2e-16 ***
## CLOSE_DEF_DIST  0.1150303  0.0027351   42.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 171730  on 124710  degrees of freedom
## Residual deviance: 165130  on 124708  degrees of freedom
## AIC: 165136
## 
## Number of Fisher Scoring iterations: 4

## Confidence intervals for coefficients using conf.int()
confint(mod2)

##                      2.5 %      97.5 %
## (Intercept)     0.18241356  0.22826366
## SHOT_DIST      -0.06600082 -0.06276239
## CLOSE_DEF_DIST  0.10967881  0.12040027

A few additional details:

The coefficients in logistic regression are estimated using a method known as maximum likelihood estimation. Likelihood theory provides a Normal approximation for these coefficients, which becomes more accurate as \(n\) approaches infinity. Thus, we might not trust the \(p\)-values provided by the summary() function unless \(n\) is large.
- There is a lot of disagreement on precisely how large the sample needs to be, with some advocating for 10 cases in each outcome per predictor in the model (a sample size as low as \(n=20\) for a balanced outcome and a single predictor), and others suggesting roughly 100 cases are needed for even the simplest models
Its worthwhile contrasting this with tests on the coefficients in linear regression, which follow a \(t\)-distribution for finite samples when the model’s Normality assumption is met.

Question #2: The code provided below converts the PERIOD variable into categories, storing the result as a new variable CAT_PERIOD.

## Add a categorical (factor) version of the period variable
shots$CAT_PERIOD = factor(shots$PERIOD)

Part A: Consider the model OUTCOME ~ SHOT_DIST + CAT_PERIOD. After adjusting for shot distance, which periods tend to have significantly worse shooting performance than the first quarter (period 1)?
Part B: Now consider the model OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST + TOUCH_TIME + CAT_PERIOD. After adjusting for shot distance, defender distance, and touch time, which periods tend to have significantly worse shooting performance than the first quarter (period 1)?
Part C: Create a data visualization that relates the variables TOUCH_TIME and CAT_PERIOD. Based upon what you see, why might the difference between periods 2, 4, and 5 (first overtime) diminish after adjusting for TOUCH_TIME.
Part D: Report a 95% confidence interval estimate for the adjusted odds ratio comparing the odds of making a shot in the fourth quarter (period 4) relative to the first quarter (period 1) that adjusts for differences in shot distance, closest defender distance, and touch time. Hint: you should use confint() first and then exponentiate the end-points.
Part E: Statistical inference using the tests performed by summary() relies upon a large sample size assumption. With this in mind, are you concerned with the validity of the 95% confidence interval you found in Part D? Briefly explain.

\(~\)

Application

In the early 2000s, a team of US and Bangladesh scientists measured the arsenic levels of wells in Arahazar upazila, Bangladesh. The team identified safe and unsafe wells (arsenic concentrations ≥0.5μg/L), and they distributed materials encouraging residents who used unsafe wells to switch to safe ones.

A year later the researchers followed up to see which residents had switched to a safe well, which is encoded below as switch_binary, the outcome of the study:

wells = read.csv("https://remiller1450.github.io/data/Wells.csv") %>%
  mutate(switch_binary = ifelse(switch == "yes", 1, 0))

The variable switch_binary records whether a family switched to a safe well, while arsenic records the arsenic concentration of the family’s original well, distance is the distance (in meters) from their home to the nearest safe well, education is the number of years of formal education of the head of household, and association indicates whether the household is active in community associations.

Question #3: This question is intended to provide a step-by-step walk through of a basic logistic regression analysis. You should keep this in mind if you’re considering logistic regression for the project.

Part A: There are three quantitative explanatory variables in this data set. For each of these variables, create a set of box plots comparing their distribution among those who switched wells and those who didn’t. Based upon what you see in these box plots, come up with your own rankings for the strength of marginal association between each of these variables and switch. That is, determine which variable seems most strongly related to switch, which seems second most strongly related, etc.
Part B: Fit a logistic regression model using the explanatory variable you identified as most strongly associated with the outcome. Interpret each coefficient in the model after exponentiating.
Part C: Now fit a logistic regression model that includes both distance and arsenic and predictors. How does the adjusted effect of the variable you considered in Part A compare to its marginal effect? What might have caused it to change in this way? Explain your reasoning.
Part D: Now fit a third logistic regression model using association, distance, and arsenic to predict whether a household switched to a safe well. For this model, interpret the \(p\)-value provided by the summary() function for the dummy variable associationyes.

Lab #16 - Introduction to Logistic Regression

Fitting Logistic Regression Models using glm()

Inference on Coefficients

Application

Fitting Logistic Regression Models using `glm()`