\(~\)
This lab covers logistic regression, or statistical models of the form: \[\log\big(\tfrac{Pr(y=1)}{1-Pr(y=1)}\big) = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p\]
Logistic regression is used when modeling a binary outcome, \(y\), using one or more explanatory variables. In this lab’s examples we’ll look at the NBA shots data set from our previous lecture.
## Libraries and data
library(ggplot2)
library(dplyr)
## Load Data, remove low quality points, and add a couple new variables
shots = read.csv("https://remiller1450.github.io/data/shot_logs.csv") %>%
filter(TOUCH_TIME > 0) %>%
mutate(OUTCOME = ifelse(SHOT_RESULT == "made", 1, 0),
CAT_DRIBBLES = cut(DRIBBLES, breaks = c(-Inf, 0, 1, Inf),
labels = c("zero", "one", "two or more")))
The variables of interest in this data set are:
OUTCOME - the primary outcome in models: 1 if a shot is
made, 0 otherwiseSHOT_DIST - distance from the basket when the shot was
attemptedCLOSE_DEF_DIST - distance from the closest defender
when the shot was attemptedTOUCH_TIME - how long the player had control of the
ball prior to the shotCAT_DRIBBLES - whether the player made “zero”, “one”,
or “two or more” dribbles before attempting the shotPERIOD - the period of the game during which the shot
was attempted (1 = first quarter, 2 = second quarter, …, 5 = first
overtime, …)\(~\)
glm()Logistic regression models are fit using glm() function
with the argument family = "binomial". The other syntax
used to fit the model mirrors what we’ve done for linear regression
models in lm(). That is, we still provide a formula
representation of the variables involved in our model. The code below
fits a logistic regression model predicting OUTCOME using
SHOT_DIST:
mod1 = glm(OUTCOME ~ SHOT_DIST, data = shots, family = "binomial")
You should recognize that this is the fitted version of the theoretical model:
\[\log\bigg(\frac{Pr(\text{Outcome =
1})}{1-Pr(\text{Outcome = 1})}\bigg) = \beta_0 + \beta_1 \cdot
\text{Shot Distance}\] We can access the estimated coefficients,
\(\hat{\beta}_0\) and \(\hat{\beta_1}\), using the
coef() function:
coef(mod1)
## (Intercept) SHOT_DIST
## 0.41475994 -0.04478712
Interpreting each:
Question #1: In this question you’ll fit models
involving various combinations of the predictors TOUCH_TIME
and CAT_DRIBBLES and interpret the coefficients in these
models.
TOUCH_TIME to predict OUTCOME. Provide an
interpretation for both estimated coefficients in this model.
Remember to apply the inverse transformation before forming your
interpretation.CAT_DRIBBLES to predict OUTCOME. Provide
an interpretation for all of the estimated coefficients in this
model. Hint: this is a categorical predictor, so you should
review our
lecture slides and recognize that your interpretations should
involve odds ratios.TOUCH_TIME and CAT_DRIBBLES to
predict OUTCOME. Why do the estimated coefficients of
TOUCH_TIME and CAT_DRIBBLES two or more both
get closer to zero in this model? Hint: If these predictors
were independent, adjusting for one would not impact the coefficient of
the other. But are they independent?\(~\)
Inference on individual coefficients in a logistic regression model can be conducted using the same functions we previously saw for linear regression:
summary() will test the null hypothesis \(H_0:\beta_j=0\) for every coefficient in
the modelconfint() will calculate confidence interval estimates
for each of the model’s coefficientsBelow are quick examples:
## Model from before
mod2 = glm(OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST, data = shots, family = "binomial")
## Tests on coefficients using summary()
summary(mod2)
##
## Call:
## glm(formula = OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST, family = "binomial",
## data = shots)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2053367 0.0116966 17.55 <2e-16 ***
## SHOT_DIST -0.0643797 0.0008261 -77.93 <2e-16 ***
## CLOSE_DEF_DIST 0.1150303 0.0027351 42.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 171730 on 124710 degrees of freedom
## Residual deviance: 165130 on 124708 degrees of freedom
## AIC: 165136
##
## Number of Fisher Scoring iterations: 4
## Confidence intervals for coefficients using conf.int()
confint(mod2)
## 2.5 % 97.5 %
## (Intercept) 0.18241356 0.22826366
## SHOT_DIST -0.06600082 -0.06276239
## CLOSE_DEF_DIST 0.10967881 0.12040027
A few additional details:
summary() function unless \(n\) is large.
Question #2: The code provided below converts the
PERIOD variable into categories, storing the result as a
new variable CAT_PERIOD.
## Add a categorical (factor) version of the period variable
shots$CAT_PERIOD = factor(shots$PERIOD)
OUTCOME ~ SHOT_DIST + CAT_PERIOD. After adjusting for shot
distance, which periods tend to have significantly worse shooting
performance than the first quarter (period 1)?OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST + TOUCH_TIME + CAT_PERIOD.
After adjusting for shot distance, defender distance, and touch time,
which periods tend to have significantly worse shooting performance than
the first quarter (period 1)?TOUCH_TIME and CAT_PERIOD. Based
upon what you see, why might the difference between periods 2, 4, and 5
(first overtime) diminish after adjusting for
TOUCH_TIME.confint() first
and then exponentiate the end-points.summary() relies upon a large sample size
assumption. With this in mind, are you concerned with the validity of
the 95% confidence interval you found in Part D? Briefly explain.\(~\)
In the early 2000s, a team of US and Bangladesh scientists measured the arsenic levels of wells in Arahazar upazila, Bangladesh. The team identified safe and unsafe wells (arsenic concentrations ≥0.5μg/L), and they distributed materials encouraging residents who used unsafe wells to switch to safe ones.
A year later the researchers followed up to see which residents had
switched to a safe well, which is encoded below as
switch_binary, the outcome of the study:
wells = read.csv("https://remiller1450.github.io/data/Wells.csv") %>%
mutate(switch_binary = ifelse(switch == "yes", 1, 0))
The variable switch_binary records whether a family
switched to a safe well, while arsenic records the arsenic
concentration of the family’s original well, distance is
the distance (in meters) from their home to the nearest safe well,
education is the number of years of formal education of the
head of household, and association indicates whether the
household is active in community associations.
Question #3: This question is intended to provide a step-by-step walk through of a basic logistic regression analysis. You should keep this in mind if you’re considering logistic regression for the project.
switch. That is,
determine which variable seems most strongly related to
switch, which seems second most strongly related, etc.distance and arsenic and
predictors. How does the adjusted effect of the variable you considered
in Part A compare to its marginal effect? What might have caused it to
change in this way? Explain your reasoning.association, distance, and
arsenic to predict whether a household switched to a safe
well. For this model, interpret the \(p\)-value provided by the
summary() function for the dummy variable
associationyes.