\(~\)
This lab covers logistic regression, or statistical models of the form: \[\log\big(\tfrac{Pr(y=1)}{1-Pr(y=1)}\big) = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p\]
Logistic regression is used when modeling a binary outcome, \(y\), using one or more explanatory variables. In this lab’s examples we’ll look at the NBA shots data set from our previous lecture.
## Libraries and data
library(ggplot2)
library(dplyr)
## Load Data, remove low quality points, and add a couple new variables
shots = read.csv("https://remiller1450.github.io/data/shot_logs.csv") %>%
filter(TOUCH_TIME > 0) %>%
mutate(OUTCOME = ifelse(SHOT_RESULT == "made", 1, 0),
CAT_DRIBBLES = cut(DRIBBLES, breaks = c(-Inf, 0, 1, Inf),
labels = c("zero", "one", "two or more")))
The variables of interest in this data set are:
OUTCOME
- the primary outcome in models: 1 if a shot is
made, 0 otherwiseSHOT_DIST
- distance from the basket when the shot was
attemptedCLOSE_DEF_DIST
- distance from the closest defender
when the shot was attemptedTOUCH_TIME
- how long the player had control of the
ball prior to the shotCAT_DRIBBLES
- whether the player made “zero”, “one”,
or “two or more” dribbles before attempting the shotPERIOD
- the period of the game during which the shot
was attempted (1 = first quarter, 2 = second quarter, …, 5 = first
overtime, …)\(~\)
glm()
Logistic regression models are fit using glm()
function
with the argument family = "binomial"
. The other syntax
used to fit the model mirrors what we’ve done for linear regression
models in lm()
. That is, we still provide a formula
representation of the variables involved in our model. The code below
fits a logistic regression model predicting OUTCOME
using
SHOT_DIST
:
mod1 = glm(OUTCOME ~ SHOT_DIST, data = shots, family = "binomial")
You should recognize that this is the fitted version of the theoretical model:
\[\log\bigg(\frac{Pr(\text{Outcome =
1})}{1-Pr(\text{Outcome = 1})}\bigg) = \beta_0 + \beta_1 \cdot
\text{Shot Distance}\] We can access the estimated coefficients,
\(\hat{\beta}_0\) and \(\hat{\beta_1}\), using the
coef()
function:
coef(mod1)
## (Intercept) SHOT_DIST
## 0.41475994 -0.04478712
Interpreting each:
Question #1: In this question you’ll fit models
involving various combinations of the predictors TOUCH_TIME
and CAT_DRIBBLES
and interpret the coefficients in these
models.
TOUCH_TIME
to predict OUTCOME
. Provide an
interpretation for both estimated coefficients in this model.
Remember to apply the inverse transformation before forming your
interpretation.CAT_DRIBBLES
to predict OUTCOME
. Provide
an interpretation for all of the estimated coefficients in this
model. Hint: this is a categorical predictor, so you should
review our
lecture slides and recognize that your interpretations should
involve odds ratios.TOUCH_TIME
and CAT_DRIBBLES
to
predict OUTCOME
. Why do the estimated coefficients of
TOUCH_TIME
and CAT_DRIBBLES two or more
both
get closer to zero in this model? Hint: If these predictors
were independent, adjusting for one would not impact the coefficient of
the other. But are they independent?\(~\)
The family = "binomial"
argument we’ve been using
describes a model’s “likelihood” function. When evaluated, the
likelihood provides a measurement of how close the model fits the sample
data. Similar to the sum of squared residuals in linear regression, the
likelihood is a single number summarizing the fit of the model that
doesn’t mean anything in absolute sense, but is meaningful in a relative
sense.
The lrtest()
function in the lmtest
library
is used to perform likelihood ratio tests on nested logistic regression
models. The function is extremely similar to anova()
, which
we used to compare nested linear regression models:
## Fit both models
mod1 = glm(OUTCOME ~ SHOT_DIST, data = shots, family = "binomial")
mod2 = glm(OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST, data = shots, family = "binomial")
## Likelihood ratio test
library(lmtest)
lrtest(mod1, mod2)
## Likelihood ratio test
##
## Model 1: OUTCOME ~ SHOT_DIST
## Model 2: OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 2 -83514
## 2 3 -82565 1 1898.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis here is that the two models have the same likelihood (a likelihood ratio of 1), and the very small \(p\)-value in this example suggests strong evidence that the larger model fits the data significantly better than the smaller model.
Question #2:
OUTCOME ~ TOUCH_TIME
is
preferable over the model OUTCOME ~ CAT_DRIBBLES
? Briefly
explain.OUTCOME ~ TOUCH_TIME + CAT_DRIBBLES
with the model
OUTCOME ~ TOUCH_TIME
. Provide the \(p\)-value of the test and a brief
conclusion indicating which model is preferable.\(~\)
Inference on individual coefficients in a logistic regression model can be conducted using the same functions we previously saw for linear regression:
summary()
will test the null hypothesis \(H_0:\beta_j=0\) for every coefficient in
the modelconfint()
will calculate confidence interval estimates
for each of the model’s coefficientsBelow are quick examples:
## Model from before
mod2 = glm(OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST, data = shots, family = "binomial")
## Tests on coefficients using summary()
summary(mod2)
##
## Call:
## glm(formula = OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST, family = "binomial",
## data = shots)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0268 -1.0882 -0.8232 1.1524 2.1135
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2053367 0.0116966 17.55 <2e-16 ***
## SHOT_DIST -0.0643797 0.0008261 -77.93 <2e-16 ***
## CLOSE_DEF_DIST 0.1150303 0.0027351 42.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 171730 on 124710 degrees of freedom
## Residual deviance: 165130 on 124708 degrees of freedom
## AIC: 165136
##
## Number of Fisher Scoring iterations: 4
## Confidence intervals using conf.int()
confint(mod2)
## 2.5 % 97.5 %
## (Intercept) 0.18241356 0.22826366
## SHOT_DIST -0.06600082 -0.06276239
## CLOSE_DEF_DIST 0.10967881 0.12040027
A few additional details:
Question #3: The code provided below converts the
PERIOD
variable into categories, storing the result as a
new variable CAT_PERIOD
.
## Add a categorical (factor) version of the period variable
shots$CAT_PERIOD = factor(shots$PERIOD)
OUTCOME ~ SHOT_DIST + CAT_PERIOD
and
OUTCOME ~ SHOT_DIST
. Report the \(p\)-value and briefly explain whether you
think CAT_PERIOD
is a worthwhile predictor after accounting
for SHOT_DIST
.OUTCOME ~ SHOT_DIST + CAT_PERIOD
. After adjusting for shot
distance, which periods tend to have significantly worse shooting
performance than the first quarter (period 1)?OUTCOME ~ SHOT_DIST + CLOSE_DEF_DIST + TOUCH_TIME + CAT_PERIOD
.
After adjusting for shot distance, defender distance, and touch time,
which periods tend to have significantly worse shooting performance than
the first quarter (period 1)?TOUCH_TIME
and CAT_PERIOD
. Based
upon what you see, why might the difference between periods 2, 4, and 5
(first overtime) diminish after adjusting for
TOUCH_TIME
.confint()
and
exponentiation.\(~\)
In the early 2000s, a team of US and Bangladesh scientists measured the arsenic levels of wells in Arahazar upazila, Bangladesh. The team identified safe and unsafe wells (arsenic concentrations ≥0.5μg/L), and they distributed materials encouraging residents who used unsafe wells to switch to safe ones.
A year later the researchers followed up to see which residents had switched to a safe well. The data below record the results of this study:
wells = read.csv("https://remiller1450.github.io/data/Wells.csv") %>%
mutate(switch_binary = ifelse(switch == "yes", 1, 0))
The variable switch_binary
records whether a family
switched to a safe well, while arsenic
records the arsenic
concentration of the family’s original well, distance
is
the distance from their home to the nearest safe well,
education
is the number of years of formal education of the
head of household, and association
indicates whether the
household is active in community associations.
Question #4: This question is intended to provide a step-by-step walk through of a basic logistic regression analysis. You should keep this in mind if you’re considering logistic regression for the project.
switch
. That is, determine
which variable seems most strongly related to switch
, which
seems second most strongly related, etc.association
should be added to your final model
from Part B. Report the \(p\)-value of
this test and a brief conclusion.