Directions:

\(~\)

Question #1 - 4.9 from Intro to Statistical Learning

This problem has to do with odds.

  1. On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?
  2. Suppose that an individual has a 16 % chance of defaulting on her credit card payment. What are the odds that she will default?

\(~\)

Question #2 - 9.16 from OpenIntro Statistics (adapted)

On January 28, 1986, a routine launch was anticipated for the Challenger space shuttle. Seventy-three seconds into the flight, disaster happened: the shuttle broke apart, killing all seven crew members on board. An investigation into the cause of the disaster focused on a critical seal called an O-ring, and it is believed that damage to these O-rings during a shuttle launch may be related to the ambient temperature during the launch.

The table below summarizes observational data on O-rings for 23 shuttle missions, where the mission order is based on the temperature at the time of the launch. Temp gives the temperature in Fahrenheit, Damaged represents the number of damaged O-rings (out of six).

library(alr4)
data("Challeng")
data.frame(Mission = 1:nrow(Challeng),
           Damaged = Challeng$fail,
           Temp = Challeng$temp)
##    Mission Damaged Temp
## 1        1       0   66
## 2        2       1   70
## 3        3       0   69
## 4        4       0   68
## 5        5       0   67
## 6        6       0   72
## 7        7       0   73
## 8        8       0   70
## 9        9       1   57
## 10      10       1   63
## 11      11       1   70
## 12      12       0   78
## 13      13       0   67
## 14      14       2   53
## 15      15       0   67
## 16      16       0   75
## 17      17       0   70
## 18      18       0   81
## 19      19       0   76
## 20      20       0   79
## 21      21       2   75
## 22      22       0   76
## 23      23       1   58
  1. While “Damaged” is not a binary variable, make an argument that logistic regression is better suited for modeling these data than linear regression.
  2. The logistic regression model below codes each mission as “1” if at least one O-ring was damaged, and “0” otherwise. Write out this model using the point estimates of the model parameters.
summary(glm(fail >= 1 ~ temp, data = Challeng, family = "binomial"))
## 
## Call:
## glm(formula = fail >= 1 ~ temp, family = "binomial", data = Challeng)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0611  -0.7613  -0.3783   0.4524   2.2175  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  15.0429     7.3786   2.039   0.0415 *
## temp         -0.2322     0.1082  -2.145   0.0320 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 20.315  on 21  degrees of freedom
## AIC: 24.315
## 
## Number of Fisher Scoring iterations: 5
  1. Interpret the effect of Temp on the odds of an O-ring failure occurring during a mission
  2. Based upon the model, do you think temperature is associated with O-ring failure? Or could the observed relationship be explained by random chance?

\(~\)

Question #3 - 9.18 from OpenIntro Statistics (adapted)

Exercise 9.16 introduced us to O-rings that were identified as a plausible explanation for the breakup of the Challenger space shuttle 73 seconds into takeoff in 1986. The investigation found that the ambient temperature at the time of the shuttle launch was closely related to the damage of O-rings, which are a critical component of the shuttle. See this earlier exercise if you would like to browse the original data

  1. The code below prints the model coefficients of a logistic model that uses Temp to predict O-ring damage. Based upon this model, predict the probability that an O-ring will be damaged at each of the following temperatures: 51, 53, and 55
coef(glm(fail >= 1 ~ temp, data = Challeng, family = "binomial"))
## (Intercept)        temp 
##  15.0429016  -0.2321627
  1. The graph below displays the observed data (plus signs), along with a curve connecting fitted probabilities (circles). Based upon a visual inspection of this graph, comment on the how well you believe the estimated logistic regression model fits these data.
plot(Challeng$temp, Challeng$fail >= 1, pch = 3)
m <- glm(fail >= 1 ~ temp, data = Challeng, family = "binomial")
lines(Challeng$temp[order(Challeng$temp, decreasing = TRUE)], m$fitted.values[order(m$fitted.values, decreasing = FALSE)], type = "both")

\(~\)

Question #4 - 9.20 from OpenIntro Statistics

Determine which of the following statements are true and false. For each statement that is false, explain why it is false.

  1. Suppose we consider the first two observations based on a logistic regression model, where the first variable in observation 1 takes a value of x1 = 6 and observation 2 has x1 = 4. Suppose we realized we made an error for these two observations, and the first observation was actually x1 = 7 (instead of 6) and the second observation actually had x1 = 5 (instead of 4). Then the predicted probability from the logistic regression model would increase the same amount for each observation after we correct these variables.
  2. When using a logistic regression model, it is impossible for the model to predict a probability that is negative or a probability that is greater than 1.
  3. Because logistic regression predicts probabilities of outcomes, observations used to build a logistic regression model need not be independent.
  4. When fitting logistic regression, we statisticians typically perform model selection using adjusted \(R^2\)

\(~\)

Question #5 - 9.23 from OpenIntro Statistics (adapted)

Previously, we encountered a data set where we applied logistic regression to aid in spam classification for individual emails. In this exercise, we’ve taken a small set of these variables (all of which are binary) and fit a formal model with the following output:

Predictor Estimate Std Error Z p-value
(Intercept) -0.8124 0.0870 -9.34 0.0000
multiple -2.6351 0.3036 -8.68 0.0000
winner 1.6272 0.3185 5.11 0.0000
format -1.5881 0.1196 -13.28 0.0000
re_subj -3.0467 0.3625 -8.40 0.0000
  1. Write down the model using the coefficients from the model fit.
  2. Suppose we have an observation where multiple = 0, winner = 1, format = 0, and re_subj = 0. What is the predicted probability that this message is spam?
  3. Put yourself in the shoes of data-scientist working on a spam filter. Are you more concerned with the sensitivity or the specificity of your model? What are some ways you might explore the tradeoffs between these two metrics?

\(~\)