Problem Set #3 (MATH-257, Spring 2021)

Directions:

On all problems sets you are expected to show sufficient work, either in the form of R code and/or written/typed calculations, for a third party to understand how you arrived at your solution.
I encourage you to write your answers in R Markdown, a nifty package capable of conveniently combining R code, R output, and LaTex typesetting into a single document.
Any submissions received after than the assigned due-date may be subject to a late penalty unless prior approval for an adjusted due-date has been given.

\(~\)

Question #1 - 4.9 from Intro to Statistical Learning

This problem has to do with odds.

On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?
Suppose that an individual has a 16 % chance of defaulting on her credit card payment. What are the odds that she will default?

\(~\)

Question #2 - 9.16 from OpenIntro Statistics (adapted)

On January 28, 1986, a routine launch was anticipated for the Challenger space shuttle. Seventy-three seconds into the flight, disaster happened: the shuttle broke apart, killing all seven crew members on board. An investigation into the cause of the disaster focused on a critical seal called an O-ring, and it is believed that damage to these O-rings during a shuttle launch may be related to the ambient temperature during the launch.

The table below summarizes observational data on O-rings for 23 shuttle missions, where the mission order is based on the temperature at the time of the launch. Temp gives the temperature in Fahrenheit, Damaged represents the number of damaged O-rings (out of six).

library(alr4)
data("Challeng")
data.frame(Mission = 1:nrow(Challeng),
           Damaged = Challeng$fail,
           Temp = Challeng$temp)

##    Mission Damaged Temp
## 1        1       0   66
## 2        2       1   70
## 3        3       0   69
## 4        4       0   68
## 5        5       0   67
## 6        6       0   72
## 7        7       0   73
## 8        8       0   70
## 9        9       1   57
## 10      10       1   63
## 11      11       1   70
## 12      12       0   78
## 13      13       0   67
## 14      14       2   53
## 15      15       0   67
## 16      16       0   75
## 17      17       0   70
## 18      18       0   81
## 19      19       0   76
## 20      20       0   79
## 21      21       2   75
## 22      22       0   76
## 23      23       1   58

While “Damaged” is not a binary variable, make an argument that logistic regression is better suited for modeling these data than linear regression.
The logistic regression model below codes each mission as “1” if at least one O-ring was damaged, and “0” otherwise. Write out this model using the point estimates of the model parameters.

summary(glm(fail >= 1 ~ temp, data = Challeng, family = "binomial"))

## 
## Call:
## glm(formula = fail >= 1 ~ temp, family = "binomial", data = Challeng)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0611  -0.7613  -0.3783   0.4524   2.2175  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  15.0429     7.3786   2.039   0.0415 *
## temp         -0.2322     0.1082  -2.145   0.0320 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.267  on 22  degrees of freedom
## Residual deviance: 20.315  on 21  degrees of freedom
## AIC: 24.315
## 
## Number of Fisher Scoring iterations: 5

Interpret the effect of Temp on the odds of an O-ring failure occurring during a mission
Based upon the model, do you think temperature is associated with O-ring failure? Or could the observed relationship be explained by random chance?

\(~\)

Question #3 - 9.18 from OpenIntro Statistics (adapted)

Exercise 9.16 introduced us to O-rings that were identified as a plausible explanation for the breakup of the Challenger space shuttle 73 seconds into takeoff in 1986. The investigation found that the ambient temperature at the time of the shuttle launch was closely related to the damage of O-rings, which are a critical component of the shuttle. See this earlier exercise if you would like to browse the original data

The code below prints the model coefficients of a logistic model that uses Temp to predict O-ring damage. Based upon this model, predict the probability that an O-ring will be damaged at each of the following temperatures: 51, 53, and 55

coef(glm(fail >= 1 ~ temp, data = Challeng, family = "binomial"))

## (Intercept)        temp 
##  15.0429016  -0.2321627

The graph below displays the observed data (plus signs), along with a curve connecting fitted probabilities (circles). Based upon a visual inspection of this graph, comment on the how well you believe the estimated logistic regression model fits these data.

plot(Challeng$temp, Challeng$fail >= 1, pch = 3)
m <- glm(fail >= 1 ~ temp, data = Challeng, family = "binomial")
lines(Challeng$temp[order(Challeng$temp, decreasing = TRUE)], m$fitted.values[order(m$fitted.values, decreasing = FALSE)], type = "both")

\(~\)

Question #4 - 9.20 from OpenIntro Statistics

Determine which of the following statements are true and false. For each statement that is false, explain why it is false.

Suppose we consider the first two observations based on a logistic regression model, where the first variable in observation 1 takes a value of x1 = 6 and observation 2 has x1 = 4. Suppose we realized we made an error for these two observations, and the first observation was actually x1 = 7 (instead of 6) and the second observation actually had x1 = 5 (instead of 4). Then the predicted probability from the logistic regression model would increase the same amount for each observation after we correct these variables.
When using a logistic regression model, it is impossible for the model to predict a probability that is negative or a probability that is greater than 1.
Because logistic regression predicts probabilities of outcomes, observations used to build a logistic regression model need not be independent.
When fitting logistic regression, we statisticians typically perform model selection using adjusted \(R^2\)

\(~\)

Question #5 - 9.23 from OpenIntro Statistics (adapted)

Previously, we encountered a data set where we applied logistic regression to aid in spam classification for individual emails. In this exercise, we’ve taken a small set of these variables (all of which are binary) and fit a formal model with the following output:

Predictor	Estimate	Std Error	Z
(Intercept)	-0.8124	0.0870	-9.34
multiple	-2.6351	0.3036	-8.68
winner	1.6272	0.3185	5.11
format	-1.5881	0.1196	-13.28
re_subj	-3.0467	0.3625	-8.40

Write down the model using the coefficients from the model fit.
Suppose we have an observation where multiple = 0, winner = 1, format = 0, and re_subj = 0. What is the predicted probability that this message is spam?
Put yourself in the shoes of data-scientist working on a spam filter. Are you more concerned with the sensitivity or the specificity of your model? What are some ways you might explore the tradeoffs between these two metrics?

\(~\)

Problem Set #3 (MATH-257, Spring 2021)

Assigned: Mar 30th, Due: Friday Apr 16th at 11:59pm

Question #1 - 4.9 from Intro to Statistical Learning

Question #2 - 9.16 from OpenIntro Statistics (adapted)

Question #3 - 9.18 from OpenIntro Statistics (adapted)

Question #4 - 9.20 from OpenIntro Statistics

Question #5 - 9.23 from OpenIntro Statistics (adapted)