Directions

Please document your answers to all homework questions using R Markdown, submitting your compiled output on P-web.

\(~\)

Question #1

This question uses the Auto data set contained in the “ISLR” package (the R package companion to our recommended textbook, Introduction to Statistical Learning). This data set contains information describing \(n=392\) different passenger vehicles.

#install.packages("ISLR")
library(ISLR)
data("Auto")

Your aim in this question is to accurately predict whether a car is American or Japanese using the attributes of the car. To facilitate this task, I’ve split the outcome variable and predictors:

## Split X, y
Auto_X = Auto %>% filter(origin %in% c(1,3)) %>% select(mpg, cylinders, displacement, horsepower, weight, acceleration)
Auto_y = Auto %>% filter(origin %in% c(1,3)) %>% mutate(American = ifelse(origin == 1, 1, 0)) %>% select(American)

## Combined into a single data.frame
Auto_full = cbind(Auto_y, Auto_X)

Note that the variable American takes the value of 1 if a car’s origin was America, and a value of 0 if the origin was Japan.

Part A: Fit a logistic regression model predicting whether country of origin is “America”. Then provide an interpretation of the effect of a 1-unit increase in mpg on the likelihood that a car is American.

Part B: Fit a decision tree model predicting whether country of origin is “America”. Based upon this model, describe the characteristics that make a car most likely to be predicted as Japanese (ie: those that yield the lowest predicted probability American = 1).

Part C: Construct an ROC curve displaying the models from Part A and Part B. Indicate which model has the better in-sample AUC, justifying your answer either numerically or by referencing the ROC curve.

Part D: Use cross-validation to compare the out-of-sample AUC of the models from Part A and Part B. Clearly state which model performed better.

\(~\)

Question #2 (no R needed)

Suppose a researcher collects data (n = 100 observations) consisting of a single numeric predictor, \(X\), and a numeric response, \(Y\). They are considering analyzing these data using either a simple linear regression model, or a cubic regression model (ie: \(Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \epsilon\)).

Part A: Suppose that the true relationship between X and Y is linear (ie: the true \(f()\) relating \(X\) and \(Y\) reflects a straight-line relationship). Consider the in-sample \(RMSE\) of the simple linear regression model and the in-sample \(RMSE\) of the cubic regression. Do you expect one to be lower than the other, do you expect them to be the same, or is there not enough information to tell? Justify your answer.

Part B: Answer the question stated Part A, but this time for the out-of-sample \(RMSE\) rather than the in-sample \(RMSE\).

Part C: Now suppose that the true relationship between X and Y is not linear (ie: the true \(f()\) relating \(X\) and \(Y\) is not a straight-line), but we don’t know how far it is from linear. Consider the in-sample \(RMSE\) of the simple linear regression model and the in-sample \(RMSE\) of the cubic regression. Do you expect one to be lower than the other, do you expect them to be the same, or is there not enough information to tell? Justify your answer.

Part D: Answer the question stated Part C, but this time for the out-of-sample \(RMSE\) rather than the in-sample \(RMSE\).

\(~\)

Question #3 (no R needed)

This question is based upon the R code given below. You will not need to write any additional code of your own to answer its components.

The “tips” data set that is used in this question records several weeks of tips received by a waiter working in a chain restaurant in suburban New York City during the early 1990s. The goal of this application is to use modeling to understand the factors which might predict the tip received by the waiter.

tips <- read.csv("https://remiller1450.github.io/data/Tips.csv")
set.seed(123)
n <- nrow(tips)
fold_id <- sample(rep(1:4, length.out = n), size = n)
preds <- numeric(n)

for(k in 1:max(fold_id)){
  train <- tips[fold_id != k, ]
  test <- tips[fold_id == k, ]
  mod <- lm(Tip ~ as.factor(Size), data = train)
  preds[fold_id == k] <- predict(mod, newdata = test)
}

Part A: You should recognize that this code is an implementation of \(k\)-fold cross-validation. How many folds does it use? How many observations are assigned to each fold?

Part B: Before the for loop begins, preds is a numeric vector filled entirely with zeros. After the loop’s first iteration, how many elements in this vector will have a non-zero value? Briefly explain.

Part C: After the for loop has finished three iterations, how many elements in the vector preds will have non-zero values? Briefly explain.

Part D: When the for loop is finished, the vector preds will be entirely comprised of non-zero values. How many fitted linear regression models contributed to these predictions across the entirety of the loop? Briefly explain.

\(~\)

Question #4 (no R needed)

For this question I’d like you to read this paper by Leo Breiman, one of the early developers of the random forest modeling algorithm. Most data scientists consider this essay a “must read” that has forever shaped their views on the differences between traditional statisticians vs. data scientists. The paper itself is the first 14-pages of the linked document, and you should read it with the goal of extracting key concepts (not following along with every minor detail).

You should then write a 1-2 paragraph response that addresses the following:

  1. What are the main points/arguments made by the author.
  2. How do you assess the arguments made by the author.

A satisfactory response will indicate that you have read the essay and thoughtfully crafted at least 6 sentences relating to its content. If your response lacks sufficient detail or doesn’t contain evidence that you understood the author’s main point it might not receive full credit.