Directions:

\(~\)

Question #1 - Intro to Statistical Learning #3.4 (adapted)

Suppose I collect a set of data (n = 100 observations) containing a single predictor and a numeric response. I then fit a simple linear regression model to the data, as well as a separate cubic regression (ie: \(Y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \epsilon\)).

  1. Suppose that the true relationship between X and Y is linear. Consider the in-sample RMSE for the linear regression, and also the in-sample RMSE for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
  2. Answer Part B using the out-of-sample RMSE rather than the in-sample RMSE.
  3. Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the in-sample RMSE for the linear regression, and also the in-sample RMSE for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
  4. Answer Part C using the out-of-sample RMSE rather than the in-sample RMSE.

\(~\)

Question #2 - Intro to Statistical Learning #3.9 (adapted)

This question involves the use of multiple linear regression on the Auto data set.

#install.packages("ISLR")
library(ISLR)
data("Auto")
  1. Produce a scatterplot matrix which includes all of the variables in the data set.
  2. Compute the matrix of correlations between the variables using the function cor. You will need to exclude the “name” variable, which is qualitative.
  3. Use the lm function to perform a multiple linear regression with “mpg” as the response and all other variables, except “name”, as the predictors. Use the summary() function to print the results. Comment on the output:
    1. Is there a relationship between the predictors and the response?
    2. Which predictors appear to have a statistically significant relationship to the response?
    3. What does the coefficient for the year variable suggest?
  4. Use the plot function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit.
  5. Use the * or : symbol to fit a multiple linear regression model with interaction effects. Do any interactions appear to be statistically significant?
  1. Based upon your scatterplot matrix in Part A, try log-transforming the predictor with the most right-skewed distribution. How does this impact the hypothesis testing results for an association between this predictor and “mpg”?

\(~\)

Question #3 - Intro to Statistical Learning #3.10 (adapted)

This question should be answered using the Carseats data set.

#install.packages("ISLR")
library(ISLR)
data("Carseats")
  1. Fit a multiple regression model to predict “Sales” using “Price”, “Urban”, and “US.”
  2. Provide an interpretation of each coefficient in the model. Be careful - some of the variables in the model are categorical!
  3. Write out the model in equation form, being careful to handle the qualitative variables properly (ie: use dummy variables).
  4. For which of the predictors can you reject the null hypothesis \(H_0: \beta_j = 0\)
  5. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is statistical evidence of association with the outcome.
  6. How well do the models in (a) and (e) fit the data? Report each model’s cross-validated RMSE.

\(~\)

Question #4 - A Second Course in Statistics 4.6 (adapted)

Earnings of Mexican street vendors. Detailed interviews were conducted with over 1,000 street vendors in the city of Puebla, Mexico, in order to study the factors influencing vendors’ incomes (World Development, February 1998). Vendors were defined as individuals working in the street, and included vendors with carts and stands on wheels and excluded beggars, drug dealers, and prostitutes. The researchers collected data on gender, age, hours worked per day, annual earnings, and education level. For this question, use the subset of data provided below.

  1. Write the population-level model for mean annual earnings, E(y), as a function of age (x1) and hours worked (x2).
  2. Fit this model using the data provided below. Write out the estimated model.
  3. Interpret each of the estimated \(\beta\) coefficients in your model.
  4. Conduct a test of the global utility of the model (at \(\alpha = 0.01\)). Interpret the result.
  5. Find and interpret the value of \(R^2\)
  6. Find and interpret \(s\), the estimated standard deviation of the errors.
  7. Is age (x1) a statistically useful predictor of annual earnings? Conduct your test using \(\alpha = 0.01\)
  8. Find a 95% confidence interval for \(\beta_2\). Interpret the interval in the context of the problem.
Street_Vendors_Subset <- read.csv("https://remiller1450.github.io/data/StreetVendors.csv")

Question #5 - A Second Course in Statistics 4.28 (adapted)

Refer to the World Development (February 1998) study of street vendors in the city of Puebla, Mexico, Exercise 4.6 (Question #4 in this assignment). Recall that the vendors’ mean annual earnings, E(y), was modeled as a first-order function of age (x1) and hours worked (x2). Now, consider the interaction model \(E(y) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3 x_1x_2 + \epsilon\). The SAS printout for the model is displayed in the next column.

  1. Fit this model and write the least squares prediction equation.
  2. What is the estimated slope relating annual earnings (y) to age (x1) when number of hours worked (x2) is 10? Interpret the result.
  3. What is the estimated slope relating annual earnings (y) to hours worked (x2) when age (x1) is 40? Interpret the result.
  4. Give the null hypothesis for testing whether age (x1) and hours worked (x2) interact.
  5. Find the p-value of the hypothesis test from Part D.
  6. Based upon the p-value in Part E. Provide an appropriate conclusion in the context of these data.
  7. Throughout Question #5 you’ve only used a subset of the full street vendors dataset. Had you been working with the full dataset, how would you expect the p-value in Part E to change?
Street_Vendors_Subset <- read.csv("https://remiller1450.github.io/data/StreetVendors.csv")