Problem Set #4 (MATH-257, Spring 2021)

Directions:

Different from previous problem sets, the aim of this assignment is to provide you one last opportunity to practice/revisit topics from this course that you may not be using on your final project.
When completing the assignment, you should select the two questions that are most different from what you did (or are currently doing) in your final project.
- To reiterate, please only answer two of these questions and ignore the ones that are most closely related to your final project
- For example, if your project involves using logistic regression to model a binary outcome, you should choose to ignore Question #3 for sure, and the other question you ignore might depend on the focus of your model as well as whether or not you compared it to something like CART. As another example, if your project involves multiple linear regression, you’d choose to ignore both Question #1 and Question #2.

\(~\)

Question #1 - Adjusted Effects in Multiple Linear Regression Models

Research has shown that the tar and nicotine content of cigarettes is associated with the amount of carbon monoxide emitted in the cigarette smoke. The data for this application involve laboratory testing results of 25 different brands of cigarettes. The goal of our analysis is to understand the influence of these variables on the amount of carbon monoxide emitted in the cigarette smoke. The data include the following variables:

Brand name
Tar content (mg)
Nicotine content (mg)
Weight (g)
Carbon monoxide content (mg)

cigs <- read.csv("https://remiller1450.github.io/data/Cigs.csv")
summary(cigs)

##     Brand                Tar           Nicotine          Weight      
##  Length:25          Min.   : 1.00   Min.   :0.1300   Min.   :0.7851  
##  Class :character   1st Qu.: 8.60   1st Qu.:0.6900   1st Qu.:0.9225  
##  Mode  :character   Median :12.80   Median :0.9000   Median :0.9573  
##                     Mean   :12.22   Mean   :0.8764   Mean   :0.9703  
##                     3rd Qu.:15.10   3rd Qu.:1.0200   3rd Qu.:1.0070  
##                     Max.   :29.80   Max.   :2.0300   Max.   :1.1650  
##  CarbonMonoxide 
##  Min.   : 1.50  
##  1st Qu.:10.00  
##  Median :13.00  
##  Mean   :12.53  
##  3rd Qu.:15.40  
##  Max.   :23.50

Part A: Fit a simple linear regression model that uses the variable “Nicotine” to predict the outcome “CarbonMonoxide”. Use this model to describe the relationship between these two variables. Additionally, answer whether the relationship appears to be statistically significant.

Part B: Fit a multiple linear regression model uses the variables “Nicotine” and “Tar” to predict the outcome “CarbonMonoxide”. Interpret the adjusted effect of “Nicotine”.

Part C: Using graphs or descriptive statistics, thoroughly explain why the adjusted effect of “Nicotine” (after adjusting for “Tar”) is so different from the unadjusted effect.

Part D: Now modify the model from Part B such that a third degree polynomial effect is used for the variable “Tar”. Why do you think the adjusted effect of “Nicotine” in this model is so insignificant?

\(~\)

Question #2 - Interactions and Model Performance in Multiple Linear Regression Models

This application will look at the sale price of diamonds sold by a major online retailer. The dataset is contained in the ggplot2 package, and is loaded below:

library(ggplot2)
data("diamonds")
diamonds$cut <- as.character(diamonds$cut)
summary(diamonds)

##      carat            cut            color        clarity          depth      
##  Min.   :0.2000   Length:53940       D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Class :character   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Mode  :character   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979                      G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400                      H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                      I: 5422   VVS1   : 3655   Max.   :79.00  
##                                      J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
##

Our analysis will focus on the following variables:

price - the sale price (in dollars)
carat - the weight of the diamond (in carats)
cut - the quality of the diamond’s cut (measured on a five-point scale)
x - the length of the largest flat facet of the diamond (the part that typically faces upward)

Part A: Fit the regression model price ~ cut*carat. Using this model, what is the estimated effect of “carat” on “price” for “Fair” cut diamonds? What is the effect is the effect of “carat” on “price” for “Good” cut diamonds? What is the effect is the effect of “carat” on “price” for “Ideal” cut diamonds?

Part B: Fit the regression model price ~ carat*x. Using this model, what is the estimated effect of a 1mm increase in “x” on “price” for a 1 carat diamond? What is the estimated effect of a 1mm increase in “x” on “price” for a 2 carat diamond?

Part C: Use the visreg2d() function in the visreg package to create an interaction plot for the model price ~ carat*x. Then, using this plot, qualitatively describe the interaction between “carat” and “x” when predicting “price”.

\(~\)

Question #3 - Evaluating the Efficacy of a Logistic Regression Model

This application will explore the use of easily collected, minimally invasive predictors in screening for type II diabetes. Because gold-standard diagnostic testing for diabetes is expensive and time consuming, the hope is that a set of more easily attained predictors can provide suitable predictive ability.

The data were collected as part of a research study on the Pima Indian tribe, and contain the following variables:

diabetes - the outcome variable (measured by a gold-standard test)
pregnant - number of prior pregnancies
glucose - glucose tolerance test results
pressure - diastolic blood pressure (mm Hg)
triceps - triceps skin fold thickness (mm)
insulin - two-hour serum insulin (mu/ml)
mass - body mass index (BMI)
pedigree - a score describing family history of diabetes
age - age (in years)

The code below loads the data from the mlbench library, then filters it to remove any cases with missing values.

library(mlbench)
data(PimaIndiansDiabetes2)
pima <- PimaIndiansDiabetes2[complete.cases(PimaIndiansDiabetes2),]
summary(pima)

##     pregnant         glucose         pressure         triceps     
##  Min.   : 0.000   Min.   : 56.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.:21.00  
##  Median : 2.000   Median :119.0   Median : 70.00   Median :29.00  
##  Mean   : 3.301   Mean   :122.6   Mean   : 70.66   Mean   :29.15  
##  3rd Qu.: 5.000   3rd Qu.:143.0   3rd Qu.: 78.00   3rd Qu.:37.00  
##  Max.   :17.000   Max.   :198.0   Max.   :110.00   Max.   :63.00  
##     insulin            mass          pedigree           age        diabetes 
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0850   Min.   :21.00   neg:262  
##  1st Qu.: 76.75   1st Qu.:28.40   1st Qu.:0.2697   1st Qu.:23.00   pos:130  
##  Median :125.50   Median :33.20   Median :0.4495   Median :27.00            
##  Mean   :156.06   Mean   :33.09   Mean   :0.5230   Mean   :30.86            
##  3rd Qu.:190.00   3rd Qu.:37.10   3rd Qu.:0.6870   3rd Qu.:36.00            
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00

Part A: The code below is used to display histograms of each numeric predictor in the dataset. Based upon your inspection of these graphs, apply a base-two log-transformation to each predictor that appears to be highly right-skewed before proceeding to the model building portion of this application (Part B).

par(mfrow = c(3,3))
for(i in 1:(ncol(pima) - 1)){
hist(pima[,i], main = paste(names(pima)[i]), xlab = "", ylab = "")
}

Part B: Using the new dataset you constructed in Part A, apply a stepwise backwards elimination algorithm that uses AIC to select the optimal logistic regression model for predicting “diabetes”.

Part C: Report the in-sample accuracy of the model you selected in Part B. Briefly comment upon why an ROC analysis might be more useful than raw accuracy in this application.

Part D: Create an ROC curve displaying the sensitivity and specificity of the model you selected in Part B. Based upon this curve, determine a predicted probability cutoff (ie: threshold value of \(t\)) that you feel does the best job balancing the need for sensitivity and specificity in this application. Report your cutoff along with the corresponding sensitivity and specificity.

\(~\)

Question #4 - The Two Cultures in Data-Modeling

For this question I’d like you to read the 14-page paper linked here: Statistical Modeling - The Two Cultures It’s not essential that you read the entire paper in great detail, but I’d like for you to read it closely enough to extract the main arguments. For reference, the author of this paper is perhaps the most well-known researcher in the area of Classification and Regression Trees, and his essay is considered a “must read” in most data science programs.

For this question, I’d like you to write a 1-2 paragraph reflection that summarizes the main ideas of the paper, as well as your opinions regarding the author’s arguments. A satisfactory response will include at least five thoughtfully crafted sentences.

Problem Set #4 (MATH-257, Spring 2021)

Assigned: Apr 16th, Due: Friday Apr 30th at 11:59pm

Question #1 - Adjusted Effects in Multiple Linear Regression Models

Question #2 - Interactions and Model Performance in Multiple Linear Regression Models

Question #3 - Evaluating the Efficacy of a Logistic Regression Model

Question #4 - The Two Cultures in Data-Modeling