Directions:
\(~\)
Research has shown that the tar and nicotine content of cigarettes is associated with the amount of carbon monoxide emitted in the cigarette smoke. The data for this application involve laboratory testing results of 25 different brands of cigarettes. The goal of our analysis is to understand the influence of these variables on the amount of carbon monoxide emitted in the cigarette smoke. The data include the following variables:
cigs <- read.csv("https://remiller1450.github.io/data/Cigs.csv")
summary(cigs)
## Brand Tar Nicotine Weight
## Length:25 Min. : 1.00 Min. :0.1300 Min. :0.7851
## Class :character 1st Qu.: 8.60 1st Qu.:0.6900 1st Qu.:0.9225
## Mode :character Median :12.80 Median :0.9000 Median :0.9573
## Mean :12.22 Mean :0.8764 Mean :0.9703
## 3rd Qu.:15.10 3rd Qu.:1.0200 3rd Qu.:1.0070
## Max. :29.80 Max. :2.0300 Max. :1.1650
## CarbonMonoxide
## Min. : 1.50
## 1st Qu.:10.00
## Median :13.00
## Mean :12.53
## 3rd Qu.:15.40
## Max. :23.50
Part A: Fit a simple linear regression model that uses the variable “Nicotine” to predict the outcome “CarbonMonoxide”. Use this model to describe the relationship between these two variables. Additionally, answer whether the relationship appears to be statistically significant.
Part B: Fit a multiple linear regression model uses the variables “Nicotine” and “Tar” to predict the outcome “CarbonMonoxide”. Interpret the adjusted effect of “Nicotine”.
Part C: Using graphs or descriptive statistics, thoroughly explain why the adjusted effect of “Nicotine” (after adjusting for “Tar”) is so different from the unadjusted effect.
Part D: Now modify the model from Part B such that a third degree polynomial effect is used for the variable “Tar”. Why do you think the adjusted effect of “Nicotine” in this model is so insignificant?
\(~\)
This application will look at the sale price of diamonds sold by a major online retailer. The dataset is contained in the ggplot2
package, and is loaded below:
library(ggplot2)
data("diamonds")
diamonds$cut <- as.character(diamonds$cut)
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Length:53940 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Class :character E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Mode :character F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Our analysis will focus on the following variables:
Part A: Fit the regression model price ~ cut*carat
. Using this model, what is the estimated effect of “carat” on “price” for “Fair” cut diamonds? What is the effect is the effect of “carat” on “price” for “Good” cut diamonds? What is the effect is the effect of “carat” on “price” for “Ideal” cut diamonds?
Part B: Fit the regression model price ~ carat*x
. Using this model, what is the estimated effect of a 1mm increase in “x” on “price” for a 1 carat diamond? What is the estimated effect of a 1mm increase in “x” on “price” for a 2 carat diamond?
Part C: Use the visreg2d()
function in the visreg
package to create an interaction plot for the model price ~ carat*x
. Then, using this plot, qualitatively describe the interaction between “carat” and “x” when predicting “price”.
\(~\)
This application will explore the use of easily collected, minimally invasive predictors in screening for type II diabetes. Because gold-standard diagnostic testing for diabetes is expensive and time consuming, the hope is that a set of more easily attained predictors can provide suitable predictive ability.
The data were collected as part of a research study on the Pima Indian tribe, and contain the following variables:
The code below loads the data from the mlbench
library, then filters it to remove any cases with missing values.
library(mlbench)
data(PimaIndiansDiabetes2)
pima <- PimaIndiansDiabetes2[complete.cases(PimaIndiansDiabetes2),]
summary(pima)
## pregnant glucose pressure triceps
## Min. : 0.000 Min. : 56.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.:21.00
## Median : 2.000 Median :119.0 Median : 70.00 Median :29.00
## Mean : 3.301 Mean :122.6 Mean : 70.66 Mean :29.15
## 3rd Qu.: 5.000 3rd Qu.:143.0 3rd Qu.: 78.00 3rd Qu.:37.00
## Max. :17.000 Max. :198.0 Max. :110.00 Max. :63.00
## insulin mass pedigree age diabetes
## Min. : 14.00 Min. :18.20 Min. :0.0850 Min. :21.00 neg:262
## 1st Qu.: 76.75 1st Qu.:28.40 1st Qu.:0.2697 1st Qu.:23.00 pos:130
## Median :125.50 Median :33.20 Median :0.4495 Median :27.00
## Mean :156.06 Mean :33.09 Mean :0.5230 Mean :30.86
## 3rd Qu.:190.00 3rd Qu.:37.10 3rd Qu.:0.6870 3rd Qu.:36.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
Part A: The code below is used to display histograms of each numeric predictor in the dataset. Based upon your inspection of these graphs, apply a base-two log-transformation to each predictor that appears to be highly right-skewed before proceeding to the model building portion of this application (Part B).
par(mfrow = c(3,3))
for(i in 1:(ncol(pima) - 1)){
hist(pima[,i], main = paste(names(pima)[i]), xlab = "", ylab = "")
}
Part B: Using the new dataset you constructed in Part A, apply a stepwise backwards elimination algorithm that uses AIC to select the optimal logistic regression model for predicting “diabetes”.
Part C: Report the in-sample accuracy of the model you selected in Part B. Briefly comment upon why an ROC analysis might be more useful than raw accuracy in this application.
Part D: Create an ROC curve displaying the sensitivity and specificity of the model you selected in Part B. Based upon this curve, determine a predicted probability cutoff (ie: threshold value of \(t\)) that you feel does the best job balancing the need for sensitivity and specificity in this application. Report your cutoff along with the corresponding sensitivity and specificity.
\(~\)
For this question I’d like you to read the 14-page paper linked here: Statistical Modeling - The Two Cultures It’s not essential that you read the entire paper in great detail, but I’d like for you to read it closely enough to extract the main arguments. For reference, the author of this paper is perhaps the most well-known researcher in the area of Classification and Regression Trees, and his essay is considered a “must read” in most data science programs.
For this question, I’d like you to write a 1-2 paragraph reflection that summarizes the main ideas of the paper, as well as your opinions regarding the author’s arguments. A satisfactory response will include at least five thoughtfully crafted sentences.