Directions:
As with all our labs, you are expected to work through the examples and questions in this lab collaboratively with your partner(s). This requires you to work together and discuss each section. You each will receive the same score for your work, so is your responsibility to make sure that both of you are on the same page. Any groups found to be using a “divide and conquer” strategy to rush through the lab’s questions will be penalized.
You should record your answers to the lab’s questions in an R Markdown file. When submitting the lab, you should only turn in the compiled .html file created by R Markdown.
You are strongly encouraged to open a separate, blank R script to run and experiment with the example code that is given throughout the lab. Please do not turn-in this code.
\(~\)
Linear regression describes a class of models that are parametric and statistical.
To understand the method, first consider a population of interest. Simple linear regression assumes a straight-line relationship between two variables in this population. This population-level model is expressed using the following parametric structure:
\[y = \beta_0 + \beta_1x + \epsilon\]
The random errors, \(\epsilon\), follow a \(N(0, \sigma^2)\) distribution that is independent of \(x\), meaning the distribution of errors should be similar everywhere along the line:
The line itself is the expected value of the response variable for a given value of \(x\). That is, \(E(y) = \hat{y} = \beta_0 + \beta_1x\). Sometimes this expression is called the model’s deterministic component, while \(\epsilon\) is the model’s random component.
\(~\)
Using a linear regression model to analyze data involves the following steps:
In the next section, we’ll take a closer look at Step #2 using the “Tips” dataset, which is introduced below.
This dataset was recorded by a restaurant server at a national chain restaurant located in a suburban shopping mall near New York City. The server kept records on every table they served over the span of several months in hopes of better understanding the tips they’d been earning. In this lab, we’ll focus on the variables “Tip” and “TotBill”, or the amount tipped and the total bill (respectively).
## Code to load the Tips dataset
tips <- read.csv("https://remiller1450.github.io/data/Tips.csv")
Question #1: Create a scatterplot displaying the relationship between “TotBill” and “Tip”. Based upon this graph, does a simple linear regression model seem reasonable for using “TotBill” to explain “Tip”?
\(~\)
Linear regression models are specified at the population-level, then the unknown parameters are estimated from the sample data.
In this class, we’ll use “hats” to denote an estimate. So, \(\hat{\beta}_0\) and \(\hat{\beta}_1\) represent estimates of the population-level intercept and slope. For example, we can write out the estimated/fitted model (using generic terms) in our Iowa City home sales example as:
\[\widehat{Sale Price} = \hat{\beta}_0 + \hat{\beta}_1 \text{Assessed Value}\]
As we’ve previously discussed, there are infinitely many possibilities for these estimates. That is, the family of straight-line models contains infinitely many lines. We want the best estimates, an idea we’ll need to more rigorously define.
Question #2: Considering the “Tips” dataset, write out the population-level model that uses “TotBill” to predict “Tip”. Then, write out the estimated model (in generic terms).
\(~\)
Least squares estimation is most widely used method for finding the “best” estimates of the model’s slope and intercept. The method is based upon the residuals, or deviations from the regression line and the observed value of \(y\). More specifically, the estimates are the values that minimize the sum of squares of these deviations.
Mathematically, we’ll use \(r_i = y_i - \hat{y}_i\) to denote the residual of the \(i^{th}\) data-point. Then, we’ll denote the sum of squares residuals as:
\[SSE = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] Notice \(SSE\) is a function of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) because \(\hat{y}_i = \beta_0 + \beta_1x\); so, we could find estimates that minimize \(SSE\) by taking partial derivatives with respect to \(\hat{\beta}_0\) and \(\hat{\beta}_1\), setting these derivatives equal to zero, then solving the resulting system of equations.
Minimizing \(SSE\) to find the least squares estimates of \(\beta_0\) and \(\beta_1\) is a very important idea, but I believe there are better ways to spend our time than re-deriving the least squares estimates for ourselves. If you’re interested in the mathematical details of these estimates, this video provides an excellent derivation.
Question #3: Consider the following estimated model: \(\widehat{Sale Price} = -1.53 + 1.03* \text{Assessed Value}\). One home in dataset had an assessed value of $173040 and sold for $172500 (case #1), while another home had an assessed value of $679540 and sold for $550000 (case #14). Use this estimated model to calculate the residuals for each of these observations (you can call them \(r_1\) and \(r_{14}\)). Which observation (case #1 or case #14) do you think had a greater influence on the least squares estimates?
\(~\)
lm
FunctionIn R
, the lm
function will fit linear regression models using least squares estimation. The example below demonstrates how to use this function, along with a few related functions that are used to access the quantities mentioned in previous sections:
## Load Iowa City Home Sales data
ic <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
## Fit model using "lm"
model <- lm(sale.amount ~ assessed, data = ic)
## Get the estimated coefficients
coef(model)
## (Intercept) assessed
## -1.522832 1.033127
## Get the residuals
res <- residuals(model)
## Histogram of the residuals
hist(res)
## Sum of squares residuals
sum(res^2)
## [1] 3.40932e+11
Question #4: Use the lm
function to estimate the slope and intercept for the linear regression that relates “TotBill” and “Tip”. Report these estimates using proper statistical notation and provide a brief interpretation of each.
Question #5: Using the model you fit in Question #4, use the which.min
function to locate the index of the largest negative residual. Next, find this observation in the dataset and calculate the percentage of the total bill that this person(s) tipped.
Question #6: Use the sum
function to sum-up all of the model’s residuals (do not square them). Are you surprised by the result? Briefly explain.
\(~\)
You can use least squares estimation to fit a straight-line model to any dataset. However, in order to capitalize on the statistical nature of this model, we must verify that the probability distribution we are using to represent that model’s random component is appropriate.
The validity of this distribution can be expressed via four basic assumptions about \(\epsilon\), which we can check using the fitted model’s residuals.
In R, the plot
function takes a fitted linear regression as an input and produces several graphs to help us evaluate these assumptions. For now, we’ll focus on the first two plots, “Residuals vs. Fitted” and “Normal Q-Q”.
## Using "model" from the Iowa City homes example, "par" is used to graph these plots in a 4x4 grid
par(mfrow = c(2,2))
plot(model)
The “Residuals vs. Fitted” plot can be used to evaluate Assumptions #2 and #4, while the “Normal Q-Q” plot can be used to evaluate Assumption #3.
Question #7: Use the plot
function to evaluate the assumptions of the model you fit in Question #4 that relates “TotBill” and “Tip”. Briefly comment on any violations you see.
\(~\)
After verifying that the Normal probability distribution is appropriate, we can engage in statistical inference. For linear regression, there are two broad types of inference we might be interested in:
These inferences are based upon the errors following a \(N(0,\sigma^2)\) distribution. However, because we must estimate \(\sigma^2\) in order to determine the standard errors of our model parameters, hypothesis tests and confidence intervals should use the \(t\)-distribution. We will not cover the details, but for simple linear regression the degrees of freedom are \(n - 2\) when conducting inference on the model’s slope or intercept.
You can find more information regarding the sampling distribution of \(\beta_1\) in Chapter 3.6 of our textbook, A Second Course in Statistics. For our purposes, we’ll rely upon R
to obtain the standard errors of these estimates using the summary
function. Recall that we can also obtain confidence intervals using the confint
function.
## Model summary (estimates, std errors, and t-tests)
summary(model)
##
## Call:
## lm(formula = sale.amount ~ assessed, data = ic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -152050 -7137 -347 7496 148286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.523e+00 1.712e+03 -0.001 0.999
## assessed 1.033e+00 8.819e-03 117.142 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20970 on 775 degrees of freedom
## Multiple R-squared: 0.9465, Adjusted R-squared: 0.9465
## F-statistic: 1.372e+04 on 1 and 775 DF, p-value: < 2.2e-16
## Confidence interval estimates
confint(model)
## 2.5 % 97.5 %
## (Intercept) -3361.652060 3358.60640
## assessed 1.015815 1.05044
Predictions can be found using the predict
function, with prediction intervals stemming from the argument interval = "prediction
. It is often useful to setup a “grid” of x-values that encompass the entire range of the explanatory variable when looking at prediction intervals.
## Setup a grid of x-values to predict over
xgrid <- data.frame(assessed = seq(min(ic$assessed), max(ic$assessed), length.out = 1000))
## Use "predict" to get model-based predictions and prediction intervals
preds <- predict(model, newdata = xgrid, interval = "prediction", level = .99)
## Base R plotting
plot(ic$assessed, ic$sale.amount) ## Scatterplot
abline(a = model$coefficients[1], b = model$coefficients[2]) ## Line equation
lines(xgrid$assessed, preds[,2], lty = 2) ## Prediction lower band
lines(xgrid$assessed, preds[,3], lty = 2) ## Prediction upper band
## ggplot version
library(ggplot2)
ggplot() + geom_point(aes(x = ic$assessed, y = ic$sale.amount)) +
geom_line(aes(x = xgrid$assessed, y = preds[,2]), linetype = "dashed") +
geom_line(aes(x = xgrid$assessed, y = preds[,3]), linetype = "dashed")
The graphs generated by the code above display 99% prediction intervals, implying that the actual y-value for 99% of data-points should be within these limits. Recognize that prediction intervals are fundamentally different from confidence intervals. The former deals with the uncertainty involved in predicting individual observations, while the later deals with the uncertainty involved in estimating model parameters.
Question #8: Using the model you fit in Question #4 that relates “TotBill” and “Tip”, create a graph showing the 95% prediction bands. Does it appear that 95% of the data falls inside of these bands?
\(~\)
Question #9: For this question you will conduct a comprehensive analysis using the data introduced below by following the steps outlines in Parts A-H.
The Armed Forces Qualification Test (AFQT) is a multiple choice aptitude test designed to gauge whether an individual’s reasoning and verbal skills are sufficient for military service. Because the test has been so broadly applied, scores are reported as percentiles and are frequently used as a measure of general intelligence.
The data you will use in the questions that follow comes from a random sample of 2584 Americans who were first selected and tested in 1979, then later re-interviewed in 2006 and asked about their education and annual income. Your focus will be on using AFQT percentile scores (AFQT
) to model annual income (Income2005
)
## Load the data
afqt <- read.csv("https://remiller1450.github.io/data/AFQT.csv")
summary
function to check for any missing values in the variables AFQT
and Income2005
.Income2005
, is the distribution of this variable skewed or symmetric?AFQT
to predict Income2005
and report the least squares estimates using proper notation.plot
function to evaluate the assumptions of this model. Do you believe inference using this model is statistically valid, or are the model’s assumptions violated assumptions?Income2005
, is the distribution of this new variable skewed or symmetric?AFQT
as the predictor). Then use the plot
function to evaluate the assumptions of this model. Based upon what you see, do you believe this model is better suited for valid statistical inference than the model you explored in Part D?AFQT
percentiles and future income.