MATH-257 - Lab #3 - Linear Regression

Directions:

As with all our labs, you are expected to work through the examples and questions in this lab collaboratively with your partner(s). This requires you to work together and discuss each section. You each will receive the same score for your work, so is your responsibility to make sure that both of you are on the same page. Any groups found to be using a “divide and conquer” strategy to rush through the lab’s questions will be penalized.

You should record your answers to the lab’s questions in an R Markdown file. When submitting the lab, you should only turn in the compiled .html file created by R Markdown.

You are strongly encouraged to open a separate, blank R script to run and experiment with the example code that is given throughout the lab. Please do not turn-in this code.

$~$

Linear Regression

Linear regression describes a class of models that are parametric and statistical.

To understand the method, first consider a population of interest. Simple linear regression assumes a straight-line relationship between two variables in this population. This population-level model is expressed using the following parametric structure:

\[y = \beta_0 + \beta_1x + \epsilon\]

$y$ is the dependent or response or outcome variable
$x$ is the independent or predictor or explanatory variable
The intercept and slope, $\beta_0$ and $\beta_1$, are the parameters of the model (making it a parametric model)
$\epsilon$ is a random error component (making it a statistical model)

The random errors, $\epsilon$, follow a $N(0, \sigma^2)$ distribution that is independent of $x$, meaning the distribution of errors should be similar everywhere along the line:

The line itself is the expected value of the response variable for a given value of $x$. That is, $E(y) = \hat{y} = \beta_0 + \beta_1x$. Sometimes this expression is called the model’s deterministic component, while $\epsilon$ is the model’s random component.

$~$

Utilizing Linear Regression

Using a linear regression model to analyze data involves the following steps:

Model Specification - Specify the model at the population level, for example: $\text{Sale Price} = \beta_0 + \beta_1 \text{Assessed Value} + \epsilon$ for all homes located in Iowa City, IA
Model Fitting - Use the sample data to estimate the model’s unknown parameters (ie: estimate $\beta_0$ and $\beta_1$)
Inference - Use the fitted model to draw conclusions about the population, or make predictions about new data

In the next section, we’ll take a closer look at Step #2 using the “Tips” dataset, which is introduced below.

This dataset was recorded by a restaurant server at a national chain restaurant located in a suburban shopping mall near New York City. The server kept records on every table they served over the span of several months in hopes of better understanding the tips they’d been earning. In this lab, we’ll focus on the variables “Tip” and “TotBill”, or the amount tipped and the total bill (respectively).

## Code to load the Tips dataset
tips <- read.csv("https://remiller1450.github.io/data/Tips.csv")

Question #1: Create a scatterplot displaying the relationship between “TotBill” and “Tip”. Based upon this graph, does a simple linear regression model seem reasonable for using “TotBill” to explain “Tip”?

$~$

Model Fitting

Linear regression models are specified at the population-level, then the unknown parameters are estimated from the sample data.

In this class, we’ll use “hats” to denote an estimate. So, $\hat{\beta}_0$ and $\hat{\beta}_1$ represent estimates of the population-level intercept and slope. For example, we can write out the estimated/fitted model (using generic terms) in our Iowa City home sales example as:

\[\widehat{Sale Price} = \hat{\beta}_0 + \hat{\beta}_1 \text{Assessed Value}\]

As we’ve previously discussed, there are infinitely many possibilities for these estimates. That is, the family of straight-line models contains infinitely many lines. We want the best estimates, an idea we’ll need to more rigorously define.

Question #2: Considering the “Tips” dataset, write out the population-level model that uses “TotBill” to predict “Tip”. Then, write out the estimated model (in generic terms).

$~$

Least Squares Estimation

Least squares estimation is most widely used method for finding the “best” estimates of the model’s slope and intercept. The method is based upon the residuals, or deviations from the regression line and the observed value of $y$. More specifically, the estimates are the values that minimize the sum of squares of these deviations.

Mathematically, we’ll use $r_i = y_i - \hat{y}_i$ to denote the residual of the $i^{th}$ data-point. Then, we’ll denote the sum of squares residuals as:

\[SSE = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] Notice $SSE$ is a function of $\hat{\beta}_0$ and $\hat{\beta}_1$ because $\hat{y}_i = \beta_0 + \beta_1x$; so, we could find estimates that minimize $SSE$ by taking partial derivatives with respect to $\hat{\beta}_0$ and $\hat{\beta}_1$, setting these derivatives equal to zero, then solving the resulting system of equations.

Minimizing $SSE$ to find the least squares estimates of $\beta_0$ and $\beta_1$ is a very important idea, but I believe there are better ways to spend our time than re-deriving the least squares estimates for ourselves. If you’re interested in the mathematical details of these estimates, this video provides an excellent derivation.

Question #3: Consider the following estimated model: $\widehat{Sale Price} = -1.53 + 1.03* \text{Assessed Value}$. One home in dataset had an assessed value of $173040 and sold for $172500 (case #1), while another home had an assessed value of $679540 and sold for $550000 (case #14). Use this estimated model to calculate the residuals for each of these observations (you can call them $r_1$ and $r_{14}$). Which observation (case #1 or case #14) do you think had a greater influence on the least squares estimates?

$~$

The `lm` Function

In R, the lm function will fit linear regression models using least squares estimation. The example below demonstrates how to use this function, along with a few related functions that are used to access the quantities mentioned in previous sections:

## Load Iowa City Home Sales data
ic <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

## Fit model using "lm"
model <- lm(sale.amount ~ assessed, data = ic)

## Get the estimated coefficients
coef(model)

## (Intercept)    assessed 
##   -1.522832    1.033127

## Get the residuals
res <- residuals(model)

## Histogram of the residuals
hist(res)

## Sum of squares residuals
sum(res^2)

## [1] 3.40932e+11

Question #4: Use the lm function to estimate the slope and intercept for the linear regression that relates “TotBill” and “Tip”. Report these estimates using proper statistical notation and provide a brief interpretation of each.

Question #5: Using the model you fit in Question #4, use the which.min function to locate the index of the largest negative residual. Next, find this observation in the dataset and calculate the percentage of the total bill that this person(s) tipped.

Question #6: Use the sum function to sum-up all of the model’s residuals (do not square them). Are you surprised by the result? Briefly explain.

$~$

Evaluating Model Assumptions

You can use least squares estimation to fit a straight-line model to any dataset. However, in order to capitalize on the statistical nature of this model, we must verify that the probability distribution we are using to represent that model’s random component is appropriate.

The validity of this distribution can be expressed via four basic assumptions about $\epsilon$, which we can check using the fitted model’s residuals.

The mean of $\epsilon$ is zero, that is $E(\epsilon = 0)$ - this isn’t really an issue with least squares estimation, as you saw in Question #6 the residuals will always sum to zero (implying they also have an average value of zero).
The variance of $\epsilon$ is constant for all values of $x$ (ie: $\sigma^2$ doesn’t depend on $x$)
The distribution of $\epsilon$ is Normal (with variance $\sigma^2$)
The errors are independent of one another (ie: the error for one data-point has no influence on the error for another data-point)

In R, the plot function takes a fitted linear regression as an input and produces several graphs to help us evaluate these assumptions. For now, we’ll focus on the first two plots, “Residuals vs. Fitted” and “Normal Q-Q”.

## Using "model" from the Iowa City homes example, "par" is used to graph these plots in a 4x4 grid
par(mfrow = c(2,2))
plot(model)

The “Residuals vs. Fitted” plot can be used to evaluate Assumptions #2 and #4, while the “Normal Q-Q” plot can be used to evaluate Assumption #3.

Using the former plot, we might conclude that Assumption #2 appears questionable since the residuals seem to be more spread out for larger fitted values (which correspond to larger assessed values (ie: $x$) according our model); otherwise, Assumption #4 looks reasonable.
Using the later plot, we might conclude that this model’s errors have slightly thicker tails that you’d expect under a Normal distribution, but I personally wouldn’t consider this to be an egregious violation of Assumption #3. Note: the ideal Q-Q plot will show a 1-1 correspondence between quantiles (a straight-line relationship)

Question #7: Use the plot function to evaluate the assumptions of the model you fit in Question #4 that relates “TotBill” and “Tip”. Briefly comment on any violations you see.

$~$

Inference and Predictions

After verifying that the Normal probability distribution is appropriate, we can engage in statistical inference. For linear regression, there are two broad types of inference we might be interested in:

Parameter estimation - evaluating the statistical uncertainty of our least squares estimates using confidence intervals and/or hypothesis testing
Prediction - using our model to make predictions about new data using prediction intervals

These inferences are based upon the errors following a $N(0,\sigma^2)$ distribution. However, because we must estimate $\sigma^2$ in order to determine the standard errors of our model parameters, hypothesis tests and confidence intervals should use the $t$-distribution. We will not cover the details, but for simple linear regression the degrees of freedom are $n - 2$ when conducting inference on the model’s slope or intercept.

You can find more information regarding the sampling distribution of $\beta_1$ in Chapter 3.6 of our textbook, A Second Course in Statistics. For our purposes, we’ll rely upon R to obtain the standard errors of these estimates using the summary function. Recall that we can also obtain confidence intervals using the confint function.

## Model summary (estimates, std errors, and t-tests)
summary(model)

## 
## Call:
## lm(formula = sale.amount ~ assessed, data = ic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -152050   -7137    -347    7496  148286 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.523e+00  1.712e+03  -0.001    0.999    
## assessed     1.033e+00  8.819e-03 117.142   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20970 on 775 degrees of freedom
## Multiple R-squared:  0.9465, Adjusted R-squared:  0.9465 
## F-statistic: 1.372e+04 on 1 and 775 DF,  p-value: < 2.2e-16

## Confidence interval estimates
confint(model)

##                    2.5 %     97.5 %
## (Intercept) -3361.652060 3358.60640
## assessed        1.015815    1.05044

Predictions can be found using the predict function, with prediction intervals stemming from the argument interval = "prediction. It is often useful to setup a “grid” of x-values that encompass the entire range of the explanatory variable when looking at prediction intervals.

## Setup a grid of x-values to predict over
xgrid <- data.frame(assessed = seq(min(ic$assessed), max(ic$assessed), length.out = 1000))

## Use "predict" to get model-based predictions and prediction intervals
preds <- predict(model, newdata = xgrid, interval = "prediction", level = .99)

## Base R plotting 
plot(ic$assessed, ic$sale.amount) ## Scatterplot
abline(a = model$coefficients[1], b = model$coefficients[2]) ## Line equation
lines(xgrid$assessed, preds[,2], lty = 2)  ## Prediction lower band
lines(xgrid$assessed, preds[,3], lty = 2)  ## Prediction upper band

## ggplot version
library(ggplot2)

ggplot() + geom_point(aes(x = ic$assessed, y = ic$sale.amount)) +
           geom_line(aes(x = xgrid$assessed, y = preds[,2]), linetype = "dashed") + 
           geom_line(aes(x = xgrid$assessed, y = preds[,3]), linetype = "dashed")

The graphs generated by the code above display 99% prediction intervals, implying that the actual y-value for 99% of data-points should be within these limits. Recognize that prediction intervals are fundamentally different from confidence intervals. The former deals with the uncertainty involved in predicting individual observations, while the later deals with the uncertainty involved in estimating model parameters.

Question #8: Using the model you fit in Question #4 that relates “TotBill” and “Tip”, create a graph showing the 95% prediction bands. Does it appear that 95% of the data falls inside of these bands?

$~$

Practice

Question #9: For this question you will conduct a comprehensive analysis using the data introduced below by following the steps outlines in Parts A-H.

The Armed Forces Qualification Test (AFQT) is a multiple choice aptitude test designed to gauge whether an individual’s reasoning and verbal skills are sufficient for military service. Because the test has been so broadly applied, scores are reported as percentiles and are frequently used as a measure of general intelligence.

The data you will use in the questions that follow comes from a random sample of 2584 Americans who were first selected and tested in 1979, then later re-interviewed in 2006 and asked about their education and annual income. Your focus will be on using AFQT percentile scores (AFQT) to model annual income (Income2005)

## Load the data
afqt <- read.csv("https://remiller1450.github.io/data/AFQT.csv")

Part A: Write code that reads the data, then use the summary function to check for any missing values in the variables AFQT and Income2005.
Part B: Create a histogram of the variable Income2005, is the distribution of this variable skewed or symmetric?
Part C: Fit a linear regression model that uses AFQT to predict Income2005 and report the least squares estimates using proper notation.
Part D: Use the plot function to evaluate the assumptions of this model. Do you believe inference using this model is statistically valid, or are the model’s assumptions violated assumptions?
Part E: Create a new variable that is base-2 logarithm of Income2005, is the distribution of this new variable skewed or symmetric?
Part F: Fit a linear regression model using the new variable you created in Part E as the outcome (keeping AFQT as the predictor). Then use the plot function to evaluate the assumptions of this model. Based upon what you see, do you believe this model is better suited for valid statistical inference than the model you explored in Part D?
Part G: Interpret the slope of the model you fit in Part F. Additionally, use a hypothesis test to determine whether there is a statistically significant association between AFQT percentiles and future income.
Part H: Using the model you fit in Part F, report the 99% prediction interval for an individual with a 99th percentile AFQT score. Be careful to report in your interval in dollars, not log-2 dollars!