Lab #5 - Correlation and Simple Linear Regression

Directions (read before starting)

Please work together with your assigned partner. Make sure you both fully understand something before moving on.
Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
Ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Onboarding

Finding correlations in R is straightforward and only involves a single function, cor(). This lab will also provide a short introduction to regression, a topic that we’ll continue to explore over the next several classes.

Before we begin, I wanted to briefly discuss the geom_smooth() function in the ggplot data visualization package and how it can help us make decisions during a correlation or regression analysis.

Consider the following graph:

colleges = read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv") 

library(ggplot2)
ggplot(colleges, aes(x = Cost, y = Avg_Fac_Salary)) + geom_point()

As we’ve previously discussed, there seems to be a moderate relationship between these variables, and it appears to be non-linear. We can add a smoother to the scatter plot to confirm this:

ggplot(colleges, aes(x = Cost, y = Salary10yr_median)) + geom_point() + geom_smooth(method = 'loess', se = FALSE)

This graph uses locally weighted smoothing (LOESS), a method similar to a moving average, to summarize the trend in the graph.

Because the trend is close to monotonic, but not linear, we might opt to use Spearman’s rank correlation to describe the relationship between these variables.

We can also use the geom_smooth() function to add a simple linear regression line to a scatter plot by changinge the method argument from ‘loess’ to ‘lm’ (short for linear model):

ggplot(colleges, aes(x = Cost, y = Salary10yr_median)) + geom_point() + geom_smooth(method = 'lm', se = FALSE)

This will be useful for visualizing the models covered in the latter portions of this lab.

\(~\)

Lab

At this point you should begin working independently with your assigned partner(s).

Correlation

Correlation analysis is performed using the cor() function. The code below demonstrates how to find Pearson’s correlation coefficient between two variables:

## Pearson's correlation
cor(x = colleges$Adm_Rate, y = colleges$ACT_median, method = "pearson")

## [1] -0.4339326

If we determined Spearman’s rank correlation to be more appropriate for this relationship, we can easily find it using the method argument:

## Spearman's correlation
cor(x = colleges$Adm_Rate, y = colleges$ACT_median, method = "spearman")

## [1] -0.2199741

Question 1:

Part A: Create a scatter plot of the variables Avg_Fac_Salary and Salary10yr_median and describe the relationship between these variables.
Part B: Is it more appropriate to use Pearson’s correlation or Spearman’s rank correlation to describe this relationship? Briefly explain.
Part C: Find and report both Pearson’s correlation and Spearman’s rank correlation for this relationship.

\(~\)

Correlation Matrices

A useful feature of the cor() function is it’s ability to provide correlation matrix, which is a table displaying correlation coefficients for all pairings of variables in a given data set.

However, because correlation can only be calculated for quantitative variables we’ll need to learn how to select a specific set of columns from our data using functions from the dpylr package:

library(dplyr)

## Select 3 demographic vars
college_demo_vars = colleges %>% 
  select(PercentFemale, PercentWhite, PercentBlack)

## Notice how 'college_demo_vars' only contains these 3 selected vars:
ncol(college_demo_vars)

## [1] 3

We can now provide these data into cor() using only the x argument:

cor(x = college_demo_vars, method = "pearson")

##               PercentFemale PercentWhite PercentBlack
## PercentFemale     1.0000000   -0.2052631    0.1684161
## PercentWhite     -0.2052631    1.0000000   -0.7808669
## PercentBlack      0.1684161   -0.7808669    1.0000000

Notice how we get back a 3 by 3 table that is symmetric about the diagonal. Each entry in this table is the correlation coefficient for a certain combination of variables.

The correlation between PercentWhite and PercentBlack is -0.78.
The entire diagonal of the table is exactly 1.0 because every variable is perfectly correlated with itself.

Question 2:

Part A: Use the select() function in the dplyr package to create a new data frame containing only the variables Cost, Net_Tuition, Debt_median, and Avg_Fac_Salary.
Part B: Use a correlation matrix to determine the pair of variables that has the strongest Spearman’s rank correlation (either positive or negative). Report this correlation.

\(~\)

Simple Linear Regression

In addition to correlation, another way to quantitatively describe the relationship between two quantitative predictors is simple linear regression:

\(y = b_0 + b_1 x + e\)

Here the outcome variable, \(y\), is modeled by a straight line relationship with an explanatory variable, \(x\), with error represented by \(e\).

\(b_0\) is the model’s intercept, or the expected value of \(y\) when \(x=0\)
\(b_1\) is the model’s slope, or the expected change in \(y\) for each 1-unit change in \(x\) (ie: when \(x\) increase from 0 to 1, or when \(x\) increases from 7 to 8)

The model formula above is a theoretical framework, in practice we need to estimate the model’s coefficients, \(b_0\) and \(b_1\), from our data.

This is done by solving for the combination that minimizes the squared sum of the model’s residuals, which are the observed errors: \(e_i = y_i - \hat{y}_i\) where \(\hat{y}_i\) is predicted y-value (ie: spot on the regression line) for the \(i^{th}\).

The diagram below illustrates this concept (albeit with some slightly different notation):

We’ll use the following notation to represent a fitted regression line (ie: the model estimated from our data that minimizes the sum of squared residuals): \[\hat{y} = \hat{b}_0 + \hat{b}_1 x\]

Notice that this equation does not include an error term, \(e\), because it represents the line itself

Below is a quick example of this in R using the colleges data set.

Propose the regression model:

\[\text{Debt_median} = b_0 + b_1*\text{Net_Tuition} + e\] 2. Fit the regression model using lm():

my_model = lm(Debt_median ~ Net_Tuition, data = colleges)

The first argument, Debt_median ~ Net_Tuition is an R formula. We can read it as “The variable Debt_median is modeled as a function of Net_Tuition”.

Print the model’s estimated coefficients (ie: slope and intercept):

## Print coefficients
coef(my_model)

##  (Intercept)  Net_Tuition 
## 1.406931e+04 2.170265e-01

Thus, the fitted regression equation is: \[\widehat{\text{Debt_median}} = 14069.31 + 0.217* \text{Net_Tuition}\]

So we can conclude that each 1 dollar increase in net tuition associated with a 21.7 cent increase in the median debt of a college’s graduates.

\(~\)

Question 3:

Part A: Fit a simple linear regression model that predicts the variable Salary10yr_median using the variable ACT_median.
Part B: Write out the fitted regression equation of the model you fit in Part A.
Part C: Provide a one-sentence interpretation of the slope coefficient in the model. That is, how does this slope describe the relationship between the variables ACT_median and Salary10yr_median.

\(~\)

The Coefficient of Determination \(R^2\)

The coefficient of determination, or \(R^2\), describes how much of the total variation in the outcome variable is explained the regression model.

Mathematically, total variation is defined as the sum of squared differences between each observation’s \(y\)-value and the mean of \(y\), or \(SST=\sum_{i=1}^{n}(y_i - \bar{y})^2\) where “SST” abbreviates the phrase “sum of squares total”.

From here, \(R^2\) is defined as: \[R^2 = \tfrac{SSR}{SST} = \tfrac{\sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}\]

The numerator in this formula, “SSR”, abbreviates the phrase “sum of squares regression”. It summarizes how different each predicted \(y\)-value (ie: \(\hat{y}_i\)) is from the average \(y\)-value.
- Notice that if all predictions are close to the average \(y\)-value the slope of the regression line must have been very flat, and thus the line doesn’t explain much variability in the outcome

In R, the coefficient of determination is one of many quantities calculated by the summary() function (we’ll discuss many of these at a later date):

my_model = lm(Debt_median ~ Net_Tuition, data = colleges)
summary(my_model)

## 
## Call:
## lm(formula = Debt_median ~ Net_Tuition, data = colleges)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11668.8  -2368.4     70.1   2424.7  10091.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.407e+04  2.265e+02   62.12   <2e-16 ***
## Net_Tuition 2.170e-01  1.462e-02   14.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3486 on 1093 degrees of freedom
## Multiple R-squared:  0.1678, Adjusted R-squared:  0.167 
## F-statistic: 220.3 on 1 and 1093 DF,  p-value: < 2.2e-16

Here we see that the \(R^2\) value (Multiple R-squared) is 0.1678, so we can say that about 16.7% of the variability in Debt_median across colleges is explained by the variable Net_Tuition.

You’ll also notice that the summary() function reports adjusted R-squared.
- We’ll soon begin studying regression models containing multiple predictors and a problem with \(R^2\) is that it will only increase when more predictors are used in the model
- Adjusted R-squared makes a correction to \(R^2\) so as not to reward overfit models (ie: those containing more predictors than necessary)

Question 4: Find the \(R^2\) value of the model you fit in Question 3. Provide a 1-sentence explanation of what this value tells you about the relationship between the variables in the model.

\(~\)

Non-linear relationships and \(R^2\)

Both Pearson’s correlation and Spearman’s rank correlation can only describe the strength of monotonic relationships; however, \(R^2\) can be used to describe any sort of relationship so long as we have a regression model that reflects the proper type of trend.

A simple example of this is polynomial regression, where we can fit a quadratic (degree 2 polynomial) or cubic (degree 3 polynomial) model to our data (rather than a straight line). This can be done directly inside the lm() function by using the poly() function within our model’s formula:

my_model = lm(Debt_median ~ poly(Net_Tuition, degree = 2), data = colleges)

We can see what this model looks like using the visreg package:

# install.packages('visreg')
library(visreg)
visreg(my_model, xvar = 'Net_Tuition', data=colleges, gg=TRUE)

Something a bit tricky about the syntax of visreg() is that it expects you to provide variable names in quotations, and you’ll get an “object error” if you don’t do so. However, I’ll never ask to memorize this type of detail, so you can always just refer back to this example or use ?visreg to see the help documentation.

Question 5:

Part A: Based upon the plot shown above, does a quadratic model appear to provide a better fit to the data than a simple linear regression (straight line) model?
Part B: Find the \(R^2\) and adjusted \(R^2\) values of the quadratic model shown above, how do they compare to the \(R^2\) and adjusted \(R^2\) values of the simple linear regression model? Do these measures suggest that the quadratic model is better?