Lab #6 - Correlation and Simple Linear Regression

This lab introduces R and R Studio as well as a few procedures we’ll use in future class sessions.

Directions (read before starting)

Please work together with your assigned partner(s). Make sure you all fully understand each topic or example before moving on.
Record your answers to lab questions separately from the lab’s examples. If you and your partners work together for the entire lab your group only needs to submit one person’s copy.
Ask for help, clarification, or even just a check-in if anything seems unclear.

$~$

Onboarding

This lab will cover methods for analyzing two quantitative variables. The starting point in such an analysis should always be to graph the involved variables using a scatter plot, which is done using geom_point() in ggplot:

## Load some example data
example_data = read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")

## Scatter plot
library(ggplot2)
ggplot(example_data, aes(x = LifeExpectancy, y = Happiness)) + geom_point()

When looking at a scatter plot we want to judge the following:

Form - is there a pattern/relationship? If so, is it linear or non-linear?
Direction - do higher values in one variable tend to correspond with higher values in the other (a positive relationship)? or do higher values tend to pair with lower values (a negative or inverse relationship)?
Strength - how closely do the data adhere to the pattern/relationship you identified (ie: strong, moderate, weak, etc.)

These attributes can be difficult to judge, but we can use smoothing to help assess form:

## Add loess smoother with error band
ggplot(example_data, aes(x = LifeExpectancy, y = Happiness)) + geom_point() +
  geom_smooth(method = "loess")

The "loess" smoothing method essentially takes a locally weighted average of the trend present in a certain region of the graph, thereby allowing the trend line to have different slopes in different locations.

We can plot a "loess" smoother on top of a linear regression line to judge whether a relationship can reasonably be considered “linear”:

## Combine loess w/ error bands and linear regression line (shown in red)
ggplot(example_data, aes(x = LifeExpectancy, y = Happiness)) + geom_point() +
  geom_smooth(method = "loess") +
  geom_smooth(method = "lm", se = FALSE, color = "red")

In this example, we see that the regression line is largely within the error band of the "loess" smoother, suggesting it is reasonable to conclude that there is an approximately linear relationship between the variables “LifeExpectancy” and “Happiness”.

To provide a second example, the data below exhibits a non-linear relationship, as the regression line is noticeably above the "loess" smoother’s error bands for GDP per capita values less than $2000, and noticeably below it for values between $2000 and $10000.

## Combine loess w/ error bands and linear regression line (shown in red)
ggplot(example_data, aes(x = GDPperCapita, y = Happiness)) + geom_point() +
  geom_smooth(method = "loess") +
  geom_smooth(method = "lm", se = FALSE, color = "red")

It might be more reasonable to describe this relationship as logarithmic (changing at a slowing rate). The types of relationships you should be comfortable identifying are exponential (changing at an increasing rate), piecewise linear (two distinct slopes in different parts of the graph), and quadratic (parabolic or U-shaped):

$~$

Lab

Throughout this lab you will work with data from 2024 on Grinnell College’s official peer group, a sample of 17 selective liberal arts colleges that are considered to represent the type of academic institution Grinnell sees itself as.

grinnell_peers = read.csv("https://remiller1450.github.io/data/grinnell_peers.csv")

$~$

Correlation

In the lab’s introduction we saw how to judge the form of a relationship between two quantitative variables using a scatter plot. We can assess the strength and direction using correlation.

The example below demonstrates the cor() function, which is used to find the correlation coefficient relating two quantitative variables in the sample data:

cor(x = example_data$LifeExpectancy, y = example_data$Happiness, method = "pearson")

## [1] 0.8323819

For non-linear, monotonic relationships (such as exponential or logarithmic) we can change the method argument to calculate Spearman’s rank-based correlation:

cor(x = example_data$LifeExpectancy, y = example_data$Happiness, method = "spearman")

## [1] 0.8413194

The cor.test() function uses the observed sample correlation to test the null hypothesis that the correlation in the broader population is zero, or $H_0: \rho = 0$ in statistical notation:

cor.test(x = example_data$LifeExpectancy, y = example_data$Happiness,  alternative = "two.sided", method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  example_data$LifeExpectancy and example_data$Happiness
## t = 17.835, df = 141, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7739865 0.8767380
## sample estimates:
##       cor 
## 0.8323819

Here we might conclude:

“There is overwhelming evidence ($p$ < 0.001) of a strong, positive, linear relationship between the life expectancies and happiness scores of countries.”

Notice that this conclusion provides:

Strength of evidence
Details about the type of relationship
Context about the data

Additionally, you are not responsible for the specific details of this test (ie: how to calculate the test statistic, what distribution is used, degrees of freedom, etc.). But you are only responsible for knowing the null hypothesis ($H_0: \rho=0$) and how to interpret the results (such as the conclusion in the previous example). Nevertheless, you might notice from the R output that the underlying procedure used by cor.test() is a $T$-test that compares the sample correlation, $r$, with the hypothesized value, 0, using a standardized test statistic.

Something else to notice is that cor.test() gives us the sample correlation, so why would we ever use the cor() function? The answer is that cor() is useful for exploratory data analysis, or the initial steps of understanding the contents of a data set, because it can provide correlation matrices:

## Create a subset containing only numeric variables
library(dplyr)
example_data_quant_only = example_data %>% select(where(is.numeric))
cor(x = example_data_quant_only, method = "pearson")

##                     Region   Happiness LifeExpectancy  Footprint         HLY
## Region          1.00000000 -0.38873868    -0.17287036 -0.1933923 -0.34928587
## Happiness      -0.38873868  1.00000000     0.83238194  0.6435921  0.97440023
## LifeExpectancy -0.17287036  0.83238194     1.00000000  0.6272413  0.92450723
## Footprint      -0.19339232  0.64359212     0.62724130  1.0000000  0.69322700
## HLY            -0.34928587  0.97440023     0.92450723  0.6932270  1.00000000
## HPI            -0.22590887  0.58999384     0.54337762 -0.1806185  0.55727149
## HPIRank         0.21939986 -0.55521562    -0.52066824  0.2100918 -0.52612709
## GDPperCapita            NA          NA             NA         NA          NA
## HDI                     NA          NA             NA         NA          NA
## Population      0.09641703  0.04270338     0.01359268 -0.0686804  0.02377288
##                       HPI    HPIRank GDPperCapita HDI  Population
## Region         -0.2259089  0.2193999           NA  NA  0.09641703
## Happiness       0.5899938 -0.5552156           NA  NA  0.04270338
## LifeExpectancy  0.5433776 -0.5206682           NA  NA  0.01359268
## Footprint      -0.1806185  0.2100918           NA  NA -0.06868040
## HLY             0.5572715 -0.5261271           NA  NA  0.02377288
## HPI             1.0000000 -0.9869268           NA  NA  0.13781113
## HPIRank        -0.9869268  1.0000000           NA  NA -0.15832782
## GDPperCapita           NA         NA            1  NA          NA
## HDI                    NA         NA           NA   1          NA
## Population      0.1378111 -0.1583278           NA  NA  1.00000000

Each cell of the correlation matrix contains the pairwise correlation between the two variables represented by the corresponding row and column labels. This is why all of the diagonal elements of the matrix have a value of exactly 1, since every variable is perfectly correlated with itself.

You should also notice the NA values in the matrix, which are due to some variables not having data on every country. We can use the argument use = "complete.obs" to only use countries with complete data (this will filter the data to remove any rows with missing values in one or more variables); however, you should notice that this will affect the entire correlation matrix, not just the variables with missing data:

cor(x = example_data_quant_only, method = "pearson", use = "complete.obs")

##                     Region   Happiness LifeExpectancy   Footprint         HLY
## Region          1.00000000 -0.39403612    -0.18329022 -0.19614183 -0.35715858
## Happiness      -0.39403612  1.00000000     0.83342784  0.64329499  0.97482271
## LifeExpectancy -0.18329022  0.83342784     1.00000000  0.62661887  0.92457559
## Footprint      -0.19614183  0.64329499     0.62661887  1.00000000  0.69236646
## HLY            -0.35715858  0.97482271     0.92457559  0.69236646  1.00000000
## HPI            -0.23185284  0.59024399     0.54409200 -0.18107454  0.55783015
## HPIRank         0.22555826 -0.55519491    -0.52064898  0.21120155 -0.52611546
## GDPperCapita   -0.24143178  0.69768299     0.66620716  0.88482249  0.75209315
## HDI            -0.14590423  0.83559974     0.92656383  0.73675814  0.90758362
## Population      0.09973282  0.04256389     0.01391005 -0.06962928  0.02364168
##                        HPI     HPIRank GDPperCapita          HDI   Population
## Region         -0.23185284  0.22555826  -0.24143178 -0.145904231  0.099732817
## Happiness       0.59024399 -0.55519491   0.69768299  0.835599738  0.042563894
## LifeExpectancy  0.54409200 -0.52064898   0.66620716  0.926563827  0.013910049
## Footprint      -0.18107454  0.21120155   0.88482249  0.736758140 -0.069629281
## HLY             0.55783015 -0.52611546   0.75209315  0.907583624  0.023641676
## HPI             1.00000000 -0.98695406  -0.02162382  0.376950260  0.138471998
## HPIRank        -0.98695406  1.00000000   0.04132563 -0.352355297 -0.158931586
## GDPperCapita   -0.02162382  0.04132563   1.00000000  0.779706121 -0.042591186
## HDI             0.37695026 -0.35235530   0.77970612  1.000000000 -0.008187599
## Population      0.13847200 -0.15893159  -0.04259119 -0.008187599  1.000000000

For data sets containing a large number of quantitative variables it can be difficult to wade through the large number of correlations reported in the correlation matrix. In these circumstances it can be beneficial to visualize the correlation matrix and use aesthetics like color and position to help understand the relationships between variables in the data set. The corrplot library and corrplot() function are demonstrated below:

# install.packages("corrplot")
library(corrplot)

## Store the correlation matrix
my_corr_matrix = cor(x = example_data_quant_only, method = "pearson", use = "complete.obs")

## Plot using corrplot
corrplot(my_corr_matrix, method = "color", addCoef.col = "black", order="hclust")

The argument order="hclust" applies hierarchical clustering to rearrange the ordering of variables in the matrix such that clusters appear together. This allows us to more easily locate blocks of variables that are all highly correlated with each other in the same way. For example, “Footprint”, “GDPperCapita”, “Happiness”, “HLY”, “LifeExpectancy”, and “HDI” form a cluster of positively correlated variables.

Question #1: For this question you should use the grinnell_peers data introduced earlier in the lab.

Part A: Create a scatter plot showing the relationship between the variables Adm_Rate (the college’s admissions rate) and Med_Grad_Debt (the median debt of the college’s students upon graduating).
Part B: Add a loess smoother and regression line to your scatter plot, then briefly comment upon the form, direction, and strength of the relationship between these variables.
Part C: Report the correlation between Adm_Rate and Med_Grad_Debt and perform a hypothesis test evaluating whether the data provide evidence that these variables are related. Your final answer should be a one-sentence conclusion like the example given in this section. Be sure to carefully choose between Pearson’s and Spearman’s correlation depending upon whether you determined the relationship to be linear or non-linear.

$~$

Question #2: For this question you should continue using the grinnell_peers data set.

Part A: Filter the data to include only quantitative variables, then create a scatter plot matrix using these variables and Pearson’s correlation. Display the correlation matrix using a graphic created by the corrplot() function.
Part B: Based upon the correlation matrix, which two variables have the strongest negative relationship?
Part C: Create another correlation matrix using Spearman’s correlation. How does the strength of the relationship change when looking at the variables P_Female and Ret_Rate (the college’s retention rate)?
Part D: Create a scatter plot showing the relationship between P_Female and Ret_Rate. Based upon this plot, briefly comment upon why it is problematic to rely upon correlation results without having first graphed the data?
Part E: Using Pearson’s correlation, perform a hypothesis test evaluating whether the data provide statistical evidence of a relationship between P_Female and Ret_Rate at colleges like Grinnell. Provide a one-sentence conclusion modeled after the example in this section.
Part F: Now perform a similar hypothesis test using Spearman’s correlation. How does the conclusion change? Which of these two tests (Pearson’s correlation in Part E, and Spearman’s correlation here) should be preferred?

$~$

Regression

Recall from our lecture slides that simple linear regression relates a quantitative explanatory variable, $X$, with a quantitative outcome variable, $Y$, via the model: \[Y = \beta_0 + \beta_1X + \epsilon\]

The above model is theoretical (sometimes described as the “population model”) and in practice we estimate $\beta_0$ and $\beta_1$ using our data, denoting these estimates $\hat{\beta}_0$ and $\hat{\beta}_1$, respectively.

We can use the lm() function to find these estimates using our sample data, thereby allowing us to write out the equation of the fitted regression line:

## Fit the model
fitted_model = lm(LifeExpectancy ~ Happiness, data = example_data)

## Print the coefficient estimates
coef(fitted_model)

## (Intercept)   Happiness 
##   28.223143    6.693042

In this example, $\hat{\beta}_0 = 28.22$ and $\hat{\beta}_1 = 6.69$, so we estimate the life expectancy of a country whose citizens have a happiness score of zero to be 28.22 (which might be meaningless extrapolation), and we expect each 1-point increase in a country’s happiness to increase its expected life expectancy by 6.69 years.

In a regression analysis we might also consider:

Using a hypothesis test to evaluate $H_0: \beta_1 = 0$, the null hypothesis that the explanatory and response variables are unrelated.
Reporting the $R^2$ value to express how closely the model fits the data

We can do both of these by providing the fitted model to the summary() function:

## Fit the model
fitted_model = lm(LifeExpectancy ~ Happiness, data = example_data)

## Give the model object to summary()
summary(fitted_model)

## 
## Call:
## lm(formula = LifeExpectancy ~ Happiness, data = example_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.5032  -3.6853  -0.1982   4.3488  13.6968 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  28.2231     2.2799   12.38   <2e-16 ***
## Happiness     6.6930     0.3753   17.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.141 on 141 degrees of freedom
## Multiple R-squared:  0.6929, Adjusted R-squared:  0.6907 
## F-statistic: 318.1 on 1 and 141 DF,  p-value: < 2.2e-16

Here $R^2 = 0.6929$, and we see strong evidence against $H_0: \beta_1 = 0$ ($p <2\cdot e^{-16}$).

You might also notice that summary() is using a $T$-test to determine the $p$-value. Like any $T$-test, this comes with an underlying set of assumptions about that should be checked to ensure the test results are trustworthy. For regression we must check that the model’s errors are independent and Normally distributed. This can be done by giving the fitted model to the plot() function:

## Check independence
plot(fitted_model, which = 1)

## Check Normality
plot(fitted_model, which = 2)

The first diagnostic plot shows the model’s residuals (observed prediction errors) plotted against its predicted values. If the errors are independent we expect this graph to show random scatter. Thus, we are looking for the smoothing line (shown in red) to stay near zero throughout the entire range of the x-axis, and we are also looking for vertical spread of the points to stay roughly constant throughout this range (constant variance).

The second diagnostic plot standardizes the models residuals (ie: applies the Z-score transformation) then finds the percentile of each standardized residual and compares it with the corresponding value from the Normal distribution. If the observed residuals align perfectly with a Normal distribution we expect these points to fall on a 45-degree line. Deviation from the 45-degree line indicates skew:

Question #3: For this question you should use the grinnell_peers data introduced earlier in the lab. The guiding question in this analysis is whether Grinnell’s peer institutions appear to reward their faculty for the economic success of the school’s graduates.

Part A: Create a scatter plot displaying the relationship between Med_Grad_Earnings_4Y and Avg_Fac_Salary and briefly describe how these variables are related
Part B: Fit a linear regression model that uses Med_Grad_Earnings_4Y to predict Avg_Fac_Salary. Report the estimated slope and intercept of the fitted model.
Part C: Use the summary() function to perform a hypothesis test investigating whether the median earnings of a college’s graduates and the average salary of its faculty are unrelated. Provide a one-sentence conclusion that summarizes the results of the test (including the $p$-value) and includes an interpretation of the estimated slope coefficient.
Part D: Following the example in this section, use the plot() function to evaluate the assumptions of the $T$-test you used in Part C. Briefly explain whether or not you believe the test to be appropriate in this situation based upon what you see in the two diagnostic plots you created.
Part E: Locate the $R^2$ value in the output from the summary() function in Part C. Based upon this value, do you believe that Med_Grad_Earnings_4Y explains a high amount of the observed variation in faculty salaries, explains a moderate amount of variation in faculty salaries, or explains little variation in faculty salaries?

$~$

Question #4: In this question you should continue using the grinnell_peers data from earlier in the lab. The guiding question in this analysis is whether the percentage female students is related to a school’s average faculty salary.

Part A: Fit a linear regression model using P_Female to predict Avg_Fac_Salary then use the summary() function to assess whether the data provide statistical evidence of a relationship between the percentage of female students and the college’s average faculty salary. Report your $p$-value and a brief conclusion.
Part B: Create the diagnostic plots described in this section of the lab to evaluate the assumptions of the $T$-test you performed in Part A. Do you notice anything problematic in these plots? Briefly explain.
Part C: Create a scatter plot using the variables P_Female and Avg_Fac_Salary and notice the outlier, Smith College, that is nearly 100% female. How do you think this data-point is influencing the slope of the regression model you fit in Part A?
Part D: Use the filter() function in the dplyr library to create a subset of data that doesn’t contain Smith College (you may use the logical condition Institution != "Smith College" when filtering). Next, refit the regression model described in Part A to this subset. How did the estimated slope change after Smith College was removed?
Part E: It is ultimately the decision of the person analyzing the data to determine how outliers should be handled. In this regard, try to think of at least one reason not to remove Smith College from this analysis, and at least one reason to remove Smith College from this analysis. Think about how the handling of the outlier influences the population you can generalize to.

$~$

Practice (required)

After Exam 1, we’ll introduce the course project, but as a preparatory step I’d like you to propose and pursue your own research questions using one of the following data sets:

The grinnell_peers data set used throughout this lab - available at: https://remiller1450.github.io/data/grinnell_peers.csv
The “Hollywood Movies” data set - available at: https://remiller1450.github.io/data/HollywoodMovies.csv
- This data set records audience ratings and box office performances for a sample of Hollywood films released between 2007 and 2013

After choosing your data set you are to propose three research questions that fit the following guidelines:

A one-sample analysis (can be categorical or quantitative)
A two-sample analysis (the outcome can be categorical or quantitative, and you might need to use filter() to get the appropriate groups)
A regression or correlation analysis.

Your research questions need to involve variables that either already exist or can be created in the data set that you selected. For each research question you are to provide:

One sentence that concisely articulates your research question
A data visualization related to your research question
Hypothesis testing results, including a proper one-sentence conclusion related to your research question. Be sure that your conclusion mentions context, strength of evidence, and relevant descriptive statistics (ie: sample means, proportions, correlation, etc.)
A one-sentence justification of the hypothesis testing approach you used (here you might reference assumptions, choice of Pearson’s vs. Spearman’s correlation, etc.)

To reiterate, you should provide these components for all three research questions you developed.