Directions (read before starting)
\(~\)
Finding correlations in R
is straightforward and only
involves a single function, cor()
. This lab will also
provide a short introduction to regression, a topic that we’ll
continue to explore over the next several classes.
Before we begin, I wanted to briefly discuss the
geom_smooth()
function in the ggplot
data
visualization package and how it can help us make decisions during a
correlation or regression analysis.
Consider the following graph:
colleges = read.csv("https://remiller1450.github.io/data/Colleges2019_Complete.csv")
library(ggplot2)
ggplot(colleges, aes(x = Cost, y = Avg_Fac_Salary)) + geom_point()
As we’ve previously discussed, there seems to be a moderate relationship between these variables, and it appears to be non-linear. We can add a smoother to the scatter plot to confirm this:
ggplot(colleges, aes(x = Cost, y = Salary10yr_median)) + geom_point() + geom_smooth(method = 'loess', se = FALSE)
This graph uses locally weighted smoothing (LOESS), a method similar to a moving average, to summarize the trend in the graph.
We can also use the geom_smooth()
function to add a
simple linear regression line to a scatter plot by changinge the method
argument from ‘loess’ to ‘lm’ (short for linear model):
ggplot(colleges, aes(x = Cost, y = Salary10yr_median)) + geom_point() + geom_smooth(method = 'lm', se = FALSE)
This will be useful for visualizing the models covered in the latter portions of this lab.
\(~\)
At this point you should begin working independently with your assigned partner(s).
Correlation analysis is performed using the cor()
function. The code below demonstrates how to find Pearson’s correlation
coefficient between two variables:
## Pearson's correlation
cor(x = colleges$Adm_Rate, y = colleges$ACT_median, method = "pearson")
## [1] -0.4339326
If we determined Spearman’s rank correlation to be more appropriate
for this relationship, we can easily find it using the
method
argument:
## Spearman's correlation
cor(x = colleges$Adm_Rate, y = colleges$ACT_median, method = "spearman")
## [1] -0.2199741
Question 1:
Avg_Fac_Salary
and Salary10yr_median
and
describe the relationship between these variables.\(~\)
A useful feature of the cor()
function is it’s ability
to provide correlation matrix, which is a table
displaying correlation coefficients for all pairings of variables in a
given data set.
However, because correlation can only be calculated for quantitative
variables we’ll need to learn how to select a specific
set of columns from our data using functions from the dpylr
package:
library(dplyr)
## Select 3 demographic vars
college_demo_vars = colleges %>%
select(PercentFemale, PercentWhite, PercentBlack)
## Notice how 'college_demo_vars' only contains these 3 selected vars:
ncol(college_demo_vars)
## [1] 3
We can now provide these data into cor()
using only the
x
argument:
cor(x = college_demo_vars, method = "pearson")
## PercentFemale PercentWhite PercentBlack
## PercentFemale 1.0000000 -0.2052631 0.1684161
## PercentWhite -0.2052631 1.0000000 -0.7808669
## PercentBlack 0.1684161 -0.7808669 1.0000000
Notice how we get back a 3 by 3 table that is symmetric about the diagonal. Each entry in this table is the correlation coefficient for a certain combination of variables.
PercentWhite
and
PercentBlack
is -0.78.Question 2:
select()
function in
the dplyr
package to create a new data frame containing
only the variables Cost
, Net_Tuition
,
Debt_median
, and Avg_Fac_Salary
.\(~\)
In addition to correlation, another way to quantitatively describe the relationship between two quantitative predictors is simple linear regression:
\(y = b_0 + b_1 x + e\)
Here the outcome variable, \(y\), is modeled by a straight line relationship with an explanatory variable, \(x\), with error represented by \(e\).
The model formula above is a theoretical framework, in practice we need to estimate the model’s coefficients, \(b_0\) and \(b_1\), from our data.
This is done by solving for the combination that minimizes the squared sum of the model’s residuals, which are the observed errors: \(e_i = y_i - \hat{y}_i\) where \(\hat{y}_i\) is predicted y-value (ie: spot on the regression line) for the \(i^{th}\).
The diagram below illustrates this concept (albeit with some slightly different notation):
We’ll use the following notation to represent a fitted regression line (ie: the model estimated from our data that minimizes the sum of squared residuals): \[\hat{y} = \hat{b}_0 + \hat{b}_1 x\]
Below is a quick example of this in R
using the
colleges
data set.
\[\text{Debt_median} = b_0 +
b_1*\text{Net_Tuition} + e\] 2. Fit the regression model using
lm()
:
my_model = lm(Debt_median ~ Net_Tuition, data = colleges)
Debt_median ~ Net_Tuition
is an
R
formula. We can read it as “The variable
Debt_median
is modeled as a function of
Net_Tuition
”.## Print coefficients
coef(my_model)
## (Intercept) Net_Tuition
## 1.406931e+04 2.170265e-01
Thus, the fitted regression equation is: \[\widehat{\text{Debt_median}} = 14069.31 + 0.217* \text{Net_Tuition}\]
So we can conclude that each 1 dollar increase in net tuition associated with a 21.7 cent increase in the median debt of a college’s graduates.
\(~\)
Question 3:
Salary10yr_median
using the variable
ACT_median
.ACT_median
and Salary10yr_median
.\(~\)
The coefficient of determination, or \(R^2\), describes how much of the total variation in the outcome variable is explained the regression model.
Mathematically, total variation is defined as the sum of squared differences between each observation’s \(y\)-value and the mean of \(y\), or \(SST=\sum_{i=1}^{n}(y_i - \bar{y})^2\) where “SST” abbreviates the phrase “sum of squares total”.
From here, \(R^2\) is defined as: \[R^2 = \tfrac{SSR}{SST} = \tfrac{\sum_{i=1}^{n}(\hat{y}_i - \bar{y})^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}\]
In R
, the coefficient of determination is one of many
quantities calculated by the summary()
function (we’ll
discuss many of these at a later date):
my_model = lm(Debt_median ~ Net_Tuition, data = colleges)
summary(my_model)
##
## Call:
## lm(formula = Debt_median ~ Net_Tuition, data = colleges)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11668.8 -2368.4 70.1 2424.7 10091.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.407e+04 2.265e+02 62.12 <2e-16 ***
## Net_Tuition 2.170e-01 1.462e-02 14.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3486 on 1093 degrees of freedom
## Multiple R-squared: 0.1678, Adjusted R-squared: 0.167
## F-statistic: 220.3 on 1 and 1093 DF, p-value: < 2.2e-16
Here we see that the \(R^2\) value
(Multiple R-squared) is 0.1678, so we can say that about 16.7% of the
variability in Debt_median
across colleges is explained by
the variable Net_Tuition
.
summary()
function reports
adjusted R-squared.
Question 4: Find the \(R^2\) value of the model you fit in Question 3. Provide a 1-sentence explanation of what this value tells you about the relationship between the variables in the model.
\(~\)
Both Pearson’s correlation and Spearman’s rank correlation can only describe the strength of monotonic relationships; however, \(R^2\) can be used to describe any sort of relationship so long as we have a regression model that reflects the proper type of trend.
A simple example of this is polynomial regression, where we
can fit a quadratic (degree 2 polynomial) or cubic (degree 3 polynomial)
model to our data (rather than a straight line). This can be done
directly inside the lm()
function by using the
poly()
function within our model’s formula:
my_model = lm(Debt_median ~ poly(Net_Tuition, degree = 2), data = colleges)
We can see what this model looks like using the visreg
package:
# install.packages('visreg')
library(visreg)
visreg(my_model, xvar = 'Net_Tuition', data=colleges, gg=TRUE)
Something a bit tricky about the syntax of visreg()
is
that it expects you to provide variable names in quotations, and you’ll
get an “object error” if you don’t do so. However, I’ll never ask to
memorize this type of detail, so you can always just refer back to this
example or use ?visreg
to see the help documentation.
Question 5: