This lab introduces R
and R Studio
as well
as a few procedures we’ll use in future class sessions.
Directions (read before starting)
\(~\)
This lab will cover methods for analyzing two quantitative
variables. The starting point in such an analysis should always be
to graph the involved variables using a scatter plot, which is done
using geom_point()
in ggplot
:
## Load some example data
example_data = read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")
## Scatter plot
library(ggplot2)
ggplot(example_data, aes(x = LifeExpectancy, y = Happiness)) + geom_point()
When looking at a scatter plot we want to judge the following:
These attributes can be difficult to judge, but we can use smoothing to help assess form:
## Add loess smoother with error band
ggplot(example_data, aes(x = LifeExpectancy, y = Happiness)) + geom_point() +
geom_smooth(method = "loess")
The "loess"
smoothing method essentially takes a locally
weighted average of the trend present in a certain region of the graph,
thereby allowing the trend line to have different slopes in different
locations.
We can plot a "loess"
smoother on top of a linear
regression line to judge whether a relationship can reasonably be
considered “linear”:
## Combine loess w/ error bands and linear regression line (shown in red)
ggplot(example_data, aes(x = LifeExpectancy, y = Happiness)) + geom_point() +
geom_smooth(method = "loess") +
geom_smooth(method = "lm", se = FALSE, color = "red")
In this example, we see that the regression line is largely within
the error band of the "loess"
smoother, suggesting it is
reasonable to conclude that there is an approximately linear
relationship between the variables “LifeExpectancy” and “Happiness”.
To provide a second example, the data below exhibits a non-linear
relationship, as the regression line is noticeably above the
"loess"
smoother’s error bands for GDP per capita values
less than $2000, and noticeably below it for values between $2000 and
$10000.
## Combine loess w/ error bands and linear regression line (shown in red)
ggplot(example_data, aes(x = GDPperCapita, y = Happiness)) + geom_point() +
geom_smooth(method = "loess") +
geom_smooth(method = "lm", se = FALSE, color = "red")
It might be more reasonable to describe this relationship as logarithmic (changing at a slowing rate). The types of relationships you should be comfortable identifying are exponential (changing at an increasing rate), piecewise linear (two distinct slopes in different parts of the graph), and quadratic (parabolic or U-shaped):
\(~\)
Throughout this lab you will work with data from 2024 on Grinnell College’s official peer group, a sample of 17 selective liberal arts colleges that are considered to represent the type of academic institution Grinnell sees itself as.
grinnell_peers = read.csv("https://remiller1450.github.io/data/grinnell_peers.csv")
\(~\)
In the lab’s introduction we saw how to judge the form of a relationship between two quantitative variables using a scatter plot. We can assess the strength and direction using correlation.
The example below demonstrates the cor()
function, which
is used to find the correlation coefficient relating two quantitative
variables in the sample data:
cor(x = example_data$LifeExpectancy, y = example_data$Happiness, method = "pearson")
## [1] 0.8323819
For non-linear, monotonic relationships (such as exponential or
logarithmic) we can change the method
argument to calculate
Spearman’s rank-based correlation:
cor(x = example_data$LifeExpectancy, y = example_data$Happiness, method = "spearman")
## [1] 0.8413194
The cor.test()
function uses the observed sample
correlation to test the null hypothesis that the correlation in the
broader population is zero, or \(H_0: \rho =
0\) in statistical notation:
cor.test(x = example_data$LifeExpectancy, y = example_data$Happiness, alternative = "two.sided", method = "pearson")
##
## Pearson's product-moment correlation
##
## data: example_data$LifeExpectancy and example_data$Happiness
## t = 17.835, df = 141, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7739865 0.8767380
## sample estimates:
## cor
## 0.8323819
Here we might conclude:
“There is overwhelming evidence (\(p\) < 0.001) of a strong, positive, linear relationship between the life expectancies and happiness scores of countries.”
Notice that this conclusion provides:
Additionally, you are not responsible for the specific
details of this test (ie: how to calculate the test statistic, what
distribution is used, degrees of freedom, etc.). But you are
only responsible for knowing the null hypothesis (\(H_0: \rho=0\)) and how to interpret the
results (such as the conclusion in the previous example). Nevertheless,
you might notice from the R
output that the underlying
procedure used by cor.test()
is a \(T\)-test that compares the sample
correlation, \(r\), with the
hypothesized value, 0, using a standardized test statistic.
Something else to notice is that cor.test()
gives us the
sample correlation, so why would we ever use the cor()
function? The answer is that cor()
is useful for
exploratory data analysis, or the initial steps of
understanding the contents of a data set, because it can provide
correlation matrices:
## Create a subset containing only numeric variables
library(dplyr)
example_data_quant_only = example_data %>% select(where(is.numeric))
cor(x = example_data_quant_only, method = "pearson")
## Region Happiness LifeExpectancy Footprint HLY
## Region 1.00000000 -0.38873868 -0.17287036 -0.1933923 -0.34928587
## Happiness -0.38873868 1.00000000 0.83238194 0.6435921 0.97440023
## LifeExpectancy -0.17287036 0.83238194 1.00000000 0.6272413 0.92450723
## Footprint -0.19339232 0.64359212 0.62724130 1.0000000 0.69322700
## HLY -0.34928587 0.97440023 0.92450723 0.6932270 1.00000000
## HPI -0.22590887 0.58999384 0.54337762 -0.1806185 0.55727149
## HPIRank 0.21939986 -0.55521562 -0.52066824 0.2100918 -0.52612709
## GDPperCapita NA NA NA NA NA
## HDI NA NA NA NA NA
## Population 0.09641703 0.04270338 0.01359268 -0.0686804 0.02377288
## HPI HPIRank GDPperCapita HDI Population
## Region -0.2259089 0.2193999 NA NA 0.09641703
## Happiness 0.5899938 -0.5552156 NA NA 0.04270338
## LifeExpectancy 0.5433776 -0.5206682 NA NA 0.01359268
## Footprint -0.1806185 0.2100918 NA NA -0.06868040
## HLY 0.5572715 -0.5261271 NA NA 0.02377288
## HPI 1.0000000 -0.9869268 NA NA 0.13781113
## HPIRank -0.9869268 1.0000000 NA NA -0.15832782
## GDPperCapita NA NA 1 NA NA
## HDI NA NA NA 1 NA
## Population 0.1378111 -0.1583278 NA NA 1.00000000
Each cell of the correlation matrix contains the pairwise correlation between the two variables represented by the corresponding row and column labels. This is why all of the diagonal elements of the matrix have a value of exactly 1, since every variable is perfectly correlated with itself.
You should also notice the NA
values in the matrix,
which are due to some variables not having data on every country. We can
use the argument use = "complete.obs"
to only use countries
with complete data (this will filter the data to remove any rows with
missing values in one or more variables); however, you should notice
that this will affect the entire correlation matrix, not just
the variables with missing data:
cor(x = example_data_quant_only, method = "pearson", use = "complete.obs")
## Region Happiness LifeExpectancy Footprint HLY
## Region 1.00000000 -0.39403612 -0.18329022 -0.19614183 -0.35715858
## Happiness -0.39403612 1.00000000 0.83342784 0.64329499 0.97482271
## LifeExpectancy -0.18329022 0.83342784 1.00000000 0.62661887 0.92457559
## Footprint -0.19614183 0.64329499 0.62661887 1.00000000 0.69236646
## HLY -0.35715858 0.97482271 0.92457559 0.69236646 1.00000000
## HPI -0.23185284 0.59024399 0.54409200 -0.18107454 0.55783015
## HPIRank 0.22555826 -0.55519491 -0.52064898 0.21120155 -0.52611546
## GDPperCapita -0.24143178 0.69768299 0.66620716 0.88482249 0.75209315
## HDI -0.14590423 0.83559974 0.92656383 0.73675814 0.90758362
## Population 0.09973282 0.04256389 0.01391005 -0.06962928 0.02364168
## HPI HPIRank GDPperCapita HDI Population
## Region -0.23185284 0.22555826 -0.24143178 -0.145904231 0.099732817
## Happiness 0.59024399 -0.55519491 0.69768299 0.835599738 0.042563894
## LifeExpectancy 0.54409200 -0.52064898 0.66620716 0.926563827 0.013910049
## Footprint -0.18107454 0.21120155 0.88482249 0.736758140 -0.069629281
## HLY 0.55783015 -0.52611546 0.75209315 0.907583624 0.023641676
## HPI 1.00000000 -0.98695406 -0.02162382 0.376950260 0.138471998
## HPIRank -0.98695406 1.00000000 0.04132563 -0.352355297 -0.158931586
## GDPperCapita -0.02162382 0.04132563 1.00000000 0.779706121 -0.042591186
## HDI 0.37695026 -0.35235530 0.77970612 1.000000000 -0.008187599
## Population 0.13847200 -0.15893159 -0.04259119 -0.008187599 1.000000000
For data sets containing a large number of quantitative variables it
can be difficult to wade through the large number of correlations
reported in the correlation matrix. In these circumstances it can be
beneficial to visualize the correlation matrix and use aesthetics like
color and position to help understand the relationships between
variables in the data set. The corrplot
library and
corrplot()
function are demonstrated below:
# install.packages("corrplot")
library(corrplot)
## Store the correlation matrix
my_corr_matrix = cor(x = example_data_quant_only, method = "pearson", use = "complete.obs")
## Plot using corrplot
corrplot(my_corr_matrix, method = "color", addCoef.col = "black", order="hclust")
The argument order="hclust"
applies hierarchical
clustering to rearrange the ordering of variables in the matrix such
that clusters appear together. This allows us to more easily locate
blocks of variables that are all highly correlated with each other in
the same way. For example, “Footprint”, “GDPperCapita”, “Happiness”,
“HLY”, “LifeExpectancy”, and “HDI” form a cluster of positively
correlated variables.
Question #1: For this question you should use the
grinnell_peers
data introduced earlier in the lab.
Adm_Rate
(the college’s
admissions rate) and Med_Grad_Debt
(the median debt of the
college’s students upon graduating).Adm_Rate
and Med_Grad_Debt
and perform a
hypothesis test evaluating whether the data provide evidence that these
variables are related. Your final answer should be a one-sentence
conclusion like the example given in this section. Be sure to carefully
choose between Pearson’s and Spearman’s correlation depending upon
whether you determined the relationship to be linear or non-linear.\(~\)
Question #2: For this question you should continue
using the grinnell_peers
data set.
corrplot()
function.P_Female
and
Ret_Rate
(the college’s retention rate)?P_Female
and Ret_Rate
.
Based upon this plot, briefly comment upon why it is problematic to rely
upon correlation results without having first graphed the data?P_Female
and
Ret_Rate
at colleges like Grinnell. Provide a one-sentence
conclusion modeled after the example in this section.\(~\)
Recall from our lecture slides that simple linear regression relates a quantitative explanatory variable, \(X\), with a quantitative outcome variable, \(Y\), via the model: \[Y = \beta_0 + \beta_1X + \epsilon\]
The above model is theoretical (sometimes described as the “population model”) and in practice we estimate \(\beta_0\) and \(\beta_1\) using our data, denoting these estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\), respectively.
We can use the lm()
function to find these estimates
using our sample data, thereby allowing us to write out the equation of
the fitted regression line:
## Fit the model
fitted_model = lm(LifeExpectancy ~ Happiness, data = example_data)
## Print the coefficient estimates
coef(fitted_model)
## (Intercept) Happiness
## 28.223143 6.693042
In this example, \(\hat{\beta}_0 = 28.22\) and \(\hat{\beta}_1 = 6.69\), so we estimate the life expectancy of a country whose citizens have a happiness score of zero to be 28.22 (which might be meaningless extrapolation), and we expect each 1-point increase in a country’s happiness to increase its expected life expectancy by 6.69 years.
In a regression analysis we might also consider:
We can do both of these by providing the fitted model to the
summary()
function:
## Fit the model
fitted_model = lm(LifeExpectancy ~ Happiness, data = example_data)
## Give the model object to summary()
summary(fitted_model)
##
## Call:
## lm(formula = LifeExpectancy ~ Happiness, data = example_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5032 -3.6853 -0.1982 4.3488 13.6968
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.2231 2.2799 12.38 <2e-16 ***
## Happiness 6.6930 0.3753 17.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.141 on 141 degrees of freedom
## Multiple R-squared: 0.6929, Adjusted R-squared: 0.6907
## F-statistic: 318.1 on 1 and 141 DF, p-value: < 2.2e-16
Here \(R^2 = 0.6929\), and we see strong evidence against \(H_0: \beta_1 = 0\) (\(p <2\cdot e^{-16}\)).
You might also notice that summary()
is using a \(T\)-test to determine the \(p\)-value. Like any \(T\)-test, this comes with an underlying set
of assumptions about that should be checked to ensure the test results
are trustworthy. For regression we must check that the model’s errors
are independent and Normally distributed. This can be done by giving the
fitted model to the plot()
function:
## Check independence
plot(fitted_model, which = 1)
## Check Normality
plot(fitted_model, which = 2)
The first diagnostic plot shows the model’s residuals (observed prediction errors) plotted against its predicted values. If the errors are independent we expect this graph to show random scatter. Thus, we are looking for the smoothing line (shown in red) to stay near zero throughout the entire range of the x-axis, and we are also looking for vertical spread of the points to stay roughly constant throughout this range (constant variance).
The second diagnostic plot standardizes the models residuals (ie: applies the Z-score transformation) then finds the percentile of each standardized residual and compares it with the corresponding value from the Normal distribution. If the observed residuals align perfectly with a Normal distribution we expect these points to fall on a 45-degree line. Deviation from the 45-degree line indicates skew:
Question #3: For this question you should use the
grinnell_peers
data introduced earlier in the lab. The
guiding question in this analysis is whether Grinnell’s peer
institutions appear to reward their faculty for the economic success of
the school’s graduates.
Med_Grad_Earnings_4Y
and
Avg_Fac_Salary
and briefly describe how these variables are
relatedMed_Grad_Earnings_4Y
to predict
Avg_Fac_Salary
. Report the estimated slope and intercept of
the fitted model.summary()
function to
perform a hypothesis test investigating whether the median earnings of a
college’s graduates and the average salary of its faculty are unrelated.
Provide a one-sentence conclusion that summarizes the results of the
test (including the \(p\)-value) and
includes an interpretation of the estimated slope coefficient.plot()
function to evaluate the assumptions of the
\(T\)-test you used in Part C. Briefly
explain whether or not you believe the test to be appropriate in this
situation based upon what you see in the two diagnostic plots you
created.summary()
function in Part C. Based upon this value, do you
believe that Med_Grad_Earnings_4Y
explains a high
amount of the observed variation in faculty salaries, explains a
moderate amount of variation in faculty salaries, or explains
little variation in faculty salaries?\(~\)
Question #4: In this question you should continue
using the grinnell_peers
data from earlier in the lab. The
guiding question in this analysis is whether the percentage female
students is related to a school’s average faculty salary.
P_Female
to predict Avg_Fac_Salary
then use
the summary()
function to assess whether the data provide
statistical evidence of a relationship between the percentage of female
students and the college’s average faculty salary. Report your \(p\)-value and a brief conclusion.P_Female
and Avg_Fac_Salary
and notice the
outlier, Smith College, that is nearly 100% female. How do you think
this data-point is influencing the slope of the regression model you fit
in Part A?filter()
function in
the dplyr
library to create a subset of data that doesn’t
contain Smith College (you may use the logical condition
Institution != "Smith College"
when filtering). Next, refit
the regression model described in Part A to this subset. How did the
estimated slope change after Smith College was removed?\(~\)
After Exam 1, we’ll introduce the course project, but as a preparatory step I’d like you to propose and pursue your own research questions using one of the following data sets:
grinnell_peers
data set used throughout this lab -
available at:
https://remiller1450.github.io/data/grinnell_peers.csv
https://remiller1450.github.io/data/HollywoodMovies.csv
After choosing your data set you are to propose three research questions that fit the following guidelines:
filter()
to get the
appropriate groups)Your research questions need to involve variables that either already exist or can be created in the data set that you selected. For each research question you are to provide:
To reiterate, you should provide these components for all three research questions you developed.