\(~\)
So far we’ve largely been able to use our data in the format that it
was given. For example, we’ve used entire data frames to create data
visualizations, and we’ve been able to use the $
operator
to access specific variables to calculate descriptive statistics like
the mean or median.
In some instances we’ll want to work with data consisting of a subset
of cases and/or variables from our original. The dplyr
library provides a suite of data manipulation functions that allow us to
obtain these subsets using pipelines, or ways of feeding data
forward through a series of steps without storing the intermediate
results.
## Load the dplyr package, install if necessary
# install.packages("dplyr")
library(dplyr)
## Read some data (118th congress we've seen previously)
congress = read.csv("https://remiller1450.github.io/data/congress_2024.csv")
## Notice the number of cases and variables in "congress"
dim(congress)
## [1] 538 7
The code below is a pipeline that uses the
filter()
function to subset the cases and the
select()
function to subset the variables, storing the
resulting subset in a data frame named “senate_ages”:
## Data pipeline
senate_ages = congress %>%
filter(Chamber == "Senate") %>%
select(Name, Party, Age)
The pipe operator, %>%
, is used to pass the output of
function into a subsequent function. In this example, pipeline the
%>%
operator is used twice:
filter()
function, which subsets the data it receives to
only include cases where the logical condition
Chamber = "Senate"
is TRUE
filter()
is passed into the
select()
function, which removes all columns from the data
frame except for “Name”, “Party”, and “Age”. Since this is the final
step in the pipeline its result is stored in the object
“senate_ages”We can check the dimensions of “senate_ages” to confirm that it only contains information on the three variables we selected for the 100 congress members in the US Senate:
dim(senate_ages)
## [1] 100 3
\(~\)
As a reminder, you should work on the lab with your assigned partner(s) using the principles of paired programming. Everyone should keep and submit their own copy of your group’s work.
In this lab you’ll use two data sets:
wines <- read.csv("https://remiller1450.github.io/data/wines.csv")
homes <- read.csv("https://remiller1450.github.io/data/AmesHousing.csv")
You’ll also need the same packages used in our previous lab,
ggplot2
and forcats
. Remember that you do
not need to install these packages if you’ve done so previously,
but you do need to load them using the library()
function:
# install.packages("ggplot2")
# install.packages("forcats")
library(ggplot2)
library(forcats)
\(~\)
Before calculating the correlation between two quantitative variables we should visualize their relationship to determine whether Pearson’s or Spearman’s correlation coefficient is most appropriate. As a reminder, we prefer Pearson’s correlation for measuring the strength of a linear relationship, and Spearman’s correlation when the relationship is non-linear (but still monotonic).
Below is a plot showing the relationship between a wine’s alcohol content and its proline content:
ggplot(data = wines, aes(x = Alcohol, y = Proline)) +
geom_point() +
geom_smooth(color = "blue", se = FALSE) +
geom_smooth(method = "lm", color = "red", se = FALSE)
In this example we included two types of smoother on the scatter plot.
geom_smooth(color = "blue", se = FALSE)
fits a locally weighted
moving average and is shown in blue. This smoother is sensitive to
non-linear patterns in the data.geom_smooth(method = "lm", color = "red", se = FALSE)
includes the argument method = "lm"
, which amounts to the
smoother being the best fitting regression line. This smoother is shown
in red and we are including it for comparison purposes.Here we see a moderate-to-strong linear relationship. The similarity of the red and blue smoothers suggests a linear relationship, thus we should use Pearson’s correlation for a more precise quantification of just how strongly these variables are related.
Both Pearson’s and Spearman’s correlation coefficients are found
using the cor()
function. The method
argument
determines whether Spearman’s or Pearson’s correlation is given:
## Pearson's cor
cor(x = wines$Alcohol, y = wines$Proline, method = "pearson")
## [1] 0.64372
## Spearman's cor
cor(x = wines$Alcohol, y = wines$Proline, method = "spearman")
## [1] 0.6335799
Both Pearson’s and Spearman’s correlation coefficients are shown in this example for illustration purposes, but you should only calculate and report one of them. It’s also not advised to calculate both and report whichever suggests a stronger association. This practice is an example “data dredging” and can increase the risk of false positive findings. Instead, you should decide upon one measure of association based upon your subject-area knowledge and exploratory analyses (data visualizations, etc.)
Question #1: For this question you should use the Ames Housing data set.
Overall.Qual
, the county
assessor’s overall rating of the home’s quality, and the variable
SalePrice
. Qualitatively describe the form,
strength, and direction of the relationship between
these two variables.\(~\)
In some analyses we’ll want to find correlations between many
different pairings of variables. For the wines data set, we might be
interested in the relationships between the variables
Alcohol
, Mg
, Proline
, and
TotalPhenols
. The cor()
function allows us to
provide an entire data frame in its x
argument so long as
that data frame only contains numeric variables:
## Use the select function to keep only the numeric variables we want
wines_subset = wines %>% select(Alcohol, Mg, Proline, TotalPhenols)
## Give this data frame to cor()
cor(x = wines_subset, method = "pearson")
## Alcohol Mg Proline TotalPhenols
## Alcohol 1.0000000 0.2707982 0.6437200 0.2891011
## Mg 0.2707982 1.0000000 0.3933508 0.2144012
## Proline 0.6437200 0.3933508 1.0000000 0.4981149
## TotalPhenols 0.2891011 0.2144012 0.4981149 1.0000000
In this setting the cor()
function produces a
correlation matrix, where each element of the matrix is
the correlation coefficient relating the corresponding pairing of
variables (defined by the row and column). For example, we can see that
the correlation between Proline
and Mg
is
0.393.
Question #2: For this question you should use the Ames Housing data set.
select()
function to
create a subset of the Ames Housing data containing only the variables
Overall.Qual
, Overall.Cond
,
TotRms.AbvGrd
and SalePrice
.Overall.Cond
and SalePrice
.\(~\)
When working with real data we’ll frequently encounter instances
where some cases do not have a recorded value for every variable. In
R
, missing values are intended to be stored as the special
value NA
.
In the Ames Housing data set, the variable
Total.Bsmt.SF
, the total amount of square footage of a
home’s basement, was not recorded for every home sale. Thus, when we try
to find the correlation between Total.Bsmt.SF
and a home’s
sale price we get the special value NA
:
cor(x = homes$Total.Bsmt.SF, y = homes$SalePrice, method = "pearson")
## [1] NA
One way to avoid this is to calculate the correlation using only
cases with complete data for all variables involved
calculations performed by cor()
:
cor(x = homes$Total.Bsmt.SF, y = homes$SalePrice, method = "pearson", use = "complete.obs")
## [1] 0.6322805
You may note that this may exclude more data than necessary if we
provide an entire data frame to cor()
but only care about a
few pairings of variables.
Question #3: Modify your code from Question #2 to
include the variable Total.Bsmt.SF
in your subset, using
the use = "complete.obs"
argument to account for missing
data in this variable when finding the correlation matrix. Record how
the correlation between Overall.Qual
and
SalePrice
differs in this correlation matrix compared to
the correlation matrix you found in Question #2 (the first digits will
be the same, but you should see a slight difference).
\(~\)
Simple linear regression provides a model that relates an explanatory variable, \(X\), with a quantitative response variable, \(Y\), via the model: \[Y = \beta_0 + \beta_1X + \epsilon\]
This model is theoretical, and in practice we estimate \(\beta_0\) and \(\beta_1\) using our data, denoting these estimates \(b_0\) and \(b_1\), respectively.
Thus, the fitted regression model equation looks like: \[\hat{y}_i = b_0 + b_1x_i\]
Here \(b_0\) and \(b_1\) have numerical values (you’ll see an example soon).
The distinction between the theoretical model and the fitted model is important - we can write out the theoretical model without having seen any data, it doesn’t involve any known numerical values, and it cannot be used to make predictions. The fitted model is estimated from the data, \(b_0\) and \(b_1\) have numerical values, and these values can be used to make predictions.
We can estimate the coefficients, \(b_0\) and \(b_1\), of a fitted simple linear regression
model using the lm()
function:
## Fitting a regression model
wine_model = lm(ColorIntensity ~ Alcohol, data = wines)
In this example, we fit the model: \(\text{ColorIntensity} = \beta_0 +
\beta_1*\text{Alcohol} + \epsilon\) and stored the fitted model
in an object named wine_model
.
We can use the coef()
function to access the estimated
slope and intercept of our fitted model:
coef(wine_model)
## (Intercept) Alcohol
## -15.22574 1.56022
We could write out the fitted model as \(\widehat{\text{ColorIntensity}} = -15.23 + 1.56*\text{Alcohol}\)
Question #4: For this question you should use the Ames Housing Data.
SalePrice
using the variable
Overall.Qual
, storing the fitted model in an object
price_model
.coef()
function to access the estimated slope and
intercept.\(~\)
The coefficient of determination, or \(R^2\), describes how much of the total
variation in the outcome variable is explained the regression model. It
is one of many details of a fitted regression model that can be found
using the summary()
function:
summary(wine_model)
##
## Call:
## lm(formula = ColorIntensity ~ Alcohol, data = wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0189 -1.3322 -0.4905 0.6174 6.0705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.2257 2.3483 -6.484 8.72e-10 ***
## Alcohol 1.5602 0.1803 8.654 3.06e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.947 on 176 degrees of freedom
## Multiple R-squared: 0.2985, Adjusted R-squared: 0.2945
## F-statistic: 74.9 on 1 and 176 DF, p-value: 3.056e-15
In this output, the \(R^2\) value is described as “Multiple R-squared”. Next to it is a similar measure, “Adjusted R-squared”, that imposes a penalty on \(R^2\) based upon the complexity of a model.
Question #5: Find and report the \(R^2\) value of the model you fit in Question #4.
\(~\)
Simple linear regression imposes a straight-line relationship between our explanatory and response variables; however, the overall framework of regression is flexible enough to accommodate non-linear relationships.
Consider using the alcohol content of a wine to predict the amount of flavanoids it contains:
ggplot(data = wines, aes(x = Alcohol, y = Flavanoids)) +
geom_point() +
geom_smooth(color = "blue", se = FALSE)
The relationship appears non-linear, and also non-monotonic. Wines having either very high or very low alcohol content tend to have the most flavanoids. Instead of trying to model this relationship using a straight-line, a quadratic polynomial might be more appropriate:
## Example of a quadratic polynomial model
quad_model = lm(Flavanoids ~ poly(Alcohol, degree = 2), data = wines)
Notice how the poly()
function is used to indicate the
explanatory variable, Alcohol
, should be expressed via a
second-degree polynomial.
This model has an \(R^2\) value of 0.16 and an adjusted \(R^2\) of 0.15.
## Check R^2
summary(quad_model)
##
## Call:
## lm(formula = Flavanoids ~ poly(Alcohol, degree = 2), data = wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0161 -0.7064 0.1393 0.6993 2.8481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.02927 0.06905 29.387 < 2e-16 ***
## poly(Alcohol, degree = 2)1 3.14702 0.92128 3.416 0.00079 ***
## poly(Alcohol, degree = 2)2 4.26156 0.92128 4.626 7.24e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9213 on 175 degrees of freedom
## Multiple R-squared: 0.1589, Adjusted R-squared: 0.1493
## F-statistic: 16.53 on 2 and 175 DF, p-value: 2.65e-07
For comparison, let’s look at the \(R^2\) of a simple linear regression model:
## Comparison linear model
line_model = lm(Flavanoids ~ Alcohol, data = wines)
summary(line_model)
##
## Call:
## lm(formula = Flavanoids ~ Alcohol, data = wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9455 -0.7794 0.2340 0.6726 3.4705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.75876 1.17370 -1.498 0.13580
## Alcohol 0.29137 0.09011 3.234 0.00146 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9732 on 176 degrees of freedom
## Multiple R-squared: 0.05608, Adjusted R-squared: 0.05072
## F-statistic: 10.46 on 1 and 176 DF, p-value: 0.001459
Because the relationship between these variables appears to be non-linear, the quadratic polynomial model explains more of the variability in the response variable than the straight-line model does, and its adjusted \(R^2\) is higher, indicating this difference is unlikely to reflect overfitting.
As another example, consider the relationship between the amount of
living space on the first floor of a home, X1st.Flr.SF
, and
the total number of rooms it has above ground,
TotRms.AbvGrd
:
ggplot(data = homes, aes(x = X1st.Flr.SF, y = TotRms.AbvGrd)) +
geom_point() +
geom_smooth(color = "blue", se = FALSE)
Notice that the smoother closely resembles a straight-line, though
there might be some minor curvature near the smaller values of
X1st.Flr.SF
. Let’s compare the coefficient of variation for
straight-line and quadratic polynomial regression models:
## Quadratic model
quad_model = lm(TotRms.AbvGrd ~ poly(X1st.Flr.SF, degree = 2), data = homes)
summary(quad_model)
##
## Call:
## lm(formula = TotRms.AbvGrd ~ poly(X1st.Flr.SF, degree = 2), data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1994 -1.0154 -0.2401 0.9030 7.1294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.44300 0.02676 240.744 <2e-16 ***
## poly(X1st.Flr.SF, degree = 2)1 33.21418 1.44866 22.927 <2e-16 ***
## poly(X1st.Flr.SF, degree = 2)2 1.06328 1.44866 0.734 0.463
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.449 on 2927 degrees of freedom
## Multiple R-squared: 0.1524, Adjusted R-squared: 0.1518
## F-statistic: 263.1 on 2 and 2927 DF, p-value: < 2.2e-16
## Linear model
line_model = lm(TotRms.AbvGrd ~ X1st.Flr.SF, data = homes)
summary(line_model)
##
## Call:
## lm(formula = TotRms.AbvGrd ~ X1st.Flr.SF, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1502 -1.0147 -0.2436 0.9008 7.1178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6271070 0.0835947 55.35 <2e-16 ***
## X1st.Flr.SF 0.0015660 0.0000683 22.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.449 on 2928 degrees of freedom
## Multiple R-squared: 0.1522, Adjusted R-squared: 0.1519
## F-statistic: 525.8 on 1 and 2928 DF, p-value: < 2.2e-16
In this example we see that the \(R^2\) value is higher for the quadratic model, which is expected since the model is more complex and it has the flexibility to get closer to the observed data-points than a straight line can. However, the adjusted \(R^2\) is higher for the straight-line model, suggesting that the quadratic model is overfitting the data.
At this point a few more things are worth noting:
Question #6: Modify the model you fit in Question #4
and subsequently used in Question #5 to allow for a quadratic
relationship between SalePrice
and
Overall.Qual
in the Ames Housing data set. Use \(R^2\) and/or adjusted \(R^2\) to decide whether you believe this
quadratic model is overfit to the data (relative to the straight-line
model from Question #4/5).
\(~\)
The previous section of this lab cover a lot of ground. This section is intended to provide an opportunity for you to use those topics in a single analysis.
In this application you’ll use the “Tips” dataset:
tips = read.csv("https://remiller1450.github.io/data/Tips.csv")
These data were recorded by a waiter in national chain restaurant located in a suburban New York City shopping mall in the early 1990s. The data document various aspects of each table served by the waiter, including the total bill, tip, size of the party, time of day, day of the week, and whether the party included a smoker.
Question #7:
select()
function to
create a data subset containing only the numeric variables
Tip
, TotBill
, and Size
. Next,
create a correlation matrix using these three variables and Spearman’s
correlation.Tip
using either TotBill
or Size
.
Which of these two variables appears most strongly associated with
Tip
based upon the correlation matrix you created in Part
A?Tip
. Qualitatively describe the
association you see in the plot.Tip
. Report and interpret each of the coefficients
of the fitted model. Additionally, if this model’s the intercept is not
meaningful, you should explain why.