\(~\)

Onboarding

So far we’ve largely been able to use our data in the format that it was given. For example, we’ve used entire data frames to create data visualizations, and we’ve been able to use the $ operator to access specific variables to calculate descriptive statistics like the mean or median.

In some instances we’ll want to work with data consisting of a subset of cases and/or variables from our original. The dplyr library provides a suite of data manipulation functions that allow us to obtain these subsets using pipelines, or ways of feeding data forward through a series of steps without storing the intermediate results.

## Load the dplyr package, install if necessary
# install.packages("dplyr")
library(dplyr)

## Read some data (118th congress we've seen previously)
congress = read.csv("https://remiller1450.github.io/data/congress_2024.csv")

## Notice the number of cases and variables in "congress"
dim(congress)
## [1] 538   7

The code below is a pipeline that uses the filter() function to subset the cases and the select() function to subset the variables, storing the resulting subset in a data frame named “senate_ages”:

## Data pipeline
senate_ages = congress %>% 
      filter(Chamber == "Senate") %>% 
      select(Name, Party, Age)

The pipe operator, %>%, is used to pass the output of function into a subsequent function. In this example, pipeline the %>% operator is used twice:

  • First, the data frame “congress” is passed into the filter() function, which subsets the data it receives to only include cases where the logical condition Chamber = "Senate" is TRUE
  • Second, the output from filter() is passed into the select() function, which removes all columns from the data frame except for “Name”, “Party”, and “Age”. Since this is the final step in the pipeline its result is stored in the object “senate_ages”

We can check the dimensions of “senate_ages” to confirm that it only contains information on the three variables we selected for the 100 congress members in the US Senate:

dim(senate_ages)
## [1] 100   3

\(~\)

Lab

As a reminder, you should work on the lab with your assigned partner(s) using the principles of paired programming. Everyone should keep and submit their own copy of your group’s work.

In this lab you’ll use two data sets:

  1. “Wines” - a data set documenting the results of a chemical analysis of 178 different samples of wine grown by three different cultivators in the same region of Italy. Additional documentation can be found here
  2. “Iowa City Housing” - a data set documenting all homes sold in Ames, Iowa between 2006 and 2010. Additional documentation can be found here
wines <- read.csv("https://remiller1450.github.io/data/wines.csv")
homes <- read.csv("https://remiller1450.github.io/data/AmesHousing.csv")

You’ll also need the same packages used in our previous lab, ggplot2 and forcats. Remember that you do not need to install these packages if you’ve done so previously, but you do need to load them using the library() function:

# install.packages("ggplot2")
# install.packages("forcats")

library(ggplot2)
library(forcats)

\(~\)

Correlation

Before calculating the correlation between two quantitative variables we should visualize their relationship to determine whether Pearson’s or Spearman’s correlation coefficient is most appropriate. As a reminder, we prefer Pearson’s correlation for measuring the strength of a linear relationship, and Spearman’s correlation when the relationship is non-linear (but still monotonic).

Below is a plot showing the relationship between a wine’s alcohol content and its proline content:

ggplot(data = wines, aes(x = Alcohol, y = Proline)) + 
  geom_point() + 
  geom_smooth(color = "blue", se = FALSE) + 
  geom_smooth(method = "lm", color = "red", se = FALSE)

In this example we included two types of smoother on the scatter plot.

  1. geom_smooth(color = "blue", se = FALSE) fits a locally weighted moving average and is shown in blue. This smoother is sensitive to non-linear patterns in the data.
  2. geom_smooth(method = "lm", color = "red", se = FALSE) includes the argument method = "lm", which amounts to the smoother being the best fitting regression line. This smoother is shown in red and we are including it for comparison purposes.

Here we see a moderate-to-strong linear relationship. The similarity of the red and blue smoothers suggests a linear relationship, thus we should use Pearson’s correlation for a more precise quantification of just how strongly these variables are related.

Both Pearson’s and Spearman’s correlation coefficients are found using the cor() function. The method argument determines whether Spearman’s or Pearson’s correlation is given:

## Pearson's cor
cor(x = wines$Alcohol, y = wines$Proline, method = "pearson")
## [1] 0.64372
## Spearman's cor
cor(x = wines$Alcohol, y = wines$Proline, method = "spearman")
## [1] 0.6335799

Both Pearson’s and Spearman’s correlation coefficients are shown in this example for illustration purposes, but you should only calculate and report one of them. It’s also not advised to calculate both and report whichever suggests a stronger association. This practice is an example “data dredging” and can increase the risk of false positive findings. Instead, you should decide upon one measure of association based upon your subject-area knowledge and exploratory analyses (data visualizations, etc.)

Question #1: For this question you should use the Ames Housing data set.

  • Part A: Create a scatter plot depicting the relationship between the variable Overall.Qual, the county assessor’s overall rating of the home’s quality, and the variable SalePrice. Qualitatively describe the form, strength, and direction of the relationship between these two variables.
  • Part B: Informed by your answer to Part A, find and report either Pearson’s or Spearman’s correlation coefficient. Briefly explain why you chose the method you reported.

\(~\)

Correlation Matrices

In some analyses we’ll want to find correlations between many different pairings of variables. For the wines data set, we might be interested in the relationships between the variables Alcohol, Mg, Proline, and TotalPhenols. The cor() function allows us to provide an entire data frame in its x argument so long as that data frame only contains numeric variables:

## Use the select function to keep only the numeric variables we want
wines_subset = wines %>% select(Alcohol, Mg, Proline, TotalPhenols)

## Give this data frame to cor()
cor(x = wines_subset, method = "pearson")
##                Alcohol        Mg   Proline TotalPhenols
## Alcohol      1.0000000 0.2707982 0.6437200    0.2891011
## Mg           0.2707982 1.0000000 0.3933508    0.2144012
## Proline      0.6437200 0.3933508 1.0000000    0.4981149
## TotalPhenols 0.2891011 0.2144012 0.4981149    1.0000000

In this setting the cor() function produces a correlation matrix, where each element of the matrix is the correlation coefficient relating the corresponding pairing of variables (defined by the row and column). For example, we can see that the correlation between Proline and Mg is 0.393.

Question #2: For this question you should use the Ames Housing data set.

  • Part A: Use the select() function to create a subset of the Ames Housing data containing only the variables Overall.Qual, Overall.Cond, TotRms.AbvGrd and SalePrice.
  • Part B: Create a correlation matrix showing the pairwise relationships between these variables using Spearman’s correlation coefficient.
  • Part C: Find and report the pair of variables with the weakest association.
  • Part D: Report and interpret the correlation between Overall.Cond and SalePrice.

\(~\)

Missing Data

When working with real data we’ll frequently encounter instances where some cases do not have a recorded value for every variable. In R, missing values are intended to be stored as the special value NA.

In the Ames Housing data set, the variable Total.Bsmt.SF, the total amount of square footage of a home’s basement, was not recorded for every home sale. Thus, when we try to find the correlation between Total.Bsmt.SF and a home’s sale price we get the special value NA:

cor(x = homes$Total.Bsmt.SF, y = homes$SalePrice, method = "pearson")
## [1] NA

One way to avoid this is to calculate the correlation using only cases with complete data for all variables involved calculations performed by cor():

cor(x = homes$Total.Bsmt.SF, y = homes$SalePrice, method = "pearson", use = "complete.obs")
## [1] 0.6322805

You may note that this may exclude more data than necessary if we provide an entire data frame to cor() but only care about a few pairings of variables.

Question #3: Modify your code from Question #2 to include the variable Total.Bsmt.SF in your subset, using the use = "complete.obs" argument to account for missing data in this variable when finding the correlation matrix. Record how the correlation between Overall.Qual and SalePrice differs in this correlation matrix compared to the correlation matrix you found in Question #2 (the first digits will be the same, but you should see a slight difference).

\(~\)

Simple Linear Regression

Simple linear regression provides a model that relates an explanatory variable, \(X\), with a quantitative response variable, \(Y\), via the model: \[Y = \beta_0 + \beta_1X + \epsilon\]

This model is theoretical, and in practice we estimate \(\beta_0\) and \(\beta_1\) using our data, denoting these estimates \(b_0\) and \(b_1\), respectively.

Thus, the fitted regression model equation looks like: \[\hat{y}_i = b_0 + b_1x_i\]

Here \(b_0\) and \(b_1\) have numerical values (you’ll see an example soon).

The distinction between the theoretical model and the fitted model is important - we can write out the theoretical model without having seen any data, it doesn’t involve any known numerical values, and it cannot be used to make predictions. The fitted model is estimated from the data, \(b_0\) and \(b_1\) have numerical values, and these values can be used to make predictions.

We can estimate the coefficients, \(b_0\) and \(b_1\), of a fitted simple linear regression model using the lm() function:

## Fitting a regression model
wine_model  = lm(ColorIntensity ~ Alcohol, data = wines)

In this example, we fit the model: \(\text{ColorIntensity} = \beta_0 + \beta_1*\text{Alcohol} + \epsilon\) and stored the fitted model in an object named wine_model.

We can use the coef() function to access the estimated slope and intercept of our fitted model:

coef(wine_model)
## (Intercept)     Alcohol 
##   -15.22574     1.56022

We could write out the fitted model as \(\widehat{\text{ColorIntensity}} = -15.23 + 1.56*\text{Alcohol}\)

  • We can interpret the estimate \(b_0 = -15.23\) as the expected color intensity rating of a wine without any alcohol. However, since none of the wines in our data set have an alcohol content below 11% we shouldn’t be interpreting the intercept and doing so can be viewed as improper extrapolation.
  • We can interpret \(b_1 = 1.56\) as the expected change in color intensity for a 1-unit change in the explanatory variable, a 1 percentage point increase in alcohol content in this example.

Question #4: For this question you should use the Ames Housing Data.

  • Part A: Fit a simple linear regression model that predicts SalePrice using the variable Overall.Qual, storing the fitted model in an object price_model.
  • Part B: Write out the fitted model from Part A using the coef() function to access the estimated slope and intercept.
  • Part C: Provide a brief interpretation of the fitted model’s coefficients. Be careful to indicate whether or not the model’s intercept is appropriate interpret.
  • Part D: A homeowner is considering a remodeling project with an estimated cost of $20,000. They anticipate it would raise the overall quality rating of their home by 0.5 points. According to the fitted regression model, if this person sells their home immediately after the remodel, how much money are they expected to gain or lose as a result of the project?
  • Part E: Is it possible that the homeowner in Part D loses money? Briefly explain. Hint: Think about the concept of residuals/error in regression modeling.

\(~\)

Coefficient of Determination (\(R^2\))

The coefficient of determination, or \(R^2\), describes how much of the total variation in the outcome variable is explained the regression model. It is one of many details of a fitted regression model that can be found using the summary() function:

summary(wine_model)
## 
## Call:
## lm(formula = ColorIntensity ~ Alcohol, data = wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0189 -1.3322 -0.4905  0.6174  6.0705 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.2257     2.3483  -6.484 8.72e-10 ***
## Alcohol       1.5602     0.1803   8.654 3.06e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.947 on 176 degrees of freedom
## Multiple R-squared:  0.2985, Adjusted R-squared:  0.2945 
## F-statistic:  74.9 on 1 and 176 DF,  p-value: 3.056e-15

In this output, the \(R^2\) value is described as “Multiple R-squared”. Next to it is a similar measure, “Adjusted R-squared”, that imposes a penalty on \(R^2\) based upon the complexity of a model.

  • The basic idea behind adjusted \(R^2\) is to avoid rewarding models that overfit the data by using more explanatory variables than necessary, or by allowing explanatory variables to have complex, non-linear relationships with the response variable in the model.
    • For simple linear regression, there’s no reason to look at the adjusted \(R^2\) value since the model is already as simple as it can be.

Question #5: Find and report the \(R^2\) value of the model you fit in Question #4.

\(~\)

Non-linear Relationships and \(R^2\)

Simple linear regression imposes a straight-line relationship between our explanatory and response variables; however, the overall framework of regression is flexible enough to accommodate non-linear relationships.

Consider using the alcohol content of a wine to predict the amount of flavanoids it contains:

ggplot(data = wines, aes(x = Alcohol, y = Flavanoids)) + 
  geom_point() + 
  geom_smooth(color = "blue", se = FALSE) 

The relationship appears non-linear, and also non-monotonic. Wines having either very high or very low alcohol content tend to have the most flavanoids. Instead of trying to model this relationship using a straight-line, a quadratic polynomial might be more appropriate:

## Example of a quadratic polynomial model
quad_model = lm(Flavanoids ~ poly(Alcohol, degree = 2), data = wines)

Notice how the poly() function is used to indicate the explanatory variable, Alcohol, should be expressed via a second-degree polynomial.

This model has an \(R^2\) value of 0.16 and an adjusted \(R^2\) of 0.15.

## Check R^2
summary(quad_model)
## 
## Call:
## lm(formula = Flavanoids ~ poly(Alcohol, degree = 2), data = wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0161 -0.7064  0.1393  0.6993  2.8481 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 2.02927    0.06905  29.387  < 2e-16 ***
## poly(Alcohol, degree = 2)1  3.14702    0.92128   3.416  0.00079 ***
## poly(Alcohol, degree = 2)2  4.26156    0.92128   4.626 7.24e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9213 on 175 degrees of freedom
## Multiple R-squared:  0.1589, Adjusted R-squared:  0.1493 
## F-statistic: 16.53 on 2 and 175 DF,  p-value: 2.65e-07

For comparison, let’s look at the \(R^2\) of a simple linear regression model:

## Comparison linear model
line_model = lm(Flavanoids ~ Alcohol, data = wines)
summary(line_model)
## 
## Call:
## lm(formula = Flavanoids ~ Alcohol, data = wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9455 -0.7794  0.2340  0.6726  3.4705 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1.75876    1.17370  -1.498  0.13580   
## Alcohol      0.29137    0.09011   3.234  0.00146 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9732 on 176 degrees of freedom
## Multiple R-squared:  0.05608,    Adjusted R-squared:  0.05072 
## F-statistic: 10.46 on 1 and 176 DF,  p-value: 0.001459

Because the relationship between these variables appears to be non-linear, the quadratic polynomial model explains more of the variability in the response variable than the straight-line model does, and its adjusted \(R^2\) is higher, indicating this difference is unlikely to reflect overfitting.

As another example, consider the relationship between the amount of living space on the first floor of a home, X1st.Flr.SF, and the total number of rooms it has above ground, TotRms.AbvGrd:

ggplot(data = homes, aes(x = X1st.Flr.SF, y = TotRms.AbvGrd)) + 
  geom_point() + 
  geom_smooth(color = "blue", se = FALSE) 

Notice that the smoother closely resembles a straight-line, though there might be some minor curvature near the smaller values of X1st.Flr.SF. Let’s compare the coefficient of variation for straight-line and quadratic polynomial regression models:

## Quadratic model
quad_model = lm(TotRms.AbvGrd ~ poly(X1st.Flr.SF, degree = 2), data = homes)
summary(quad_model)
## 
## Call:
## lm(formula = TotRms.AbvGrd ~ poly(X1st.Flr.SF, degree = 2), data = homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1994 -1.0154 -0.2401  0.9030  7.1294 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     6.44300    0.02676 240.744   <2e-16 ***
## poly(X1st.Flr.SF, degree = 2)1 33.21418    1.44866  22.927   <2e-16 ***
## poly(X1st.Flr.SF, degree = 2)2  1.06328    1.44866   0.734    0.463    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.449 on 2927 degrees of freedom
## Multiple R-squared:  0.1524, Adjusted R-squared:  0.1518 
## F-statistic: 263.1 on 2 and 2927 DF,  p-value: < 2.2e-16
## Linear model
line_model = lm(TotRms.AbvGrd ~ X1st.Flr.SF, data = homes)
summary(line_model)
## 
## Call:
## lm(formula = TotRms.AbvGrd ~ X1st.Flr.SF, data = homes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1502 -1.0147 -0.2436  0.9008  7.1178 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.6271070  0.0835947   55.35   <2e-16 ***
## X1st.Flr.SF 0.0015660  0.0000683   22.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.449 on 2928 degrees of freedom
## Multiple R-squared:  0.1522, Adjusted R-squared:  0.1519 
## F-statistic: 525.8 on 1 and 2928 DF,  p-value: < 2.2e-16

In this example we see that the \(R^2\) value is higher for the quadratic model, which is expected since the model is more complex and it has the flexibility to get closer to the observed data-points than a straight line can. However, the adjusted \(R^2\) is higher for the straight-line model, suggesting that the quadratic model is overfitting the data.

At this point a few more things are worth noting:

  1. Statisticians generally do not consider adjusted \(R^2\) to be great model selection tool; however, its frequently used in some disciplines for this purpose in practice, so you should be aware of what it is.
  2. For a non-linear relationship that isn’t monotonic, Pearson’s and Spearman’s correlations are not appropriate. This is the primary situation where \(R^2\) can be useful in assessing the strength of association between two quantitative variables.

Question #6: Modify the model you fit in Question #4 and subsequently used in Question #5 to allow for a quadratic relationship between SalePrice and Overall.Qual in the Ames Housing data set. Use \(R^2\) and/or adjusted \(R^2\) to decide whether you believe this quadratic model is overfit to the data (relative to the straight-line model from Question #4/5).

\(~\)

Application

The previous section of this lab cover a lot of ground. This section is intended to provide an opportunity for you to use those topics in a single analysis.

In this application you’ll use the “Tips” dataset:

tips = read.csv("https://remiller1450.github.io/data/Tips.csv")

These data were recorded by a waiter in national chain restaurant located in a suburban New York City shopping mall in the early 1990s. The data document various aspects of each table served by the waiter, including the total bill, tip, size of the party, time of day, day of the week, and whether the party included a smoker.

Question #7:

  • Part A: Use the select() function to create a data subset containing only the numeric variables Tip, TotBill, and Size. Next, create a correlation matrix using these three variables and Spearman’s correlation.
  • Part B: Suppose we’d like to predict Tip using either TotBill or Size. Which of these two variables appears most strongly associated with Tip based upon the correlation matrix you created in Part A?
  • Part C: Create a scatter plot showing the relationship between the variable you selected in Part B and the response variable Tip. Qualitatively describe the association you see in the plot.
  • Part D: Based upon your assessment in Part C, should Pearson’s or Spearman’s correlation coefficient be preferred when reporting the strength of association between the variables depicted in your scatter plot? Briefly explain.
  • Part E: Fit a simple linear regression model using the explanatory variable you identified in Part B to predict Tip. Report and interpret each of the coefficients of the fitted model. Additionally, if this model’s the intercept is not meaningful, you should explain why.
  • Part F: Modify your model from Part E to allow for a quadratic polynomial relationship between variables. Then, using \(R^2\) and/or adjusted \(R^2\), state whether you prefer this model over the one from Part E if the goal is maximally explain variation in tips without overfitting this sample of data.