Introduction:

Exploratory data analysis is an important preliminary step aimed at helping you better understand the nuances of your data. It is usually much easier to see trends and relationships visually than it is to rely upon guess and check model building. For this reason, every modeling project should begin with a comprehensive graphical analysis. This lab will cover data visualizations in R as a first step in model building.

Directions:

As with all our labs, you are expected to work through the examples and questions in this lab collaboratively with your partner(s). This requires you to work together and discuss each section. You each will receive the same score for your work, so is your responsibility to make sure that both of you are on the same page. Any groups found to be using a “divide and conquer” strategy to rush through the lab’s questions will be penalized.

You should record your answers to the lab’s questions in an R Markdown file. When submitting the lab, you should only turn in the compiled .html file created by R Markdown.

You are strongly encouraged to open a separate, blank R script to run and experiment with the example code that is given throughout the lab. Please do not turn-in this code.

\(~\)

Colleges 2019 Data Dictionary

Throughout the lab we will use the Colleges 2019 Dataset, a simplified set of cases from the US government’s 2019-2020 College Scorecard dataset (it includes fewer variables and is filtered to include only primarily undergraduate institutions with at least 400 enrolled students). Below is a description of the available variables:

  • Name - Name of the institution
  • City - City where the institution is located
  • State - State where the institution is located
  • Enrollment - Number of full-time enrolled students
  • Private - Binary indicator distinguishing public and private institutions
  • Region - Geographic region
  • Adm_Rate - Admissions rate, the proportion of applications who are admitted
  • ACT_median - Median composite ACT score of enrolled students
  • ACT_Q1 - 25th percentile composite ACT score of enrolled students
  • ACT_Q3 - 75th percentile composite ACT score of enrolled students
  • Cost - Average yearly cost of attendance
  • Net_Tuition - Average tuition cost after discounts (scholarships, grants, etc.)
  • Avg_Fac_Salary - Average faculty salary
  • PercentFemale - Proportion of enrolled students who are female
  • PercentWhite - Proportion of enrolled students who identify as White
  • PercentBlack - Proportion of enrolled students who identify as Black
  • PercentHispanic - Proportion of enrolled students who identify as Hispanic
  • PercentAsian - Proportion of enrolled students who identify as Asian
  • FourYearComp_Males - Proportion of male students who go on to earn a degree within four years of their initial enrollment
  • FourYearComp_Females - Proportion of male students who go on to earn a degree within four years of their initial enrollment
  • Debt_median - Median student debt upon leaving the institution
  • Salary10yr_median - Median salary 10 years after graduating the institution

\(~\)

First Steps

The first step in exploratory data analysis is getting a high-level understanding of the data by formally describing the cases and variables. In this step, you’ll also want to look for missing data and sparsely populated categories in this step.

A few helpful commands are shown below, try running them for yourself to see what they do:

## Begin by reading in the data
colleges19 <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

## Print the dimensions (rows and columns)
dim(colleges19)

## Print the "structure" of the data.frame
str(colleges19)

## Print a quick summary of each variable (missing data reported as NA's)
summary(colleges19)

## Print a contingency table for two categorical variables
table(colleges19$Region, colleges19$Private)

Question #1: Using the output from the code given above, answer the following: A) How many colleges are in this dataset? B) How many different regions are recorded in these data? C) How many colleges are missing ACT data?

\(~\)

Univariate Exploration

After getting a preliminary sense of what the data contain, the next step is to investigate the distributions of the variables of interest. This step is important because it will help you identify any variables with outliers or unusual distributions that may need to be transformed prior to modeling.

In general, you should use boxplots and histograms to explore numeric variables; and you should use bar charts and tables to explore categorical variables. The sections below will provide example code showing how to create each graph, please try these examples for yourselves.

\(~\)

Boxplots

## If possible, we'll prefer ggplot graphics
library(ggplot2)
colleges19 <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

## ggplot example (coord_flip() is optional, it makes the box horizontal)
ggplot(colleges19, aes(y = ACT_median)) + geom_boxplot() + coord_flip()

## Base R example
boxplot(colleges19$ACT_median, horizontal = TRUE)

Histograms

## ggplot example (notice the impact of the bins argument)
ggplot(colleges19, aes(x = ACT_median)) + geom_histogram(bins = 10)

## Base R example
hist(colleges19$ACT_median, bins = 10)

Bar charts

## ggplot example (notice how theme(...) is used to rotate the x-axis text to make the graph more readible)
ggplot(colleges19, aes(x = Region)) + geom_bar() + theme(axis.text.x = element_text(angle = 45))

## Base R example (notice how the table function is used inside of the barplot function)
barplot(table(colleges19$Region))

Question #2: Based upon the examples given above, write code that creates appropriate univariate graphs that display the following variables: A) “Debt_median”, B) “Private”, C) “Adm_Rate”

\(~\)

Multivariate Exploration

Most models involve more than one variable, making it essential for us to explore how various pairs of variables are related in our data. In general, you should follow the guidelines below:

  • Use stacked and conditional bar charts to explore relationships involving two categorical variables
  • Use box plots to explore relationships involving one categorical and one numeric variable
  • Use a scatterplot to explore relationships involving two numeric variables

The sections below will provide example code showing how to create each of these graphs, please try these code examples for yourselves.

\(~\)

Stacked and Conditional Bar Charts

## ggplot stacked bar chart
ggplot(colleges19, aes(x = Region, fill = Private)) + geom_bar() + theme(axis.text.x = element_text(angle = 45))

## ggplot conditional bar chart
ggplot(colleges19, aes(x = Region, fill = Private)) + geom_bar(position = "fill") + theme(axis.text.x = element_text(angle = 45))

## Base R stacked bar chart
barplot(table(colleges19$Region, colleges19$Private), legend.text = TRUE)

## Base R conditional bar chart, plus so legend formatting
barplot(prop.table(table(colleges19$Region, colleges19$Private), margin = 2), legend.text = TRUE,
        args.legend = list(x = "topleft", bty = "n", ncol = 3, cex = .5))

Boxplots

## ggplot
ggplot(colleges19, aes(x = Private, y = ACT_median)) + geom_boxplot()  + coord_flip()

## Base R (notice the formula notation)
boxplot(colleges19$ACT_median ~ colleges19$Private, horizontal = TRUE)

Scatterplots

## ggplot
ggplot(colleges19, aes(x = Adm_Rate, y = ACT_median)) + geom_point()

## Base R
plot(colleges19$Adm_Rate, colleges19$ACT_median)

Question #3: Based upon the examples given above, write code that creates an appropriate graph to display the relationship between the following combinations of variables: A) “Region” and “PercentWhite”, B) “State” and “Private”, C) “Enrollment” and “Adm_Rate”; Then, write a sentence or two describing whether you think there is a relationship between each of these variable pairs.

\(~\)

Confounding Variables

Humans tend to be pretty good at recognizing and understanding patterns between two variables. Unfortunately, things become increasingly more difficult when the data contain three or more interrelated variables. Sometimes researchers will use clever experimental design to ensure they only need to consider two variables. Randomized experiments are covered extensively in introductory statistics courses for this reason.

When working with observational data, we always must be concerned about the protentional for confounding variables, or variables that are related to both the explanatory and response variable we are interested in.

Confounding variables can obscure the relationships we see in our data, they must be accounted for if we want to accurately describe the phenomenon we’re attempting to model/understand. For now, we’ll focus on identifying these variables and displaying their impact graphically. Later this semester we’ll learn how to adjust for them using models.

\(~\)

Categorical Confounds

If you can identify a categorical confounding variable, stratification is a simple and effective way to neutralize its impact. Stratification involves filtering the data into distinct subsets, or strata, based upon the confounding variable, then conducting separate analyses in each.

When using the ggplot, the facet_wrap function provides a quick way to stratify the data you’re graphing. The example below shows the relationship between admissions rates and attendance costs, stratified by whether a school is private or public.

ggplot(colleges19, aes(x = Adm_Rate, y = Cost)) + geom_point() + facet_wrap(~Private)

In this example, stratifying the data is important because a different relationship can be observed in each stratum. If we did not stratify, this distinction would go unnoticed. Going further, the correlation coefficient gives us a useful way of gauging the impact of this confounding variable:

## Overall, "un-stratified" correlation
cor(x = colleges19$Adm_Rate, y=  colleges19$Cost, use = "pairwise.complete.obs")
## [1] -0.3141143
## Stratifying by "Private"
priv <- subset(colleges19, Private == "Private")
pub <- subset(colleges19, Private == "Public")

cor(x = priv$Adm_Rate, y=  priv$Cost, use = "pairwise.complete.obs")
## [1] -0.35123
cor(x = pub$Adm_Rate, y=  pub$Cost, use = "pairwise.complete.obs")
## [1] -0.083372

Overall, there is a moderately strong, inverse relationship between admissions rate and costs (more selective schools tend to cost more). However, this isn’t true when you only consider public schools, it is driven almost entirely by private schools.

Finally, recognize that stratification isn’t always necessary. If a variable is not a confound, you are adding unnecessary complexity to your analysis by incorporating it. An example of this is shown below, notice how a similar relationship is for both private and public schools.

ggplot(colleges19, aes(x = Adm_Rate, y = ACT_median)) + geom_point() + facet_wrap(~Private)

\(~\)

Numeric Confounds

Numerical confounding variables are a bit trickier to handle via stratification (but we’ll soon learn how to adjust for them using modeling). One option is to “cut” the variable into a categorical one, a process demonstrated below:

## Create the categories using the "cut" function
colleges19$Enrollment_Category <- cut(colleges19$Enrollment, breaks = c(-Inf, 3000, 8000, Inf))
table(colleges19$Enrollment_Category)
## 
##  (-Inf,3e+03] (3e+03,8e+03]  (8e+03, Inf] 
##           979           371           308
## Name the categories using the "levels" function
levels(colleges19$Enrollment_Category) <- c("Small","Midsize","Large")
table(colleges19$Enrollment_Category)
## 
##   Small Midsize   Large 
##     979     371     308

Cutting a numeric variable into a categorical one is somewhat controversial strategy, particularly if the category boundaries are arbitrarily chosen without any external basis. That said, it’s a strategy that is frequently used in real-world analyses, so it’s something worth knowing how to do.

When it comes to identifying and understanding numerical confounding variables, we use of graphical elements such as color or shape:

ggplot(colleges19, aes(x = ACT_median, y = Debt_median, col = Cost)) + geom_point()

Question #4: Based upon the examples in the sections above, write code that creates an appropriate graph to display the relationship between the following combinations of variables: A) “Region” and “PercentWhite”, B) “ACT_median” and “Adm_Rate”; In doing so, consider whether the variables “Private” and “Enrollment” are confounding variables in the comparison and account for them in your visualization (if necessary). Additionally, write 1-2 sentences summarizing the relationship for the comparisons in A) and B). For this question, you are welcome to cut, coloration, and/or stratification, just be sure to justify your decision.

\(~\)

Practice

In this section you’ll use the Ame’s Housing dataset to practice the graphical techniques above while using the patterns you see to make specific judgments about how these data might be modeled.

This dataset contains detailed information on all residential properties in Ames, Iowa that sold between 2006 and 2010. Because it contains so many variables, please refer to this link for the data dictionary.

# Load the "Ames Housing" dataset
AmesHousing <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/data-viz/data/AmesHousing.csv")

Question #5: How many properties are contained in this dataset? How many of them are single family (ie: 1Fam building types)?

Question #6: Graph the variable “SalePrice” and describe its distribution. That is, identify if the distribution is skewed or approximately Normal, as well as if there are any outliers.

Question #7: Take the base-ten logarithm of the variable “SalePrice” and graph it (Hint: use the log function with the argument base = 10). How does this distribution compare to the one you saw in Question #6? What does a 1-unit change in the transformed variable mean in terms of sale price?

Question #8: Consider a property’s above ground living area, “GrLivArea”, as a predictor of “SalePrice”. Does this variable appear to be a strong or weak predictor of price? Briefly explain, using a graph to justify your answer.

Question #9: Consider whether a property contains central air conditioning, “CentralAir”, as a predictor of that property’s “SalePrice”. Does this variable appear to be a strong or weak predictor? Briefly explain, using a graph to justify your answer.

Question #10: Consider the year that a property was built, “YearBuilt” as a predictor of “SalePrice”. A) How is this variable related to price? B) Does the overall quality of the property appear to be a confounding variable in this relationship? C) Can you identify another confounding variable in the relationship between “YearBuilt” and “SalePrice”? Please justify your answers using one or more appropriate graphs.

\(~\)

Going Further (optional)

If you’d like to learn more about making your visualizations more beautiful, some great resources are listed below: