This lab is intended to provide practice with the topics we covered during Week 1 of class (namely, data structures, univariate summaries and graphs, and bivariate summaries and graphs).

Directions (Please read before starting)

  1. You are put into groups when working through labs because you are expected to work together - this means talking through the logic, steps, code, etc. that are needed to progress through the lab.
  2. Every member of your group should submit a lab write-up, but you each should include all of your group’s names on the top.
  3. Please take advantage of the opportunity to ask questions. Labs are meant to be formative and are intended to help you apply your “theoretical” knowledge of statistics and data analysis towards more realistic world applications - as an instructor I am happy to help with any aspect of this.

\(~\)

College Scorecard Data

The College Scorecard is a government run database that stores institutional level data on all accredited colleges and universities in the United States. A new version of the data is published yearly, and contains over 400 variables. This lab will use a simplified dataset from the 2019-2020 academic year.

Data Dictionary

The College19 Dataset is a reduced version of the 2019-2020 College Scorecard data that includes fewer variables and is filtered to include only primarily undergraduate institutions with at least 400 enrolled students. It contains the following variables:

  • Name - Name of the institution
  • City - City where the institution is located
  • State - State where the institution is located
  • Enrollment - Number of full-time enrolled students
  • Private - Binary indicator distinguishing public and private institutions
  • Region - Geographic region
  • Adm_Rate - Admissions rate, the proportion of applications who are admitted
  • ACT_median - Median composite ACT score of enrolled students
  • ACT_Q1 - 25th percentile composite ACT score of enrolled students
  • ACT_Q3 - 75th percentile composite ACT score of enrolled students
  • Cost - Average yearly cost of attendance
  • Net_Tuition - Average tuition cost after discounts (scholarships, grants, etc.)
  • Avg_Fac_Salary - Average faculty salary
  • PercentFemale - Proportion of enrolled students who are female
  • PercentWhite - Proportion of enrolled students who identify as White
  • PercentBlack - Proportion of enrolled students who identify as Black
  • PercentHispanic - Proportion of enrolled students who identify as Hispanic
  • PercentAsian - Proportion of enrolled students who identify as Asian
  • FourYearComp_Males - Proportion of male students who go on to earn a degree within four years of their initial enrollment
  • FourYearComp_Females - Proportion of male students who go on to earn a degree within four years of their initial enrollment
  • Debt_median - Median student debt upon leaving the institution
  • Salary10yr_median - Median salary 10 years after graduating the institution

First Steps

The first step in any data analysis is getting a high-level understanding of the data. This means describing the cases and how many there are, as well as looking over the variables and being aware of how they’re recorded.

Some helpful commands are shown below:

## Begin by reading in the data
colleges19 <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

## Print the dimensions
dim(colleges19)

## Print the "structure" of the data.frame
str(colleges19)

## Print the first 6 cases
head(colleges19)

Question #1: Copy the two commands given above that read-in the data and print its dimension into a code block under your header for Question #1 (ie: the grey area between the ```{r} and ``` under a Question #1 header). Then, in space beneath the code block intended for written responses, describe how many cases and variables are in these data, as well as what each case represents.

\(~\)

Graphical Exploration

Once you are comfortable with the contents of a dataset, your next step should be to identify variables of interest and explore their univariate distributions.

Univariate Graphs

To generate a specific type of graph, you will need to use the References/Examples provided in the last sections of this document. The questions that follow will help you practice this process (and being able to work from another person’s examples is an essential R programming skill).

Question #2: Identify one categorical variable and one quantitative variable that you think might be interesting. Then, using the examples mentioned above, create a univariate graph of the categorical variable you selected. Write 1-2 sentences describing the distribution of this variable.

Question #3: Create a univariate graph of the quantitative variable you selected in Question #2. Write 1-2 sentences describing the distribution of this variable. Be sure to comment on shape, center, and spread.

\(~\)

Bivariate Graphs

After studying the univariate distributions of the variables you are interested in, it is now appropriate to evaluate the possible association between those variables.

Question #4: Create a bivariate graph depicting the relationship between the two variables you selected in Question #2. Then, describe the association (if any) between the variables.

Question #5: What information did you obtain in Questions #2 or #3 that is important in understanding your variables, but is not contained in the graph you created in Question #4.

Question #6: Create a stacked conditional barchart showing the proportion of private colleges within each geographic region. Then, write 1-2 sentences describing whether or not these two variables appear associated.

Question #7: Create a scatterplot showing the relationship between cost and average faculty salary. Then, write 1-2 sentences describing whether or not these two variables appear associated. Be sure to address form, direction, and strength.

\(~\)

Multivariate Graphs

Bivariate relationships can be misleading, particularly for data like these. An increasingly important statistical skill is multivariate thinking, the process of understanding and describing the potentially complicated relationships between three or more variables.

Question #8: Create a scatterplot showing the relationship between cost and average faculty salary that colors the data-points by public/private. How does considering the variable “Private” impact your interpretation of the relationship between cost and average faculty salary?

\(~\)

Reporting Associations

After using data visualization to qualitatively identify whether two (or more) variables are associated, the next step is to summarize that association using descriptive statistics. The second portion of the References/Examples section provides examples on how to do this.

Question #9: Report a difference in means for two groups involved in the comparisons you made in Question #4. If your categorical variable has more than two groups, report the difference for the most interesting pairing of groups.

Question #10: Report the difference in conditional proportions using the proportion of public schools in the Rocky Mountain region minus the proportion of public schools in the Great Lakes region.

Question #11: Report the correlation coefficient between cost and average faculty salary. Then, subset/filter the data to obtain the correlation coefficient between cost and average faculty salary for only private colleges. Write 1-2 sentences describing why these two correlations are so different.

\(~\)

Extra Credit (Optional)

Question #12 (Extra Credit): Briefly explain how the ecological fallacy might apply to scatterplot you created in Question #7.

\(~\)

References and examples

The sections that follow provide easy to follow examples on how to obtain some frequently sought after graphs or summary information.

Whenever possible, code is shown for the ggplot2 graphics (preferred) and base R graphics (if you have trouble installing ggplot2)

Visualization

## Read in the "Tips" data to use in these examples
example_data <- read.csv("https://remiller1450.github.io/data/Tips.csv")

## If possible, try to use "ggplot" graphics
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5

Scatterplot (basic)

## ggplot
ggplot(example_data, aes(x = TotBill, y = Tip)) + geom_point()

## base R
plot(example_data$TotBill, example_data$Tip)

Scatterplot (color by group)

## ggplot
ggplot(example_data, aes(x = TotBill, y = Tip, color = Sex)) + geom_point()

## base R
plot(example_data$TotBill, example_data$Tip, col = factor(example_data$Sex))

Scatterplot (color by numeric)

## ggplot
ggplot(example_data, aes(x = TotBill, y = Tip, color = Size)) + geom_point()

## base R
plot(example_data$TotBill, example_data$Tip, col = heat.colors(example_data$Size))

Boxplot (simple)

## ggplot
ggplot(example_data, aes(x = Tip)) + geom_boxplot()

## base R
boxplot(example_data$Tip)

Boxplot (by group)

## ggplot
ggplot(example_data, aes(x = Tip, y = Time)) + geom_boxplot()

## base R
boxplot(Tip ~ Time, data = example_data)

Boxplot (by group, stratified)

## ggplot
ggplot(example_data, aes(x = Tip, y = Time)) + geom_boxplot() + facet_wrap(~Sex)

## base R
boxplot(Tip ~ Time + Sex, data = example_data)

Histogram

## ggplot
ggplot(example_data, aes(x = Tip)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## base R
hist(example_data$Tip)

Barchart (simple)

## ggplot
ggplot(example_data, aes(x = Time)) + geom_bar()

## base R
barplot(table(example_data$Time)) ## notice the table() function

Barchart (stacked)

## ggplot
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar()

## base R
barplot(table(example_data$Time, example_data$Sex)) ## notice the table() function

Barchart (conditional)

## ggplot
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "fill")

## base R
barplot(prop.table(table(example_data$Time, example_data$Sex), margin = 2)) ## notice the prop.table() function

Barchart (conditional, stratified)

## ggplot
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "fill") + facet_wrap(~Day)

## Cannot be easily done in base R :/

Summarization

## Read in the "Tips" data to use in examples
example_data <- read.csv("https://remiller1450.github.io/data/Tips.csv")

Default Numeric Summary

summary(example_data)
##     TotBill           Tip             Sex               Smoker         
##  Min.   : 3.07   Min.   : 1.000   Length:244         Length:244        
##  1st Qu.:13.35   1st Qu.: 2.000   Class :character   Class :character  
##  Median :17.80   Median : 2.900   Mode  :character   Mode  :character  
##  Mean   :19.79   Mean   : 2.998                                        
##  3rd Qu.:24.13   3rd Qu.: 3.562                                        
##  Max.   :50.81   Max.   :10.000                                        
##      Day                Time                Size     
##  Length:244         Length:244         Min.   :1.00  
##  Class :character   Class :character   1st Qu.:2.00  
##  Mode  :character   Mode  :character   Median :2.00  
##                                        Mean   :2.57  
##                                        3rd Qu.:3.00  
##                                        Max.   :6.00

After Subsetting

summary(example_data[example_data$Sex == "M",])  # Only male customers
##     TotBill           Tip            Sex               Smoker         
##  Min.   : 7.25   Min.   : 1.00   Length:157         Length:157        
##  1st Qu.:14.00   1st Qu.: 2.00   Class :character   Class :character  
##  Median :18.35   Median : 3.00   Mode  :character   Mode  :character  
##  Mean   :20.74   Mean   : 3.09                                        
##  3rd Qu.:24.71   3rd Qu.: 3.76                                        
##  Max.   :50.81   Max.   :10.00                                        
##      Day                Time                Size      
##  Length:157         Length:157         Min.   :1.000  
##  Class :character   Class :character   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :2.000  
##                                        Mean   :2.631  
##                                        3rd Qu.:3.000  
##                                        Max.   :6.000
summary(example_data[example_data$Tip > 5,])  # Only tips above $5
##     TotBill           Tip             Sex               Smoker         
##  Min.   : 7.25   Min.   : 5.070   Length:18          Length:18         
##  1st Qu.:26.46   1st Qu.: 5.178   Class :character   Class :character  
##  Median :30.16   Median : 5.885   Mode  :character   Mode  :character  
##  Mean   :31.94   Mean   : 6.273                                        
##  3rd Qu.:34.83   3rd Qu.: 6.650                                        
##  Max.   :50.81   Max.   :10.000                                        
##      Day                Time                Size      
##  Length:18          Length:18          Min.   :2.000  
##  Class :character   Class :character   1st Qu.:3.000  
##  Mode  :character   Mode  :character   Median :4.000  
##                                        Mean   :3.667  
##                                        3rd Qu.:4.000  
##                                        Max.   :6.000

Standard deviation

sd(example_data$Tip)
## [1] 1.383638

Percentiles

quantile(example_data$Tip, .9)
## 90% 
##   5

Tables (one-way)

table(example_data$Day)
## 
## Fri Sat Sun Thu 
##  19  87  76  62

Tables (two-way)

table(example_data$Sex, example_data$Day)
##    
##     Fri Sat Sun Thu
##   F   9  28  18  32
##   M  10  59  58  30

Tables (row props)

my_table <- table(example_data$Sex, example_data$Day)
prop.table(my_table, margin = 1)  ## Change to "margin = 2" for columns
##    
##            Fri        Sat        Sun        Thu
##   F 0.10344828 0.32183908 0.20689655 0.36781609
##   M 0.06369427 0.37579618 0.36942675 0.19108280

Tables (column props)

my_table <- table(example_data$Sex, example_data$Day)
prop.table(my_table, margin = 2)
##    
##           Fri       Sat       Sun       Thu
##   F 0.4736842 0.3218391 0.2368421 0.5161290
##   M 0.5263158 0.6781609 0.7631579 0.4838710

Tables (stratified/three-way)

table(example_data$Sex, example_data$Day, example_data$Time)
## , ,  = Day
## 
##    
##     Fri Sat Sun Thu
##   F   4   0   0  31
##   M   3   0   0  30
## 
## , ,  = Night
## 
##    
##     Fri Sat Sun Thu
##   F   5  28  18   1
##   M   7  59  58   0

Correlation

cor(example_data$TotBill, example_data$Tip)
## [1] 0.6757341

Correlation (stratified)

## First do the stratification
smokers <- example_data[example_data$Smoker == "Yes",]
non_smokers <- example_data[example_data$Smoker == "No",]

## Then calculate the correlations
cor(smokers$TotBill, smokers$Tip)
## [1] 0.4882179
cor(non_smokers$TotBill, non_smokers$Tip)
## [1] 0.8221826

Turning in Lab #2

Please submit your responses to the questions contained in Lab #2 via Canvas as a compiled .html file. Please let me know if you have trouble “knitting” your code, or if you think something is going wrong with the formatting.

As a general reminder, everyone should turn in their own copy of the lab, but you should include the names of the other members of your group (either in a comment or as authors in R Markdown).