This lab is intended to provide practice with the topics we covered during Week 1 of class (namely, data structures, univariate summaries and graphs, and bivariate summaries and graphs).
\(~\)
The College Scorecard is a government run database that stores institutional level data on all accredited colleges and universities in the United States. A new version of the data is published yearly, and contains over 400 variables. This lab will use a simplified dataset from the 2019-2020 academic year.
The College19 Dataset is a reduced version of the 2019-2020 College Scorecard data that includes fewer variables and is filtered to include only primarily undergraduate institutions with at least 400 enrolled students. It contains the following variables:
The first step in any data analysis is getting a high-level understanding of the data. This means describing the cases and how many there are, as well as looking over the variables and being aware of how they’re recorded.
Some helpful commands are shown below:
## Begin by reading in the data
colleges19 <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
## Print the dimensions
dim(colleges19)
## Print the "structure" of the data.frame
str(colleges19)
## Print the first 6 cases
head(colleges19)
Question #1: Copy the two commands given above that read-in the data and print its dimension into a code block under your header for Question #1 (ie: the grey area between the ```{r} and ``` under a Question #1 header). Then, in space beneath the code block intended for written responses, describe how many cases and variables are in these data, as well as what each case represents.
\(~\)
Once you are comfortable with the contents of a dataset, your next step should be to identify variables of interest and explore their univariate distributions.
To generate a specific type of graph, you will need to use the References/Examples provided in the last sections of this document. The questions that follow will help you practice this process (and being able to work from another person’s examples is an essential R programming skill).
Question #2: Identify one categorical variable and one quantitative variable that you think might be interesting. Then, using the examples mentioned above, create a univariate graph of the categorical variable you selected. Write 1-2 sentences describing the distribution of this variable.
Question #3: Create a univariate graph of the quantitative variable you selected in Question #2. Write 1-2 sentences describing the distribution of this variable. Be sure to comment on shape, center, and spread.
\(~\)
After studying the univariate distributions of the variables you are interested in, it is now appropriate to evaluate the possible association between those variables.
Question #4: Create a bivariate graph depicting the relationship between the two variables you selected in Question #2. Then, describe the association (if any) between the variables.
Question #5: What information did you obtain in Questions #2 or #3 that is important in understanding your variables, but is not contained in the graph you created in Question #4.
Question #6: Create a stacked conditional barchart showing the proportion of private colleges within each geographic region. Then, write 1-2 sentences describing whether or not these two variables appear associated.
Question #7: Create a scatterplot showing the relationship between cost and average faculty salary. Then, write 1-2 sentences describing whether or not these two variables appear associated. Be sure to address form, direction, and strength.
\(~\)
Bivariate relationships can be misleading, particularly for data like these. An increasingly important statistical skill is multivariate thinking, the process of understanding and describing the potentially complicated relationships between three or more variables.
Question #8: Create a scatterplot showing the relationship between cost and average faculty salary that colors the data-points by public/private. How does considering the variable “Private” impact your interpretation of the relationship between cost and average faculty salary?
\(~\)
After using data visualization to qualitatively identify whether two (or more) variables are associated, the next step is to summarize that association using descriptive statistics. The second portion of the References/Examples section provides examples on how to do this.
Question #9: Report a difference in means for two groups involved in the comparisons you made in Question #4. If your categorical variable has more than two groups, report the difference for the most interesting pairing of groups.
Question #10: Report the difference in conditional proportions using the proportion of public schools in the Rocky Mountain region minus the proportion of public schools in the Great Lakes region.
Question #11: Report the correlation coefficient between cost and average faculty salary. Then, subset/filter the data to obtain the correlation coefficient between cost and average faculty salary for only private colleges. Write 1-2 sentences describing why these two correlations are so different.
\(~\)
Question #12 (Extra Credit): Briefly explain how the ecological fallacy might apply to scatterplot you created in Question #7.
\(~\)
The sections that follow provide easy to follow examples on how to obtain some frequently sought after graphs or summary information.
Whenever possible, code is shown for the ggplot2 graphics (preferred) and base R graphics (if you have trouble installing ggplot2)
## Read in the "Tips" data to use in these examples
example_data <- read.csv("https://remiller1450.github.io/data/Tips.csv")
## If possible, try to use "ggplot" graphics
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
## ggplot
ggplot(example_data, aes(x = TotBill, y = Tip)) + geom_point()
## base R
plot(example_data$TotBill, example_data$Tip)
## ggplot
ggplot(example_data, aes(x = TotBill, y = Tip, color = Sex)) + geom_point()
## base R
plot(example_data$TotBill, example_data$Tip, col = factor(example_data$Sex))
## ggplot
ggplot(example_data, aes(x = TotBill, y = Tip, color = Size)) + geom_point()
## base R
plot(example_data$TotBill, example_data$Tip, col = heat.colors(example_data$Size))
## ggplot
ggplot(example_data, aes(x = Tip)) + geom_boxplot()
## base R
boxplot(example_data$Tip)
## ggplot
ggplot(example_data, aes(x = Tip, y = Time)) + geom_boxplot()
## base R
boxplot(Tip ~ Time, data = example_data)
## ggplot
ggplot(example_data, aes(x = Tip, y = Time)) + geom_boxplot() + facet_wrap(~Sex)
## base R
boxplot(Tip ~ Time + Sex, data = example_data)
## ggplot
ggplot(example_data, aes(x = Tip)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## base R
hist(example_data$Tip)
## ggplot
ggplot(example_data, aes(x = Time)) + geom_bar()
## base R
barplot(table(example_data$Time)) ## notice the table() function
## ggplot
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar()
## base R
barplot(table(example_data$Time, example_data$Sex)) ## notice the table() function
## ggplot
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "fill")
## base R
barplot(prop.table(table(example_data$Time, example_data$Sex), margin = 2)) ## notice the prop.table() function
## ggplot
ggplot(example_data, aes(x = Time, fill = Sex)) + geom_bar(position = "fill") + facet_wrap(~Day)
## Cannot be easily done in base R :/
## Read in the "Tips" data to use in examples
example_data <- read.csv("https://remiller1450.github.io/data/Tips.csv")
summary(example_data)
## TotBill Tip Sex Smoker
## Min. : 3.07 Min. : 1.000 Length:244 Length:244
## 1st Qu.:13.35 1st Qu.: 2.000 Class :character Class :character
## Median :17.80 Median : 2.900 Mode :character Mode :character
## Mean :19.79 Mean : 2.998
## 3rd Qu.:24.13 3rd Qu.: 3.562
## Max. :50.81 Max. :10.000
## Day Time Size
## Length:244 Length:244 Min. :1.00
## Class :character Class :character 1st Qu.:2.00
## Mode :character Mode :character Median :2.00
## Mean :2.57
## 3rd Qu.:3.00
## Max. :6.00
summary(example_data[example_data$Sex == "M",]) # Only male customers
## TotBill Tip Sex Smoker
## Min. : 7.25 Min. : 1.00 Length:157 Length:157
## 1st Qu.:14.00 1st Qu.: 2.00 Class :character Class :character
## Median :18.35 Median : 3.00 Mode :character Mode :character
## Mean :20.74 Mean : 3.09
## 3rd Qu.:24.71 3rd Qu.: 3.76
## Max. :50.81 Max. :10.00
## Day Time Size
## Length:157 Length:157 Min. :1.000
## Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Median :2.000
## Mean :2.631
## 3rd Qu.:3.000
## Max. :6.000
summary(example_data[example_data$Tip > 5,]) # Only tips above $5
## TotBill Tip Sex Smoker
## Min. : 7.25 Min. : 5.070 Length:18 Length:18
## 1st Qu.:26.46 1st Qu.: 5.178 Class :character Class :character
## Median :30.16 Median : 5.885 Mode :character Mode :character
## Mean :31.94 Mean : 6.273
## 3rd Qu.:34.83 3rd Qu.: 6.650
## Max. :50.81 Max. :10.000
## Day Time Size
## Length:18 Length:18 Min. :2.000
## Class :character Class :character 1st Qu.:3.000
## Mode :character Mode :character Median :4.000
## Mean :3.667
## 3rd Qu.:4.000
## Max. :6.000
sd(example_data$Tip)
## [1] 1.383638
quantile(example_data$Tip, .9)
## 90%
## 5
table(example_data$Day)
##
## Fri Sat Sun Thu
## 19 87 76 62
table(example_data$Sex, example_data$Day)
##
## Fri Sat Sun Thu
## F 9 28 18 32
## M 10 59 58 30
my_table <- table(example_data$Sex, example_data$Day)
prop.table(my_table, margin = 1) ## Change to "margin = 2" for columns
##
## Fri Sat Sun Thu
## F 0.10344828 0.32183908 0.20689655 0.36781609
## M 0.06369427 0.37579618 0.36942675 0.19108280
my_table <- table(example_data$Sex, example_data$Day)
prop.table(my_table, margin = 2)
##
## Fri Sat Sun Thu
## F 0.4736842 0.3218391 0.2368421 0.5161290
## M 0.5263158 0.6781609 0.7631579 0.4838710
table(example_data$Sex, example_data$Day, example_data$Time)
## , , = Day
##
##
## Fri Sat Sun Thu
## F 4 0 0 31
## M 3 0 0 30
##
## , , = Night
##
##
## Fri Sat Sun Thu
## F 5 28 18 1
## M 7 59 58 0
cor(example_data$TotBill, example_data$Tip)
## [1] 0.6757341
## First do the stratification
smokers <- example_data[example_data$Smoker == "Yes",]
non_smokers <- example_data[example_data$Smoker == "No",]
## Then calculate the correlations
cor(smokers$TotBill, smokers$Tip)
## [1] 0.4882179
cor(non_smokers$TotBill, non_smokers$Tip)
## [1] 0.8221826
Please submit your responses to the questions contained in Lab #2 via Canvas as a compiled .html file. Please let me know if you have trouble “knitting” your code, or if you think something is going wrong with the formatting.
As a general reminder, everyone should turn in their own copy of the lab, but you should include the names of the other members of your group (either in a comment or as authors in R Markdown).