Lab #4 - Comparative summaries and dplyr

Directions (read before starting)

Please work together with your assigned partner. Make sure you both fully understand something before moving on.
Record your answers to lab questions separately from the lab’s examples. You and your partner should only turn in responses to lab questions, nothing more and nothing less.
Ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Onboarding

The dplyr package contains several functions that are widely used for data manipulation.

## Library
# install.packages("dplyr")
library(dplyr)

## Data for these examples
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")

The main data preparation steps we’ll use in this class are:

Filtering - reducing our data set to only include rows that meet a certain criteria
Mutating - creating new variables by transforming existing ones
Summarizing - calculating descriptive statistics
Grouping - instructing subsequent dplyr functions to act separately on groups of data (when their typical behavior is to act on the entire data set)

Additional details and examples of these will be given in the lab. Another important feature of the dplyr.

A unique feature of this package is the ability to use piping to chain together a sequence of data preparation steps. The pipe operator, %>%, will pass forward the object/output returned by a function into a subsequent function.

The follow code summarizes the 4-year cost of colleges in IA or MN without using piping:

## Summarize the 4-year 'Cost' of colleges in 'IA' or 'MN' w/o piping
ia_mn_colleges = filter(colleges, State %in% c("IA", "MN")) # Filter
ia_mn_colleges2 = mutate(ia_mn_colleges, Cost_4yrs = 4*Cost) # Create new var 'Cost_4yrs'
summarize(ia_mn_colleges2,   # Calculate mean/sd of Cost_4yrs
          mean_cost = mean(Cost_4yrs), 
          sd_cost = sd(Cost_4yrs))

##   mean_cost  sd_cost
## 1  158481.4 54980.78

Notice how we needed to store several intermediate objects that are only used once.

The code below achieves the same result using piping:

colleges %>% 
  filter(State %in% c("IA", "MN")) %>%
  mutate(Cost_4yrs = 4*Cost) %>%
  summarize(mean_cost = mean(Cost_4yrs), sd_cost = sd(Cost_4yrs))

##   mean_cost  sd_cost
## 1  158481.4 54980.78

Note that we could store the results of our pipeline as an object that can be used to create data visualizations or tables:

## Store the final output of the pipeline
ia_mn_cost_summary = colleges %>% 
  filter(State %in% c("IA", "MN")) %>%
  mutate(Cost_4yrs = 4*Cost) %>%
  summarize(mean_cost = mean(Cost_4yrs), sd_cost = sd(Cost_4yrs))

ia_mn_cost_summary[1] # Print first element of the output

##   mean_cost
## 1  158481.4

Lab

At this point you should begin working independently with your assigned partner(s).

The first part of the lab will provide an overview of the data preparation steps mentioned in the introduction. The dplyr functions for these steps are summarized in the table below:

Function	Meaning
`filter`	keep specific observations (i.e. specific rows)
`mutate`	add new derived columns to a data frame
`summarize`	aggregate many rows into a summary measure
`group_by`	perform subsequent operations separately on groups created by a categorical variable

\(~\)

Filtering

The purpose of filtering is to reduce a data set to only contain rows that meet a certain criteria. That criteria must be specified in the form of a logical expression. A few common logical operators are described in the table below:

Operator	Meaning
`>`	greater than the given value
`<`	less than the given value
`==`	equal to the given value
`%in%`	equal to any element of the given vector

Most of these are straightforward, the example below filters the colleges data to only include colleges with an ‘ACT_median’ of at least 30:

## Filter to 30+ ACT
act_30plus = colleges %>% filter(ACT_median > 29)
head(act_30plus)

##                    Name          City State Enrollment Private      Region
## 1       Amherst College       Amherst    MA       1855 Private New England
## 2        Babson College     Wellesley    MA       2361 Private New England
## 3       Barnard College      New York    NY       2557 Private    Mid East
## 4    Bentley University       Waltham    MA       4177 Private New England
## 5 Binghamton University        Vestal    NY      13990  Public    Mid East
## 6        Boston College Chestnut Hill    MA       9639 Private New England
##   Adm_Rate ACT_median ACT_Q1 ACT_Q3  Cost Net_Tuition Avg_Fac_Salary
## 1   0.1281         33     31     31 71300       27541         116775
## 2   0.2441         30     28     28 68482       34365         146502
## 3   0.1392         32     30     30 71071       34147         124551
## 4   0.4321         30     28     28 66180       30307         126936
## 5   0.3981         30     28     28 25846        8187          88011
## 6   0.2789         33     31     31 70588       34327         137907
##   PercentFemale PercentWhite PercentBlack PercentHispanic PercentAsian
## 1     0.5439093       0.7564       0.0986          0.1143       0.0703
## 2     0.4822134       0.8108       0.0668          0.0899       0.0586
## 3            NA       0.7115       0.1131          0.1382       0.0790
## 4     0.3995138       0.8769       0.0440          0.0528       0.0365
## 5     0.5177917       0.7664       0.0907          0.1061       0.0691
## 6     0.5189448       0.8172       0.0751          0.0810       0.0495
##   FourYearComp_Males FourYearComp_Females Debt_median Salary10yr_median
## 1          0.8421053            0.8969072       11900             65000
## 2                 NA                   NA       21875             96100
## 3                 NA                   NA       16250             57900
## 4          0.8166259            0.8916968       23250             86900
## 5          0.6869496            0.7648980       16750             61600
## 6          0.8724584            0.9169329       16000             72500

You should be aware that we can combine multiple filtering conditions into the same use of the filter() function by specifying each condition as its own argument:

## Filter to 30+ ACT and located in IA
ia_act_30plus = colleges %>% filter(ACT_median > 29, State == "IA")
head(ia_act_30plus)

##               Name     City State Enrollment Private Region Adm_Rate ACT_median
## 1 Grinnell College Grinnell    IA       1683 Private Plains   0.2438         32
##   ACT_Q1 ACT_Q3  Cost Net_Tuition Avg_Fac_Salary PercentFemale PercentWhite
## 1     30     30 65814       20369         101979     0.5348837       0.7933
##   PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1       0.0971          0.0841         0.04          0.7413793
##   FourYearComp_Females Debt_median Salary10yr_median
## 1            0.8372093       15000             49100

The %in% operator is slightly more complicated, as it allows a logical condition to be satisfied by multiple values. Below is a simple example that filters the data to include colleges with an ‘ACT_median’ of at least 30 that are located in either “IA” or “MN”:

## Filter to 30+ ACT and located in IA
ia_mn_act_30plus = colleges %>% filter(ACT_median > 29, State %in% c("IA", "MN"))
head(ia_mn_act_30plus)

##                 Name       City State Enrollment Private Region Adm_Rate
## 1   Carleton College Northfield    MN       2046 Private Plains   0.1984
## 2   Grinnell College   Grinnell    IA       1683 Private Plains   0.2438
## 3 Macalester College Saint Paul    MN       2160 Private Plains   0.4124
##   ACT_median ACT_Q1 ACT_Q3  Cost Net_Tuition Avg_Fac_Salary PercentFemale
## 1         33     31     31 68835       32952         109116     0.4594595
## 2         32     30     30 65814       20369         101979     0.5348837
## 3         31     29     29 66280       23225          95607     0.5933852
##   PercentWhite PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1       0.8066       0.0766          0.0760       0.0501                 NA
## 2       0.7933       0.0971          0.0841       0.0400          0.7413793
## 3       0.8358       0.0623          0.0602       0.0409          0.8071429
##   FourYearComp_Females Debt_median Salary10yr_median
## 1                   NA       12163             54200
## 2            0.8372093       15000             49100
## 3            0.9123711       19432             47600

Question #1: Filter the colleges data to include only schools in the Plains and Great Lakes regions with enrollments less than 20,000. Then, create a one-way table summarizing the variable “Private” for only these colleges.

\(~\)

Mutating

The purpose of mutating is to create new variable that is an function of one or more existing variables. The example below demonstrates this by creating the variable ‘debt_cost_ratio’ for colleges in IA:

## Notice the use of piping
ia_debt_cost = colleges %>% 
                 filter(State == "IA") %>%
                 mutate(debt_cost_ratio = Debt_median/Cost)
head(ia_debt_cost$debt_cost_ratio)  ## First few values of the ratio

## [1] 0.3241166 0.3782317 0.4612143 0.4789379 0.4106096 0.3964742

In general, any function or mathematical operator that acts on a single variable can be used in mutate(). One common use case that you should know is how to create a new binary variable using ifelse():

## Create selectivity var then make a one-way table
ia_sel = colleges %>% 
                 filter(State == "IA") %>%
                 mutate(Selective = ifelse(Adm_Rate < 0.5, "selective", "not selective"))
table(ia_sel$Selective)  ## one-way table

## 
## not selective     selective 
##            26             1

The first argument inside of ifelse() is a logical test, in this example the test is whether college’s admissions rate is less than 0.5
- The second argument is what should happen if the test is TRUE, in this example if a college has an admissions rate less than 0.5 the resulting value is set to be "selective"
- The third argument is what happens when the test is FALSE, in this case these colleges are given the value "not selective"

Question #2:

Part A: Filter the colleges data to only include schools in Illinois (abbreviated IL), then use mutate() and ifelse() to create a new variable called “large_sized” that has the value "large" if a school has an enrollment larger than 10,000 and "not large" if the enrollment is less than 10,000.
Part B: Create a two-way frequency table showing the variables large_sized and Private using the data from Part A, then use the table to determine whether these two variables are associated.

\(~\)

Summarizing

The summarize() function will calculate descriptive statistics that require an aggregation of rows. For example:

colleges %>% 
  filter(State == "IA") %>%
  summarize(Median_Cost = median(Cost))

##   Median_Cost
## 1       43520

We can calculate any number of different statistics in same use of summarize(). The example below shows a customized set:

colleges %>% 
  filter(State == "IA") %>%
  summarize(Min_Cost = min(Cost),
            TenPer_Cost = quantile(Cost, 0.1),     ## 10th percentile
            Median_Cost = median(Cost),
            NinetyPer_Cost = quantile(Cost, 0.9),  ## 90th percentile
            Max_Cost = max(Cost))

##   Min_Cost TenPer_Cost Median_Cost NinetyPer_Cost Max_Cost
## 1    20476     22368.2       43520        54256.2    65814

Question #3:

Part A: Using the summarize() function, report the interquartile range (IQR) and standard deviation of the variable “Enrollment” for all colleges located in the Great Lakes and Plains regions (for all schools in these regions).
Part B: Create a data visualization showing the distribution of Enrollment for the colleges used in Part A. Based upon the shape of this distribution, is the IQR or standard deviation a more appropriate way to describe the variable’s spread?

\(~\)

Grouped Summarization

Frequently, we’d like to perform parallel calculations for different subsets of data. For example, you might want to find the mean and standard deviation of the variable “Cost” separately for every region, or you might even want to find it separately for the private and public schools in each different state.

Group-wise manipulation involves two steps:

First, the group_by() function is used to internally add grouping tags to the rows of your data. You will not see these tags, but other tidyverse functions will recognize and use them.
Next, the tagged data are passed into any of the aforementioned dplyr functions (usually summarize()), and those functions are executed separately on each group.

Shown below are a few examples.

In Example #1, we find the state-specific median cost of the colleges in each state located in the “Plains” region:

## Example #1
colleges %>% 
  filter(Region == "Plains") %>%
  group_by(State) %>%
  summarize(Median_Cost = median(Cost, na.rm = TRUE))

## # A tibble: 7 × 2
##   State Median_Cost
##   <chr>       <dbl>
## 1 IA         43520 
## 2 KS         38832 
## 3 MN         35887 
## 4 MO         30279 
## 5 ND         19299 
## 6 NE         29258.
## 7 SD         22609

In Example #2, we find each state’s median cost separately for private and public colleges located in Iowa, Minnesota, or Missouri. We also use n() to count the number of colleges, recorded as “N”, belonging to each group reported in the summary.

## Example #2
colleges %>% 
  filter(State %in% c("IA", "MN", "MO")) %>%
  group_by(State, Private) %>%
  summarize(Median_Cost = median(Cost, na.rm = TRUE),
            N = n())

## # A tibble: 6 × 4
## # Groups:   State [3]
##   State Private Median_Cost     N
##   <chr> <chr>         <dbl> <int>
## 1 IA    Private      44206     25
## 2 IA    Public       21295      4
## 3 MN    Private      48860     24
## 4 MN    Public       21416     12
## 5 MO    Private      37788.    32
## 6 MO    Public       19346     13

Notice the use of the argument na.rm = TRUE in the median() function to remove missing values before calculating the median. Without this argument it is impossible to determine the median due to the presence of missing data.

Question #4: Find the mean, median, and standard deviation of the variable “Cost” separately for each region. Based upon these results, are “Cost” and “Region” associated? Briefly explain.