dplyr
Directions (read before starting)
\(~\)
The dplyr
package contains several functions that are
widely used for data manipulation.
## Library
# install.packages("dplyr")
library(dplyr)
## Data for these examples
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
The main data preparation steps we’ll use in this class are:
dplyr
functions to act separately on groups of data (when their typical
behavior is to act on the entire data set)Additional details and examples of these will be given in the lab.
Another important feature of the dplyr
.
A unique feature of this package is the ability to use
piping to chain together a sequence of data preparation
steps. The pipe operator, %>%
, will pass forward the
object/output returned by a function into a subsequent function.
The follow code summarizes the 4-year cost of colleges in IA or MN without using piping:
## Summarize the 4-year 'Cost' of colleges in 'IA' or 'MN' w/o piping
ia_mn_colleges = filter(colleges, State %in% c("IA", "MN")) # Filter
ia_mn_colleges2 = mutate(ia_mn_colleges, Cost_4yrs = 4*Cost) # Create new var 'Cost_4yrs'
summarize(ia_mn_colleges2, # Calculate mean/sd of Cost_4yrs
mean_cost = mean(Cost_4yrs),
sd_cost = sd(Cost_4yrs))
## mean_cost sd_cost
## 1 158481.4 54980.78
Notice how we needed to store several intermediate objects that are only used once.
The code below achieves the same result using piping:
colleges %>%
filter(State %in% c("IA", "MN")) %>%
mutate(Cost_4yrs = 4*Cost) %>%
summarize(mean_cost = mean(Cost_4yrs), sd_cost = sd(Cost_4yrs))
## mean_cost sd_cost
## 1 158481.4 54980.78
Note that we could store the results of our pipeline as an object that can be used to create data visualizations or tables:
## Store the final output of the pipeline
ia_mn_cost_summary = colleges %>%
filter(State %in% c("IA", "MN")) %>%
mutate(Cost_4yrs = 4*Cost) %>%
summarize(mean_cost = mean(Cost_4yrs), sd_cost = sd(Cost_4yrs))
ia_mn_cost_summary[1] # Print first element of the output
## mean_cost
## 1 158481.4
At this point you should begin working independently with your assigned partner(s).
The first part of the lab will provide an overview of the data
preparation steps mentioned in the introduction. The dplyr
functions for these steps are summarized in the table below:
Function | Meaning |
---|---|
filter |
keep specific observations (i.e. specific rows) |
mutate |
add new derived columns to a data frame |
summarize |
aggregate many rows into a summary measure |
group_by |
perform subsequent operations separately on groups created by a categorical variable |
\(~\)
The purpose of filtering is to reduce a data set to only contain rows that meet a certain criteria. That criteria must be specified in the form of a logical expression. A few common logical operators are described in the table below:
Operator | Meaning |
---|---|
> |
greater than the given value |
< |
less than the given value |
== |
equal to the given value |
%in% |
equal to any element of the given vector |
Most of these are straightforward, the example below filters the
colleges
data to only include colleges with an ‘ACT_median’
of at least 30:
## Filter to 30+ ACT
act_30plus = colleges %>% filter(ACT_median > 29)
head(act_30plus)
## Name City State Enrollment Private Region
## 1 Amherst College Amherst MA 1855 Private New England
## 2 Babson College Wellesley MA 2361 Private New England
## 3 Barnard College New York NY 2557 Private Mid East
## 4 Bentley University Waltham MA 4177 Private New England
## 5 Binghamton University Vestal NY 13990 Public Mid East
## 6 Boston College Chestnut Hill MA 9639 Private New England
## Adm_Rate ACT_median ACT_Q1 ACT_Q3 Cost Net_Tuition Avg_Fac_Salary
## 1 0.1281 33 31 31 71300 27541 116775
## 2 0.2441 30 28 28 68482 34365 146502
## 3 0.1392 32 30 30 71071 34147 124551
## 4 0.4321 30 28 28 66180 30307 126936
## 5 0.3981 30 28 28 25846 8187 88011
## 6 0.2789 33 31 31 70588 34327 137907
## PercentFemale PercentWhite PercentBlack PercentHispanic PercentAsian
## 1 0.5439093 0.7564 0.0986 0.1143 0.0703
## 2 0.4822134 0.8108 0.0668 0.0899 0.0586
## 3 NA 0.7115 0.1131 0.1382 0.0790
## 4 0.3995138 0.8769 0.0440 0.0528 0.0365
## 5 0.5177917 0.7664 0.0907 0.1061 0.0691
## 6 0.5189448 0.8172 0.0751 0.0810 0.0495
## FourYearComp_Males FourYearComp_Females Debt_median Salary10yr_median
## 1 0.8421053 0.8969072 11900 65000
## 2 NA NA 21875 96100
## 3 NA NA 16250 57900
## 4 0.8166259 0.8916968 23250 86900
## 5 0.6869496 0.7648980 16750 61600
## 6 0.8724584 0.9169329 16000 72500
You should be aware that we can combine multiple filtering conditions
into the same use of the filter()
function by specifying
each condition as its own argument:
## Filter to 30+ ACT and located in IA
ia_act_30plus = colleges %>% filter(ACT_median > 29, State == "IA")
head(ia_act_30plus)
## Name City State Enrollment Private Region Adm_Rate ACT_median
## 1 Grinnell College Grinnell IA 1683 Private Plains 0.2438 32
## ACT_Q1 ACT_Q3 Cost Net_Tuition Avg_Fac_Salary PercentFemale PercentWhite
## 1 30 30 65814 20369 101979 0.5348837 0.7933
## PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1 0.0971 0.0841 0.04 0.7413793
## FourYearComp_Females Debt_median Salary10yr_median
## 1 0.8372093 15000 49100
The %in%
operator is slightly more complicated, as it
allows a logical condition to be satisfied by multiple values. Below is
a simple example that filters the data to include colleges with an
‘ACT_median’ of at least 30 that are located in either “IA” or “MN”:
## Filter to 30+ ACT and located in IA
ia_mn_act_30plus = colleges %>% filter(ACT_median > 29, State %in% c("IA", "MN"))
head(ia_mn_act_30plus)
## Name City State Enrollment Private Region Adm_Rate
## 1 Carleton College Northfield MN 2046 Private Plains 0.1984
## 2 Grinnell College Grinnell IA 1683 Private Plains 0.2438
## 3 Macalester College Saint Paul MN 2160 Private Plains 0.4124
## ACT_median ACT_Q1 ACT_Q3 Cost Net_Tuition Avg_Fac_Salary PercentFemale
## 1 33 31 31 68835 32952 109116 0.4594595
## 2 32 30 30 65814 20369 101979 0.5348837
## 3 31 29 29 66280 23225 95607 0.5933852
## PercentWhite PercentBlack PercentHispanic PercentAsian FourYearComp_Males
## 1 0.8066 0.0766 0.0760 0.0501 NA
## 2 0.7933 0.0971 0.0841 0.0400 0.7413793
## 3 0.8358 0.0623 0.0602 0.0409 0.8071429
## FourYearComp_Females Debt_median Salary10yr_median
## 1 NA 12163 54200
## 2 0.8372093 15000 49100
## 3 0.9123711 19432 47600
Question #1: Filter the colleges data to include only schools in the Plains and Great Lakes regions with enrollments less than 20,000. Then, create a one-way table summarizing the variable “Private” for only these colleges.
\(~\)
The purpose of mutating is to create new variable that is an function of one or more existing variables. The example below demonstrates this by creating the variable ‘debt_cost_ratio’ for colleges in IA:
## Notice the use of piping
ia_debt_cost = colleges %>%
filter(State == "IA") %>%
mutate(debt_cost_ratio = Debt_median/Cost)
head(ia_debt_cost$debt_cost_ratio) ## First few values of the ratio
## [1] 0.3241166 0.3782317 0.4612143 0.4789379 0.4106096 0.3964742
In general, any function or mathematical operator that acts on a
single variable can be used in mutate()
. One common use
case that you should know is how to create a new binary variable using
ifelse()
:
## Create selectivity var then make a one-way table
ia_sel = colleges %>%
filter(State == "IA") %>%
mutate(Selective = ifelse(Adm_Rate < 0.5, "selective", "not selective"))
table(ia_sel$Selective) ## one-way table
##
## not selective selective
## 26 1
ifelse()
is a logical
test, in this example the test is whether college’s admissions rate is
less than 0.5
TRUE
, in this example if a college has an admissions rate
less than 0.5 the resulting value is set to be
"selective"
FALSE
, in this case these colleges are given the value
"not selective"
Question #2:
IL
), then use
mutate()
and ifelse()
to create a new variable
called “large_sized” that has the value "large"
if a school
has an enrollment larger than 10,000 and "not large"
if the
enrollment is less than 10,000.large_sized
and Private
using
the data from Part A, then use the table to determine whether these two
variables are associated.\(~\)
The summarize()
function will calculate descriptive
statistics that require an aggregation of rows. For example:
colleges %>%
filter(State == "IA") %>%
summarize(Median_Cost = median(Cost))
## Median_Cost
## 1 43520
We can calculate any number of different statistics in same use of
summarize()
. The example below shows a customized set:
colleges %>%
filter(State == "IA") %>%
summarize(Min_Cost = min(Cost),
TenPer_Cost = quantile(Cost, 0.1), ## 10th percentile
Median_Cost = median(Cost),
NinetyPer_Cost = quantile(Cost, 0.9), ## 90th percentile
Max_Cost = max(Cost))
## Min_Cost TenPer_Cost Median_Cost NinetyPer_Cost Max_Cost
## 1 20476 22368.2 43520 54256.2 65814
Question #3:
summarize()
function, report the interquartile range (IQR) and standard deviation of
the variable “Enrollment” for all colleges located in the Great Lakes
and Plains regions (for all schools in these regions).\(~\)
Frequently, we’d like to perform parallel calculations for different subsets of data. For example, you might want to find the mean and standard deviation of the variable “Cost” separately for every region, or you might even want to find it separately for the private and public schools in each different state.
Group-wise manipulation involves two steps:
group_by()
function is used to internally
add grouping tags to the rows of your data. You will not see these tags,
but other tidyverse
functions will recognize and use
them.dplyr
functions (usually summarize()
), and
those functions are executed separately on each group.Shown below are a few examples.
In Example #1, we find the state-specific median cost of the colleges in each state located in the “Plains” region:
## Example #1
colleges %>%
filter(Region == "Plains") %>%
group_by(State) %>%
summarize(Median_Cost = median(Cost, na.rm = TRUE))
## # A tibble: 7 × 2
## State Median_Cost
## <chr> <dbl>
## 1 IA 43520
## 2 KS 38832
## 3 MN 35887
## 4 MO 30279
## 5 ND 19299
## 6 NE 29258.
## 7 SD 22609
In Example #2, we find each state’s median cost separately for
private and public colleges located in Iowa, Minnesota, or Missouri. We
also use n()
to count the number of colleges, recorded as
“N”, belonging to each group reported in the summary.
## Example #2
colleges %>%
filter(State %in% c("IA", "MN", "MO")) %>%
group_by(State, Private) %>%
summarize(Median_Cost = median(Cost, na.rm = TRUE),
N = n())
## # A tibble: 6 × 4
## # Groups: State [3]
## State Private Median_Cost N
## <chr> <chr> <dbl> <int>
## 1 IA Private 44206 25
## 2 IA Public 21295 4
## 3 MN Private 48860 24
## 4 MN Public 21416 12
## 5 MO Private 37788. 32
## 6 MO Public 19346 13
Notice the use of the argument na.rm = TRUE
in the
median()
function to remove missing values before
calculating the median. Without this argument it is impossible to
determine the median due to the presence of missing data.
Question #4: Find the mean, median, and standard deviation of the variable “Cost” separately for each region. Based upon these results, are “Cost” and “Region” associated? Briefly explain.