This lab briefly introduces a few miscellaneous topics related to data processing.
# Please load the following packages (installing if necessary)
library(dplyr)
library(readr)
\(~\)
In R
, you can create your own custom function using
function()
and the assignment operator. The example below
creates a function that returns the squared distance between the mean
and median of a numeric input variable, “x”:
my_fun <- function(x = 99){
d2 <- (mean(x, na.rm = TRUE) - median(x, na.rm = TRUE))^2
return(d2)
}
## Try the function
ex_data <- c(0,2,10)
my_fun(x = ex_data)
## [1] 4
x = 99
dictates that the function should
have an argument “x”, and specifies a default value of 99
.
If a user of the function doesn’t supply their own “x” argument the
function will use the value 99
.return()
defines the function’s output, in this example
the function will output the object d2
.x
and d2
only exist internally when
the function is being used.Question #1: Create a function named
top_cat()
that accepts a categorical variable, “y”, and
returns the frequency of the most frequent category. You can accomplish
this using the table()
and max()
functions.
Next, include code that tests your function on the variable “Region” in
the colleges data set (the correct output should be
400
).
\(~\)
Custom functions are most useful when you’d like to repeat an action several times with slightly different inputs. For example, you might want to find the squared distance between the mean and median for every numeric variable in a data set.
A for loop is one way to iterate through the relevant portions of a data frame:
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
num_data <- colleges %>% select_if(is.numeric) ## Selects all columns that return TRUE to is.numeric
for(i in 1:ncol(num_data)){
print(my_fun(num_data[,i]))
}
## [1] 10274337
## [1] 0.0006547193
## [1] 0.5254721
## [1] 0.6265691
## [1] 0.6265691
## [1] 6919279
## [1] 1653630
## [1] 12303334
## [1] 6.353848e-06
## [1] 0.001075313
## [1] 0.001652711
## [1] 0.002807293
## [1] 9.5804e-05
## [1] 5.645276e-05
## [1] 1.662577e-05
## [1] 1.995113
## [1] 2548162
The code for(i in 1:ncol(num_data))
will create a
variable, “i”, which will increment along the sequence
1:ncol(num_data)
each time the loop is run. This allows for
the ith column to given as an input to the
num_dat()
function during the ith repetition of
the loop.
The example above merely prints the output of num_data()
for each numeric column; however, we’ll generally want to store these
results. To do this, we can set up an empty object before the loop and
assign values into one of its positions on each iteration:
## Create a blank numeric vector with a length equal to "ncol(num_data)"
colleges_d2 <- numeric(length = ncol(num_data))
## Loop and store
for(i in 1:ncol(num_data)){
colleges_d2[i] <- my_fun(num_data[,i])
}
print(colleges_d2)
## [1] 1.027434e+07 6.547193e-04 5.254721e-01 6.265691e-01 6.265691e-01
## [6] 6.919279e+06 1.653630e+06 1.230333e+07 6.353848e-06 1.075313e-03
## [11] 1.652711e-03 2.807293e-03 9.580400e-05 5.645276e-05 1.662577e-05
## [16] 1.995113e+00 2.548162e+06
Question #2: Write code that includes a for
loop and stores the frequency in the most common category for the
first three columns in the “colleges” data set using the function you
created in Question #1. Then, print the object storing these frequencies
(the correct result should be 3 24 127
)
\(~\)
The apply()
function can sometimes be used as an
alternative to a for loop. Consider the example below:
apply(num_data, MARGIN = 2, FUN = my_fun)
## Enrollment Adm_Rate ACT_median
## 1.027434e+07 6.547193e-04 5.254721e-01
## ACT_Q1 ACT_Q3 Cost
## 6.265691e-01 6.265691e-01 6.919279e+06
## Net_Tuition Avg_Fac_Salary PercentFemale
## 1.653630e+06 1.230333e+07 6.353848e-06
## PercentWhite PercentBlack PercentHispanic
## 1.075313e-03 1.652711e-03 2.807293e-03
## PercentAsian FourYearComp_Males FourYearComp_Females
## 9.580400e-05 5.645276e-05 1.662577e-05
## Debt_median Salary10yr_median
## 1.995113e+00 2.548162e+06
The argument MARGIN
dictates which of the object’s
dimensions the function should be iterated across, with
MARGIN = 1
going through the rows, and
MARGIN = 2
going through the columns.
Question #3: Use apply()
to find the
frequency in the most common category for the first three columns in the
“colleges” data set. You should prepare your input data beforehand by
using either the select()
function or subsetting using
indices.
\(~\)
Consider an experiment where the researchers record information on
each participant in a separate excel file. Efficiently loading these
files into R
is an essential first step in any
analysis.
As a preliminary step in this analysis, you’ll need to recognize that
your computer’s installation of R
has a default location
where it looks for files (and stores saved output) called the
working directory.
You can find the file path to your working directory using the
getwd()
function:
getwd()
## [1] "C:/Users/millerry/OneDrive - Grinnell College/Documents/STA-230/f23/lab"
You can change this location using the setwd()
function:
setwd("C:/Users/millerry/Downloads")
getwd()
## [1] "C:/Users/millerry/Downloads"
Note that in R Markdown, changing the working directory in a code chunk will only apply to that chunk (which is generally not a problem if you are using a single code chunk to load your data).
At times you might choose to change your working directory to the
location of your data files to reduce the complexity of the file paths
needed to find them. The example below takes this folder of
data files that were downloaded and extracted into the H-drive, then
it uses the list.files()
function to list each file
contained in the folder:
list.files(path = "C:/Users/millerry/Downloads/experiment")
## [1] "run18_treatment.xlsx" "run21_control.xlsx" "run34_control.xlsx"
## [4] "run35_treatment.xlsx"
We can see that there are 4 different .xlsx files stored in the “experiment” folder.
\(~\)
Now let’s suppose we want to find the means of the variable
“VDS.Veh.Speed” for each participant file. Note that these are excel
files (not .csv files), so we must first load (and possibly install) the
readxl
package in order to read them into
R
.
We then can iterate through these files using a for loop, storing the mean of each participant:
library(readxl)
my_dir = "C:/Users/millerry/Downloads/experiment"
my_files <- list.files(path = my_dir) ## List of file names in your directory
means <- numeric(length(my_files)) ## Set up storage object
for(i in 1:length(my_files)){ ## Loop over each file
temp <- read_excel(paste0(my_dir, "/", my_files[i])) ## Read by appending file name to the path prefix using paste0()
means[i] <- mean(temp$VDS.Veh.Speed) ## Store the mean of the current file
}
print(means)
## [1] 24.91347 35.62761 56.93149 26.81412
Question #4: Using the example above as a template,
find the standard deviations (using the sd()
function) of
each participant file. Store these standard deviations in an object
named “sds” and print them as part of your answer.
\(~\)
Sometimes it can be useful to aggregate several data frames with the
same structure into one larger data frame. For example, we might want to
combine all four participant files from Question #4 into a combined data
frame. Or, perhaps we want to aggregate several years of data from the
same source. These tasks can be handled by the rbind()
function, which will append the rows of one or more data frames to an
initial data frame (provided the column names match):
df1 <- data.frame(Year = 2019, val = rnorm(3))
df2 <- data.frame(Year = 2020, val = rnorm(3, mean = 10))
rbind(df1, df2)
## Year val
## 1 2019 -0.6032545
## 2 2019 -0.2141548
## 3 2019 0.3347360
## 4 2020 10.2357148
## 5 2020 8.8312616
## 6 2020 10.5517912
Note that this result could also be achieved using
full_join()
(though I personally find that approach less
intuitive):
full_join(x = df1, y = df2)
## Year val
## 1 2019 -0.6032545
## 2 2019 -0.2141548
## 3 2019 0.3347360
## 4 2020 10.2357148
## 5 2020 8.8312616
## 6 2020 10.5517912
However, rbind()
has the advantage of easily being able
to bind an arbitrary number of data frames in a single command:
df3 <- data.frame(Year = 2021, val = rnorm(3, mean = -5)) ## How about a third year?
rbind(df1, df2, df3)
## Year val
## 1 2019 -0.6032545
## 2 2019 -0.2141548
## 3 2019 0.3347360
## 4 2020 10.2357148
## 5 2020 8.8312616
## 6 2020 10.5517912
## 7 2021 -6.1669451
## 8 2021 -4.9003577
## 9 2021 -5.3024403
Question #5: Use rbind()
to combine the
four data files in the “experiment” folder (used in Question #4) into a
single data frame. Be sure to add a participant identifier to each file
as a preliminary step. Print the dimensions of the resulting data frame
using dim()
. Hint: You may choose to use
rbind()
inside a for loop to repeatedly append new rows
onto an existing data frame. In this approach, your initial data frame
can be NULL
if you don’t want to anticipate the structure
of the files.