This lab briefly introduces a few miscellaneous topics related to data processing.

# Please load the following packages (installing if necessary)
library(dplyr)
library(readr)

\(~\)

Functions

In R, you can create your own custom function using function() and the assignment operator. The example below creates a function that returns the squared distance between the mean and median of a numeric input variable, “x”:

my_fun <- function(x = 99){
  d2 <- (mean(x, na.rm = TRUE) - median(x, na.rm = TRUE))^2
  return(d2)
}

## Try the function
ex_data <- c(0,2,10)
my_fun(x = ex_data)
## [1] 4
  • The syntax x = 99 dictates that the function should have an argument “x”, and specifies a default value of 99. If a user of the function doesn’t supply their own “x” argument the function will use the value 99.
  • return() defines the function’s output, in this example the function will output the object d2.
  • Note that functions use their own local environment, so the objects x and d2 only exist internally when the function is being used.

Question #1: Create a function named top_cat() that accepts a categorical variable, “y”, and returns the frequency of the most frequent category. You can accomplish this using the table() and max() functions. Next, include code that tests your function on the variable “Region” in the colleges data set (the correct output should be 400).

\(~\)

Iteration

Custom functions are most useful when you’d like to repeat an action several times with slightly different inputs. For example, you might want to find the squared distance between the mean and median for every numeric variable in a data set.

Loops

A for loop is one way to iterate through the relevant portions of a data frame:

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
num_data <- colleges %>% select_if(is.numeric)  ## Selects all columns that return TRUE to is.numeric

for(i in 1:ncol(num_data)){ 
   print(my_fun(num_data[,i]))
}
## [1] 10274337
## [1] 0.0006547193
## [1] 0.5254721
## [1] 0.6265691
## [1] 0.6265691
## [1] 6919279
## [1] 1653630
## [1] 12303334
## [1] 6.353848e-06
## [1] 0.001075313
## [1] 0.001652711
## [1] 0.002807293
## [1] 9.5804e-05
## [1] 5.645276e-05
## [1] 1.662577e-05
## [1] 1.995113
## [1] 2548162

The code for(i in 1:ncol(num_data)) will create a variable, “i”, which will increment along the sequence 1:ncol(num_data) each time the loop is run. This allows for the ith column to given as an input to the num_dat() function during the ith repetition of the loop.

The example above merely prints the output of num_data() for each numeric column; however, we’ll generally want to store these results. To do this, we can set up an empty object before the loop and assign values into one of its positions on each iteration:

## Create a blank numeric vector with a length equal to "ncol(num_data)"
colleges_d2 <- numeric(length = ncol(num_data))

## Loop and store
for(i in 1:ncol(num_data)){ 
  colleges_d2[i] <- my_fun(num_data[,i])
}

print(colleges_d2)
##  [1] 1.027434e+07 6.547193e-04 5.254721e-01 6.265691e-01 6.265691e-01
##  [6] 6.919279e+06 1.653630e+06 1.230333e+07 6.353848e-06 1.075313e-03
## [11] 1.652711e-03 2.807293e-03 9.580400e-05 5.645276e-05 1.662577e-05
## [16] 1.995113e+00 2.548162e+06

Question #2: Write code that includes a for loop and stores the frequency in the most common category for the first three columns in the “colleges” data set using the function you created in Question #1. Then, print the object storing these frequencies (the correct result should be 3 24 127)

\(~\)

Apply

The apply() function can sometimes be used as an alternative to a for loop. Consider the example below:

apply(num_data, MARGIN = 2, FUN = my_fun)
##           Enrollment             Adm_Rate           ACT_median 
##         1.027434e+07         6.547193e-04         5.254721e-01 
##               ACT_Q1               ACT_Q3                 Cost 
##         6.265691e-01         6.265691e-01         6.919279e+06 
##          Net_Tuition       Avg_Fac_Salary        PercentFemale 
##         1.653630e+06         1.230333e+07         6.353848e-06 
##         PercentWhite         PercentBlack      PercentHispanic 
##         1.075313e-03         1.652711e-03         2.807293e-03 
##         PercentAsian   FourYearComp_Males FourYearComp_Females 
##         9.580400e-05         5.645276e-05         1.662577e-05 
##          Debt_median    Salary10yr_median 
##         1.995113e+00         2.548162e+06

The argument MARGIN dictates which of the object’s dimensions the function should be iterated across, with MARGIN = 1 going through the rows, and MARGIN = 2 going through the columns.

Question #3: Use apply() to find the frequency in the most common category for the first three columns in the “colleges” data set. You should prepare your input data beforehand by using either the select() function or subsetting using indices.

\(~\)

Files

Consider an experiment where the researchers record information on each participant in a separate excel file. Efficiently loading these files into R is an essential first step in any analysis.

Working Directories

As a preliminary step in this analysis, you’ll need to recognize that your computer’s installation of R has a default location where it looks for files (and stores saved output) called the working directory.

You can find the file path to your working directory using the getwd() function:

getwd()
## [1] "C:/Users/millerry/OneDrive - Grinnell College/Documents/STA-230/f23/lab"

You can change this location using the setwd() function:

setwd("C:/Users/millerry/Downloads")
getwd()
## [1] "C:/Users/millerry/Downloads"

Note that in R Markdown, changing the working directory in a code chunk will only apply to that chunk (which is generally not a problem if you are using a single code chunk to load your data).

At times you might choose to change your working directory to the location of your data files to reduce the complexity of the file paths needed to find them. The example below takes this folder of data files that were downloaded and extracted into the H-drive, then it uses the list.files() function to list each file contained in the folder:

list.files(path = "C:/Users/millerry/Downloads/experiment")
## [1] "run18_treatment.xlsx" "run21_control.xlsx"   "run34_control.xlsx"  
## [4] "run35_treatment.xlsx"

We can see that there are 4 different .xlsx files stored in the “experiment” folder.

\(~\)

Reading several files

Now let’s suppose we want to find the means of the variable “VDS.Veh.Speed” for each participant file. Note that these are excel files (not .csv files), so we must first load (and possibly install) the readxl package in order to read them into R.

We then can iterate through these files using a for loop, storing the mean of each participant:

library(readxl)
my_dir = "C:/Users/millerry/Downloads/experiment"
my_files <- list.files(path = my_dir)  ## List of file names in your directory

means <- numeric(length(my_files))                           ## Set up storage object
for(i in 1:length(my_files)){                                ## Loop over each file
  temp <- read_excel(paste0(my_dir, "/", my_files[i]))       ## Read by appending file name to the path prefix using paste0()
  means[i] <- mean(temp$VDS.Veh.Speed)                       ## Store the mean of the current file
}
print(means)
## [1] 24.91347 35.62761 56.93149 26.81412

Question #4: Using the example above as a template, find the standard deviations (using the sd() function) of each participant file. Store these standard deviations in an object named “sds” and print them as part of your answer.

\(~\)

Row binding

Sometimes it can be useful to aggregate several data frames with the same structure into one larger data frame. For example, we might want to combine all four participant files from Question #4 into a combined data frame. Or, perhaps we want to aggregate several years of data from the same source. These tasks can be handled by the rbind() function, which will append the rows of one or more data frames to an initial data frame (provided the column names match):

df1 <- data.frame(Year = 2019, val = rnorm(3))
df2 <- data.frame(Year = 2020, val = rnorm(3, mean = 10))

rbind(df1, df2)
##   Year        val
## 1 2019 -0.6032545
## 2 2019 -0.2141548
## 3 2019  0.3347360
## 4 2020 10.2357148
## 5 2020  8.8312616
## 6 2020 10.5517912

Note that this result could also be achieved using full_join() (though I personally find that approach less intuitive):

full_join(x = df1, y = df2)
##   Year        val
## 1 2019 -0.6032545
## 2 2019 -0.2141548
## 3 2019  0.3347360
## 4 2020 10.2357148
## 5 2020  8.8312616
## 6 2020 10.5517912

However, rbind() has the advantage of easily being able to bind an arbitrary number of data frames in a single command:

df3 <- data.frame(Year = 2021, val = rnorm(3, mean = -5)) ## How about a third year?
rbind(df1, df2, df3)
##   Year        val
## 1 2019 -0.6032545
## 2 2019 -0.2141548
## 3 2019  0.3347360
## 4 2020 10.2357148
## 5 2020  8.8312616
## 6 2020 10.5517912
## 7 2021 -6.1669451
## 8 2021 -4.9003577
## 9 2021 -5.3024403

Question #5: Use rbind() to combine the four data files in the “experiment” folder (used in Question #4) into a single data frame. Be sure to add a participant identifier to each file as a preliminary step. Print the dimensions of the resulting data frame using dim(). Hint: You may choose to use rbind() inside a for loop to repeatedly append new rows onto an existing data frame. In this approach, your initial data frame can be NULL if you don’t want to anticipate the structure of the files.