Introduction:
The purpose of this lab is to acquaint (or re-acquaint) you with operations in R
that are essential when working with real data. Data are often stored or recorded in formats that are not conducive to modeling. Accordingly, basic data processing and management skills are necessary for anyone involved in data modeling.
Please note that this lab assumes a basic level of experience using R
in the R Studio environment. If you are unfamiliar or rusty using R
, I encourage you to take a look at the “Introduction to R” lab I use in Math-256 (linked here) before attempting this lab.
Directions:
You are expected to work through the examples and questions in this lab collaboratively with your partner(s). This requires you to work together and discuss each topic. Both of you will receive the same score for your work, so is your responsibility to stay on the same page. Additionally, any groups found to be using a “divide and conquer” strategy to rush through the lab’s questions may be penalized for not adhering to the guidelines presented above.
You should record your answers to the lab’s questions in an R Markdown file. When submitting the lab, you should only turn in the compiled .html file created by R Markdown. While working on the lab, you are strongly encouraged to open a separate/blank R script to run and experiment with the example code that is given throughout the lab. Please do not turn-in this practice code, I only want to see your responses to the lab’s questions.
\(~\)
All models utilize data. In this class, we’ll load data into the R
environment using the read.csv
function (be aware that there are other similar functions for non-csv files). By default, read.csv
will store the .csv file as an object known as a data frame.
## Load data on the Golden State Warriors record-setting 2015-16 season
gsw <- read.csv("https://remiller1450.github.io/data/GSWarriors.csv")
## Check that "gsw" is a data frame
class(gsw)
## [1] "data.frame"
Show below are a couple useful functions to verify that a dataset was loaded without any issues:
## Check variable names
names(gsw)
## [1] "Game" "Date" "Location" "Opp" "Win"
## [6] "Points" "OppPoints" "FG" "FGA" "FG3"
## [11] "FG3A" "FT" "FTA" "Rebounds" "OffReb"
## [16] "Assists" "Steals" "Blocks" "Turnovers" "Fouls"
## [21] "OppFG" "OppFGA" "OppFG3" "OppFG3A" "OppFT"
## [26] "OppFTA" "OppRebounds" "OppOffReb" "OppAssists" "OppSteals"
## [31] "OppBlocks" "OppTurnovers" "OppFouls"
## Check dimensions of the data frame
dim(gsw) # it has 82 rows (games) and 33 columns (variables)
## [1] 82 33
You can also double-click on the dataset in “Environment” tab in the top-right corner of R Studio to view it using R’s spreadsheet viewer.
\(~\)
In R
, data.frame objects are best viewed as a collection of column vectors. The code demonstrates this fact by first creating two vectors, x
and y
, using the c()
function. It then uses these vectors as the basis for a data.frame named df
:
## Create vectors
x <- c(1, 4, 5, 2, 4)
y <- c("A", "B", "B", "A", "A")
## Assemble into data frame
df <- data.frame(value = x, group = y)
## Print df
print(df)
## value group
## 1 1 A
## 2 4 B
## 3 5 B
## 4 2 A
## 5 4 A
## Notice the variable names
names(df)
## [1] "value" "group"
Often we need to isolate or reference a single variable (vector) contained within a data.frame. There are two ways of doing this:
The code below demonstrates how to access vector corresponding to the variable “group” in the data.frame we created earlier:
## Select via index (group is the second column)
df[,2]
## [1] "A" "B" "B" "A" "A"
## Select via name (group is named "group")
df$group
## [1] "A" "B" "B" "A" "A"
While there are multiple ways to select variables, cases (synonymous with data points, observations, or subjects) can only be accessed using indexing:
## Select the 3rd subject in the data frame (ie: row #3)
df[3,]
## value group
## 3 5 B
## Select the 3rd subject from the "group" vector
df$group[3]
## [1] "B"
Question #1: First, write your own code that loads the “Colleges 2019” dataset from the url “https://remiller1450.github.io/data/Colleges2019.csv” into a data.frame named “colleges”. Next, use the dim
function to print the dimensions of this dataset. Finally, write your own code to print the city of Xavier University (the 1650th observation in the dataset).
\(~\)
In R
, functions translate input(s) into output. A few examples are shown below:
## This function takes the square root of its input
sqrt(4)
## [1] 2
## This function takes the absolute value of its input
abs(-1)
## [1] 1
A function’s inputs are called arguments in R documentation. For complex functions, the arguments should be specified using names which are internally defined within the function. The example below takes the base-two logarithm of the number 4 using the log function and appropriate inputs to the arguments x and base.
log(x = 4, base = 2)
## [1] 2
If you are ever uncertain about a function’s inputs or output you can read it’s help documentation by typing ?
in front of the function’s name in the R
console. The example below shows documentation for the log
function:
?log
\(~\)
Many of the functions used in modeling use formulas as a way of describing how the variables in the model are related. The general syntax for a formula is to put the name of the outcome or response variable on the left of a ~
and the names of explanatory variables on the right side (separated by plus signs if multiple are used):
\[\text{Outcome} \sim \text{Explanatory Variable(s) seperated by +'s}\]
Shown are a few examples using the lm
function (which fits linear regression models, a topic we’ll explore in greater starting in week 3):
## Read GS Warriors data
gsw <- read.csv("https://remiller1450.github.io/data/GSWarriors.csv")
## Use made 3's to predict/explain points
model1 <- lm(Points ~ FG3, data = gsw)
model1
##
## Call:
## lm(formula = Points ~ FG3, data = gsw)
##
## Coefficients:
## (Intercept) FG3
## 89.105 1.963
## Use made 3's and Rebounds to predict/explain points
model2 <- lm(Points ~ FG3 + Rebounds, data = gsw)
model2
##
## Call:
## lm(formula = Points ~ FG3 + Rebounds, data = gsw)
##
## Coefficients:
## (Intercept) FG3 Rebounds
## 83.4475 1.9643 0.1222
Sometimes, functions are used inside of formulas:
## Use the log of Rebounds as an explanatory variable
model3 <- lm(Points ~ FG3 + log(Rebounds), data = gsw)
model3
##
## Call:
## lm(formula = Points ~ FG3 + log(Rebounds), data = gsw)
##
## Coefficients:
## (Intercept) FG3 log(Rebounds)
## 70.486 1.965 4.865
Question #2: Using the “Colleges 2019” data that you loaded in Question #1, write code that uses a formula inside of the lm
function to fit a linear regression model where both Adm_Rate
and the square root of Enrollment
are used to predict a college’s Cost
. Store this model in an object named cost_model
.
\(~\)
To facilitate more complex tasks in R, many people have developed their own sets of functions known as packages. The first time you use a package it will need to be installed (ie: downloaded onto your PC):
install.packages("ggplot2")
Once a package is installed, it still needs to be loaded using the library
(or require
) function to be used. You’ll need to load a package every time you open R Studio, but you’ll only need to install it once.
library(ggplot2)
The “ggplot2” package is an extremely popular graphics package that we will use throughout the semester:
## Load colleges data
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
## qplot or "quick plot" is a function contained in the package ggplot2
qplot(x = Adm_Rate, y = Cost, color = Region, data = colleges)
\(~\)
Another package we’ll occasionally use throughout the semester is dplyr
, which contains a suite of functions used for processing and preparing data into a form that is suitable for modeling.
One very common data processing step is to filter our data set to contain only observations that match a certain criterion. The example below filters the college data (using the filter
function in the dplyr
package) to contain only colleges located in New England (storing this subset in a data.frame named ne_colleges
):
library(dplyr)
ne_colleges <- filter(colleges, Region == "New England")
qplot(x = Adm_Rate, y = Cost, color = Region, data = ne_colleges)
The first argument of filter
is the dataset, and the subsequent arguments are logical conditions that define the filtering criteria. You can use multiple filtering conditions at the same time by adding them as additional arguments. The code below filters the college data to contain only private colleges located in New England with enrollments less that 5,000:
library(dplyr)
ne_small_priv_colleges <- filter(colleges, Region == "New England", Private == "Private", Enrollment < 5000)
qplot(x = Adm_Rate, y = Cost, color = Region, data = ne_small_priv_colleges)
Question #3: Starting with the full “Colleges 2019” dataset, use the filter
function in the dplyr
package to create a data frame named “cheap_oh_colleges” that contains colleges located in the state of Ohio (OH) with a Net_Tuition
less than $10000. (Hint: be sure to include library(dplyr)
in your R Markdown, but do not include install.packages("dplyr")
as this will prevent your document from compiling)
\(~\)
The dplyr
also allows you combine multiple data frames that are linked by a common variable. The usage of these functions are based upon structured query language (SQL). In this class, we’ll only ever use the left_join
function (which combines two data sets using a common key or ID variable).
The example below demonstrates how this function is used:
## A data frame named "orders"
orders <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/orders.csv")
orders
## order id date
## 1 1 4 Jan-01
## 2 2 8 Feb-01
## 3 3 42 Apr-15
## 4 4 50 Apr-17
## Another data frame named "customers"
customers <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/customers.csv")
customers
## id name
## 1 4 Tukey
## 2 8 Wickham
## 3 15 Mason
## 4 16 Jordan
## 5 23 Patil
## 6 42 Cox
## Left join will attach entries from the "y" data frame to the "x" data frame using the variable "id"
orders_customers <- left_join(x = orders, y = customers, by = "id")
orders_customers
## order id date name
## 1 1 4 Jan-01 Tukey
## 2 2 8 Feb-01 Wickham
## 3 3 42 Apr-15 Cox
## 4 4 50 Apr-17 <NA>
## Notice what happens if we switch x and y
orders_customers2 <- left_join(x = customers, y =orders, by = "id")
orders_customers2
## id name order date
## 1 4 Tukey 1 Jan-01
## 2 8 Wickham 2 Feb-01
## 3 15 Mason NA <NA>
## 4 16 Jordan NA <NA>
## 5 23 Patil NA <NA>
## 6 42 Cox 3 Apr-15
Question #4: Join the books
and authors
data frames below using the variable ISBN
to attach an authorID to each book. (Hint: there are 6 books, so if your answer doesn’t contain 6 rows you should briefly explain why this is).
books <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/books.csv")
authors <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/merging/data/book-authors.csv")
\(~\)
As the dominant software among statisticians, R
contains numerous functions for a wide variety of formal statistical procedures.
Conveniently, most of these simply require you to input your model object into the appropriate function. Demonstrated below are two statistical methods we will learn about throughout the semester, confidence intervals and hypothesis testing:
## Read GS Warriors data
gsw <- read.csv("https://remiller1450.github.io/data/GSWarriors.csv")
## Use made 3's to predict/explain points
model1 <- lm(Points ~ FG3, data = gsw)
## CI estimates for the model parameters using "confint"
confint(model1)
## 2.5 % 97.5 %
## (Intercept) 82.583343 95.626129
## FG3 1.488098 2.438386
## Hypothesis testing using "summary"
summary(model1)
##
## Call:
## lm(formula = Points ~ FG3, data = gsw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.7739 -4.5809 -0.0717 5.0588 17.2261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 89.1047 3.2770 27.191 < 2e-16 ***
## FG3 1.9632 0.2388 8.223 2.96e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.614 on 80 degrees of freedom
## Multiple R-squared: 0.458, Adjusted R-squared: 0.4513
## F-statistic: 67.61 on 1 and 80 DF, p-value: 2.959e-12
It is important to recognize that statistical inference is not always useful or necessary. For it to work, we must be operating under the assumption that this is uncertainty inherent to the sample data we’re basing our model upon. Typically, this uncertainty describes random sampling from a population, but there can be other justifications. For example, we might consider the 82 games used in the model above to represent the intrinsic attributes of the Warriors. In other words, if that Warriors team played these games again, we’d expect different outcomes, and our statistical procedures can account for that variability.
Question #5: To begin, write code to load the “RMR” dataset from the url: https://remiller1450.github.io/data/RMR.csv; this file contains the bodyweights (in lbs) and resting metabolic rates (in kcal per day) of a random sample of 44 adult US women. Next, use the lm
function to fit a linear model that uses bodyweight to predict resting metabolic rate. Finally, answer the following questions:
\(~\)
Question #6: Using the “Colleges 2019” dataset, test whether net tuition is significantly different for private and public colleges. In your answer, be sure to formally state the hypothesis you are testing, describe how you are testing it, and clearly state your conclusion in the context of the data.