MATH-256 - Lab #1 - An Introduction to R

This lab will provide introduction to RStudio, its user interface, some basic R programming commands, and a publishing extension known as RMarkdown. In doing so, it will also cover some basic definitions and operations involving data.

Directions (Please read before starting)

You are put into groups when working through labs because you are expected to work together - this means talking through the logic, steps, code, etc. that are needed to progress through the lab.
Every member of your group should submit a lab write-up, but you each should include all of your group’s names on the top.
Please take advantage of the opportunity to ask questions. Labs are meant to be formative and are intended to help you apply your “on paper” knowledge towards more realistic world applications - as an instructor I am happy to help with any aspect of this.

0 - Accessing R Studio

In this course you’ll need to install R on your own personal computer, which you should bring to class every day. Getting R Studio to function properly requires two steps:

Download and install R from http://www.r-project.org/
Download and install RStudio from http://www.rstudio.com/

R and Rstudio are both open-source software and they’re completely free to download and use, and they don’t take up much space on PC, so there is very little downside to adding them to your personal computer.

$~$

1 - The Layout of R Studio

After you open RStudio, the first thing you’ll want to do is open a file to work in. You can do this by navigating: File -> New File -> RScript, which will open a new window in the top left of the RStudio user interface for you to work in. At this point you should see four panels:

Your R Script (top left)
The Console (bottom left)
Your Environment (top right)
The Files/Plots/Help viewer (bottom right)

An R Script is like a textfile that stores your code while you work on it. At any time, you can send some or all of the code in your R Script to the Console for the computer to execute. You can also type commands directly into the Console. The Console will echo any code you tell it to run, and will display the textual/numeric output that the code generates.

The Environment shows you the names of datasets, variables, and user-created functions that have been loaded into your workspace and can be accessed by your code. The Files/Plots/Help Viewer will display graphics generated by your code and a few other useful entities (like help documentation and file trees).

Question #1: Create a blank R Script. You will use this R Script to record your answers to future questions in this document.

$~$

2 - Using R

R is an interpreted programming language, which allows you to have the computer execute any piece of code contained your RScript at any time without a lengthy compiling process.

To run a single piece of code, simply highlight it and either hit Ctrl-Enter or click on the “Run” button near the top right corner of your R Script. You should see an echo of the code you ran in the Console, along with any response generated by that code.

R as a Calculator

At its most basic level, R can be used to perform arithmetic operations. A few examples are shown below, try typing them into your R Script and executing them on your own.

4 + 6 - (24/6)

## [1] 6

5 ^ 2 + 2 * 2

## [1] 29

Some arithmetic operations require the use of functions. In the example below, the function “exp” raises the number $e$ to the power you provide as an input to the function. In the example below, the number “2” is given as an input into the function exp:

exp(2)

## [1] 7.389056

This function takes the square root:

sqrt(4)

## [1] 2

This function takes the absolute value:

abs(-1)

## [1] 1

A function’s inputs are called arguments in R documentation. For complex functions, the arguments should be specified using names which are internally defined within the function. The example below takes the base two logarithm of the number 4 using the log function and appropriate inputs to the arguments x and base.

log(x = 4, base = 2)

## [1] 2

Help Documentation

R contains thousands of functions, each potentially containing many different arguments. Whenever using an unfamiliar function for the first time it good practice to read that function’s documentation, which will describe the function’s uses and arguments. You can pull up a function’s documentation by typing ? before the function’s name in console. The example below pulls up the documentation of the log function.

?log

Remark: A complete list of the functions contained in base R is available at this link; that said, browsing this list tends to be a very inefficient way of finding the function you need. There’s no shame in searching online (ie: using Google, StackOverflow, etc.) for the right R commands for the task you’re working on.

Including Comments

When coding, it is good practice to include comments that describe what your code is doing. In R the character “#” is used to start a comment. Everything located that line and appearing to right of the “#” will not be executed when that line is submitted to the console.

# This entire line is a comment and will do nothing if run

1:6 # The command "1:6" appears before this comment

## [1] 1 2 3 4 5 6

In your R Script, comments appear in green. You also should remember that the “#” starts a comment only for a single line of your RScript, so long comments that requiring multiple lines should each begin with their own “#”.

Question #2: In your R Script add a comment with the text “Question 2” on the script’s first line. Then, on the second line, write a command that finds the square root of the absolute value of negative four.

$~$

3 - Objects and Assignment

R stores data in containers called objects. Data is assigned into an object using <- or =. After assignment, data can be referenced using that object’s name. The simplest objects are scalars, or a single elements:

x <- 5 # This assigns the integer value '5' to an object called 'x'
x^2    # We can now reference 'x'

## [1] 25

R stores sequences of elements in objects called vectors:

x <- 1:3 # The sequence {1, 2, 3} is assigned to the vector called 'x'
print(x)

## [1] 1 2 3

y <- c(1,2,3) # The function 'c' concatenates arguments (separated by commas) into a vector
print(y)

## [1] 1 2 3

z <- c("A","B","C") # Vectors can contain different types of values
print(z)

## [1] "A" "B" "C"

There are three types of vectors:

numeric/integer vectors - for example: x = c(1,2,3)
character vectors - for example: x = c("A","B","C")
logical vectors - for example: x = c(TRUE, FALSE, TRUE)

Vector types are important because most functions expect inputs of a certain type and will produce an error if an input of the wrong type is given. Mixing these different types usually will default to a character vector - this is a common source of errors when working with real world raw data loaded into R.

You can check the type of an object using the typeof function:

chars <- c("1","2","3") # Create a character vector
typeof(chars)

## [1] "character"

nums <- c(1,2,3) # Create a numeric vector
typeof(nums)

## [1] "double"

mean(chars) # This produces an error, mean() only works for numeric vectors

## Warning in mean.default(chars): argument is not numeric or logical: returning NA

## [1] NA

mean(nums) # This works as intended

## [1] 2

Many R functions are vectorized, meaning they can accept a scalar input, for example 1, and return a scalar output, f(1), or they can accept a vector input, such as c(1,2,3), and return a vector c(f(1),f(2),f(3)). The sqrt function is vectorized:

nums <- c(1,2,3,4)
sqrt(nums)

## [1] 1.000000 1.414214 1.732051 2.000000

Datasets are usually stored in objects called data.frames, which are composed of several vectors of the same length:

DF <- data.frame(x = x, y = y, z = z) # Creates a data.frame object 'DF'
print(DF)

##   x y z
## 1 1 1 A
## 2 2 2 B
## 3 3 3 C

Question #3: In your R Script go to a new line and add the comment “Question 3” and on the line(s) below this comment create a data.frame object named “dfn” containing a vector named “number” that is the integers from 1 to 10, and a vector named “number_squared” which is the integers from 1 to 10 squared.

Indexing

Suppose we have a vector ‘x’ and would like to extract the element in its second position and assign it to a new object called ‘b’:

x <- 5:10
b <- x[2]
b

## [1] 6

The square brackets are used to access a certain position (or multiple positions) within an object. In this example we access the second position within the object “x”.

Some objects, such as data.frames, have multiple dimensions, requiring indices in each dimension to describe an element’s position:

DF <- data.frame(x = x, y = y, z = z) 
DF[2,3] # The element in row 2, column 3

## [1] "B"

$~$

4 - Reading in Data

There are many ways to get data into R. If a data set is stored somewhere on your computer as a .csv file, you can load it using the R function read.csv:

# my.data <- read.csv("H://path_to_my_data/my_data.csv")

read.csv is also capable of reading .csv files stored on the web. All of the datasets that you’ll need for labs and homework assignments in this class will be hosted online. The example below reads the data set IowaCityHomeSales.csv (which is hosted at specified the url) and stores it as an object called ‘my_data’:

my_data <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

We can use the head function to see the first few rows of these data:

head(my_data)  # This function prints the first several rows of an object

When working with a newly loaded dataset there are a few things you might want to know:

dim(my_data) # prints the dimensions of 'my.data'

## [1] 777  19

nrow(my_data) # prints the number of rows of 'my.data'

## [1] 777

ncol(my_data) # prints the number of columns of 'my.data'

## [1] 19

colnames(my_data) # prints the column names of 'mydata'

##  [1] "sale.amount"  "sale.date"    "occupancy"    "style"        "built"       
##  [6] "bedrooms"     "bsmt"         "ac"           "attic"        "area.base"   
## [11] "area.add"     "area.bsmt"    "area.garage1" "area.garage2" "area.living" 
## [16] "area.lot"     "lon"          "lat"          "assessed"

Statisticians typically to the rows of a dataset as “cases”, “observations”, or “data-points”. They typically refer to the columns of a dataset as “variables” or “features”. Much like how vectors can be of different types, so can variables:

Numeric or Quantitative variables record a numeric value describing a case. These variables can be either continuous, taking on infinitely many uncountable values, or discrete, taking on countably many values (typically integers). Be aware that datasets might use numeric values to encode other meanings. Arithmetic operations must make sense if a variable is truly numeric.
Categorical variables divide the cases into groups, typically using a set of labeled categories. These variable can be nominal, containing many categories with no natural order, binary, containing two mutually exclusive categories, or ordinal, containing categories with a natural ordering.

Question #4 (Part A): In your R Script go to a new line add the comment “Question 4”, then on the line(s) below write code that reads and stores the data located at “https://remiller1450.github.io/data/ElectionMargin.csv” as “election_data”. Next, write code that finds dimensions of this data.frame.

Question #4 (Part B): For the election margin data, use comments to record your answers to the following questions:

What are the cases and variables in these data?
What type of variable is “Result”? (be specific!)
Is “Candidate” a categorical variable?

Subsetting Data

Suppose we want to access a single variable in our data set, there are a few different ways we can do so:

sale_price1 <- my_data$sale.amount # The $ accesses the variable named 'sale.amount' within 'my.data'
sale_price2 <- my_data[,1] # We can also use indexing to access 'sale.amount'

# Notice how we specified second dimension of 'my.data' (its columns)
head(sale_price1)

## [1] 172500  90000 168500 205000 121000 215000

head(sale_price2)

## [1] 172500  90000 168500 205000 121000 215000

Suppose we want to access a single case (subject) in our dataset:

first_house <- my_data[1,] # This stores the entire first row
head(first_house)

##   sale.amount sale.date           occupancy         style built bedrooms bsmt
## 1      172500  1/3/2005 116 (Zero Lot Line) 1 Story Frame  1993        3 Full
##    ac attic area.base area.add area.bsmt area.garage1 area.garage2 area.living
## 1 Yes  None      1102        0       925          418            0        1102
##   area.lot       lon      lat assessed
## 1     5520 -91.50913 41.65116   173040

Suppose we want a range of cases:

firstfive_houses <- my_data[1:5,] # This stores the first five rows
head(firstfive_houses)

##   sale.amount sale.date                            occupancy             style
## 1      172500  1/3/2005                  116 (Zero Lot Line)     1 Story Frame
## 2       90000  1/5/2005                    113 (Condominium)     1 Story Frame
## 3      168500 1/12/2005 101 (Single-Family / Owner Occupied) Split Foyer Frame
## 4      205000 1/14/2005 101 (Single-Family / Owner Occupied) Split Foyer Frame
## 5      121000 1/24/2005                    113 (Condominium)     1 Story Condo
##   built bedrooms bsmt  ac attic area.base area.add area.bsmt area.garage1
## 1  1993        3 Full Yes  None      1102        0       925          418
## 2  2001        2 None Yes  None       878        0         0            0
## 3  1976        4 Full Yes  None      1236        0       700          576
## 4  1995        3 Full Yes  None      1466        0       500            0
## 5  2001        2 None Yes  None      1150        0         0            0
##   area.garage2 area.living area.lot       lon      lat assessed
## 1            0        1102     5520 -91.50913 41.65116   173040
## 2          264         878     3718 -91.52296 41.67324    89470
## 3            0        1236     8800 -91.48231 41.65849   164230
## 4            0        1466    16720 -91.55224 41.64900   211890
## 5          528        1150     3427 -91.57814 41.65263   115430

Logical Conditions and Subsetting

Suppose we want know which elements of sale.amount are larger than $200,000:

my_data$sale.amount > 200000 # Logical vector for the condition "> 200000"
which(my_data$sale.amount > 200000) # Positions of elements where the condition is "TRUE"

Some useful logical operators include:

Logical conditions are particularly useful for subsetting objects, here are a few examples:

large_and_expensive1 <- my_data[my_data$sale.amount > 500000 & my_data$area.living > 3000,]  ## Subset via indexing
large_and_expensive2 <- subset(my_data, my_data$sale.amount > 500000 & my_data$area.living > 3000) ## Subset via the "subset" function

The code given above shows two different ways of creating a new object that contains all homes which sold for more than $ 500,000 and have living areas over 3,000 square feet.

nobsmt_or_noac1 <- my_data[my_data$bsmt == "None" | my_data$ac != "Yes",]
nobsmt_or_noac2 <- subset(my_data, my_data$bsmt == "None" | my_data$ac != "Yes") 
dim(nobsmt_or_noac1)

## [1] 239  19

The example above creates a new object containing all homes that don’t have a basement or don’t have air conditioning

nobsmt_and_noac <- my_data[my_data$bsmt == "None" & my_data$ac != "Yes",]
dim(nobsmt_and_noac)

## [1] 14 19

The example above creates a new object containing homes that don’t have a base and don’t have air conditioning. Pay careful attention to the difference between this example and the prior example.

Question #5: In your R Script go to a new line and add the comment “Question 5”, then on the line(s) below write code that create a new data.frame called “election_losers” that subsets “election_data” (which you created in Question 4) to include only rows where the result was “Lost” and the year was less than or equal to 1984.

$~$

5 - Summarizing Data

Tables

Frequency tables are a way to summarize a single categorical (factor) variable. A one-way frequency table shows the frequencies of categories in a single categorical variable, while a two-way frequency tables shows the relationship between two categorical variables. Both of these summaries can be created by the “table” function:

table(my_data$style) # A one-way frequency table of 'style'

## 
## 1 1/2 Story Frame     1 Story Brick     1 Story Condo     1 Story Frame 
##                25                24                45               347 
##     2 Story Brick     2 Story Condo     2 Story Frame Split Foyer Frame 
##                10                27               184                84 
## Split Level Frame 
##                31

table(my_data$bedrooms, my_data$bsmt) # A two-way frequency table of 'bedrooms' and 'bsmt'

##    
##     1/2 3/4 Crawl Full None
##   1   1   0     0    6   10
##   2   2   0     1  116  120
##   3   2   0     0  250   43
##   4   1   1     0  163    5
##   5   0   0     0   47    1
##   6   0   0     0    5    0
##   7   0   0     0    3    0

# Notice how 'bedrooms' is stored as a numeric variable, but it still can be used in the table function

Tables are their own type of object, and they can be used by functions like “barplot”:

my_table <- table(my_data$bsmt) # Tables can be stored as objects
barplot(my_table) # Creates a bar plot from a table

We can construct basic visuals of numeric variables too:

hist(my_data$sale.amount) # Histograms are for numeric variables

In next week’s lab we will cover how to create a broader range of more elegant visualizations, but for now these examples provide a useful illustration of how R handles different variables.

Numeric Summaries

Below are some examples showing how to calculate a few common summary statistics:

mean(my_data$sale.amount) # mean

## [1] 180098.3

sd(my_data$sale.amount) # standard deviation

## [1] 90655.31

min(my_data$sale.amount) # minimum

## [1] 38250

max(my_data$sale.amount) # maximum

## [1] 815000

quantile(my_data$sale.amount, .35) # the 35th percentile

##    35% 
## 139740

Right now it’s okay if you don’t know the exact definition of some of these summaries, we’ll cover methods for summarizing data in greater detail later this week.

The summary function conveniently provides many of these statistics all at once:

summary(my_data$sale.amount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   38250  130000  157900  180098  205000  815000

Question #6: In your R Script go to a new line and add the comment “Question 6”, then on the line(s) below write code that finds the range of the variable “Approval” in the data.frame “election_data” (the original one you first read in) by using either the min and max functions, or the range function. (Hint: be careful to read the statistical definition of “range” before finalizing your answer.)

$~$

6 - Packages

To facilitate more complex tasks in R, many people have developed their own sets of functions known as packages. If you will be working with a new package for the first time, it first must be installed:

install.packages("ggplot2")

After a package is installed, it still needs to be loaded using the library (or require) function before its functions can be used. You’ll need to re-load a package every time you open RStudio, but you’ll only need to install it once.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.5

qplot(my_data$ac) # qplot is a function in the package ggplot2

Question #7: In your R Script go to a new line and add the comment “Question 7”, then on the line(s) below write code that installs and loads the “plotly” package. Then, write a line of code that reads the “professor salaries” dataset from this link: https://remiller1450.github.io/data/Salaries.csv and stores it as an object called prof_data. Finally, add the code ggplotly(qplot(prof_data$rank, prof_data$salary)) and write a comment that briefly describes what this code generates.

$~$

7 - R Markdown

At the beginning of this document you were instructed to open a file called an “R Script”. This type of file is intended to contain only executable R code and comments.

R Studio also supports many other types of files, some of which use an general authoring framework known as “Markdown”. An “R Markdown” file allows you to both:

Save and execute R code
Generate high quality, reproducible reports that can be shared with an audience

To use R Markdown you might need to install and load the rmarkdown package:

install.packages("rmarkdown")
library("rmarkdown")

Once you have the package installed and loaded, you’ll be able to create RMarkdown files by selecting: File -> New File -> R Markdown. Go ahead and try this, hitting “Ok” to use the default options.

The R Markdown file you opened should look somewhat like an RScript. The first thing you should notice is the header, which is initiated by three ‘-’ characters and closed by another three ‘-’ characters. The header contains information, such as title, author, etc. that will appear at the top of the document created by your code.
The second thing you’ll see is a block of code initiated by $\text{```\{r setup\}}$ and closed by $\text{```}$. The $\text{```}$ wrappers are used to tell R Markdown that what is contained inside is code that should be executed when the document is created. This particular code block just sets up some options that will be used in executing your R code when your report is built, for now you can ignore it.
Next you’ll see section headers defined by the $\#$ character. The number of $\#$ characters used determines the level (size) of the header.
Finally you’ll see some blocks of R code initiated by $\text{```\{r\}}$ and closed by $\text{```}$. These are chunks of R code that can be executed by clicking on the green arrow in the upper right corner of the code box. You can still execute smaller pieces of code within these blocks by highlighting them and hitting Ctrl-Enter.

As was mentioned earlier, the main use of R Markdown is to create beautiful documents that blend R code, output, and text into a polished report. To generate this document you must compile your R Markdown file using the “Knit” button (a blue yarn ball icon) located towards the upper left part of your screen.

More on R Markdown

The information in the prior section provides a very brief introduction R Markdown, I encourage you to go through the lessons created by RMarkdown’s developers at https://rmarkdown.rstudio.com/lesson-1.html if you have the time. These lessons include numerous screen shots, videos, and more detailed explanations of exactly how RMarkdown works and what it is capable of.

In this class, you will be expected to submit all future homework assignments and labs as compiled R Markdown documents. On the first few assignments I will provide you a template for doing so, but later on you’ll be expected to create these documents by yourself, so it is to your advantage to learn about R Markdown now if you have the time.

Question #8 (Optional Extra Credit) - Part A: Create a new R Markdown file and delete all of the template code that appears beneath the “r setup” code block. Change the title to “Lab #1” and the authors to the names of your group members. Next, create section labels for each of this lab’s questions using three $\#$ characters followed by “Question X” (where X is the number of the question). Then, create an R code block within each section and add the code you wrote pertaining to that question. Finally, move any textual answers that had been written in comments to the area beneath your code block (but before the next section label). Be sure to remove the $\#$ characters from these comments.

Question #8 (Optional Extra Credit) - Part B: R Markdown will use LaTex typesetting for any text that is wrapped in $\$$ characters. For example, $\$\text{\\beta}\$$ will appear as a the Greek letter $\beta$ after you knit your document. To practice this, include $\$\text{H_0: \\mu = 0}\$$ in a sentence (the sentence can be anything, but should not be contained in an R code chunk or a section label).

$~$

Turning in Lab #1

Please submit your responses to the questions contained in Lab #1 via Canvas. If you completed the extra credit questions, you should turn in the compiled .html report (this can be found in the location where you saved your R Markdown file). Otherwise, you should turn in your .R script.

As a reminder, everyone should turn in their own copy of the lab, but you should include the names of the other members of your group (either in a comment or as authors in R Markdown).