A Crash Course in R

This lab provides a very brief introduction to RStudio, its user interface, some basic R programming, as well as a publishing extension known as RMarkdown.

Accessing RStudio

In this class, RStudio may be accessed in two ways, either from a physical copy of the on your computer, or from the Grinnell RStudio server https://rstudio.grinnell.edu/. A physical copy of RStudio is already installed on each computer in this classroom as well as a few other labs on campus. If you’d like to get RStudio on your personal computer you simply need to:

Download and install R from http://www.r-project.org/
Download and install RStudio from http://www.rstudio.com/

R and Rstudio are open-source software and completely free to download and use; so there is little downside to adding to your personal computer.

The Layout of RStudio

When you open RStudio the first thing you’ll want to do is open a file to work in. You can do this by navigating: File -> New File -> RScript. This will open a new window in the top left of the RStudio user interface for you to work in. At this point you should see four panels:

Your RScript (top left)
The Console (bottom left)
Your Environment (top right)
The Files/Plots/Help viewer (bottom right)

The RScript is similar to a textfile and stores your code while you work on it. At anytime you can send some or all of the code in your RScript to the Console for the computer to execute. You can also type commands directly into the Console. The Console will echo any code you tell it to run and display any textual/numeric output that the code generates.

The Environment shows you the names of datasets, variables, and user-created functions that have been loaded into your workspace and can be accessed by your code. The Files/Plots/Help Viewer will display graphics generated by your code. On the Grinnell RStudio Server it will also allow you to conveniently access files (such as labs or data sets) that are stored on the server.

Using R

R is an interpreted language, which allows you to have the computer execute any piece of the code in your RScript at any time.
To run a piece of code simply highlight it and hit Ctrl-Enter, you should see the code that you ran appear in the Console, along with the response generated by the code.

R as a Calculator

At its most basic level, R can be used to perform arithmetic operations. A few examples are shown below, try typing them into your RScript and executing them on your own.

4 + 6 - (24/6)

## [1] 6

5 ^ 2 + 2 * 2

## [1] 29

Some arithmetic operations require the use of functions. In the example below, the function “exp” raises the number $e$ to the power that is input into the function. The input is given in the parentheses. In this example, the number 2 is input into the function “exp”:

exp(2)

## [1] 7.389056

This function takes the square root:

sqrt(4)

## [1] 2

This function takes the absolute value:

abs(-1)

## [1] 1

Including Comments

Often in programming languages, you can provide comments within code to explain what the code does or leave notes for yourself. In R, the character “#” is used to start a comment. Everything on the same line immediately to the right of the “#” will not be executed if submitted to the console.

# This entire line is a comment and will do nothing if run

1:6 # The command "1:6" appears before this comment

## [1] 1 2 3 4 5 6

In your RScript, comments appear in green. You also should remember that the “#” starts a comment only for a single line of your RScript.

Objects and Assignment

R allows you to store things in objects, which can later be referenced or used as inputs to functions:

x <- 5 # This assigns the integer value '5' to the variable called 'x'
x^2    # We can now reference 'x'

## [1] 25

R stores sequences in objects called vectors:

x <- 1:3 # The sequence {1, 2, 3} is assigned to the vector called 'x'
print(x)

## [1] 1 2 3

y <- c(1,2,3) # The function 'c' concatenates arguments (seperated by commas) into a vector
print(y)

## [1] 1 2 3

z <- c("A","B","C") # Vectors can contain many types of values
print(z)

## [1] "A" "B" "C"

Data is typically stored in objects called data.frames, which are comprised of several vectors of the same length:

DF <- data.frame(x = x, y = y, z = z) # Creates a data.frame object 'DF'
print(DF)

##   x y z
## 1 1 1 A
## 2 2 2 B
## 3 3 3 C

Note that R is case sensitive, meaning the lower case ‘x’ is a different object than the upper case ‘X’.

Indexing

Suppose we setup a vector ‘x’ and would like to extract the element in its second position and store it in a new object called ‘b’:

x <- 5:10
b <- x[2]
b

## [1] 6

The square brackets indicate we want to access a certain position (or multiple positions) within ‘x’.

For some objects, such as data.frames, multiple dimensions are needed to specify an element’s position:

DF <- data.frame(x = x, y = y, z = z) 
DF[2,3] # The element in row 2, column 3

## [1] B
## Levels: A B C

Reading in Data

There are many ways to get data into R. If a data set is stored somewhere on your computer as a .csv file, you can load it using the R function read.csv:

# my.data <- read.csv("H://path_to_my_data/my_data.csv")

read.csv is also capable of reading .csv files stored on the web. All of the datasets that you’ll need for this class are stored on my personal website. The example below reads the data set IowaCityHomeSales.csv and stores it as an object called ‘my.data’:

my.data <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

We can check to see the data were read in correctly:

head(my.data)  # This function prints the first several rows of an object

When working with a newly loaded dataset there are a few things we might want to know about the object storing the data:

dim(my.data) # prints the dimensions of 'my.data'

## [1] 777  19

nrow(my.data) # prints the number of rows of 'my.data'

## [1] 777

ncol(my.data) # prints the number of columns of 'my.data'

## [1] 19

colnames(my.data) # prints the sames of the variables (columns) of 'mydata'

##  [1] "sale.amount"  "sale.date"    "occupancy"    "style"       
##  [5] "built"        "bedrooms"     "bsmt"         "ac"          
##  [9] "attic"        "area.base"    "area.add"     "area.bsmt"   
## [13] "area.garage1" "area.garage2" "area.living"  "area.lot"    
## [17] "lon"          "lat"          "assessed"

Variable Types

Most of the data we’ll work will contain one of two types of variables, numeric variables (ie: ‘sale.amount’) or factor variables (ie: ‘bsmt’). If you’re ever unsure of a variable’s type you can check it using data.class:

data.class(my.data$sale.amount)

## [1] "numeric"

data.class(my.data$bsmt)

## [1] "factor"

Occassionally we might need to use logical variables, in R these take on values of “TRUE” or “FALSE” and have several uses. Logical variables can be created using logical conditions:

x <- (1 > 3)
x

## [1] FALSE

data.class(x)

## [1] "logical"

Subsetting Data

Suppose we want to access a single variable in our data set, there are a few different ways we can do so:

sale.price1 <- my.data$sale.amount # The $ accesses the variable 'sale.amount' within 'my.data'
sale.price2 <- my.data[,1] # We can also use indexing to access 'sale.amount'
# Notice how we specified second dimension of 'my.data' (its columns)
head(sale.price1)

## [1] 172500  90000 168500 205000 121000 215000

head(sale.price2)

## [1] 172500  90000 168500 205000 121000 215000

Suppose we want to access a single case (subject) in our dataset:

first.house <- my.data[1,] # This stores the entire first row
head(first.house)

##   sale.amount sale.date           occupancy         style built bedrooms
## 1      172500  1/3/2005 116 (Zero Lot Line) 1 Story Frame  1993        3
##   bsmt  ac attic area.base area.add area.bsmt area.garage1 area.garage2
## 1 Full Yes  None      1102        0       925          418            0
##   area.living area.lot       lon      lat assessed
## 1        1102     5520 -91.50913 41.65116   173040

Logical Conditions and Subsetting

Suppose we want know which elements of sale.amount are larger than $200,000:

my.data$sale.amount > 200000 # Logical vector for the condition "> 200000"
which(my.data$sale.amount > 200000) # Positions of elements where the condition is "TRUE"

Some useful logical operators include:

Logical conditions are particularly useful for subsetting objects, here are a few examples:

large.and.expensive <- my.data[my.data$sale.amount > 500000 & my.data$area.living > 3000,]

The example above creates a new object containing all homes that sold for more than $ 500,000 and have living areas over 3,000 square feet.

no.bsmt.or.no.ac <- my.data[my.data$bsmt == "None" | my.data$ac != "Yes",]
dim(no.bsmt.or.no.ac)

## [1] 239  19

The example above creates a new object containing all homes that don’t have a basement or don’t have air conditioning

no.bsmt.and.no.ac <- my.data[my.data$bsmt == "None" & my.data$ac != "Yes",]
dim(no.bsmt.and.no.ac)

## [1] 14 19

The example above creates a new object containing homes that don’t have a base and don’t have air conditioning. Pay careful attention to the difference between this example and the prior example.

Tables, Bar Charts and Histograms

One-way frequency tables summarize a single categorical (factor) variable, while two-way frequencies tables summarize the relationship between two categorical variables. Both of these summaries can be created by the “table” function:

table(my.data$style) # A one-way frequency table of 'style'

## 
## 1 1/2 Story Frame     1 Story Brick     1 Story Condo     1 Story Frame 
##                25                24                45               347 
##     2 Story Brick     2 Story Condo     2 Story Frame Split Foyer Frame 
##                10                27               184                84 
## Split Level Frame 
##                31

table(my.data$bedrooms, my.data$bsmt) # A two-way frequency table of 'bedrooms' and 'bsmt'

##    
##     1/2 3/4 Crawl Full None
##   1   1   0     0    6   10
##   2   2   0     1  116  120
##   3   2   0     0  250   43
##   4   1   1     0  163    5
##   5   0   0     0   47    1
##   6   0   0     0    5    0
##   7   0   0     0    3    0

# Notice that 'bedrooms' is stored as a numeric variable, 
# but it still can be used in the table function

Tables are their own type of object, they can be used by functions like “barplot”:

my.table <- table(my.data$bsmt) # Tables can be stored as objects
barplot(my.table) # Creates a bar plot from a table

We can construct basic visuals of numeric variables too:

hist(my.data$sale.amount) # Histograms are for numeric variables

Numeric Summaries

Below are some examples showing how to calculate some common summary statistics:

mean(my.data$sale.amount) # mean

## [1] 180098.3

sd(my.data$sale.amount) # standard deviation

## [1] 90655.31

min(my.data$sale.amount) # minimum

## [1] 38250

max(my.data$sale.amount) # maximum

## [1] 815000

quantile(my.data$sale.amount, .35) # the 35th percentile

##    35% 
## 139740

The summary function conveniently provides many of these statistics all at once:

summary(my.data$sale.amount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   38250  130000  157900  180098  205000  815000

Packages

To facilitate more complex tasks in R people have developed their own sets of functions known as packages. If you are working on your own computer, packages will need to be installed:

install.packages("ggplot2")

Once a package is installed it still needs to be loaded in order to be used. You’ll need to load a package every time you open RStudio, but you’ll only need to install it once.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.2

qplot(my.data$ac) # qplot is a function in the package ggplot2

If you are working on the Grinnell RStudio Server, you will be unable to install packages; however, almost all of the packages you’ll are already installed on the server.

RMarkdown

At the beginning of this document you were instructed to open a file called an “RScript”. RStudio supports many other types of files, some of which are written in an authoring framework known as “RMarkdown”. RMarkdown conveniently allows you to both:

Save and execute R code
Generate high quality, reproducible reports that can be shared with an audience

To use RMarkdown you might need to install and load the package:

install.packages("rmarkdown")
library("rmarkdown")

If you have the package installed and loaded, you’ll be able to create RMarkdown files by selecting: File -> New File -> R Markdown. Go ahead and try this, hitting “Ok” to use the default options.

Your new RMarkdown file should look a lot like an RScript. The first thing you should notice is the header, which is initiated by three ‘-’ characters and closed by another three ‘-’ characters. The header contains information that will appear at the top of your compiled report.
The second thing you’ll see is a block of code initiated by $\text{```\{r setup\}}$ and closed by $\text{```}$. This just sets up some options used when executing your R code when your report is built, for now you can ignore it.
Next you’ll see section headers defined by the $\#$ character. The number of $\#$ characters determines the level (size) of the header.
Finally you’ll see some blocks of code initiated by $\text{```\{r\}}$ and closed by $\text{```}$. These are bits of R code that can be executed by clicking on the green arrow in the upper right corner of the code box. You can execute smaller pieces of code within these blocks by highlighting them and hitting Ctrl-Enter.

To compile your RMarkdown file into a polished report you need to “Knit” the file. You can do this by clicking on the “Knit” button (the yarn ball icon) located towards the upper left of screen.

More on RMarkdown

The information in the prior section provides a minimally sufficient introduction RMarkdown, I encourage you to go through the lessons created by RMarkdown’s developers at https://rmarkdown.rstudio.com/lesson-1.html if you have the time. These lessons include numerous screen shots, videos, and more detailed explanations of exactly how RMarkdown works and what it is capable of.

On Your Own Questions

Directions:

Create a new RMarkdown file “HW1_MyName.Rmd” where “MyName” is replaced with your actual name. Title the document “Homework 1”
Write code blocks to address the following questions, label each block as “Question 1”, “Question 2”, etc. See the file “HW_Sample.Rmd” file if you need a template
Make sure your .Rmd file compiles (knits) and the formatting looks good
Submit via email both your .Rmd file and the .html file that it generates

Option #1 (Questions 1 - 6)

Question #1

Write code that reads in the data file “CollegeData.csv”, which is available at the url: “https://remiller1450.github.io/data/CollegeData.csv”, and stores the data as an object named “Dat”.

Question #2

Write code that finds the 20th percentile average four year tuition cost (ie: the 20th percentile of the variable “COSTT4_A”). Write a sentence describing (in non-statistical terms) what the 20th percentile means.

Question #3

Write code that creates a new vector named “log.sal” that is the log of the variable “AVGFACSAL” in the College Data. Then use the summary function to provide a summary of the new variable. Write a setence stating whether “log.sal” is approximately symmetric.

Question #4

Write code that creates a two-way frequency table using the variables “REGION” and “LOCALE”. Write 1-2 sentences describing the relationship you see in the table.

Question #5

Use the qplot function in the package ggplot2 to construct a plot of the variable “REGION”.

Question #6

Write code that creates a new object named “Iowa.dat” that contains only colleges located in the state of Iowa (ie: the variable “STABBR” is “IA”). Print the dimensions of this new object.

Option #2 (Questions 4 and 7)

Question #4

See Question 4 from the section above

Question #7

In R you can define your own functions. The following code defines a function titled “position2”, which accepts an object as its input (which is defined internally as “X”) and returns the element in the second position:

position2 <- function(X){
  out <- X[2]
  return(out)
}
position2(c("Q","R","S"))

## [1] "R"

The trimmed (or truncated) mean is a statistical measure of central tendency that removes a certain percentage of the highest and lowest observations (ie: the 10% trimmed mean uses the middle 80% of the data).

For this question, write an R function named “trimmedmean” that accepts two arguments, a data vector “X” and a percentage “p”, and returns the p% trimmed mean. Then use your function to find the 5% trimmed mean of the variable “AVGFACSAL” from the College Data described in Question #1. (Hint: Use the function “sort” on your vector and then use logical conditions to subset it before taking its mean)