This lab will provide introduction to RStudio
, its user interface, some basic R
programming commands, and a publishing extension known as RMarkdown
. In doing so, it will also cover some basic definitions and operations involving data.
In this course you’ll need to install R
on your own personal computer, which you should bring to class every day. Getting R Studio
to function properly requires two steps:
R
from http://www.r-project.org/RStudio
from http://www.rstudio.com/R
and Rstudio
are both open-source software and they’re completely free to download and use, and they don’t take up much space on PC, so there is very little downside to adding them to your personal computer.
\(~\)
After you open RStudio
, the first thing you’ll want to do is open a file to work in. You can do this by navigating: File -> New File -> RScript, which will open a new window in the top left of the RStudio
user interface for you to work in. At this point you should see four panels:
An R Script is like a textfile that stores your code while you work on it. At any time, you can send some or all of the code in your R Script to the Console for the computer to execute. You can also type commands directly into the Console. The Console will echo any code you tell it to run, and will display the textual/numeric output that the code generates.
The Environment shows you the names of datasets, variables, and user-created functions that have been loaded into your workspace and can be accessed by your code. The Files/Plots/Help Viewer will display graphics generated by your code and a few other useful entities (like help documentation and file trees).
Question #1: Create a blank R Script. You will use this R Script to record your answers to future questions in this document.
\(~\)
R
is an interpreted programming language, which allows you to have the computer execute any piece of code contained your RScript at any time without a lengthy compiling process.
To run a single piece of code, simply highlight it and either hit Ctrl-Enter or click on the “Run” button near the top right corner of your R Script. You should see an echo of the code you ran in the Console, along with any response generated by that code.
At its most basic level, R
can be used to perform arithmetic operations. A few examples are shown below, try typing them into your R Script and executing them on your own.
4 + 6 - (24/6)
## [1] 6
5 ^ 2 + 2 * 2
## [1] 29
Some arithmetic operations require the use of functions. In the example below, the function “exp” raises the number \(e\) to the power you provide as an input to the function. In the example below, the number “2” is given as an input into the function exp
:
exp(2)
## [1] 7.389056
This function takes the square root:
sqrt(4)
## [1] 2
This function takes the absolute value:
abs(-1)
## [1] 1
A function’s inputs are called arguments in R
documentation. For complex functions, the arguments should be specified using names which are internally defined within the function. The example below takes the base two logarithm of the number 4 using the log
function and appropriate inputs to the arguments x
and base
.
log(x = 4, base = 2)
## [1] 2
R
contains thousands of functions, each potentially containing many different arguments. Whenever using an unfamiliar function for the first time it good practice to read that function’s documentation, which will describe the function’s uses and arguments. You can pull up a function’s documentation by typing ?
before the function’s name in console. The example below pulls up the documentation of the log
function.
?log
Remark: A complete list of the functions contained in base R is available at this link; that said, browsing this list tends to be a very inefficient way of finding the function you need. There’s no shame in searching online (ie: using Google, StackOverflow, etc.) for the right R commands for the task you’re working on.
When coding, it is good practice to include comments that describe what your code is doing. In R
the character “#” is used to start a comment. Everything located that line and appearing to right of the “#” will not be executed when that line is submitted to the console.
# This entire line is a comment and will do nothing if run
1:6 # The command "1:6" appears before this comment
## [1] 1 2 3 4 5 6
In your R Script, comments appear in green. You also should remember that the “#” starts a comment only for a single line of your RScript, so long comments that requiring multiple lines should each begin with their own “#”.
Question #2: In your R Script add a comment with the text “Question 2” on the script’s first line. Then, on the second line, write a command that finds the square root of the absolute value of negative four.
\(~\)
R
stores data in containers called objects. Data is assigned into an object using <-
or =
. After assignment, data can be referenced using that object’s name. The simplest objects are scalars, or a single elements:
x <- 5 # This assigns the integer value '5' to an object called 'x'
x^2 # We can now reference 'x'
## [1] 25
R
stores sequences of elements in objects called vectors:
x <- 1:3 # The sequence {1, 2, 3} is assigned to the vector called 'x'
print(x)
## [1] 1 2 3
y <- c(1,2,3) # The function 'c' concatenates arguments (separated by commas) into a vector
print(y)
## [1] 1 2 3
z <- c("A","B","C") # Vectors can contain different types of values
print(z)
## [1] "A" "B" "C"
There are three types of vectors:
x = c(1,2,3)
x = c("A","B","C")
x = c(TRUE, FALSE, TRUE)
Vector types are important because most functions expect inputs of a certain type and will produce an error if an input of the wrong type is given. Mixing these different types usually will default to a character vector - this is a common source of errors when working with real world raw data loaded into R.
You can check the type of an object using the typeof
function:
chars <- c("1","2","3") # Create a character vector
typeof(chars)
## [1] "character"
nums <- c(1,2,3) # Create a numeric vector
typeof(nums)
## [1] "double"
mean(chars) # This produces an error, mean() only works for numeric vectors
## Warning in mean.default(chars): argument is not numeric or logical: returning NA
## [1] NA
mean(nums) # This works as intended
## [1] 2
Many R
functions are vectorized, meaning they can accept a scalar input, for example 1
, and return a scalar output, f(1)
, or they can accept a vector input, such as c(1,2,3)
, and return a vector c(f(1),f(2),f(3))
. The sqrt
function is vectorized:
nums <- c(1,2,3,4)
sqrt(nums)
## [1] 1.000000 1.414214 1.732051 2.000000
Datasets are usually stored in objects called data.frames, which are composed of several vectors of the same length:
DF <- data.frame(x = x, y = y, z = z) # Creates a data.frame object 'DF'
print(DF)
## x y z
## 1 1 1 A
## 2 2 2 B
## 3 3 3 C
Question #3: In your R Script go to a new line and add the comment “Question 3” and on the line(s) below this comment create a data.frame object named “dfn” containing a vector named “number” that is the integers from 1 to 10, and a vector named “number_squared” which is the integers from 1 to 10 squared.
Suppose we have a vector ‘x’ and would like to extract the element in its second position and assign it to a new object called ‘b’:
x <- 5:10
b <- x[2]
b
## [1] 6
The square brackets are used to access a certain position (or multiple positions) within an object. In this example we access the second position within the object “x”.
Some objects, such as data.frames, have multiple dimensions, requiring indices in each dimension to describe an element’s position:
DF <- data.frame(x = x, y = y, z = z)
DF[2,3] # The element in row 2, column 3
## [1] "B"
\(~\)
There are many ways to get data into R
. If a data set is stored somewhere on your computer as a .csv file, you can load it using the R function read.csv
:
# my.data <- read.csv("H://path_to_my_data/my_data.csv")
read.csv
is also capable of reading .csv files stored on the web. All of the datasets that you’ll need for labs and homework assignments in this class will be hosted online. The example below reads the data set IowaCityHomeSales.csv
(which is hosted at specified the url) and stores it as an object called ‘my_data’:
my_data <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
We can use the head
function to see the first few rows of these data:
head(my_data) # This function prints the first several rows of an object
When working with a newly loaded dataset there are a few things you might want to know:
dim(my_data) # prints the dimensions of 'my.data'
## [1] 777 19
nrow(my_data) # prints the number of rows of 'my.data'
## [1] 777
ncol(my_data) # prints the number of columns of 'my.data'
## [1] 19
colnames(my_data) # prints the column names of 'mydata'
## [1] "sale.amount" "sale.date" "occupancy" "style" "built"
## [6] "bedrooms" "bsmt" "ac" "attic" "area.base"
## [11] "area.add" "area.bsmt" "area.garage1" "area.garage2" "area.living"
## [16] "area.lot" "lon" "lat" "assessed"
Statisticians typically to the rows of a dataset as “cases”, “observations”, or “data-points”. They typically refer to the columns of a dataset as “variables” or “features”. Much like how vectors can be of different types, so can variables:
Question #4 (Part A): In your R Script go to a new line add the comment “Question 4”, then on the line(s) below write code that reads and stores the data located at “https://remiller1450.github.io/data/ElectionMargin.csv” as “election_data”. Next, write code that finds dimensions of this data.frame.
Question #4 (Part B): For the election margin data, use comments to record your answers to the following questions:
Suppose we want to access a single variable in our data set, there are a few different ways we can do so:
sale_price1 <- my_data$sale.amount # The $ accesses the variable named 'sale.amount' within 'my.data'
sale_price2 <- my_data[,1] # We can also use indexing to access 'sale.amount'
# Notice how we specified second dimension of 'my.data' (its columns)
head(sale_price1)
## [1] 172500 90000 168500 205000 121000 215000
head(sale_price2)
## [1] 172500 90000 168500 205000 121000 215000
Suppose we want to access a single case (subject) in our dataset:
first_house <- my_data[1,] # This stores the entire first row
head(first_house)
## sale.amount sale.date occupancy style built bedrooms bsmt
## 1 172500 1/3/2005 116 (Zero Lot Line) 1 Story Frame 1993 3 Full
## ac attic area.base area.add area.bsmt area.garage1 area.garage2 area.living
## 1 Yes None 1102 0 925 418 0 1102
## area.lot lon lat assessed
## 1 5520 -91.50913 41.65116 173040
Suppose we want a range of cases:
firstfive_houses <- my_data[1:5,] # This stores the first five rows
head(firstfive_houses)
## sale.amount sale.date occupancy style
## 1 172500 1/3/2005 116 (Zero Lot Line) 1 Story Frame
## 2 90000 1/5/2005 113 (Condominium) 1 Story Frame
## 3 168500 1/12/2005 101 (Single-Family / Owner Occupied) Split Foyer Frame
## 4 205000 1/14/2005 101 (Single-Family / Owner Occupied) Split Foyer Frame
## 5 121000 1/24/2005 113 (Condominium) 1 Story Condo
## built bedrooms bsmt ac attic area.base area.add area.bsmt area.garage1
## 1 1993 3 Full Yes None 1102 0 925 418
## 2 2001 2 None Yes None 878 0 0 0
## 3 1976 4 Full Yes None 1236 0 700 576
## 4 1995 3 Full Yes None 1466 0 500 0
## 5 2001 2 None Yes None 1150 0 0 0
## area.garage2 area.living area.lot lon lat assessed
## 1 0 1102 5520 -91.50913 41.65116 173040
## 2 264 878 3718 -91.52296 41.67324 89470
## 3 0 1236 8800 -91.48231 41.65849 164230
## 4 0 1466 16720 -91.55224 41.64900 211890
## 5 528 1150 3427 -91.57814 41.65263 115430
Suppose we want know which elements of sale.amount are larger than $200,000:
my_data$sale.amount > 200000 # Logical vector for the condition "> 200000"
which(my_data$sale.amount > 200000) # Positions of elements where the condition is "TRUE"
Some useful logical operators include:
Logical conditions are particularly useful for subsetting objects, here are a few examples:
large_and_expensive1 <- my_data[my_data$sale.amount > 500000 & my_data$area.living > 3000,] ## Subset via indexing
large_and_expensive2 <- subset(my_data, my_data$sale.amount > 500000 & my_data$area.living > 3000) ## Subset via the "subset" function
The code given above shows two different ways of creating a new object that contains all homes which sold for more than $ 500,000 and have living areas over 3,000 square feet.
nobsmt_or_noac1 <- my_data[my_data$bsmt == "None" | my_data$ac != "Yes",]
nobsmt_or_noac2 <- subset(my_data, my_data$bsmt == "None" | my_data$ac != "Yes")
dim(nobsmt_or_noac1)
## [1] 239 19
The example above creates a new object containing all homes that don’t have a basement or don’t have air conditioning
nobsmt_and_noac <- my_data[my_data$bsmt == "None" & my_data$ac != "Yes",]
dim(nobsmt_and_noac)
## [1] 14 19
The example above creates a new object containing homes that don’t have a base and don’t have air conditioning. Pay careful attention to the difference between this example and the prior example.
Question #5: In your R Script go to a new line and add the comment “Question 5”, then on the line(s) below write code that create a new data.frame called “election_losers” that subsets “election_data” (which you created in Question 4) to include only rows where the result was “Lost” and the year was less than or equal to 1984.
\(~\)
Frequency tables are a way to summarize a single categorical (factor) variable. A one-way frequency table shows the frequencies of categories in a single categorical variable, while a two-way frequency tables shows the relationship between two categorical variables. Both of these summaries can be created by the “table” function:
table(my_data$style) # A one-way frequency table of 'style'
##
## 1 1/2 Story Frame 1 Story Brick 1 Story Condo 1 Story Frame
## 25 24 45 347
## 2 Story Brick 2 Story Condo 2 Story Frame Split Foyer Frame
## 10 27 184 84
## Split Level Frame
## 31
table(my_data$bedrooms, my_data$bsmt) # A two-way frequency table of 'bedrooms' and 'bsmt'
##
## 1/2 3/4 Crawl Full None
## 1 1 0 0 6 10
## 2 2 0 1 116 120
## 3 2 0 0 250 43
## 4 1 1 0 163 5
## 5 0 0 0 47 1
## 6 0 0 0 5 0
## 7 0 0 0 3 0
# Notice how 'bedrooms' is stored as a numeric variable, but it still can be used in the table function
Tables are their own type of object, and they can be used by functions like “barplot”:
my_table <- table(my_data$bsmt) # Tables can be stored as objects
barplot(my_table) # Creates a bar plot from a table
We can construct basic visuals of numeric variables too:
hist(my_data$sale.amount) # Histograms are for numeric variables
In next week’s lab we will cover how to create a broader range of more elegant visualizations, but for now these examples provide a useful illustration of how R handles different variables.
Below are some examples showing how to calculate a few common summary statistics:
mean(my_data$sale.amount) # mean
## [1] 180098.3
sd(my_data$sale.amount) # standard deviation
## [1] 90655.31
min(my_data$sale.amount) # minimum
## [1] 38250
max(my_data$sale.amount) # maximum
## [1] 815000
quantile(my_data$sale.amount, .35) # the 35th percentile
## 35%
## 139740
Right now it’s okay if you don’t know the exact definition of some of these summaries, we’ll cover methods for summarizing data in greater detail later this week.
The summary
function conveniently provides many of these statistics all at once:
summary(my_data$sale.amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 38250 130000 157900 180098 205000 815000
Question #6: In your R Script go to a new line and add the comment “Question 6”, then on the line(s) below write code that finds the range of the variable “Approval” in the data.frame “election_data” (the original one you first read in) by using either the min
and max
functions, or the range
function. (Hint: be careful to read the statistical definition of “range” before finalizing your answer.)
\(~\)
To facilitate more complex tasks in R
, many people have developed their own sets of functions known as packages. If you will be working with a new package for the first time, it first must be installed:
install.packages("ggplot2")
After a package is installed, it still needs to be loaded using the library
(or require
) function before its functions can be used. You’ll need to re-load a package every time you open RStudio
, but you’ll only need to install it once.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
qplot(my_data$ac) # qplot is a function in the package ggplot2
Question #7: In your R Script go to a new line and add the comment “Question 7”, then on the line(s) below write code that installs and loads the “plotly” package. Then, write a line of code that reads the “professor salaries” dataset from this link: https://remiller1450.github.io/data/Salaries.csv and stores it as an object called prof_data
. Finally, add the code ggplotly(qplot(prof_data$rank, prof_data$salary))
and write a comment that briefly describes what this code generates.
\(~\)
At the beginning of this document you were instructed to open a file called an “R Script”. This type of file is intended to contain only executable R
code and comments.
R Studio
also supports many other types of files, some of which use an general authoring framework known as “Markdown”. An “R Markdown” file allows you to both:
R
codeTo use R Markdown you might need to install and load the rmarkdown
package:
install.packages("rmarkdown")
library("rmarkdown")
Once you have the package installed and loaded, you’ll be able to create RMarkdown files by selecting: File -> New File -> R Markdown. Go ahead and try this, hitting “Ok” to use the default options.
R
code when your report is built, for now you can ignore it.R
code that can be executed by clicking on the green arrow in the upper right corner of the code box. You can still execute smaller pieces of code within these blocks by highlighting them and hitting Ctrl-Enter.As was mentioned earlier, the main use of R Markdown is to create beautiful documents that blend R code, output, and text into a polished report. To generate this document you must compile your R Markdown file using the “Knit” button (a blue yarn ball icon) located towards the upper left part of your screen.
The information in the prior section provides a very brief introduction R Markdown, I encourage you to go through the lessons created by RMarkdown’s developers at https://rmarkdown.rstudio.com/lesson-1.html if you have the time. These lessons include numerous screen shots, videos, and more detailed explanations of exactly how RMarkdown works and what it is capable of.
In this class, you will be expected to submit all future homework assignments and labs as compiled R Markdown documents. On the first few assignments I will provide you a template for doing so, but later on you’ll be expected to create these documents by yourself, so it is to your advantage to learn about R Markdown now if you have the time.
Question #8 (Optional Extra Credit) - Part A: Create a new R Markdown file and delete all of the template code that appears beneath the “r setup” code block. Change the title to “Lab #1” and the authors to the names of your group members. Next, create section labels for each of this lab’s questions using three \(\#\) characters followed by “Question X” (where X is the number of the question). Then, create an R code block within each section and add the code you wrote pertaining to that question. Finally, move any textual answers that had been written in comments to the area beneath your code block (but before the next section label). Be sure to remove the \(\#\) characters from these comments.
Question #8 (Optional Extra Credit) - Part B: R Markdown will use LaTex typesetting for any text that is wrapped in \(\$\) characters. For example, \(\$\text{\\beta}\$\) will appear as a the Greek letter \(\beta\) after you knit your document. To practice this, include \(\$\text{H_0: \\mu = 0}\$\) in a sentence (the sentence can be anything, but should not be contained in an R code chunk or a section label).
\(~\)
Please submit your responses to the questions contained in Lab #1 via Canvas. If you completed the extra credit questions, you should turn in the compiled .html report (this can be found in the location where you saved your R Markdown file). Otherwise, you should turn in your .R script.
As a reminder, everyone should turn in their own copy of the lab, but you should include the names of the other members of your group (either in a comment or as authors in R Markdown).