This lab provides a very brief introduction to RStudio
, its user interface, some basic R
programming, as well as a publishing extension known as RMarkdown
.
In this class, RStudio may be accessed in two ways, either from a physical copy of the on your computer, or from the Grinnell RStudio server https://rstudio.grinnell.edu/. A physical copy of RStudio
is already installed on each computer in this classroom as well as a few other labs on campus. If you’d like to get RStudio on your personal computer you simply need to:
R
from http://www.r-project.org/RStudio
from http://www.rstudio.com/R
and Rstudio
are open-source software and completely free to download and use; so there is little downside to adding to your personal computer.
When you open RStudio
the first thing you’ll want to do is open a file to work in. You can do this by navigating: File -> New File -> RScript. This will open a new window in the top left of the RStudio
user interface for you to work in. At this point you should see four panels:
The RScript is similar to a textfile and stores your code while you work on it. At anytime you can send some or all of the code in your RScript to the Console for the computer to execute. You can also type commands directly into the Console. The Console will echo any code you tell it to run and display any textual/numeric output that the code generates.
The Environment shows you the names of datasets, variables, and user-created functions that have been loaded into your workspace and can be accessed by your code. The Files/Plots/Help Viewer will display graphics generated by your code. On the Grinnell RStudio Server it will also allow you to conveniently access files (such as labs or data sets) that are stored on the server.
R
is an interpreted language, which allows you to have the computer execute any piece of the code in your RScript at any time.
To run a piece of code simply highlight it and hit Ctrl-Enter, you should see the code that you ran appear in the Console, along with the response generated by the code.
At its most basic level, R
can be used to perform arithmetic operations. A few examples are shown below, try typing them into your RScript and executing them on your own.
4 + 6 - (24/6)
## [1] 6
5 ^ 2 + 2 * 2
## [1] 29
Some arithmetic operations require the use of functions. In the example below, the function “exp” raises the number \(e\) to the power that is input into the function. The input is given in the parentheses. In this example, the number 2 is input into the function “exp”:
exp(2)
## [1] 7.389056
This function takes the square root:
sqrt(4)
## [1] 2
This function takes the absolute value:
abs(-1)
## [1] 1
Often in programming languages, you can provide comments within code to explain what the code does or leave notes for yourself. In R
, the character “#” is used to start a comment. Everything on the same line immediately to the right of the “#” will not be executed if submitted to the console.
# This entire line is a comment and will do nothing if run
1:6 # The command "1:6" appears before this comment
## [1] 1 2 3 4 5 6
In your RScript, comments appear in green. You also should remember that the “#” starts a comment only for a single line of your RScript.
R
allows you to store things in objects, which can later be referenced or used as inputs to functions:
x <- 5 # This assigns the integer value '5' to the variable called 'x'
x^2 # We can now reference 'x'
## [1] 25
R
stores sequences in objects called vectors:
x <- 1:3 # The sequence {1, 2, 3} is assigned to the vector called 'x'
print(x)
## [1] 1 2 3
y <- c(1,2,3) # The function 'c' concatenates arguments (seperated by commas) into a vector
print(y)
## [1] 1 2 3
z <- c("A","B","C") # Vectors can contain many types of values
print(z)
## [1] "A" "B" "C"
Data is typically stored in objects called data.frames, which are comprised of several vectors of the same length:
DF <- data.frame(x = x, y = y, z = z) # Creates a data.frame object 'DF'
print(DF)
## x y z
## 1 1 1 A
## 2 2 2 B
## 3 3 3 C
Note that R
is case sensitive, meaning the lower case ‘x’ is a different object than the upper case ‘X’.
Suppose we setup a vector ‘x’ and would like to extract the element in its second position and store it in a new object called ‘b’:
x <- 5:10
b <- x[2]
b
## [1] 6
The square brackets indicate we want to access a certain position (or multiple positions) within ‘x’.
For some objects, such as data.frames, multiple dimensions are needed to specify an element’s position:
DF <- data.frame(x = x, y = y, z = z)
DF[2,3] # The element in row 2, column 3
## [1] B
## Levels: A B C
There are many ways to get data into R
. If a data set is stored somewhere on your computer as a .csv file, you can load it using the R function read.csv
:
# my.data <- read.csv("H://path_to_my_data/my_data.csv")
read.csv
is also capable of reading .csv files stored on the web. All of the datasets that you’ll need for this class are stored on my personal website. The example below reads the data set IowaCityHomeSales.csv and stores it as an object called ‘my.data’:
my.data <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
We can check to see the data were read in correctly:
head(my.data) # This function prints the first several rows of an object
When working with a newly loaded dataset there are a few things we might want to know about the object storing the data:
dim(my.data) # prints the dimensions of 'my.data'
## [1] 777 19
nrow(my.data) # prints the number of rows of 'my.data'
## [1] 777
ncol(my.data) # prints the number of columns of 'my.data'
## [1] 19
colnames(my.data) # prints the sames of the variables (columns) of 'mydata'
## [1] "sale.amount" "sale.date" "occupancy" "style"
## [5] "built" "bedrooms" "bsmt" "ac"
## [9] "attic" "area.base" "area.add" "area.bsmt"
## [13] "area.garage1" "area.garage2" "area.living" "area.lot"
## [17] "lon" "lat" "assessed"
Most of the data we’ll work will contain one of two types of variables, numeric variables (ie: ‘sale.amount’) or factor variables (ie: ‘bsmt’). If you’re ever unsure of a variable’s type you can check it using data.class
:
data.class(my.data$sale.amount)
## [1] "numeric"
data.class(my.data$bsmt)
## [1] "factor"
Occassionally we might need to use logical variables, in R
these take on values of “TRUE” or “FALSE” and have several uses. Logical variables can be created using logical conditions:
x <- (1 > 3)
x
## [1] FALSE
data.class(x)
## [1] "logical"
Suppose we want to access a single variable in our data set, there are a few different ways we can do so:
sale.price1 <- my.data$sale.amount # The $ accesses the variable 'sale.amount' within 'my.data'
sale.price2 <- my.data[,1] # We can also use indexing to access 'sale.amount'
# Notice how we specified second dimension of 'my.data' (its columns)
head(sale.price1)
## [1] 172500 90000 168500 205000 121000 215000
head(sale.price2)
## [1] 172500 90000 168500 205000 121000 215000
Suppose we want to access a single case (subject) in our dataset:
first.house <- my.data[1,] # This stores the entire first row
head(first.house)
## sale.amount sale.date occupancy style built bedrooms
## 1 172500 1/3/2005 116 (Zero Lot Line) 1 Story Frame 1993 3
## bsmt ac attic area.base area.add area.bsmt area.garage1 area.garage2
## 1 Full Yes None 1102 0 925 418 0
## area.living area.lot lon lat assessed
## 1 1102 5520 -91.50913 41.65116 173040
Suppose we want know which elements of sale.amount are larger than $200,000:
my.data$sale.amount > 200000 # Logical vector for the condition "> 200000"
which(my.data$sale.amount > 200000) # Positions of elements where the condition is "TRUE"
Some useful logical operators include:
Logical conditions are particularly useful for subsetting objects, here are a few examples:
large.and.expensive <- my.data[my.data$sale.amount > 500000 & my.data$area.living > 3000,]
The example above creates a new object containing all homes that sold for more than $ 500,000 and have living areas over 3,000 square feet.
no.bsmt.or.no.ac <- my.data[my.data$bsmt == "None" | my.data$ac != "Yes",]
dim(no.bsmt.or.no.ac)
## [1] 239 19
The example above creates a new object containing all homes that don’t have a basement or don’t have air conditioning
no.bsmt.and.no.ac <- my.data[my.data$bsmt == "None" & my.data$ac != "Yes",]
dim(no.bsmt.and.no.ac)
## [1] 14 19
The example above creates a new object containing homes that don’t have a base and don’t have air conditioning. Pay careful attention to the difference between this example and the prior example.
One-way frequency tables summarize a single categorical (factor) variable, while two-way frequencies tables summarize the relationship between two categorical variables. Both of these summaries can be created by the “table” function:
table(my.data$style) # A one-way frequency table of 'style'
##
## 1 1/2 Story Frame 1 Story Brick 1 Story Condo 1 Story Frame
## 25 24 45 347
## 2 Story Brick 2 Story Condo 2 Story Frame Split Foyer Frame
## 10 27 184 84
## Split Level Frame
## 31
table(my.data$bedrooms, my.data$bsmt) # A two-way frequency table of 'bedrooms' and 'bsmt'
##
## 1/2 3/4 Crawl Full None
## 1 1 0 0 6 10
## 2 2 0 1 116 120
## 3 2 0 0 250 43
## 4 1 1 0 163 5
## 5 0 0 0 47 1
## 6 0 0 0 5 0
## 7 0 0 0 3 0
# Notice that 'bedrooms' is stored as a numeric variable,
# but it still can be used in the table function
Tables are their own type of object, they can be used by functions like “barplot”:
my.table <- table(my.data$bsmt) # Tables can be stored as objects
barplot(my.table) # Creates a bar plot from a table
We can construct basic visuals of numeric variables too:
hist(my.data$sale.amount) # Histograms are for numeric variables
Below are some examples showing how to calculate some common summary statistics:
mean(my.data$sale.amount) # mean
## [1] 180098.3
sd(my.data$sale.amount) # standard deviation
## [1] 90655.31
min(my.data$sale.amount) # minimum
## [1] 38250
max(my.data$sale.amount) # maximum
## [1] 815000
quantile(my.data$sale.amount, .35) # the 35th percentile
## 35%
## 139740
The summary
function conveniently provides many of these statistics all at once:
summary(my.data$sale.amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 38250 130000 157900 180098 205000 815000
To facilitate more complex tasks in R
people have developed their own sets of functions known as packages. If you are working on your own computer, packages will need to be installed:
install.packages("ggplot2")
Once a package is installed it still needs to be loaded in order to be used. You’ll need to load a package every time you open RStudio
, but you’ll only need to install it once.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
qplot(my.data$ac) # qplot is a function in the package ggplot2
If you are working on the Grinnell RStudio Server, you will be unable to install packages; however, almost all of the packages you’ll are already installed on the server.
At the beginning of this document you were instructed to open a file called an “RScript”. RStudio
supports many other types of files, some of which are written in an authoring framework known as “RMarkdown”. RMarkdown conveniently allows you to both:
R
codeTo use RMarkdown you might need to install and load the package:
install.packages("rmarkdown")
library("rmarkdown")
If you have the package installed and loaded, you’ll be able to create RMarkdown files by selecting: File -> New File -> R Markdown. Go ahead and try this, hitting “Ok” to use the default options.
R
code when your report is built, for now you can ignore it.R
code that can be executed by clicking on the green arrow in the upper right corner of the code box. You can execute smaller pieces of code within these blocks by highlighting them and hitting Ctrl-Enter.To compile your RMarkdown file into a polished report you need to “Knit” the file. You can do this by clicking on the “Knit” button (the yarn ball icon) located towards the upper left of screen.
The information in the prior section provides a minimally sufficient introduction RMarkdown, I encourage you to go through the lessons created by RMarkdown’s developers at https://rmarkdown.rstudio.com/lesson-1.html if you have the time. These lessons include numerous screen shots, videos, and more detailed explanations of exactly how RMarkdown works and what it is capable of.
Directions:
Question #1
Write code that reads in the data file “CollegeData.csv”, which is available at the url: “https://remiller1450.github.io/data/CollegeData.csv”, and stores the data as an object named “Dat”.
Question #2
Write code that finds the 20th percentile average four year tuition cost (ie: the 20th percentile of the variable “COSTT4_A”). Write a sentence describing (in non-statistical terms) what the 20th percentile means.
Question #3
Write code that creates a new vector named “log.sal” that is the log of the variable “AVGFACSAL” in the College Data. Then use the summary
function to provide a summary of the new variable. Write a setence stating whether “log.sal” is approximately symmetric.
Question #4
Write code that creates a two-way frequency table using the variables “REGION” and “LOCALE”. Write 1-2 sentences describing the relationship you see in the table.
Question #5
Use the qplot
function in the package ggplot2
to construct a plot of the variable “REGION”.
Question #6
Write code that creates a new object named “Iowa.dat” that contains only colleges located in the state of Iowa (ie: the variable “STABBR” is “IA”). Print the dimensions of this new object.
Question #4
See Question 4 from the section above
Question #7
In R
you can define your own functions. The following code defines a function titled “position2”, which accepts an object as its input (which is defined internally as “X”) and returns the element in the second position:
position2 <- function(X){
out <- X[2]
return(out)
}
position2(c("Q","R","S"))
## [1] "R"
The trimmed (or truncated) mean is a statistical measure of central tendency that removes a certain percentage of the highest and lowest observations (ie: the 10% trimmed mean uses the middle 80% of the data).
For this question, write an R
function named “trimmedmean” that accepts two arguments, a data vector “X” and a percentage “p”, and returns the p% trimmed mean. Then use your function to find the 5% trimmed mean of the variable “AVGFACSAL” from the College Data described in Question #1. (Hint: Use the function “sort” on your vector and then use logical conditions to subset it before taking its mean)