R
This lab introduces R
and R Studio
as well
as the format of future class sessions.
Directions (read before starting)
\(~\)
The “Preamble” section of labs is something we’ll go through together at the start of class.
The “Lab” section is something you will work on with a partner using paired programming, a framework defined as follows:
Partners are encouraged to switch roles throughout the “Lab” section, but for the first few labs the less experienced coder should spend more time as the driver.
\(~\)
After you open RStudio
, the first thing you’ll want to
do is open a file to work in. You can do this by navigating: File ->
New File -> RScript, which will open a new window in the top left of
the RStudio
interface for you to work in. At this point you
should see four panels:
An R Script is like a text-file that stores your code while you work on it. At any point you can send some or all of the code in your R Script to the Console to execute. You can also type commands directly into the Console. The Console will echo any code you run, and it will display any textual/numeric output generated by your code.
The Environment shows you the names of data sets, variables, and user-created functions that have been loaded into your work space and can be accessed by your code. The Files/Plots/Help Viewer will display graphics generated by your code and a few other useful entities (like help documentation and file trees).
Question #0: Create a blank R Script. You will use this R Script to record your answers to future questions in this document.
\(~\)
R
is an interpreted programming language, which allows
you to have the computer execute any piece of code contained your R
Script at any time without a lengthy compiling process.
To run a single piece of code, simply highlight it and either hit Ctrl-Enter or click on the “Run” button near the top right corner of your R Script. You should see an echo of the code you ran in the Console, along with any response generated by that code.
4 + 6 - (24/6)
## [1] 6
5 ^ 2 + 2 * 2
## [1] 29
The examples shown above demonstrate how R
can be used
as a calculator. However, most of code we will write will rely upon
functions, or pre-built units of code that translate
one or more inputs into one or more outputs.
log(x = 4, base = 2)
## [1] 2
The example above demonstrates the log()
function. The
input named “x” is set to be 4, and the input named “base” is set to 2.
The labels given to these inputs, “x” and “base”, are the function’s
arguments. The function returns the output “2”, which
is \(\text{log}_2(4)\). Note that
log(4, 2)
will also produce the output “2” as any unlabeled
inputs are mapped to arguments in the order defined by the creator of
the function.
\(~\)
You’ll eventually end up memorizing the arguments of common
R
functions; however, while you’re learning I strongly
encourage you to read the help documentation for any
R
function used in your code. You can access a function’s
documentation by typing a ?
in front of the function name
and submitting to the console.
?log
\(~\)
When coding, it is good practice to include comments that describe
what your code is doing. In R
the character “#” is used to
start a comment. Everything appearing on the same line to the right of
the “#” will not be executed when that line is submitted to the
console.
# This entire line is a comment and will do nothing if run
1:6 # The command "1:6" appears before this comment
## [1] 1 2 3 4 5 6
In your R Script, comments appear in green. You also should remember that the “#” starts a comment only for a single line of your R Script, so long comments requiring multiple lines should each begin with their own “#”.
\(~\)
The remainder of the lab is to be completed by you and your lab partner. You should work at a comfortable pace that ensures both of you thoroughly understand the lab’s contents and examples.
\(~\)
An important part of data science is reproducibility, or the ability for two people to independently replicate the results of a project.
To ensure reproducibility, every data analysis should begin by
importing raw data into R
and manipulating it used
documented (commented) code. Further, the raw data should be imported
using functions, such as read.csv
, instead of the point and
click interface provided by the “Import Dataset” button (at the top of
the environment pane).
Below are two different examples:
## Loading a CSV file from a web URL (storing it as "my_data")
my_data <- read.csv("https://some_webpage/some_data.csv")
## Loading a CSV file with a local file path
my_data <- read.csv("H:/path_to_my_data/my_data.csv")
A few things to note.
<-
or =
can be used to
assign something to a named object. The <-
operator will create the object globally, while =
will
create the object locally in the environment where it was used. For the
purposes of this course, we can use the two interchangeably since our
code will “live” in the global environment./
or \\
. A single
\
is used by R
to start an instance of a
special text character. For example, \n
creates a new line
in a string of text.Question #1 (Part A): Add code to your script that
uses the read.csv()
function to create an object named
my_data
that contains the “Happy Planet” data stored at: https://remiller1450.github.io/data/HappyPlanet.csv
After running your Question #1 code, an entry named “my_data” should appear in the Environment panel (top right).
You can click on the small arrow icon to reveal the data’s structure, or you can click on the object’s name to view the data in spreadsheet format.
Question #1 (Part B) Inspect the structure of
my_data
and view the data set in spreadsheet format. In an
R
comment, briefly describe how this data set is structured
(ie: what does each row and column represent, what are some of the
columns, etc.)
\(~\)
R
stores data in containers called objects.
Data is assigned into an object using <-
or
=
. After assignment, data can be referenced using the
object’s name. The simplest objects are scalars, or a single
elements:
x <- 5 # This assigns the integer value '5' to an object called 'x'
x^2 # We can now reference 'x'
## [1] 25
R
stores sequences of elements in objects called
vectors:
x <- 1:3 # The sequence {1, 2, 3} is assigned to the vector called 'x'
print(x)
## [1] 1 2 3
y <- c(1,2,3) # The function 'c' concatenates arguments (separated by commas) into a vector
print(y)
## [1] 1 2 3
z <- c("A","B","C") # Vectors can contain many types of values
print(z)
## [1] "A" "B" "C"
The three most important types of vectors are:
x = c(1,2,3)
x = c("A","B","C")
x = c(TRUE, FALSE, TRUE)
You should always consider a vector’s type before using it. Many
functions expect specific input types and will produce an error if the
wrong type is used. You can check the type of an object using the
typeof()
function:
chars <- c("1","2","3") # Create a character vector
typeof(chars)
## [1] "character"
nums <- c(1,2,3) # Create a numeric vector
typeof(nums)
## [1] "double"
mean(chars) # This produces an error, mean() only works for numeric vectors
## Warning in mean.default(chars): argument is not numeric or logical: returning NA
## [1] NA
mean(nums) # This works as intended
## [1] 2
Certain R
functions are vectorized, meaning
they can accept a scalar input, for example 1
, and return
the scalar output f(1)
, or they can accept a vector input,
such as c(1,2,3)
, and return the vector
c(f(1),f(2),f(3))
. For example, sqrt()
is
vectorized:
nums <- c(1,2,3,4)
sqrt(nums)
## [1] 1.000000 1.414214 1.732051 2.000000
Data are usually stored in objects called data.frames, which are composed of several vectors of the same length:
DF <- data.frame(A = x, B = y, C = z) # Creates a data.frame object 'DF'
print(DF)
## A B C
## 1 1 1 A
## 2 2 2 B
## 3 3 3 C
Functions like read.csv()
will automatically store their
output as a data frame:
my_data <- read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")
typeof(my_data)
## [1] "list"
However, notice typeof()
describes my_data
as a list object. Lists are a flexible class of objects whose
elements can be of any type. A data frame is a special case of a
list.
Shown below is an example list containing three components, two different data frames and the character string “ABC”:
my_list <- list(my_data, DF, "ABC")
Question #2: Create a data frame named
my_DF
containing two vectors, \(J\) and \(K\), where \(J\) is created using the seq
function to be a sequence from 0 to 100 counting by 10 and \(K\) is created using the rep
function to replicate the character string “XYZ” the proper number of
times. Hint: read the help documentation for each function
(seq
and rep
) to determine the necessary
arguments.
\(~\)
Suppose we have a vector “x” and would like to extract the element in its second position and assign it to a new object called “b”:
x <- 5:10
b <- x[2]
b
## [1] 6
The square brackets, [
and ]
, are used to
access a certain position (or multiple positions) within an
object. In this example we access the second position of the object
“x”.
Some objects, such as data frames, have multiple dimensions, requiring indices in each dimension (separated by commas) to describe a single element. A few examples are shown below:
DF <- data.frame(x = x, y = y, z = z)
DF[2,3] # The element in row 2, column 3
## [1] "B"
DF[2,] # Everything in row 2
## x y z
## 2 6 2 B
For list objects, double square brackets, [[
, are used
to access positions within the list:
my_list[[2]] ## The 2nd component of the list
## A B C
## 1 1 1 A
## 2 2 2 B
## 3 3 3 C
Question #3: Use indices to print the Happiness
score (column #3) of Hong Kong (row #57) in the object
my_data
(the Happy Planet data from Question #1). Be sure
your code does not print any other information about this
observation.
\(~\)
Suppose we want to access a single variable from a data set, there are a few different ways we can do so:
# The $ accesses the component named 'Country' within 'my_data'
countries <- my_data$Country
# Position indexing to access the variable 'Country' (since its the first column)
countries2 <- my_data[,1]
# Use the name of the variable in place of an index position
countries3 <- my_data[,'Country']
Suppose we want to access a single observation (data point) in our dataset:
Albania <- my_data[1,] # This stores the entire first row
Suppose we want a range of observations:
FirstFive <- my_data[1:5,] # This stores the first five rows
head(FirstFive)
## Country Region Happiness LifeExpectancy Footprint HLY HPI HPIRank
## 1 Albania 7 5.5 76.2 2.2 41.7 47.91 54
## 2 Algeria 3 5.6 71.7 1.7 40.1 51.23 40
## 3 Angola 4 4.3 41.7 0.9 17.8 26.78 130
## 4 Argentina 1 7.1 74.8 2.5 53.4 58.95 15
## 5 Armenia 7 5.0 71.7 1.4 36.1 48.28 48
## GDPperCapita HDI Population
## 1 5316 0.801 3.15
## 2 7062 0.733 32.85
## 3 2335 0.446 16.10
## 4 14280 0.869 38.75
## 5 4945 0.775 3.02
The head
function prints the first few rows and
variables of an object. Here are a few other functions might use when
working with a new data set:
dim(my_data) # prints the dimensions of 'my.data'
## [1] 143 11
nrow(my_data) # prints the number of rows of 'my.data'
## [1] 143
ncol(my_data) # prints the number of columns of 'my.data'
## [1] 11
colnames(my_data) # prints the names of the variables (columns) of 'mydata'
## [1] "Country" "Region" "Happiness" "LifeExpectancy"
## [5] "Footprint" "HLY" "HPI" "HPIRank"
## [9] "GDPperCapita" "HDI" "Population"
Question #4 (Part A): Write code that prints the populations of the last three observations (countries) that appear in the Happy Planet data.
Question #4 (Part B): Write code that finds the median value of the “LifeExpectancy” variable for the last 10 observations in the Happy Planet data.
\(~\)
Often we want to access all data that meet certain criteria. For example, we may want to analyze all countries with a life expectancy above 80. To accomplish this, we’ll need to use logical operators:
## This returns a logical vector using the condition "> 80"
my_data$LifeExpectancy > 80
A few logical operators you should know of are:
Operator | Description |
---|---|
== |
equal to |
!= |
not equal to |
> |
great than |
>= |
greater than or equal to |
< |
less than |
<= |
less than or equal to |
& |
and |
| |
or |
! |
negation (“not”) |
The which()
function can be used to identify the indices
of elements of within an object containing the logical value
TRUE
, for example:
## This returns the positions where the condition evaluated to TRUE
which(my_data$LifeExpectancy > 80 )
This result could then be used as indices to subset
my_data
:
## sub-setting via indices
keep_idx <- which(my_data$LifeExpectancy > 80)
my_subset <- my_data[keep_idx, ]
The approach shown above is a bit cumbersome. As an alternative we
can use the subset()
function alongside logical
expressions:
## Example #1
Ex1 <- subset(my_data, LifeExpectancy > 80)
In example #1, the data frame Ex1
will contain the
subset of countries with life expectancy above 80. Notice how the
subset()
function knows that LifeExpectancy
is
a component of my_data
.
## Example #2
Ex2 <- subset(my_data, LifeExpectancy <= 70 & Happiness > 6)
In example #2, the &
operator is used to create a
data frame, Ex2
, containing all countries with a life
expectancy of 70 or below and a happiness score above 6.
## Example #3
Ex3 <- subset(my_data, LifeExpectancy <= 70 | Happiness > 6)
In example #3, the |
operator is used create a data
frame of all countries with a life expectancy of 70 or below or
a happiness score above 6. Notice the different dimensions of
Ex2
and Ex3
:
dim(Ex2)
## [1] 9 11
dim(Ex3)
## [1] 118 11
Question #5: Create a data frame named “Q5” that contains all countries with a population over 100 million that also have a happiness score of 6 or lower. Then, print the number of rows of this data frame.
\(~\)
Descriptive summaries are an essential component of any data analysis. A few functions used to calculate several basic numerical summaries are shown below:
mean(my_data$LifeExpectancy) # mean
## [1] 67.83846
sd(my_data$LifeExpectancy) # standard deviation
## [1] 11.04193
min(my_data$LifeExpectancy) # minimum
## [1] 40.5
max(my_data$LifeExpectancy ) # maximum
## [1] 82.3
quantile(my_data$LifeExpectancy, probs = .35) # the 35th percentile
## 35%
## 66.18
Each of these functions operates on a single variable. For a broader
set of summary statistics, you can input an entire data frame into the
summary()
function:
summary(my_data)
## Country Region Happiness LifeExpectancy
## Length:143 Min. :1.000 Min. :2.400 Min. :40.50
## Class :character 1st Qu.:2.000 1st Qu.:5.000 1st Qu.:61.90
## Mode :character Median :4.000 Median :5.900 Median :71.50
## Mean :3.832 Mean :5.919 Mean :67.84
## 3rd Qu.:6.000 3rd Qu.:7.000 3rd Qu.:76.05
## Max. :7.000 Max. :8.500 Max. :82.30
##
## Footprint HLY HPI HPIRank
## Min. : 0.500 Min. :11.60 Min. :16.59 Min. : 1.0
## 1st Qu.: 1.300 1st Qu.:31.10 1st Qu.:34.47 1st Qu.: 36.5
## Median : 2.200 Median :41.80 Median :43.60 Median : 72.0
## Mean : 2.877 Mean :41.38 Mean :43.38 Mean : 72.0
## 3rd Qu.: 3.850 3rd Qu.:53.20 3rd Qu.:52.20 3rd Qu.:107.5
## Max. :10.200 Max. :66.70 Max. :76.12 Max. :143.0
##
## GDPperCapita HDI Population
## Min. : 667 Min. :0.3360 Min. : 0.290
## 1st Qu.: 2107 1st Qu.:0.5790 1st Qu.: 4.455
## Median : 6632 Median :0.7720 Median : 10.480
## Mean :11275 Mean :0.7291 Mean : 44.145
## 3rd Qu.:15711 3rd Qu.:0.8680 3rd Qu.: 31.225
## Max. :60228 Max. :0.9680 Max. :1304.500
## NA's :2 NA's :2
Notice how summary()
is not particularly useful
categorical variables. For these variables you should be using
frequency tables.
A one-way frequency table shows the frequencies of
categories in a single categorical variable, while a two-way
frequency tables shows the relationship between two categorical
variables. Both are created by the table()
function:
table(my_data$Region) # A one-way frequency table of 'region'
##
## 1 2 3 4 5 6 7
## 24 24 16 33 7 12 27
table(my_data$Region, my_data$LifeExpectancy > 80) # A two-way frequency table showing the number of countries w/ LifeExpectancy > 80 by region
##
## FALSE TRUE
## 1 24 0
## 2 16 8
## 3 15 1
## 4 33 0
## 5 7 0
## 6 10 2
## 7 27 0
# Notice how the table function can use numeric, logical, and character variables
Tables are their own type of object, and they can be used as an input
to functions like barplot()
:
my_table <- table(my_data$Region) # Tables can be stored as objects
barplot(my_table) # Creates a bar plot from a table
They can also be used as an input to the prop.table()
function to find row or column proportions:
prop.table(my_table, margin = 1) # "margin = 1" gives row props, "margin = 2" gives column props
##
## 1 2 3 4 5 6 7
## 1 1 1 1 1 1 1
In the example above, the table only had a single dimension (so each row total was the same as the frequency). Shown below is a more typical example:
my_table <- table(my_data$Region, my_data$LifeExpectancy > 80)
prop.table(my_table, margin = 1)
##
## FALSE TRUE
## 1 1.0000000 0.0000000
## 2 0.6666667 0.3333333
## 3 0.9375000 0.0625000
## 4 1.0000000 0.0000000
## 5 1.0000000 0.0000000
## 6 0.8333333 0.1666667
## 7 1.0000000 0.0000000
Notice how this example used a logical condition to construction a binary variable to serve as the columns in the table.
Question #6: Find the mean, median, and range
(maximum - minimum) of the variable LifeExpectancy
in the
Happy Planet data. Briefly comment on whether the distribution of this
variable seems to be symmetric or skewed using plain text beneath your
answer’s code chunk.
\(~\)
The College Scorecard is a government database that record various characteristics of accredited colleges and universities within the United States. A portion of this database containing 2019-2020 data on colleges that primarily award undergraduate degrees and had at least 400 full time students is available at the URL below:
https://remiller1450.github.io/data/Colleges2019.csv
You will use these data in Question #7 (below).
Question #7:
R
and
store them as a data.frame object named colleges
.colleges
that admit fewer than 25% of applicants (as measured by “Adm_Rate”).
Store this subject in an object named
colleges_selective
.colleges_selective
, construct a table containing the
proportion of private colleges within each region.