R
This lab introduces R
and R Studio
as well
as a few procedures we’ll use in future class sessions.
Directions (read before starting)
\(~\)
Most of our labs will begin with a short “onboarding” section that we will go through together as a class.
After onboarding, you’ll complete the “lab” section using a paired programming approach. This framework entails:
R
open to test out certain pieces of code.You should aim to switch roles every few questions, with less experienced coders spending more time in the “driver” role.
\(~\)
When you first open RStudio
you’ll need to place to
write and store your code. The simplest solution is to create a new “R
Script” using the menus:
File -> New File -> R Script
This will open a blank page in the upper-left of the
RStudio
interface. At this point you should see four
panels:
An R Script is like a text file that stores your code while you work on it. You can execute some or all of your code by sending it to the console in the following ways:
To illustrate this, try running the following code:
log2(4)
## [1] 2
You should notice that the Console will echo any code you run, and it will display any textual/numeric output generated by your code.
However, if you ask R
to run an improper command you’ll
see a red error message in the console. For example, try running the
following code:
log2(x) # Produces an error since 'x' hasn't been defined
Next, the Environment (upper-right) shows information on any objects that have been loaded into your work session.
To illustrate the features of this panel, try running the following code:
college_majors = read.csv("https://remiller1450.github.io/data/majors.csv")
This command uses the function read.csv()
to load data
from the file “majors.csv” located at the URL provided in quotations.
This data set is stored in R
as an object named
college_majors
, which you should see in the
Environment.
college_majors
will open a
viewer page, allowing you to inspect the data as if it were a
spreadsheet.college_majors
will
display a list of variable names contained in the data set.Finally, the Files/Plots/Help Viewer (lower-right) will display graphics generated by your code, help documentation, and a file explorer tree.
To demonstrate this panel, try running the following code:
?sum
This opens the help documentation for the function/object name given
after the question mark (the sum()
function in this
example). You should aim to become comfortable using the help
documentation when working with unfamiliar functions.
\(~\)
At this point you should begin working independently with your
assigned partner(s) using a paired programming framework. You will be
responsible for submitting one copy of responses to the
questions embedded in the lab. However, homework assignments will
involve R
and you will encounter R
output on
in-class exams, so everyone in the group should be comfortable with the
contents of the lab.
In R
data are stored in objects. Data are
assigned into an object using either <-
or
=
. Once the data has been assigned, it can then be
referenced using the object’s name.
The simplest type of object is a single-element vector:
x = 5 # assigns the value '5' to an object named 'x'
We can now use the name x
to reference this stored
value:
x^2 # squares the value stored in 'x'
## [1] 25
More generally, we’ll encounter vectors containing more than one element:
y = 1:3 # the sequence {1, 2, 3} is assigned to the vector called 'x'
print(y)
## [1] 1 2 3
In any data storage object, values are given positions that can be
used to access them later. For example, the code below accesses then
prints the second value stored in y
using its index
position, which is 2:
print(y[2])
## [1] 2
If you’re familiar with other programming languages, you’ll want to
note that R
starts indexing at 1
. In other
words, the first value contained in a vector is stored in position 1,
the second is stored in position 2, and so on.
\(~\)
Vectors can store values of various different types, for example:
z = c("X", "Y", "Z")
Here the c()
function was used to concatenate three
different character strings that were separated by commas into
a single vector named z
.
There are 3 types of values you’ll need to be aware of in this course:
x = c(1,2,3)
x = c("A","B","C")
x = c(TRUE, FALSE, TRUE)
The class()
function provides information about an
object and can also be used to check the type of a vector:
class(z)
## [1] "character"
Being familiar with different data types is important because many
functions will expect their inputs to be a certain type. For example,
the mean()
function will throw an error when its input
isn’t a numeric type:
mean(z) # Recall that z is character, so it has no average value
## Warning in mean.default(z): argument is not numeric or logical: returning NA
## [1] NA
Sometimes we can successfully coerce an object of one type into a different type so that it can be analyzed properly. Below are a few examples of coercion:
x = c("10", "20") # Example vector
class(x) # Notice that it's a character vector, so mean() won't work
## [1] "character"
new_x = as.numeric(x) # Coerce x to be numeric using as.numeric(), then store as new_x
mean(new_x) # This now works!
## [1] 15
Question #2: The as.character()
function will coerce a numeric or logical type into a character type.
For example, 1
will be coerced into "1"
. Using
an R
comment, briefly describe a situation where it might
be useful to coerce a numeric vector of integers into a character
vector. Hint: Think about categorical data and how someone
might choose to represent this data in a CSV or Excel file.
\(~\)
We will primarily work with data stored in objects known as “data frames”, which are collections of several named vectors of the same length (ie: same number of elements).
We can assemble a data frame ourselves using
data.frame()
:
x = c(1,5,9) # First vector
y = c("X", "Y", "Z") # Second vector
df = data.frame(var_name1 = x, var_name2 = y) # Put together into a data frame, notice the naming format
print(df)
## var_name1 var_name2
## 1 1 X
## 2 5 Y
## 3 9 Z
The named vectors contained within a data frame, in this case
var_name1
and var_name2
, can be accessed
individually using the $
operator as demonstrated
below:
median(df$var_name1) # Finding the median of "var_name1" from "df"
## [1] 5
Notice how the name of the data frame is given before the
$
and the name of the vector within the data frame is then
given after the $
. Because object names are case sensitive,
you must get both names exactly correct or you’ll see an error letting
you know that the object you’re trying to access doesn’t exist.
Data frames are two-dimensional objects whose values are indexed via rows and columns. So, if we wanted to access the value in position 2,2 (second row, second column) we could use the following code:
df[2,2] # Print element in position 2,2
## [1] "Y"
We can access an entire row of the data by leaving an empty space in the second dimension’s index position:
df[2,] # Print 2nd row
## var_name1 var_name2
## 2 5 Y
Similarly, we could access an entire column by leaving the first dimension’s index position blank:
df[,2] # Print 2nd column
## [1] "X" "Y" "Z"
Finally, you should note that the read.csv()
function
will automatically store its output in a data frame object. In the
example below we can see that the class()
function tells us
that college_majors
is indeed a data frame object.
college_majors = read.csv("https://remiller1450.github.io/data/majors.csv") # Load college majors data using read.csv()
class(college_majors) # Check that it's a data frame
## [1] "data.frame"
Most of the data that we’ll work with this semester will be loaded
into R
from the web using read.csv()
.
Question #3: Make sure you’ve run the code provided
above, then use the environment panel to determine the name of the
vector inside the data frame college_majors
that stores
information on the median salary of workers with a bachelors degree in
the given major (recalling that each row in the data frame denotes a
different major). Next, use the max()
function to find the
maximum value of this variable.
Question #4: Use the which.min()
to
determine the index position of the major whose workforce contains the
lowest percentage of bachelors degrees. Store this value in an object
named mb
, and use the object mb
, which should
store an integer, to access and print the entire row of data
corresponding to this major.
\(~\)
Now that we know a little bit about data structures in
R
, we’re ready to see a few functions that will help us
learn some basic things about our data:
## Find the number of rows (observations)
nrow(college_majors)
## Find the number of columns (variables)
ncol(college_majors)
## Print the "structure" of the data, which includes variable names and their types
str(college_majors)
## Print the "head" or the first few rows of the data
head(college_majors)
We’ll continue to learn some additional functions as necessary, but
for now you should focus your attention on understanding: how to read
data from the web using read.csv()
, how different data
structures store values in R
, and how to access information
on the cases/observations, variables, and values of a data frame object
using R
.
Question #5: At the URL
https://remiller1450.github.io/data/HappyPlanet.csv
is the
“Happy Planet” data set, which was assembled by The Happy Planet Index using
data from a global survey that asked respondents from various countries
a set of questions about how they felt their lives were going. These
responses were combined with other data sources to create an index
measuring the health and well-being of the inhabitants of various
nations around the world.
R
as a data frame object named
“happy_planet”.R
comment to write
a 1-sentence description of what constitutes an observation for this
data set.R
comment to write the
names of any variables that were read into R
as character
type.R
comment to indicate
whether there are any variables in this data set that should be coerced
from their current type to another type before being used. Briefly
explain why you feel this way. As an example you might say that
variable_x
was read as a character but it’s actually
numeric, so we should use as.numeric()
before using
it.
Comments
Computer code never lives in isolation. You should always prepare your code to be read and reused by your future self and others. Annotating your code with comments, or text that appears alongside the code but is not executed by the console, is an important aspect of coding.
In
R
, the character “#” is used to start a comment. Everything appearing on the same line to the right of the “#” will not be executed when that line is submitted to the console.Question #1: Open a new R Script, on the first line create a comment with the text “Question 1 - your names”, and save the script in a location that you’ll later be able to find under the name “Lab1_your_names.R” (filling in your group’s names). Indicate your responses to all future lab questions using an appropriate comment.
\(~\)