Lab #1 - Introduction to R

This lab introduces R and R Studio as well as a few procedures we’ll use in future class sessions.

Directions (read before starting)

Please work together with your assigned partner(s). Make sure you all fully understand each topic or example before moving on.
Record your answers to lab questions separately from the lab’s examples. You and your partner(s) should only turn in one set of responses to the lab’s questions, nothing more and nothing less.
Ask for help, clarification, or even just a check-in if anything seems unclear.

$~$

Onboarding

Most of our labs will begin with a short “onboarding” section that we will go through together as a class.

After onboarding, you’ll complete the “lab” section using a paired programming approach. This framework entails:

One person acting as the driver, or the person who is in charge of physically operating the computer and writing the code/responses that will be turned in.
The other partner(s) acting as navigators. Their primary role is to review the actions of the driver, providing guidance and oversight. They might also have their own sessions of R open to test out certain pieces of code.

You should aim to switch roles every few questions, with less experienced coders spending more time in the “driver” role.

$~$

The Layout of R Studio

When you first open RStudio you’ll need to place to write and store your code. The simplest solution is to create a new “R Script” using the menus:

File -> New File -> R Script

This will open a blank page in the upper-left of the RStudio interface. At this point you should see four panels:

Your R Script (upper-left)
The Console (lower-left)
Your Environment (upper-right)
The Files/Plots/Help viewer (lower-right)

An R Script is like a text file that stores your code while you work on it. You can execute some or all of your code by sending it to the console in the following ways:

Click the “Run” arrow in the upper-right of the R Script menu
Type “Ctrl-Enter” (PC) or “Cmd-Enter” (mac)

To illustrate this, try running the following code:

log2(4)

## [1] 2

You should notice that the Console will echo any code you run, and it will display any textual/numeric output generated by your code.

However, if you ask R to run an improper command you’ll see a red error message in the console. For example, try running the following code:

log2(x)  # Produces an error since 'x' hasn't been defined

Next, the Environment (upper-right) shows information on any objects that have been loaded into your work session.

To illustrate the features of this panel, try running the following code:

college_majors = read.csv("https://remiller1450.github.io/data/majors.csv")

This command uses the function read.csv() to load data from the file “majors.csv” located at the URL provided in quotations. This data set is stored in R as an object named college_majors, which you should see in the Environment.

Clicking on the object name college_majors will open a viewer page, allowing you to inspect the data as if it were a spreadsheet.
Clicking on the blue arrow next to college_majors will display a list of variable names contained in the data set.

Finally, the Files/Plots/Help Viewer (lower-right) will display graphics generated by your code, help documentation, and a file explorer tree.

To demonstrate this panel, try running the following code:

?sum

This opens the help documentation for the function/object name given after the question mark (the sum() function in this example). You should aim to become comfortable using the help documentation when working with unfamiliar functions.

$~$

Lab

At this point you should begin working independently with your assigned partner(s) using a paired programming framework. You will be responsible for submitting one copy of responses to the questions embedded in the lab. However, homework assignments will involve R and you will encounter R output on in-class exams, so everyone in the group should be comfortable with the contents of the lab.

Comments

Computer code never lives in isolation. You should always prepare your code to be read and reused by your future self and others. Annotating your code with comments, or text that appears alongside the code but is not executed by the console, is an important aspect of coding.

In R, the character “#” is used to start a comment. Everything appearing on the same line to the right of the “#” will not be executed when that line is submitted to the console.

# This entire line is a comment and will do nothing if run

1:6 # The command "1:6" appears before this comment

## [1] 1 2 3 4 5 6

Question #1: Open a new R Script, on the first line create a comment with the text “Question 1 - your names”, and save the script in a location that you’ll later be able to find under the name “Lab1_your_names.R” (filling in your group’s names). Indicate your responses to all future lab questions using an appropriate comment.

$~$

Vectors

In R data are stored in objects. Data are assigned into an object using either <- or =. Once the data has been assigned, it can then be referenced using the object’s name.

The simplest type of object is a single-element vector:

x = 5  # assigns the value '5' to an object named 'x'

We can now use the name x to reference this stored value:

x^2  # squares the value stored in 'x'

## [1] 25

More generally, we’ll encounter vectors containing more than one element:

y = 1:3 # the sequence {1, 2, 3} is assigned to the vector called 'x'
print(y)

## [1] 1 2 3

In any data storage object, values are given positions that can be used to access them later. For example, the code below accesses then prints the second value stored in y using its index position, which is 2:

print(y[2])

## [1] 2

If you’re familiar with other programming languages, you’ll want to note that R starts indexing at 1. In other words, the first value contained in a vector is stored in position 1, the second is stored in position 2, and so on.

$~$

Data Types

Vectors can store values of various different types, for example:

z = c("X", "Y", "Z")

Here the c() function was used to concatenate three different character strings that were separated by commas into a single vector named z.

There are 3 types of values you’ll need to be aware of in this course:

numeric - for example: x = c(1,2,3)
character - for example: x = c("A","B","C")
logical - for example: x = c(TRUE, FALSE, TRUE)

The class() function provides information about an object and can also be used to check the type of a vector:

class(z)

## [1] "character"

Being familiar with different data types is important because many functions will expect their inputs to be a certain type. For example, the mean() function will throw an error when its input isn’t a numeric type:

mean(z)  # Recall that z is character, so it has no average value

## Warning in mean.default(z): argument is not numeric or logical: returning NA

## [1] NA

Sometimes we can successfully coerce an object of one type into a different type so that it can be analyzed properly. Below are a few examples of coercion:

x = c("10", "20")  # Example vector
class(x)           # Notice that it's a character vector, so mean() won't work

## [1] "character"

new_x = as.numeric(x)  # Coerce x to be numeric using as.numeric(), then store as new_x
mean(new_x)            # This now works!

## [1] 15

Question #2: The as.character() function will coerce a numeric or logical type into a character type. For example, 1 will be coerced into "1". Using an R comment, briefly describe a situation where it might be useful to coerce a numeric vector of integers into a character vector. Hint: Think about categorical data and how someone might choose to represent this data in a CSV or Excel file.

$~$

Data Frames

We will primarily work with data stored in objects known as “data frames”, which are collections of several named vectors of the same length (ie: same number of elements).

We can assemble a data frame ourselves using data.frame():

x = c(1,5,9)                # First vector
y = c("X", "Y", "Z")        # Second vector

df = data.frame(var_name1 = x, var_name2 = y)  # Put together into a data frame, notice the naming format
print(df)

##   var_name1 var_name2
## 1         1         X
## 2         5         Y
## 3         9         Z

The named vectors contained within a data frame, in this case var_name1 and var_name2, can be accessed individually using the $ operator as demonstrated below:

median(df$var_name1)  # Finding the median of "var_name1" from "df"

## [1] 5

Notice how the name of the data frame is given before the $ and the name of the vector within the data frame is then given after the $. Because object names are case sensitive, you must get both names exactly correct or you’ll see an error letting you know that the object you’re trying to access doesn’t exist.

Data frames are two-dimensional objects whose values are indexed via rows and columns. So, if we wanted to access the value in position 2,2 (second row, second column) we could use the following code:

df[2,2]  # Print element in position 2,2

## [1] "Y"

We can access an entire row of the data by leaving an empty space in the second dimension’s index position:

df[2,]  # Print 2nd row

##   var_name1 var_name2
## 2         5         Y

Similarly, we could access an entire column by leaving the first dimension’s index position blank:

df[,2]  # Print 2nd column

## [1] "X" "Y" "Z"

Finally, you should note that the read.csv() function will automatically store its output in a data frame object. In the example below we can see that the class() function tells us that college_majors is indeed a data frame object.

college_majors = read.csv("https://remiller1450.github.io/data/majors.csv")  # Load college majors data using read.csv()
class(college_majors)    # Check that it's a data frame

## [1] "data.frame"

Most of the data that we’ll work with this semester will be loaded into R from the web using read.csv().

Question #3: Make sure you’ve run the code provided above, then use the environment panel to determine the name of the vector inside the data frame college_majors that stores information on the median salary of workers with a bachelors degree in the given major (recalling that each row in the data frame denotes a different major). Next, use the max() function to find the maximum value of this variable.

Question #4: Use the which.min() to determine the index position of the major whose workforce contains the lowest percentage of bachelors degrees. Store this value in an object named mb, and use the object mb, which should store an integer, to access and print the entire row of data corresponding to this major.

$~$

Data basics

Now that we know a little bit about data structures in R, we’re ready to see a few functions that will help us learn some basic things about our data:

## Find the number of rows (observations)
nrow(college_majors)

## Find the number of columns (variables)
ncol(college_majors)

## Print the "structure" of the data, which includes variable names and their types
str(college_majors)

## Print the "head" or the first few rows of the data
head(college_majors)

We’ll continue to learn some additional functions as necessary, but for now you should focus your attention on understanding: how to read data from the web using read.csv(), how different data structures store values in R, and how to access information on the cases/observations, variables, and values of a data frame object using R.

Question #5: At the URL https://remiller1450.github.io/data/HappyPlanet.csv is the “Happy Planet” data set, which was assembled by The Happy Planet Index using data from a global survey that asked respondents from various countries a set of questions about how they felt their lives were going. These responses were combined with other data sources to create an index measuring the health and well-being of the inhabitants of various nations around the world.

Part A: Write code that loads the “Happy Planet” data set into R as a data frame object named “happy_planet”.
Part B: Find the number of observations in the “Happy Planet” data set, then use an R comment to write a 1-sentence description of what constitutes an observation for this data set.
Part C: Find the number of variables in the “Happy Planet” data and use an R comment to write the names of any variables that were read into R as character type.
Part D: Use an R comment to indicate whether there are any variables in this data set that should be coerced from their current type to another type before being used. Briefly explain why you feel this way. As an example you might say that variable_x was read as a character but it’s actually numeric, so we should use as.numeric() before using it.