Lab #1 - Introduction to R

This lab introduces R and R Studio as well as a few procedures we’ll use in future class sessions.

Directions (read before starting)

Please work together with your assigned partner(s). Make sure you all fully understand each topic or example before moving on.
Record your answers to lab questions separately from the lab’s examples. Everyone should have their own copy of your group’s responses, and each individual should turn-in this copy, even if it’s identical to that of your lab partners.
Ask for help, clarification, or even just a check-in if anything seems unclear.

$~$

Onboarding

Most labs will begin with a short “onboarding” section that we’ll cover as a class. After that, you’re expected to work through the lab with your group following the “paired programming” paradigm. This framework entails:

One person acting as a driver, or the person in charge of physically typing out the ideas generated by the group and testing them out in R.
The other partner(s) acting as navigators, or collaborators who review the actions of the driver providing ideas, oversight, and guidance. As a navigator you may still have your own instances of R open to try out ideas and record your group’s final responses.

You should aim to switch roles on a regular basis, with less experienced coders spending more time in the “driver” role.

$~$

The Layout of R Studio

When you first open RStudio you’ll need to place to write and store your code. The simplest solution is to create a new “R Script” via:

File -> New File -> R Script

This opens a blank page in the upper-left of the RStudio interface. At this point you should see four panels:

Your Script (upper-left)
The Console (lower-left)
Your Environment (upper-right)
The Files/Plots/Help viewer (lower-right)

An R Script is like a text file that stores your code while you work on it. You can execute some or all of your code by sending it to the console in the following ways:

Click the “Run” arrow in the upper-right of the R Script menu
Type “Ctrl-Enter” (PC) or “Cmd-Enter” (mac)

If you highlight a segment of code before running, only the highlighted code will be sent to the console for execution. Try this out by running the following code:

log2(4)

## [1] 2

You should notice that the Console echoes any code you run, and beneath the echo it prints any textual/numeric output that the code produces.

When code you submit to the console cannot be executed due to errors you will receive a red-colored message describing the problem. For example, try running the following code:

log2(x)  # Produces an error since 'x' hasn't been defined

Next, the Environment (upper-right) shows information on any objects that have been loaded into your work session.

To illustrate the features of this panel, try running the following code:

college_majors = read.csv("https://remiller1450.github.io/data/majors.csv")

This uses the function read.csv() to load data from the file “majors.csv” that is housed at the URL provided in quotations. This data set is stored in R as an object named college_majors, which you should see in the Environment.

Clicking on the object name college_majors will open a viewer page, allowing you to inspect the data as if it were a spreadsheet.
Clicking on the blue arrow next to college_majors will display a list of variable names contained in the data set.

Finally, the Files/Plots/Help Viewer (lower-right) will display graphics generated by your code, help documentation, and a file explorer tree.

To demonstrate this panel, try running the following code:

?sum

This opens the help documentation for the function/object name given after the question mark (the sum() function in this example). If you encounter an error when working a function you should try to read its help documentation.

$~$

Lab

At this point you should begin working independently with your assigned partner(s) using a paired programming framework. Remember that you should read the lab’s content, not just the questions, and you should all agree with an answer before moving on. Despite being graded, labs aren’t intended to be formal assessments and you can ask for questions

Comments

Computer code never lives in isolation. You should always prepare your code to be read and reused by your future self and others. Annotating your code with comments, or text that appears alongside the code but is not executed by the console, is an important aspect of coding.

In R, the character “#” is used to start a comment. Everything appearing on the same line to the right of the “#” will not be executed when that line is submitted to the console.

# This entire line is a comment and will do nothing if run

1:6 # The command "1:6" appears before this comment

## [1] 1 2 3 4 5 6

Question #1:

Part A: Open a new R Script and add a line with a comment containing the text: “Question 1”. Save your script in a location on your PC that you can find later using the name: “Lab1_your_names.R” (filling in your group’s names). Indicate your responses to all future lab questions using an appropriate comment.
Part B: Beneath the comment you created in Part A add the code: happy_planet_data = read.csv("https://remiller1450.github.io/data/happy_planet_2025.csv")
Part C: Run the code you added in Part B, then use the environment panel to determine how many cases and variables are contained in this data set. You may consider every column to be a variable despite the fact that some are simply case identifiers. Record your answer using an R comment.

$~$

Vectors

In R, objects are nameable structures used to store data. Data are assigned into an object using either <- or =. After assignment, an object’s name is used to reference the data it holds.

The simplest type of data storage object is a single-element vector, which is sometimes called a “scalar”:

x = 5  # assigns the value '5' to an object named 'x'

We can now use the name x to reference this stored value:

x^2  # squares the value stored in 'x'

## [1] 25

More generally, we’ll encounter vectors containing multiple elements:

y = 1:3 # the sequence {1, 2, 3} is assigned to the vector called 'x'
print(y)

## [1] 1 2 3

This motivates us to understand indices, or the positions where specific pieces of data, known as elements or atomic units, are located within an object. In R, vectors have a single index that begins counting at 1 (many other programming languages begin counting at 0).

The example below uses square brackets ([ and ]) to access the second element of y, which is the number 2. This element is then stored as a new object named y2, which is printed using the print() function:

y2 = y[2] # access the element in position 2 and store as y2
print(y2) # print to confirm we extracted "2"

## [1] 2

Data Frames

The most common data structure in R is the data frame, which is a collection of one or more named vectors that each contain the same number of elements. While we’ll primarily work with data frames created by the read.csv() function, there are some valuable insights in creating one ourselves.

The example below takes two vectors, x1 and x2, and assembles them into a data frame object named df. You should notice how x1 and x2 are given the names “ID” and “value” when the data frame is created.

x1 = c(209, 230, 310)  # First vector
x2 = c(5, 6, 8)        # Second vector

df = data.frame(ID = x1, value = x2)  # Put together into a data frame, giving each a name
print(df)

##    ID value
## 1 209     5
## 2 230     6
## 3 310     8

A useful property of data frames is that the individual vectors they contain can be accessed using $ operator. The code demonstrates this by accessing the vector named “value” in the data frame df then using the median() function to find its median (the midpoint of values when they’re put in ascending order):

median(df$value)

## [1] 6

We can also use indices to access data stored in df, but we’ll need recognize that data frames use two dimensions (rows and columns) to organize their elements. Consider the following examples:

## Access a single element
df[2,1]

## [1] 230

## Access an entire column
df[ ,1]

## [1] 209 230 310

## Access and entire row
df[2, ]

##    ID value
## 2 230     6

Question #2: For the parts that follow you should use the happy_planet_data data frame you previously created in Question 1.

Part A: Find the median value of the variable that records a country’s life expectancy.
Part B: Print the 100th case in the data set using indices this country’s data.

$~$

Data Types

Vectors can store values of various different types. The example below creates a vector, z, whose type is “character”:

z = c("X", "Y", "Z")

Notice that this vector was created using the c() function to concatenate three different character strings that are separated by commas. We will frequently use the c() function to create our own vectors at various points throughout the semester.

There are many types of vectors in R, but there are only 3 that you need to be aware of for this course:

numeric - for example: x = c(1,2,3)
character - for example: x = c("A","B","C")
logical - for example: x = c(TRUE, FALSE, TRUE)

The class() function provides information about an object and it can be used to check the type of a vector:

class(z)

## [1] "character"

It is important to be familiar with different data types because many functions only work as intended when they receive inputs of the proper type.

For example, the mean() function will throw an error when given an input that isn’t a numeric type:

mean(z)  # Recall that z is character, so it has no average value

## Warning in mean.default(z): argument is not numeric or logical: returning NA

## [1] NA

Sometimes we can successfully coerce an object into a different type so that it can be properly handled by functions that expect a certain type. Below are a few examples of coercion:

x = c("10", "20")  # Example vector
class(x)           # Notice that it's a character vector, so mean() doesn't work

## [1] "character"

new_x = as.numeric(x)  # Coerce x to be numeric using as.numeric(), then store as new_x
mean(new_x)            # This now works!

## [1] 15

Question #3: The as.character() function will coerce a numeric or logical type into a character type. For example, 1 will be coerced into "1". Using an R comment, briefly describe why this might be desirable for one of the variables contained in happy_planet_data, the data frame you’ve worked with in Questions 1 and 2.

$~$

Basic data functions

At this point we’re ready to cover a few commonly used functions that help us work with data structures in R. Below are a few functions that provide the dimensions of data objects:

## Example data frame
college_majors = read.csv("https://remiller1450.github.io/data/majors.csv")

## Find the number of elements in a vector (recall that df$ID is the vector ID in df)
length(college_majors$Major)

## Find the number of rows (cases) in a data frame
nrow(college_majors)

## Find the number of columns (variables) in a data frame
ncol(college_majors)

## Find the dimensions (rows, columns) of a data frame
dim(college_majors)

These functions are useful in finding the number of cases and variables in an object so that other quantities can be computed. Soon we’ll need to calculate proportions, which use the number of cases/elements as a denominator. It’s preferable to have code that finds and stores this denominator rather than using a magic number you got from the Environment panel or your own knowledge of the data.

$~$

Practice (required)

Question #4: For this question you will use the data stored at the URL: https://remiller1450.github.io/data/congress_2024.csv which contains information on the 118th US Congress

Part A: Using an R comment, briefly describe what a “case” is for this data set.
Part B: Use a basic data function to find the number of cases in this data set. You should confirm this using the information in the Environment panel.
Part C: Print the entire row corresponding to the 50th case in the data set using indices.
Part D: Find the median age of all members of the 118th US Congress (at the time they took office, you do not need to use their birthday to find their present age).