Lab #2 - Introduction to R and One-Sample Categorical Data

This lab introduces R and R Studio as well as a few procedures we’ll use in future class sessions.

Directions (read before starting)

Please work together with your assigned partner(s). Make sure you all fully understand each topic or example before moving on.
Record your answers to lab questions separately from the lab’s examples. Everyone should have their own copy of your group’s responses, and each individual should turn-in this copy, even if it’s identical to that of your lab partners.
Ask for help, clarification, or even just a check-in if anything seems unclear.

$~$

Onboarding

Most labs will begin with a short “onboarding” section that we’ll cover as a class. After that, you’re expected to work through the lab with your group following the “paired programming” paradigm. This framework entails:

One person acting as a driver, or the person in charge of physically typing out the ideas generated by the group and testing them out in R.
The other partner(s) acting as navigators, or collaborators who review the actions of the driver providing ideas, oversight, and guidance. As a navigator you may still have your own instances of R open to try out ideas and record your group’s final responses.

You should aim to switch roles on a regular basis, with less experienced coders spending more time in the “driver” role.

$~$

The Layout of R Studio

When you first open RStudio you’ll need to place to write and store your code. The simplest solution is to create a new “R Script” via:

File -> New File -> R Script

This opens a blank page in the upper-left of the RStudio interface. At this point you should see four panels:

Your Script (upper-left)
The Console (lower-left)
Your Environment (upper-right)
The Files/Plots/Help viewer (lower-right)

An R Script is like a text file that stores your code while you work on it. You can execute some or all of your code by sending it to the console in the following ways:

Click the “Run” arrow in the upper-right of the R Script menu
Type “Ctrl-Enter” (PC) or “Cmd-Enter” (mac)

If you highlight a segment of code before running, only the highlighted code will be sent to the console for execution. Try this out by running the following code:

log2(4)

## [1] 2

You should notice that the Console echoes any code you run, and beneath the echo it prints any textual/numeric output that the code produces.

When code you submit to the console cannot be executed due to errors, you will receive a red-colored message describing the problem. For example, try running the following code:

log2(x)  # Produces an error since 'x' hasn't been defined

Next, the Environment (upper-right) shows information on any objects that have been loaded into your work session.

To illustrate the features of this panel, try running the following code:

college_majors = read.csv("https://remiller1450.github.io/data/majors.csv")

This uses the function read.csv() to load data from the file “majors.csv” that is housed at the URL provided in quotations. This data set is stored in R as an object named college_majors, which you should see in the Environment.

Clicking on the object name college_majors will open a viewer page, allowing you to inspect the data as if it were a spreadsheet.
Clicking on the blue arrow next to college_majors will display a list of variable names contained in the data set.

Finally, the Files/Plots/Help Viewer (lower-right) will display graphics generated by your code, help documentation, and a file explorer tree.

To demonstrate this panel, try running the following code:

?sum

This opens the help documentation for the function/object name given after the question mark (the sum() function in this example). If you encounter an error when working a function you should try to read its help documentation.

$~$

Lab

At this point you should begin working independently with your assigned partner(s) using a paired programming framework. Remember that you should read the lab’s content, not just the questions, and you should all agree with an answer before moving on. Despite being graded, labs aren’t intended to be formal assessments and you can ask for questions

Comments

Computer code never lives in isolation. You should always prepare your code to be read and reused by your future self and others. Annotating your code with comments, or text that appears alongside the code but is not executed by the console, is an important aspect of coding.

In R, the character “#” is used to start a comment. Everything appearing on the same line to the right of the “#” will not be executed when that line is submitted to the console.

# This entire line is a comment and will do nothing if run

1:6 # The command "1:6" appears before this comment

## [1] 1 2 3 4 5 6

Question #1:

Part A: Open a new R Script and add a line with a comment containing the text: “Question 1”. Save your script in a location on your PC that you can find later using the name: “Lab1_your_names.R” (filling in your group’s names). Indicate your responses to all future lab questions using an appropriate comment.
Part B: Beneath the comment you created in Part A add the code: happy_planet_data = read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")
Part C: Run the code you added in Part B, then use the environment panel to determine how many cases (countries) and variables are contained in this data set. You may consider every column to be a variable even though some are only used to identify cases. Record your answer using an R comment.

$~$

Vectors

In R, objects are nameable structures used to store data. Data are assigned into an object using either <- or =. After assignment, an object’s name is used to reference the data it holds.

The simplest type of data storage object is a single-element vector, which is sometimes called a “scalar”:

x = 5  # assigns the value '5' to an object named 'x'

We can now use the name x to reference this stored value:

x^2  # squares the value stored in 'x'

## [1] 25

More generally, we’ll encounter vectors containing multiple elements:

y = 1:3 # the sequence {1, 2, 3} is assigned to the vector called 'y'
print(y)

## [1] 1 2 3

This motivates us to understand indices, or the positions where specific pieces of data, known as elements or atomic units, are located within an object. In R, vectors have a single index that begins counting at 1 (many other programming languages begin counting at 0).

The example below uses square brackets ([ and ]) to access the second element of y, which is the number 2. This element is then stored as a new object named y2, which is printed using the print() function:

y2 = y[2] # access the element in position 2 and store as y2
print(y2) # print to confirm we extracted "2"

## [1] 2

Data Frames

The most common data structure in R is the data frame, which is a collection of one or more named vectors that each contain the same number of elements. While we’ll primarily work with data frames created by the read.csv() function, there are some valuable insights in creating one ourselves.

The example below takes two vectors, x1 and x2, and assembles them into a data frame object named df. You should notice how x1 and x2 are given the names “ID” and “value” when the data frame is created.

x1 = c(209, 230, 310)  # First vector
x2 = c(5, 6, 8)        # Second vector

df = data.frame(ID = x1, value = x2)  # Put together into a data frame, giving each a name
print(df)

##    ID value
## 1 209     5
## 2 230     6
## 3 310     8

A useful property of data frames is that the individual vectors they contain can be accessed using $ operator. The code demonstrates this by accessing the vector named “value” in the data frame df then using the median() function to find its median (the midpoint of values when they’re put in ascending order):

median(df$value)

## [1] 6

We can also use indices to access data stored in df, but we’ll need to recognize that data frames use two dimensions (rows and columns) to organize their elements. Consider the following examples:

## Access a single element
df[2,1]

## [1] 230

## Access an entire column
df[ ,1]

## [1] 209 230 310

## Access and entire row
df[2, ]

##    ID value
## 2 230     6

Question #2: For the parts that follow you should use the happy_planet_data data frame you previously created in Question 1.

Part A: Find the median value of the variable that records a country’s life expectancy.
Part B: Print the 100th case in the data set using indices to access this country’s data.

$~$

Data Types

Vectors can store values of various different types. The example below creates a vector, z, whose type is “character”:

z = c("X", "Y", "Z")

Notice that this vector was created using the c() function to concatenate three different character strings that are separated by commas. We will frequently use the c() function to create our own vectors at various points throughout the semester.

There are many types of vectors in R, but there are only 3 that you need to be aware of for this course:

numeric - for example: x = c(1,2,3)
character - for example: x = c("A","B","C")
logical - for example: x = c(TRUE, FALSE, TRUE)

The class() function provides information about an object and it can be used to check the type of a vector:

class(z)

## [1] "character"

It is important to be familiar with different data types because many functions only work as intended when they receive inputs of the proper type.

For example, the mean() function will throw an error when given an input that isn’t a numeric type:

mean(z)  # Recall that z is character, so calculating a mean doesn't make sense for it

## Warning in mean.default(z): argument is not numeric or logical: returning NA

## [1] NA

Sometimes we can successfully coerce an object into a different type so that it can be properly handled by functions that expect a certain type. Below are a few examples of coercion:

x = c("10", "20")  # Example vector
class(x)           # Notice that it's a character vector, so mean() doesn't work

## [1] "character"

new_x = as.numeric(x)  # Coerce x to be numeric using as.numeric(), then store as new_x
mean(new_x)            # This now works!

## [1] 15

Question #3: The as.character() function will coerce a numeric or logical type into a character type. For example, 1 will be coerced into "1". Using an R comment, briefly describe why this might be desirable for one of the variables contained in happy_planet_data, the data frame you’ve worked with in Questions 1 and 2.

$~$

Describing One-Sample Categorical Data

One-sample categorical data is straightforward to describe, as our only real options are:

Frequencies - or counts of the number of cases in each category
Proportions - the fraction of all cases in each category

We can use R to create a one-way table of frequencies using the table() function as demonstrated below:

## Table showing frequencies of each region in the happy planet data
table(happy_planet_data$Region)

## 
##  1  2  3  4  5  6  7 
## 24 24 16 33  7 12 27

Note that 1=Latin America, 2=Western nations, 3=Middle East, 4=Sub-Saharan Africa, 5=South Asia, 6=East Asia, and 7=former Communist countries. You can find more information about these data at this link

We can also use R to calculate proportions for each category shown in our table via the prop.table() function:

## First store the frequency table in its own object
region_table = table(happy_planet_data$Region)

## Next use this table object as the input to prop.table()
prop.table(region_table)

## 
##          1          2          3          4          5          6          7 
## 0.16783217 0.16783217 0.11188811 0.23076923 0.04895105 0.08391608 0.18881119

Question #4: Use the table() and prop.table() functions to find the proportion of science majors included among the majors present in the “college majors” data set introduced in the lab’s onboarding section.

$~$

Visualizing One-Sample Categorical Data

In our next lab we’ll learn about a more sophisticated graphics package, ggplot2, but for now we’ll make simpler visualizations.

For one-sample categorical data the goal of a data visualization is to display the frequencies of each category so that the overall distribution of the variable can be assessed. Bar charts are generally the most effective way of doing this, but in some circumstances, such as the analysis of a series of binary categorical variables, pie charts can be useful.

The code below provides an example of the functions used to create these data visualizations:

## Example barplot of regions in the happy planet data set
barplot(region_table)

## Example pie chart
pie(region_table)

You should note that both barplot() and pie() expect a frequency table as their input.

Question #5: Create both a pie chart and a bar chart displaying the distribution of the variable “Category” in the “college majors” data set introduced in the lab’s onboarding section. Next, briefly comment upon why statisticians might prefer bar charts over pie charts by considering how easy it is to compare the relative frequency of each category using each graph. Hint: Think about how easy/difficult it is to determine whether there are more humanities or science majors included in this data set using each type of graph.

$~$

Putting it All Together

So far our lectures and labs have introduced the following topics:

Sampling - Statisticians need to understand how data were collected and assess whether it seems reasonable for it to be representative of the population of interest. Random sampling is ideal, but other methods that introduce minimal bias are acceptable.
Descriptive Statistics - Sample data is typically not suitable for presentation in its raw, spreadsheet format. To make it accessible to others statisticians will summarize the key attributes and relationships using descriptive statistics. For one-sample categorical data this entails one-way frequency tables and proportions.
Data Visualizations - Descriptive statistics are generally accompanied by data visualizations. For one-sample categorical data, we can use bar charts and pie charts to visually display frequencies and proportions so that trends are easy to identify.
Inference - Hypothesis tests (or other statistical approaches) are used to evaluate population-level claims using the sample data. Inference informs us about the findings in our sample data that we can be statistically confident will apply to the broader population.

The concepts listed above comprise the core of most statistical analyses. In this section you will practice applying these concepts to real data.

Question #6: For this question use should use the data stored at the following URL: https://remiller1450.github.io/data/congress_sample.csv

These data are a random sample of $n=30$ current or former members of Congress in the United States who have held office at any point since the year 2000.

Part A: Does it seem reasonable to use these data to make statistical inferences about all members of congress in recent political history? Briefly explain.
Part B: Find the proportion of senators in this sample.
Part C: Create a data visualization showing the distribution of the variable “Current_Chamber”.
Part D: This sample of data was actually collected by asking generative AI (Microsoft Copilot) to put together a random sample of congress members. Given that we know that congress is composed of 435 House of Representatives members and 100 Senators, do these data provide statistical evidence that Copilot did not actually select members of congress at random? Clearly state your hypotheses, use StatKey to find a two-sided $p$-value, and provide a one-sentence conclusion.