R
and
One-Sample Categorical DataThis lab introduces R
and R Studio
as well
as a few procedures we’ll use in future class sessions.
Directions (read before starting)
\(~\)
Most labs will begin with a short “onboarding” section that we’ll cover as a class. After that, you’re expected to work through the lab with your group following the “paired programming” paradigm. This framework entails:
R
.R
open to try out ideas and record your group’s final
responses.You should aim to switch roles on a regular basis, with less experienced coders spending more time in the “driver” role.
\(~\)
When you first open RStudio
you’ll need to place to
write and store your code. The simplest solution is to create a new “R
Script” via:
File -> New File -> R Script
This opens a blank page in the upper-left of the RStudio
interface. At this point you should see four panels:
An R Script is like a text file that stores your code while you work on it. You can execute some or all of your code by sending it to the console in the following ways:
If you highlight a segment of code before running, only the highlighted code will be sent to the console for execution. Try this out by running the following code:
log2(4)
## [1] 2
You should notice that the Console echoes any code you run, and beneath the echo it prints any textual/numeric output that the code produces.
When code you submit to the console cannot be executed due to errors, you will receive a red-colored message describing the problem. For example, try running the following code:
log2(x) # Produces an error since 'x' hasn't been defined
Next, the Environment (upper-right) shows information on any objects that have been loaded into your work session.
To illustrate the features of this panel, try running the following code:
college_majors = read.csv("https://remiller1450.github.io/data/majors.csv")
This uses the function read.csv()
to load data from the
file “majors.csv” that is housed at the URL provided in quotations. This
data set is stored in R
as an object named
college_majors
, which you should see in the
Environment.
college_majors
will open a
viewer page, allowing you to inspect the data as if it were a
spreadsheet.college_majors
will
display a list of variable names contained in the data set.Finally, the Files/Plots/Help Viewer (lower-right) will display graphics generated by your code, help documentation, and a file explorer tree.
To demonstrate this panel, try running the following code:
?sum
This opens the help documentation for the function/object name given
after the question mark (the sum()
function in this
example). If you encounter an error when working a function you should
try to read its help documentation.
\(~\)
At this point you should begin working independently with your assigned partner(s) using a paired programming framework. Remember that you should read the lab’s content, not just the questions, and you should all agree with an answer before moving on. Despite being graded, labs aren’t intended to be formal assessments and you can ask for questions
In R
, objects are nameable structures used to
store data. Data are assigned into an object using either
<-
or =
. After assignment, an object’s name
is used to reference the data it holds.
The simplest type of data storage object is a single-element vector, which is sometimes called a “scalar”:
x = 5 # assigns the value '5' to an object named 'x'
We can now use the name x
to reference this stored
value:
x^2 # squares the value stored in 'x'
## [1] 25
More generally, we’ll encounter vectors containing multiple elements:
y = 1:3 # the sequence {1, 2, 3} is assigned to the vector called 'y'
print(y)
## [1] 1 2 3
This motivates us to understand indices, or the positions
where specific pieces of data, known as elements or atomic
units, are located within an object. In R
, vectors
have a single index that begins counting at 1 (many other programming
languages begin counting at 0).
The example below uses square brackets ([
and
]
) to access the second element of y
, which is
the number 2
. This element is then stored as a new object
named y2
, which is printed using the print()
function:
y2 = y[2] # access the element in position 2 and store as y2
print(y2) # print to confirm we extracted "2"
## [1] 2
The most common data structure in R
is the data
frame, which is a collection of one or more named vectors that each
contain the same number of elements. While we’ll primarily work with
data frames created by the read.csv()
function, there are
some valuable insights in creating one ourselves.
The example below takes two vectors, x1
and
x2
, and assembles them into a data frame object named
df
. You should notice how x1
and
x2
are given the names “ID” and “value” when the data frame
is created.
x1 = c(209, 230, 310) # First vector
x2 = c(5, 6, 8) # Second vector
df = data.frame(ID = x1, value = x2) # Put together into a data frame, giving each a name
print(df)
## ID value
## 1 209 5
## 2 230 6
## 3 310 8
A useful property of data frames is that the individual vectors they
contain can be accessed using $
operator. The code
demonstrates this by accessing the vector named “value” in the data
frame df
then using the median()
function to
find its median (the midpoint of values when they’re put in ascending
order):
median(df$value)
## [1] 6
We can also use indices to access data stored in df
, but
we’ll need to recognize that data frames use two dimensions (rows and
columns) to organize their elements. Consider the following
examples:
## Access a single element
df[2,1]
## [1] 230
## Access an entire column
df[ ,1]
## [1] 209 230 310
## Access and entire row
df[2, ]
## ID value
## 2 230 6
Question #2: For the parts that follow you should
use the happy_planet_data
data frame you previously created
in Question 1.
\(~\)
Vectors can store values of various different types. The example
below creates a vector, z
, whose type is “character”:
z = c("X", "Y", "Z")
Notice that this vector was created using the c()
function to concatenate three different character strings that
are separated by commas. We will frequently use the c()
function to create our own vectors at various points throughout the
semester.
There are many types of vectors in R
, but there are only
3 that you need to be aware of for this course:
x = c(1,2,3)
x = c("A","B","C")
x = c(TRUE, FALSE, TRUE)
The class()
function provides information about an
object and it can be used to check the type of a vector:
class(z)
## [1] "character"
It is important to be familiar with different data types because many functions only work as intended when they receive inputs of the proper type.
For example, the mean()
function will throw an error
when given an input that isn’t a numeric type:
mean(z) # Recall that z is character, so calculating a mean doesn't make sense for it
## Warning in mean.default(z): argument is not numeric or logical: returning NA
## [1] NA
Sometimes we can successfully coerce an object into a different type so that it can be properly handled by functions that expect a certain type. Below are a few examples of coercion:
x = c("10", "20") # Example vector
class(x) # Notice that it's a character vector, so mean() doesn't work
## [1] "character"
new_x = as.numeric(x) # Coerce x to be numeric using as.numeric(), then store as new_x
mean(new_x) # This now works!
## [1] 15
Question #3: The as.character()
function will coerce a numeric or logical type into a character type.
For example, 1
will be coerced into "1"
. Using
an R
comment, briefly describe why this might be desirable
for one of the variables contained in happy_planet_data
,
the data frame you’ve worked with in Questions 1 and 2.
\(~\)
One-sample categorical data is straightforward to describe, as our only real options are:
We can use R
to create a one-way table of
frequencies using the table()
function as demonstrated
below:
## Table showing frequencies of each region in the happy planet data
table(happy_planet_data$Region)
##
## 1 2 3 4 5 6 7
## 24 24 16 33 7 12 27
Note that 1=Latin America, 2=Western nations, 3=Middle East, 4=Sub-Saharan Africa, 5=South Asia, 6=East Asia, and 7=former Communist countries. You can find more information about these data at this link
We can also use R
to calculate proportions for each
category shown in our table via the prop.table()
function:
## First store the frequency table in its own object
region_table = table(happy_planet_data$Region)
## Next use this table object as the input to prop.table()
prop.table(region_table)
##
## 1 2 3 4 5 6 7
## 0.16783217 0.16783217 0.11188811 0.23076923 0.04895105 0.08391608 0.18881119
Question #4: Use the table()
and
prop.table()
functions to find the proportion of science
majors included among the majors present in the “college majors” data
set introduced in the lab’s onboarding section.
\(~\)
In our next lab we’ll learn about a more sophisticated graphics
package, ggplot2
, but for now we’ll make simpler
visualizations.
For one-sample categorical data the goal of a data visualization is to display the frequencies of each category so that the overall distribution of the variable can be assessed. Bar charts are generally the most effective way of doing this, but in some circumstances, such as the analysis of a series of binary categorical variables, pie charts can be useful.
The code below provides an example of the functions used to create these data visualizations:
## Example barplot of regions in the happy planet data set
barplot(region_table)
## Example pie chart
pie(region_table)
You should note that both barplot()
and
pie()
expect a frequency table as their input.
Question #5: Create both a pie chart and a bar chart displaying the distribution of the variable “Category” in the “college majors” data set introduced in the lab’s onboarding section. Next, briefly comment upon why statisticians might prefer bar charts over pie charts by considering how easy it is to compare the relative frequency of each category using each graph. Hint: Think about how easy/difficult it is to determine whether there are more humanities or science majors included in this data set using each type of graph.
\(~\)
So far our lectures and labs have introduced the following topics:
The concepts listed above comprise the core of most statistical analyses. In this section you will practice applying these concepts to real data.
Question #6: For this question use should use the
data stored at the following URL:
https://remiller1450.github.io/data/congress_sample.csv
These data are a random sample of \(n=30\) current or former members of Congress in the United States who have held office at any point since the year 2000.
Comments
Computer code never lives in isolation. You should always prepare your code to be read and reused by your future self and others. Annotating your code with comments, or text that appears alongside the code but is not executed by the console, is an important aspect of coding.
In
R
, the character “#” is used to start a comment. Everything appearing on the same line to the right of the “#” will not be executed when that line is submitted to the console.Question #1:
happy_planet_data = read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")
R
comment.\(~\)