R
EssentialsThis lab is intended to further your understanding of essential
aspects of R
needed as a precursor for performing more
advanced data science tasks.
Directions (Please read before starting)
\(~\)
To facilitate more complex tasks in R
, many people have
developed their own sets of functions known as packages. If you
plan on working with a new package for the first time, it must
be installed:
install.packages("ggplot2")
Once a package is installed, it still needs to be loaded into your R
session using the library()
function (or
require()
) before its contents can be used.
You’ll need to re-load a package every time you open
R Studio
, but you’ll only need to install it
once.
my_data <- read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")
library(ggplot2)
qplot(my_data$Region) # qplot is a function in the package ggplot2
\(~\)
Our first lab introduced the “R Script” file type. R Scripts are
built to contain only executable R
code and comments.
R Studio
supports several other types of files, some of
which use the “Markdown” authoring framework. An “R Markdown” file
allows you to both:
R
codeTo use R Markdown, you’ll need the rmarkdown
package:
install.packages("rmarkdown")
library("rmarkdown")
Once you have the package installed and loaded, you can create a new R Markdown file by selecting: File -> New File -> R Markdown.
At the top of the document is the header:
The second thing you’ll see is a code chunk:
R
code when your report is built. For now, you should keep
this chunk as it appears and place your actual code inside of other code
chunks.R
code in a chunk by clicking the
small green arrow in the upper right corner. You can also highlight
individual code pieces and execute them using Ctrl-Enter.Next you’ll see section headers:
Finally, R Markdown allows you to type ordinary text outside of code chunks. Thus, you can easily integrate written text into the same document as your code and its output.
The primary purpose of R Markdown is to create documents that blend R code, output, and text into a polished report. To generate this document you must compile your R Markdown file using the “Knit” button (a blue yarn ball icon) located towards the upper left part of your screen.
Question #1: Create a new R Markdown file and delete all of the template code that appears beneath the “r setup” code block. Change the title to “Lab #2” and the author to your name. Next, create section labels for each of this lab’s questions (there are 6 of them) using three \(\#\) characters followed by “Question X” (where X is the number of the question).
Question #1 (continued): R Markdown will use LaTex typesetting for any text wrapped in \(\$\) characters. For example, \(\$\text{\\beta}\$\) will appear as a the Greek letter \(\beta\) after you knit your document. To practice this, include \(\$\text{H_0: \\mu = 0}\$\) in a sentence (the sentence can say anything, but it should not be inside an R code chunk or a section header).
\(~\)
At this point you should begin working with your partner. This lab
will continue building on the fundamental aspects of R
introduced in Lab #1. The lab’s examples will continue using the “Happy
Planet” data, so please make sure you include code to load it.
my_data <- read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")
\(~\)
In our previous lab we saw how to access elements of a data frame using their positional indices. However, suppose we want to determine which countries have an average expectancy over 80 years?
my_data$LifeExpectancy > 80 # Logical vector for the condition "> 80"
which(my_data$LifeExpectancy > 80 ) # Positions of elements where the condition is "TRUE"
The first step in this task is identifying which elements match the condition we’re interested in (life expectancy over 80).
A few logical operators you should know of are:
Operator | Description |
---|---|
== |
equal to |
!= |
not equal to |
> |
great than |
>= |
greater than or equal to |
< |
less than |
<= |
less than or equal to |
& |
and |
| |
or |
! |
negation (“not”) |
Logical expressions can be used to create a subset of an
object via the subset()
function:
## Example #1
Ex1 <- subset(my_data, LifeExpectancy > 80)
In example #1, the data frame Ex1
will contain the
subset of countries with life expectancy above 80.
## Example #2
Ex2 <- subset(my_data, LifeExpectancy <= 70 & Happiness > 6)
In example #2, the &
operator is used to create a
data frame, Ex2
, containing all countries with a life
expectancy of 70 or below and a happiness score above 6.
## Example #3
Ex3 <- subset(my_data, LifeExpectancy <= 70 | Happiness > 6)
In example #3, the |
operator is used create a data
frame of all countries with a life expectancy of 70 or below or
a happiness score above 6. Notice the different dimensions of
Ex2
and Ex3
:
dim(Ex2)
## [1] 9 11
dim(Ex3)
## [1] 118 11
Question #2: Create a data frame named “Q2” that contains all countries with a population over 100 million that also have a happiness score of 6 or lower. Then, print the number of rows of this data frame. Remember to place your code in a properly formatted code chunk. This code chunk should begin by loading the Happy Planet data.
\(~\)
Descriptive summaries of data are an essential component of any analysis. Key functions for finding few basic numerical summaries are shown below:
mean(my_data$LifeExpectancy) # mean
## [1] 67.83846
sd(my_data$LifeExpectancy) # standard deviation
## [1] 11.04193
min(my_data$LifeExpectancy) # minimum
## [1] 40.5
max(my_data$LifeExpectancy ) # maximum
## [1] 82.3
quantile(my_data$LifeExpectancy, .35) # the 35th percentile
## 35%
## 66.18
Each of these functions operates on a single variable. For a broader
set of summary statistics, you can input an entire data frame into the
summary()
function:
summary(my_data)
## Country Region Happiness LifeExpectancy
## Length:143 Min. :1.000 Min. :2.400 Min. :40.50
## Class :character 1st Qu.:2.000 1st Qu.:5.000 1st Qu.:61.90
## Mode :character Median :4.000 Median :5.900 Median :71.50
## Mean :3.832 Mean :5.919 Mean :67.84
## 3rd Qu.:6.000 3rd Qu.:7.000 3rd Qu.:76.05
## Max. :7.000 Max. :8.500 Max. :82.30
##
## Footprint HLY HPI HPIRank
## Min. : 0.500 Min. :11.60 Min. :16.59 Min. : 1.0
## 1st Qu.: 1.300 1st Qu.:31.10 1st Qu.:34.47 1st Qu.: 36.5
## Median : 2.200 Median :41.80 Median :43.60 Median : 72.0
## Mean : 2.877 Mean :41.38 Mean :43.38 Mean : 72.0
## 3rd Qu.: 3.850 3rd Qu.:53.20 3rd Qu.:52.20 3rd Qu.:107.5
## Max. :10.200 Max. :66.70 Max. :76.12 Max. :143.0
##
## GDPperCapita HDI Population
## Min. : 667 Min. :0.3360 Min. : 0.290
## 1st Qu.: 2107 1st Qu.:0.5790 1st Qu.: 4.455
## Median : 6632 Median :0.7720 Median : 10.480
## Mean :11275 Mean :0.7291 Mean : 44.145
## 3rd Qu.:15711 3rd Qu.:0.8680 3rd Qu.: 31.225
## Max. :60228 Max. :0.9680 Max. :1304.500
## NA's :2 NA's :2
Notice how summary()
is not particularly useful
categorical variables. For these variables you should be using
frequency tables.
A one-way frequency table shows the frequencies of
categories in a single categorical variable, while a two-way
frequency tables shows the relationship between two categorical
variables. Both are created by the table()
function:
table(my_data$Region) # A one-way frequency table of 'region'
##
## 1 2 3 4 5 6 7
## 24 24 16 33 7 12 27
table(my_data$Region, my_data$LifeExpectancy > 80) # A two-way frequency table showing the number of countries w/ LifeExpectancy > 80 by region
##
## FALSE TRUE
## 1 24 0
## 2 16 8
## 3 15 1
## 4 33 0
## 5 7 0
## 6 10 2
## 7 27 0
# Notice how the table function can use numeric, logical, and character variables
Tables are their own type of object, and they can be used as an input
to functions like barplot()
:
my_table <- table(my_data$Region) # Tables can be stored as objects
barplot(my_table) # Creates a bar plot from a table
They can also be used as an input to the prop.table()
function to find row or column proportions:
prop.table(my_table, margin = 1) # "margin = 1" gives row props, "margin = 2" gives column props
##
## 1 2 3 4 5 6 7
## 1 1 1 1 1 1 1
In the example above, the table only had a single dimension (so each row total was the same as the frequency). Shown below is a more typical example:
my_table <- table(my_data$Region, my_data$LifeExpectancy > 80)
prop.table(my_table, margin = 1)
##
## FALSE TRUE
## 1 1.0000000 0.0000000
## 2 0.6666667 0.3333333
## 3 0.9375000 0.0625000
## 4 1.0000000 0.0000000
## 5 1.0000000 0.0000000
## 6 0.8333333 0.1666667
## 7 1.0000000 0.0000000
Question #3: Find the mean, median, and range
(maximum - minimum) of the variable LifeExpectancy
in the
Happy Planet data. Briefly comment on whether the distribution of this
variable seems to be symmetric or skewed using plain text beneath your
answer’s code chunk.
\(~\)
Lab #1 introduced three important types of vectors:
x = c(1,2,3)
x = c("A","B","C")
x = c(TRUE, FALSE, TRUE)
Many functions require their inputs be of a certain type.
Fortunately, data can be coerced into another type using the
as.
family of functions:
## A character vector where the text strings are numbers
x <- c("1","12","123")
typeof(x)
## [1] "character"
## Coerce 'x' to a numeric vector
x <- as.numeric(x)
x
## [1] 1 12 123
typeof(x)
## [1] "double"
Question #4 Coerce the variable “Region” into a
character variable. Use the typeof
function to verify the
change. Hint: you should overwrite the “Region” vector within
“my_data” as part of this question.
\(~\)
Real data sometimes contain missing values, which
R
stores as the special element NA
. Usually
missing values stem directly from your raw data, but they can also be
introduced by coercion or other operations/functions:
## The second element is a blank space
x <- c("1"," ","123")
typeof(x)
## [1] "character"
## Coerce to a numeric vector (stored as 'y'), notice the NA
y <- as.numeric(x)
y
## [1] 1 NA 123
Missing values cause problems for many functions, unless you explicitly instruct those function on how to handle them:
mean(y) ## Doesn't handle the missing value
## [1] NA
mean(y, na.rm = TRUE) ## Removes the missing value
## [1] 62
If missing values are removed in any part of an analysis, you should
track and report the identities of the cases that were excluded. You can
use the is.na
function to help locate these cases.
is.na(y) ## Returns TRUE if the value is missing
## [1] FALSE TRUE FALSE
which(is.na(y)) ## Uses the which function to return the positions where is.na() returns "TRUE"
## [1] 2
Question #5: Find the median value of the variable “GDPperCapita” in the Happy Planet data, removing missing values if necessary. Report the country names corresponding to any missing values that you removed (if applicable).
\(~\)
Many functions will coerce character variables into factors.
On the surface won’t notice a difference, but internally a factor uses categorical labels called levels. By default, these labels are ordered alphabetically, but in some circumstances you’ll want to organize them yourself.
## A vector containing different months
mons <- c("March","April","January","November","January", "September","October","September","November","August","January","November",
"November","February","May","August", "July","December","August","August","September","November", "February","April")
## Convert it to a factor
mons_unordered = factor(mons)
## Notice the factor defaults to alphabetical order
barplot(table(mons_unordered))
## Convert to a factor with ordering specified by the "levels" argument
mons_ordered = factor(mons_unordered, levels= c("January","February","March","April","May","June",
"July","August","September","October","November","December"),
ordered = TRUE)
## Notice the new ordering (useful for data visualization!)
barplot(table(mons_ordered))
\(~\)
The code below loads data from The College Scorecard, a government database that record various characteristics of accredited colleges and universities within the United States. This particular data set contains only primarily undergraduate institutions with at least 400 full-times students for the 2019-20 academic year.
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
Question #6 - Part A: Create a subset of these data that contains all schools that admit less than 50% of applicants (an “Adm_Rate” less than 50%) and are located in the “Great Lakes” region. You should use this subset in Parts B and C.
Question #6 - Part B: Using the subset created in Part A, find the average value of “Salary10yr_median”, the median salary of a school’s alumni 10 years after their graduation. Remove missing data if necessary, but be sure to report the identity of any colleges that were removed.
Question #6 - Part C: The “Great Lakes” region consists of 5 different states: IL, IN, MI, OH, and WI. Using the subset created in Part A, create a bar plot that displays the number of colleges (meeting the criteria specified in Part A) in each of these states in descending order (you may examine the frequency table to determine this ordering “by hand”).