R
EssentialsThis lab is a continuation of our introductory lab. It will cover R Markdown, packages/libraries, and a few additional topics.
Directions (Please read before starting)
\(~\)
To facilitate more complex tasks in R
, many people have
developed their own sets of functions known as packages. If you
plan on working with a new package for the first time, it must
be installed:
install.packages("ggplot2")
Once a package is installed, it still needs to be loaded into your R
session using the library()
function (or
require()
) before its contents can be used.
You’ll need to re-load a package every time you open
R Studio
, but you’ll only need to install it
once.
my_data <- read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")
library(ggplot2)
qplot(my_data$Region) # qplot is a function in the package ggplot2
\(~\)
Our first lab introduced the “R Script” file type. R Scripts are
built to contain only executable R
code and comments.
R Studio
supports several other types of files, some of
which use the “Markdown” authoring framework. An “R Markdown” file
allows you to both:
R
codeTo use R Markdown, you’ll need the rmarkdown
package:
install.packages("rmarkdown")
library("rmarkdown")
Once you have the package installed and loaded, you can create a new R Markdown file by selecting: File -> New File -> R Markdown.
At the top of the document is the header:
The second thing you’ll see is a code chunk:
R
code when your report is built. For now, you should keep
this chunk as it appears and place your actual code inside of other code
chunks.R
code in a chunk by clicking the
small green arrow in the upper right corner. You can also highlight
individual code pieces and execute them using Ctrl-Enter.Next you’ll see section headers:
Finally, R Markdown allows you to type ordinary text outside of code chunks. Thus, you can easily integrate written text into the same document as your code and its output.
The primary purpose of R Markdown is to create documents that blend R code, output, and text into a polished report. To generate this document you must compile your R Markdown file using the “Knit” button (a blue yarn ball icon) located towards the upper left part of your screen.
Question #8: Create a new R Markdown file and delete all of the template code that appears beneath the “r setup” code block. Change the title to “Lab #1” and the author to your name(s). Next, create section labels for each question in Lab 1, Part 1 using three \(\#\) characters followed by “Question X” (where X is the number of the question). Finally, copy over all of your responses from Lab 1, Part 1 into the appropriately placed code chunks.
Question #8 (continued): R Markdown will use LaTex typesetting for any text wrapped in \(\$\) characters. For example, \(\$\text{\\beta}\$\) will appear as a the Greek letter \(\beta\) after you knit your document. To practice this, add a label for Question #8 and below it include \(\$\text{H_0: \\mu = 0}\$\) in a sentence (the sentence can say anything, but it should not be inside an R code chunk or a section header).
\(~\)
At this point you should begin working with your partner. This lab
will continue building on the fundamental aspects of R
introduced previously. The lab’s examples will continue using the “Happy
Planet” data, so please make sure you include code to load it.
my_data <- read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")
Part 1 of Lab #1 introduced three important types of vectors:
x = c(1,2,3)
x = c("A","B","C")
x = c(TRUE, FALSE, TRUE)
Many functions require their inputs be of a certain type.
Fortunately, data can be coerced into another type using the
as.
family of functions:
## A character vector where the text strings are numbers
x <- c("1","12","123")
typeof(x)
## [1] "character"
## Coerce 'x' to a numeric vector
x <- as.numeric(x)
x
## [1] 1 12 123
typeof(x)
## [1] "double"
Question #9 Coerce the variable “Region” into a
character variable. Use the typeof
function to verify the
change. Hint: you should overwrite the “Region” vector within
“my_data” as part of this question.
\(~\)
Real data sometimes contain missing values, which
R
stores as the special element NA
. Missing
values may be present in your raw data, but they can also be introduced
by coercion or other operations/functions:
## The second element is a blank space
x <- c("1"," ","123")
typeof(x)
## [1] "character"
## Coerce to a numeric vector (stored as 'y'), notice the NA
y <- as.numeric(x)
y
## [1] 1 NA 123
Missing values can cause problems for many functions, but some
functions have arguments that control how missing values are handled.
The example below shows how to remove any missing values when
calculating the mean of y
:
mean(y) ## Doesn't handle the missing value
## [1] NA
mean(y, na.rm = TRUE) ## Removes the missing value
## [1] 62
If missing values are removed in any part of an analysis, you should
track and report the identities of the cases that were excluded. You can
use the is.na
function to help locate these cases.
is.na(y) ## Returns TRUE if the value is missing
## [1] FALSE TRUE FALSE
which(is.na(y)) ## Uses the which function to return the positions where is.na() returns "TRUE"
## [1] 2
Another useful function is na.omit
, which will subset a
data frame to remove any rows that contain missing data in any variable.
This function is demonstrated on the Happy Planet data below:
## Store the subset without missing data
my_data_without_na <- na.omit(my_data)
## Compare dimensions
dim(my_data)
## [1] 143 11
dim(my_data_without_na)
## [1] 141 11
Question #10: Find the median value of the variable “GDPperCapita” in the Happy Planet data, removing any missing values in this variable if necessary. Report the country names corresponding to any missing values that you removed (if applicable).
\(~\)
Many functions will coerce character variables into factors.
On the surface you might not notice any difference, but internally a factor relies upon a set of categorical labels known as levels. By default, these labels are ordered alphabetically, but in some circumstances you’ll want to organize them yourself.
## A vector containing different months
mons <- c("March","April","January","November","January", "September","October","September","November","August","January","November",
"November","February","May","August", "July","December","August","August","September","November", "February","April")
## Convert it to a factor
mons_unordered = factor(mons)
## Notice the factor defaults to alphabetical order
barplot(table(mons_unordered))
## Convert to a factor with ordering specified by the "levels" argument
mons_ordered = factor(mons_unordered, levels= c("January","February","March","April","May","June",
"July","August","September","October","November","December"),
ordered = TRUE)
## Notice the new ordering (useful for data visualization!)
barplot(table(mons_ordered))
Question #11:
The code below loads the “colleges” data set. Recall that this data set contains information pertaining to all primarily undergraduate institutions with at least 400 full-times students in the 2019-20 academic year.
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
\(~\)
The data available at the URL below contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010:
https://remiller1450.github.io/data/AmesHousing.csv
A more detailed description can be found at this link
R
and
store in a data frame named ames_housing
. Then check the
type of the variable MS.SubClass
and compare it with the
description of this variable given in the link above. Based upon your
assessment, should this variable coerced to a different type? Briefly
explain.Garage.Type
.
Hint: you can use the sum()
function on a logical
vector to count the number of TRUE
values.Garage.Type
. What is the
average Garage.Area
of these homes?Exter.Cond
(exterior condition), create an ordered factor that goes from “Poor”
condition (a value of Po
) to “Excellent” condition (a value
of Ex
) following the order and definitions given in the
detailed description for this variable. Use the barplot()
and table()
functions to construct a bar chart using your
ordered factor variable.