R
and One-Sample
Quantitative DataThis lab introduces R
and R Studio
as well
as a few procedures we’ll use in future class sessions.
Directions (read before starting)
\(~\)
While functions we’ll use are contained in the default installation
of R
, others are stored in external libraries known as
“packages”.
The first package we’ll use is ggplot2
, which is used to
create professional quality data visualizations. To use a package it
must be installed onto the computer you’re working on. You can do this
using install.packages()
:
install.packages("ggplot2")
Installing a package is similar to downloading an app - you only ever need to download it once, but you’ll need to load it each time you want to use it. As a result, you’ll frequently encounter code that looks like this:
# install.packages("ggplot2")
library(ggplot2)
Here the installation command will not be run due to the comment, but
the library()
function, which loads the
ggplot2
package into your current R
session,
will be run.
\(~\)
The ggplot2
package uses a layer-based framework to
build data visualizations. We can understand layering via the following
sequence of examples:
## Read the data from our last lab
college_majors <- read.csv("https://remiller1450.github.io/data/majors.csv")
## Example #1.1 - nothing (just a base layer)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income))
## Example #1.2 - add a layer displaying raw data
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point()
## Example #1.3 - add another layer (regression line)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
## Example #1.4 - add another layer (reference text)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point() +
geom_smooth(method = "lm") +
annotate("text", x=25, y=80000, label = "Strong, positive relationship")
## `geom_smooth()` using formula = 'y ~ x'
Notice how each layer is added to an existing graph using
+
. The data frame and variable names from the initial use
of ggplot()
(the first layer) are passed forward to
subsequent layers. This allows + geom_point()
to create a
scatter plot despite being given no arguments.
\(~\)
The guiding philosophy of ggplot
is to define data
visualizations grammatically using:
aes()
x = Per_Male
instructs ggplot
to relate
each value of Per_Male
to a position on the x-axis.+ geom_point()
instructs ggplot
to use
points to display the aesthetic mappings you defined inside of
aes()
Sometimes we’ll want a visual cue to be mapped to a variable, but other times we won’t. What difference do you see in the following examples?
## Example 2.1
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income, color = Category)) + geom_point() + labs(title = "Example 2.1")
## Example 2.2
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income, color = "blue")) + geom_point() + labs(title = "Example 2.2")
## Example 2.3
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point(color = "blue") + labs(title = "Example 2.2")
\(~\)
At this point you should begin working independently with your assigned partner(s) using a paired programming framework. Remember that you should read the lab’s content, not just the questions, and you should all agree with an answer before moving on.
Last week we introduced the concept of one-sample data, which involves treating all of the observations in our data set as a single sample.
The table below summarizes various things we might include in an analysis of one-sample data depending upon whether the variable of interest is categorical or quantitative:
One-sample Categorical | One-sample Quantitative | |
---|---|---|
Descriptive Statistics | Frequencies - table() Proportions - prop.table() |
Mean - mean() Median - median() Standard Deviation - sd() IQR - IQR()
Range - range() |
Data Visualizations | Bar plot - geom_bar() Pie chart - geom_bar() + coord_polar() |
Histogram - geom_histogram() Boxplot - geom_boxplot() |
Hypothesis Tests | \(H_0: p =\_\_\) | \(H_0: \mu =\_\_\) |
As you might notice, describing quantitative data can be more complex than describing categorical data. Two of the most important attributes we should describe are a quantitative variable’s center, or where values tend to typically fall, and its spread, or the extent to which values tend to vary relative to the center.
The table below provides a brief description of how each of these is calculated:
Attribute | Measure | Description |
---|---|---|
Center | Mean | Sum of all values divided by the number of values: \(\bar{x} = \frac{1}{n} \sum x_i\) |
Median | Middle value when data are ordered; if even number of values, average the two middle | |
Spread | Standard Deviation | The average deviation of data from the mean: \(s = \sqrt{\frac{1}{n-1} \sum (x_i - \bar{x})^2}\) |
IQR | Difference between 75th and 25th percentiles: \(\text{IQR} = Q_3 - Q_1\) | |
Range | Maximum value minus minimum value: \(\text{Range} = \max(x) - \min(x)\) |
Question #1: For this question you should use the toy data set created below:
example_data = data.frame(x1 = c(1,3,3,4,5,6),
x2 = c(1,3,3,4,5,60))
R
to find the mean and
median of the variable x1
. Do both of these measures of
center seem to accurately describe the “typical value” of this
variable’s distribution? Briefly explain.R
to find the mean and
median of x2
. Do both of these measures of center seem to
accurately describe the “typical value” of this variable’s distribution?
Briefly explain.x2
the value
60 can be considered an outlier, an extreme observation that is
far away from the majority of the observed data. Based upon what you’ve
seen in Parts A and B, which measure of center, the mean or median,
appears to be more greatly influenced by the presence of outliers?R
to find the standard
deviations of x1
and x2
. How do outliers
impact the standard deviation of a variable?\(~\)
Descriptive statistics are used to summarize the most important features of a variable, but we’ll also want to know the shape of the variable’s distribution.
Histograms and boxplots are the primary data visualizations you should consider for a single quantitative variable, with histograms being more useful for judging shape but less useful for understanding center and spread, and boxplots providing a nice balance of information related to center, spread, and shape.
Below is an example of how to create each visualization using
ggplot()
:
## Histogram example
ggplot(data = college_majors, mapping = aes(x = Per_Male)) + geom_histogram(bins = 20)
## Boxplot example
ggplot(data = college_majors, mapping = aes(x = Per_Male)) + geom_boxplot()
A few things to note:
bins = 20
to specify that we wanted a histogram that uses
20 bins.Graphing the distribution of a variable to show its shape helps statisticians identify skew and outliers, both of which will influence the hypothesis testing approaches used (a topic we’ll discuss later on).
Below are some examples:
Question #2: For this question you should use the Happy Planet data set, which is available at the URL below:
HPI
, which records each country’s composite happiness index
score. Try making histograms with 10, 15, and 25 bins to gauge how the
number of bins influences how the histogram looks. After doing so,
describe the distribution of “HPI” using one of the four shapes given in
this section’s examples.LifeExpectancy
. Your final answer
should create a histogram displaying the variable and include a comment
describing the shape of the distribution.LifeExpectancy
. Use this boxplot to find approximate values
of one measure of center and one measure of
spread.\(~\)
As we’ve discussed, a statistical analysis should include an
assessment of the sample, descriptive statistics, data visualizations,
and inference (hypothesis testing). For one-sample categorical data, we
had used StatKey simulations to help us estimate the \(p\)-value in our hypothesis tests. In this
section of the lab we’ll do the same for one-sample quantitative data.
Next week we’ll learn how to perform similar hypothesis tests in
R
.
The first difference between one-sample categorical data and one-sample quantitative data is the information that is needed to create the null distribution by simulating outcomes that could have been observed had the null hypothesis been true.
After we’ve uploaded our data and entered the value of our null hypothesis we can use StatKey to simulate outcomes that could have been observed had our null hypothesis been true. We then follow the same procedure to calculate the \(p\)-value. That is, we find the proportion of of outcomes at least as extreme as the mean observed in the real sample by checking one of the “left tail”, “two-tail”, or “right tail” boxes then clicking and adjusting the value below the x-axis to match the observed sample mean.
Question #3: Click on this link to download the “Tips” data set. These data were collected by a server working in a chain restaurant in a suburb of New York City. Each row records information about a table they served.
\(~\)
In this section you will analyze a few variables from a random sample of \(n=200\) patients admitted to the intensive care unit (ICU) of a research hospital affiliated with Carnegie Mellon University (CMU):
The relevant variables for your analyses of these data are:
Question #4:
\(~\)
Question #5: