This lab introduces R and R Studio as well as a few procedures we’ll use in future class sessions.

Directions (read before starting)

  1. Please work together with your assigned partner(s). Make sure you all fully understand each topic or example before moving on.
  2. Record your answers to lab questions separately from the lab’s examples. Everyone should have their own copy of your group’s responses, and each individual should turn-in this copy, even if it’s identical to that of your lab partners.
  3. Ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Onboarding

Packages

While functions we’ll use are contained in the default installation of R, others are stored in external libraries known as “packages”.

The first package we’ll use is ggplot2, which is used to create professional quality data visualizations. To use a package it must be installed onto the computer you’re working on. You can do this using install.packages():

install.packages("ggplot2")

Installing a package is similar to downloading an app - you only ever need to download it once, but you’ll need to load it each time you want to use it. As a result, you’ll frequently encounter code that looks like this:

# install.packages("ggplot2")
library(ggplot2)

Here the installation command will not be run due to the comment, but the library() function, which loads the ggplot2 package into your current R session, will be run.

\(~\)

The “Grammar of Graphics”

The ggplot2 package uses a layer-based framework to build data visualizations. We can understand layering via the following sequence of examples:

## Read the data from our last lab
college_majors <- read.csv("https://remiller1450.github.io/data/majors.csv")

## Example #1.1 - nothing (just a base layer)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income))

## Example #1.2 - add a layer displaying raw data
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point()

## Example #1.3 - add another layer (regression line)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
## Example #1.4 - add another layer (reference text)
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point() +
  geom_smooth(method = "lm") + 
  annotate("text", x=25, y=80000, label = "Strong, positive relationship")
## `geom_smooth()` using formula = 'y ~ x'

Notice how each layer is added to an existing graph using +. The data frame and variable names from the initial use of ggplot() (the first layer) are passed forward to subsequent layers. This allows + geom_point() to create a scatter plot despite being given no arguments.

\(~\)

Terminology

The guiding philosophy of ggplot is to define data visualizations grammatically using:

  1. Aesthetic mappings - relations between visual cues and variable names given inside of aes()
    • x = Per_Male instructs ggplot to relate each value of Per_Male to a position on the x-axis.
  2. Geometric elements - elements you actually see on the graphic (ie: points, lines, etc.)
    • + geom_point() instructs ggplot to use points to display the aesthetic mappings you defined inside of aes()

Sometimes we’ll want a visual cue to be mapped to a variable, but other times we won’t. What difference do you see in the following examples?

## Example 2.1
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income, color = Category)) + geom_point() + labs(title = "Example 2.1")

## Example 2.2
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income, color = "blue")) + geom_point() + labs(title = "Example 2.2")

## Example 2.3
ggplot(data = college_majors, mapping = aes(x = Per_Male, y = Bach_Med_Income)) + geom_point(color = "blue") + labs(title = "Example 2.2")

\(~\)

Lab

At this point you should begin working independently with your assigned partner(s) using a paired programming framework. Remember that you should read the lab’s content, not just the questions, and you should all agree with an answer before moving on.

Describing One-Sample Quantitative Data

Last week we introduced the concept of one-sample data, which involves treating all of the observations in our data set as a single sample.

The table below summarizes various things we might include in an analysis of one-sample data depending upon whether the variable of interest is categorical or quantitative:

One-sample Categorical One-sample Quantitative
Descriptive Statistics Frequencies - table()
Proportions - prop.table()
Mean - mean()
Median - median()
Standard Deviation - sd()
IQR - IQR()
Range - range()
Data Visualizations Bar plot - geom_bar()
Pie chart - geom_bar() + coord_polar()
Histogram - geom_histogram()
Boxplot - geom_boxplot()
Hypothesis Tests \(H_0: p =\_\_\) \(H_0: \mu =\_\_\)

As you might notice, describing quantitative data can be more complex than describing categorical data. Two of the most important attributes we should describe are a quantitative variable’s center, or where values tend to typically fall, and its spread, or the extent to which values tend to vary relative to the center.

The table below provides a brief description of how each of these is calculated:

Attribute Measure Description
Center Mean Sum of all values divided by the number of values: \(\bar{x} = \frac{1}{n} \sum x_i\)
Median Middle value when data are ordered; if even number of values, average the two middle
Spread Standard Deviation The average deviation of data from the mean: \(s = \sqrt{\frac{1}{n-1} \sum (x_i - \bar{x})^2}\)
IQR Difference between 75th and 25th percentiles: \(\text{IQR} = Q_3 - Q_1\)
Range Maximum value minus minimum value: \(\text{Range} = \max(x) - \min(x)\)

Question #1: For this question you should use the toy data set created below:

example_data = data.frame(x1 = c(1,3,3,4,5,6),
                          x2 = c(1,3,3,4,5,60))
  • Part A: Use R to find the mean and median of the variable x1. Do both of these measures of center seem to accurately describe the “typical value” of this variable’s distribution? Briefly explain.
  • Part B: Now use R to find the mean and median of x2. Do both of these measures of center seem to accurately describe the “typical value” of this variable’s distribution? Briefly explain.
  • Part C: For the variable x2 the value 60 can be considered an outlier, an extreme observation that is far away from the majority of the observed data. Based upon what you’ve seen in Parts A and B, which measure of center, the mean or median, appears to be more greatly influenced by the presence of outliers?
  • Part D: Use R to find the standard deviations of x1 and x2. How do outliers impact the standard deviation of a variable?

\(~\)

Visualizing One-Sample Quantitative Data

Descriptive statistics are used to summarize the most important features of a variable, but we’ll also want to know the shape of the variable’s distribution.

Histograms and boxplots are the primary data visualizations you should consider for a single quantitative variable, with histograms being more useful for judging shape but less useful for understanding center and spread, and boxplots providing a nice balance of information related to center, spread, and shape.

Below is an example of how to create each visualization using ggplot():

## Histogram example
ggplot(data = college_majors, mapping = aes(x = Per_Male)) + geom_histogram(bins = 20)

## Boxplot example
ggplot(data = college_majors, mapping = aes(x = Per_Male)) + geom_boxplot()

A few things to note:

  1. We only need a single aesthetic mapping to display one-sample data. Here we map the “Per_Male” variable to positions on the x-axis
  2. Histograms are bar charts created by organizing the numeric values into equally spaced bins. In our example we used the argument bins = 20 to specify that we wanted a histogram that uses 20 bins.
  3. Sometimes you’ll want to try out a few different values for the number of bins as this parameter can influence your visual assessment of the distribution’s shape.

Graphing the distribution of a variable to show its shape helps statisticians identify skew and outliers, both of which will influence the hypothesis testing approaches used (a topic we’ll discuss later on).

  • A variable is skewed right if its distribution has a long tail of large values.
  • Similarly, a variable is skewed left if its distribution has a long tail of small values.
  • If both tails of the distribution are similar we’ll describe the distribution as approximately symmetric
  • Some symmetric distributions follow a bell-curve shape that we’ll soon describe as approximately Normal, in reference to the Normal distribution

Below are some examples:

Question #2: For this question you should use the Happy Planet data set, which is available at the URL below:

https://remiller1450.github.io/data/HappyPlanet.csv

  • Part A: Create a histogram of the variable HPI, which records each country’s composite happiness index score. Try making histograms with 10, 15, and 25 bins to gauge how the number of bins influences how the histogram looks. After doing so, describe the distribution of “HPI” using one of the four shapes given in this section’s examples.
  • Part B: Repeat the same steps as Part A, but this time using the variable LifeExpectancy. Your final answer should create a histogram displaying the variable and include a comment describing the shape of the distribution.
  • Part C: Now create a boxplot of the variable LifeExpectancy. Use this boxplot to find approximate values of one measure of center and one measure of spread.

\(~\)

Hypothesis Testing for One-Sample Quantitative Data

As we’ve discussed, a statistical analysis should include an assessment of the sample, descriptive statistics, data visualizations, and inference (hypothesis testing). For one-sample categorical data, we had used StatKey simulations to help us estimate the \(p\)-value in our hypothesis tests. In this section of the lab we’ll do the same for one-sample quantitative data. Next week we’ll learn how to perform similar hypothesis tests in R.

The first difference between one-sample categorical data and one-sample quantitative data is the information that is needed to create the null distribution by simulating outcomes that could have been observed had the null hypothesis been true.

  • For one-sample categorical data we only needed the sample size and the proportion stated in the null hypothesis
    • Recall the helper-hinderer example where we could simulate sets of \(n=16\) using \(H_0: p=0.5\), recording the observed proportions in the null distribution.
  • For one-sample quantitative data our null hypothesis pertains to the population’s mean, denoted by the symbol \(\mu\). Like before, the null distribution should be centered around the value given in the null hypothesis, but the spread will depend upon the amount of variability present in the sample data.
    • Therefore, we need to provide StatKey with both the value given in the null hypothesis and all of the numeric values of the variable of interest.
    • This involves uploading the CSV containing our data using the “Upload File” button, then selecting the column of interest.

After we’ve uploaded our data and entered the value of our null hypothesis we can use StatKey to simulate outcomes that could have been observed had our null hypothesis been true. We then follow the same procedure to calculate the \(p\)-value. That is, we find the proportion of of outcomes at least as extreme as the mean observed in the real sample by checking one of the “left tail”, “two-tail”, or “right tail” boxes then clicking and adjusting the value below the x-axis to match the observed sample mean.

Question #3: Click on this link to download the “Tips” data set. These data were collected by a server working in a chain restaurant in a suburb of New York City. Each row records information about a table they served.

  • Part A: Many people feel that the minimum acceptable tip in a restaurant is 10% below average/poor service. Perform a one-sided hypothesis test to evaluate whether there is statistical evidence that this server on average receives tip percentages exceeding 10%. Provide the one-sided \(p\)-value you estimate using StatKey along with a one-sentence conclusion.
  • Part B: According to several online sources, 15% is considered a standard tip, reflecting expected/satisfactory service. Perform a two-sided hypothesis test to evaluate whether there is statistical evidence that this server on average receives tip percentages differing from 15%. Provide the two-sided \(p\)-value you estimate using StatKey along with a one-sentence conclusion.

\(~\)

Practice (required)

In this section you will analyze a few variables from a random sample of \(n=200\) patients admitted to the intensive care unit (ICU) of a research hospital affiliated with Carnegie Mellon University (CMU):

https://remiller1450.github.io/data/ICUAdmissions.csv

The relevant variables for your analyses of these data are:

  • Systolic: The systolic blood pressure reading of the patient when they were admitted
  • Previous: 0 if this is the first time the patient was admitted or 1 if the patient has been re-admitted for the same underlying issue

Question #4:

  • Part A: Name two populations that might be of interest to the researchers who collected these data. One of these populations should be a group that you are highly confident these data provide an unbiased representation of. The other can be a population where these data might present minor-to-moderate sampling bias.
  • Part B: Suppose the researchers wanted to know if these data suggest the readmission rate of ICU patients at CMU hospitals differs from the national average, which is estimated to be 14.67%. Does this scenario involve one-sample categorical or one-sample quantitative data? Briefly explain.
  • Part C: Provide an appropriate set of descriptive statistics and at least one relevant data visualization for the research question described in Part B.
  • Part D: Perform an appropriate hypothesis test, using StatKey to estimate the \(p\)-value, that addresses the research question described in Part B. Your response should clearly state your null and alternative hypotheses using proper statistical notation, provide the \(p\)-value, and make a one-sentence conclusion that uses appropriate context and avoids generic phrases like “reject \(H_0\)

\(~\)

Question #5:

  • Part A: A systolic blood pressure of less than 120 mm Hg is considered healthy. Suppose the researchers want to know if their data suggest ICU patients on average have higher than normal systolic blood pressure when they are admitted. Does this scenario involve one-sample categorical or one-sample quantitative data? Briefly explain.
  • Part B: Provide an appropriate set of descriptive statistics and at least one relevant data visualization for the research question described in Part A.
  • Part C: Perform an appropriate hypothesis test, using StatKey to estimate the \(p\)-value, that addresses the research question described in Part A. Your response should clearly state your null and alternative hypotheses using proper statistical notation, provide the \(p\)-value, and make a one-sentence conclusion that uses appropriate context and avoids generic phrases like “reject \(H_0\)