R
This lab introduces R
and R Studio
as well
as a few procedures we’ll use in future class sessions.
Directions (read before starting)
\(~\)
In our first lab you wrote your code in an R Script;
however, R Studio supports several other file types, including R
Markdown, a framework that allows for R
code, its
output, and markdown text to
seamlessly coexist in the same document.
If you recently installed R Studio it should come with R Markdown already available. You can check this by navigating:
File -> New File -> R Markdown
If you do not see “R Markdown” displayed in this menu you’ll need to
install the rmarkdown
package:
# install.packages("rmarkdown")
library(rmarkdown)
## Warning: package 'rmarkdown' was built under R version 4.3.3
\(~\)
At the top of an R Markdown document is the header:
After the end of the header you’ll see a code chunk:
After the setup code chunk you’ll see a section header:
Following the section header you’ll see ordinary text:
\(~\)
The purpose of R Markdown is to seamlessly blend R code, output, and
written text. This is accomplished by “knitting” your file into a
completed report. You can knit a file using the “Knit” button (blue yarn
ball icon), or (on windows) by pressing ctrl-shift-k
.
A few things to know about knitting:
install.packages()
and
View()
cannot be used in the environment where the document
is knit. You should comment-out or remove these commands before knitting
to prevent errors.\(~\)
At this point you should begin working independently with your assigned partner(s) using a paired programming framework. Remember that you should read the lab’s content, not just the questions, and you should all agree with an answer before moving on.
When analyzing one-sample categorical data we compare the observed sample proportion, \(\hat{p}\), with values that would be expected under a null hypothesis of the form \(H_0: p = \_\_\).
The following sections will cover the one-sample \(Z\)-test, an approach based upon a Normal probability model that works well for large samples, and the exact binomial test, a more computationally expensive approach that calculates the exact probability of each possible value of the observed proportion (rather than approximating these probabilities with a smooth curve like the \(Z\)-test does).
\(~\)
Recall from our previous lecture that the one-sample \(Z\)-test involves two steps:
The prop.test()
function in R
can be used
to find the \(p\)-value produced by the
one-sample Z-test. The code below uses prop.test()
to
replicate the one-sample \(Z\)-test on
the infant toy choice data from our previous lecture:
prop.test(x = 14, n = 16, p = 0.5, alternative = "greater", correct = FALSE)
##
## 1-sample proportions test without continuity correction
##
## data: 14 out of 16, null probability 0.5
## X-squared = 9, df = 1, p-value = 0.00135
## alternative hypothesis: true p is greater than 0.5
## 95 percent confidence interval:
## 0.6837869 1.0000000
## sample estimates:
## p
## 0.875
A few details to unpack:
x
argument and the denominator (the sample size for
one-sample data) as the n
argument.p
argument.alternative
argument is used for a
two-sided test, but we could set it to "less"
or
"greater"
for one-sided testsprop.test()
applies the Yates’ continuity
correction, but we can turn this off using correct = FALSE
correct = FALSE
in a real
data analysis, but we are only doing it in this lab to see that the
\(p\)-values from
prop.test()
exactly match the ones we calculate
ourselvesQuestion #1: In Lab 3 you worked the CMU ICU admissions data set, a random sample of \(n=200\) patients. These data are found below:
prop.test()
to perform the
one-sample \(Z\)-test and confirm that
\(p\)-value you get matches the one you
found using StatKey.\(~\)
The Normal probability model that the one-sample Z-test relies upon is only reasonable when at least 10 observations in the sample belong to each category involved in the proportions. Or, put differently, when \(n\cdot p \geq 10\) and \(n\cdot (1-p) \geq 10\).
The primary issue with the Normal model in these small-sample situations is that the null distribution contains a small number of discrete possibilities that cannot be reliably approximated by a continuous curve. The exact binomial test overcomes this by using the binomial probability distribution to calculate the probability of each discrete outcome present in the null distribution.
The example below uses binom.test()
to perform the exact
binomial test on the data from our helper-hinderer example:
binom.test(x = 14, n = 16, p = 0.5, alternative = "greater")
##
## Exact binomial test
##
## data: 14 and 16
## number of successes = 14, number of trials = 16, p-value = 0.00209
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
## 0.6561748 1.0000000
## sample estimates:
## probability of success
## 0.875
You should note that this function uses the same arguments/syntax as
prop.test()
, but the \(p\)-value we get is slightly different.
You should also notice that we only observed 2 choices of the “hinderer” toy in the study, so the large sample condition of the \(Z\)-test is not met, leading us to prefer the exact binomial test for an analysis of these data.
Question #2: People who receive a liver transplant have an 89% chance of surviving at least one year. A medical consultant seeking to attract patients advertises that 59 of 62 patients she has worked with have survived at least one year, a death rate of less than half the national average (4.8% vs. 11%).
prop.test()
. Provide the R
code for the test
along with a one-sentence conclusion.prop.test()
to confirm that you arrive at the same \(p\)-value that you found in Part A, and
also use binom.test()
to confirm you arrive at the same
\(p\)-value you found in Part C. Is it
surprising that these \(p\)-values are
identical? Briefly explain.\(~\)
When analyzing one-sample quantitative data we typically compare the observed sample mean, \(\overline{x}\), with what could have been expected under a null hypothesis of the form \(H_0: \mu = \_\_\).
The following sections will cover the one-sample \(T\)-test, an approach that relies upon Student’s \(t\)-distribution as a probability model, and the Wilcoxon Signed-Rank test, an approach that doesn’t assume any particular probability model and thus can be used in scenarios where the conditions for the \(T\)-test are not met.
\(~\)
The one-sample \(T\)-test is performed in almost exactly the same manner as the one-sample \(Z\)-test:
The t.test()
function is used to perform the one-sample
\(t\)-test. The R
code
below performs this test on data from an experiment which compared
changes in LDL cholesterol experienced by subjects when they ate an oat
bran cereal as part of their breakfast relative to when they ate a corn
flakes cereal. The variable Differenece
recorded each
subject’s change in LDL (mmol/L).
## Load data
oat_diet = read.csv("https://remiller1450.github.io/data/Oatbran.csv")
## Perform T-test
t.test(x = oat_diet$Difference, mu = 0, alternative = "two.sided")
##
## One Sample t-test
##
## data: oat_diet$Difference
## t = 3.3444, df = 13, p-value = 0.005278
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.1284606 0.5972537
## sample estimates:
## mean of x
## 0.3628571
Question #3: For this question you will use the CMU ICU admissions data set used previously in this lab and in Lab 3.
t.test()
to perform the
one-sample \(T\)-test on these data and
confirm that \(p\)-value you get
matches the one you found using StatKey in Part B.\(~\)
The Wilcoxon Signed-Rank Test is a non-parametric analog to the one-sample \(T\)-test for a single mean. It is often used when the sample size is small and it is unreasonable to assume the data came from a Normally distributed population.
Consider the oat bran study from the previous section where we
performed a \(T\)-test on the variable
Difference
. Rather than using the numerical values
themselves as the basis of the test, the Wilcoxon Signed-Rank Test first
ranks each data-point based upon its absolute value, then the
data-points are grouped according to their sign (positive or
negative):
The test then compares the sum of each data-point’s rank multiplied by its sign against a null distribution to produce a \(p\)-value. This comparison amounts to a test of whether the median of the population is a specified value. We will not cover the precise steps of this test in detail, but you should recognize that if the null hypothesis is \(H_0: m = 0\) and the data are symmetric then the expected test statistic is zero.
The code below performs the Wilcoxon Signed-Rank Test on the oat bran data set:
## Signed Rank Test
wilcox.test(x = oat_diet$Difference, mu = 0)
##
## Wilcoxon signed rank exact test
##
## data: oat_diet$Difference
## V = 93, p-value = 0.008545
## alternative hypothesis: true location is not equal to 0
Question #4: A veterinary anatomist measured the
nerve cell density at two different locations in the intestine, site 1 -
the mid-region of the jejunum and site 2 - the mesenteric region of the
jejunum. The nerve cell density (thousands of cells per mm\(^3\)) was measured in each location, with
the difference recorded as the variable diff
, which is the
focus of this analysis.
diff
using 15 bins. Based upon
this histogram and the sample size, explain whether you believe the
one-sample \(T\)-test is appropriate
for these data.\(~\)
In this section you will practice the decision making skills involved in deciding upon the proper statistical analysis for a given research question. This includes:
Question #5: On Homework 1 you worked with the “TSA claims” data set, a random sample of \(n=5000\) of claims made by travelers against the Transport Security Administration (TSA) between 2003 and 2008, the first five years that the agency existed. In this question you will investigate whether the average paid claim amount is less than $200.
R
used to perform the test, provide a one-sentence
conclusion and a one-sentence justification for the choice of test,
being mindful of the assumptions involved.\(~\)
Question #6: Homework 2 introduces the “ACS Employment” data set, which is a random sample of 1287 employed individuals collected as part of the American Community Survey (ACS) performed by during the US Census Bureau. In this question you will investigate a claim by Forbes that 89% of US adults have health insurance.
R
used to perform the test, provide a one-sentence
conclusion and a one-sentence justification for the choice of test,
being mindful of the assumptions involved.