R
This lab provides a brief overview of functions used to perform statistical tests that frequently discussed in an introductory statistics course.
If you are unfamiliar with hypothesis testing, I encourage you to look over my course notes on the topic
Also note that this lab is entirely optional. The format is more similar to a reference guide, and all questions appear at the end of the document; however, they’ll require you to be familiar with the earlier sections.
\(~\)
One-sample tests are used to decide whether a summary statistic from your sample data is statistically different from a hypothesized value.
For categorical data, a common null hypothesis is \(H_0: p = p_0\), where \(p_0\) is a hypothesized proportion for the categorical outcome of interest. This hypothesis can be evaluated using a one-sample Z-test:
acs <- read.csv("https://remiller1450.github.io/data/EmployedACS.csv") ## Random sample of 1287 employed individuals from the American Community Survey
n_male = sum(acs$Sex == 1) ## Number of males among respondents
n_total = nrow(acs) ## Total sample size
prop.test(x = n_male, n = n_total, p = 0.5, alternative = "two.sided")
##
## 1-sample proportions test with continuity correction
##
## data: n_male out of n_total, null probability 0.5
## X-squared = 3.1826, df = 1, p-value = 0.07443
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4975476 0.5528050
## sample estimates:
## p
## 0.5252525
or an exact binomial test:
binom.test(x = n_male, n = n_total, p = 0.5, alternative = "two.sided")
##
## Exact binomial test
##
## data: n_male and n_total
## number of successes = 676, number of trials = 1287, p-value = 0.07439
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.4975508 0.5528386
## sample estimates:
## probability of success
## 0.5252525
For either test, you must provide the numerator and denominator of the sample proportion (ie: count of males and total sample size), as well as the hypothesized population proportion, “p”.
For quantitative data, the one-sample \(t\)-test should be used to assess the hypothesis: \(H_0: \mu = \mu_0\), where \(\mu_0\) is a hypothesized mean.
t.test(x = acs$Income, mu = 40, alternative = "two.sided")
##
## One Sample t-test
##
## data: acs$Income
## t = 2.9449, df = 1286, p-value = 0.003289
## alternative hypothesis: true mean is not equal to 40
## 95 percent confidence interval:
## 41.50877 47.53074
## sample estimates:
## mean of x
## 44.51976
In this example, a vector containing the quantitative variable is
given as the x
argument, and the hypothesized mean is
defined by the argument mu
.
\(~\)
Two-sample tests are used to assess whether two different sample groups are statistically different. A common example is A/B testing, where experimental participants are randomly assigned into one of two conditions (A or B) and an outcome is recorded.
For categorical data, you might use a difference in proportions Z-test:
## First you'll need the numerator and denominator of each sample's proportion
ins_white = sum(acs$HealthInsurance == 1 & acs$Race == "white")
n_white = sum(acs$Race == "white")
ins_black = sum(acs$HealthInsurance == 1 & acs$Race == "black")
n_black = sum(acs$Race == "black")
## Z-test
prop.test(x = c(ins_white, ins_black), n = c(n_white, n_black))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(ins_white, ins_black) out of c(n_white, n_black)
## X-squared = 0.14491, df = 1, p-value = 0.7034
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.04383785 0.07295584
## sample estimates:
## prop 1 prop 2
## 0.9283521 0.9137931
Notice the x
argument is given a vector containing two
values, in this example they are the number of insured individuals in
each group. Similarly, the n
argument is given the total
number of individuals belonging to each group (also as a vector
containing two elements).
For quantitative data, you should use a two-sample T-test:
t.test(Income ~ Sex, data = acs)
##
## Welch Two Sample t-test
##
## data: Income by Sex
## t = -4.921, df = 1231.2, p-value = 9.776e-07
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -20.650616 -8.878211
## sample estimates:
## mean in group 0 mean in group 1
## 36.76471 51.52913
The syntax relating “Income” and “Sex” in this example uses
formula notation. Here, the formula Income ~ Sex
indicates the quantitative outcome, “Income”, should be evaluated
according to the two groups created by the variable “Sex”.
You might read the ~
symbol as “is predicted by”, so
this entire formula can be read as “Income is predicted by Sex”.
\(~\)
One and two-sample testing procedures assume all categorical variables are binary. However, oftentimes it is too simplistic to reduce a nominal categorical variable (many categories) into a binary variable (two categories).
In these circumstances, you might consider a Chi-squared Test (either goodness of fit or association) or Fisher’s Exact Test (association):
## Goodness of fit Chi-squared test
chisq.test(x = table(acs$Race), p = c(0.05, 0.15, 0.1, 0.7))
##
## Chi-squared test for given probabilities
##
## data: table(acs$Race)
## X-squared = 54.6, df = 3, p-value = 8.356e-12
## Chi-squared test of association
chisq.test(x = table(acs$Race, acs$HealthInsurance))
##
## Pearson's Chi-squared test
##
## data: table(acs$Race, acs$HealthInsurance)
## X-squared = 25.378, df = 3, p-value = 1.287e-05
## Fisher's exact test
fisher.test(x = table(acs$Race, acs$HealthInsurance))
##
## Fisher's Exact Test for Count Data
##
## data: table(acs$Race, acs$HealthInsurance)
## p-value = 0.0001573
## alternative hypothesis: two.sided
For goodness of fit testing, the x
argument should be a
one-way table; while for tests of association, it should be a two-way
table. The argument p
is only used for goodness of fit
testing, and it indicates the hypothesized proportions.
\(~\)
Similarly, some studies involve the comparison of a numeric outcome across several groups. In these settings you should use one-way ANOVA:
anova_mod <- aov(Income ~ Race, data = acs)
summary(anova_mod)
## Df Sum Sq Mean Sq F value Pr(>F)
## Race 3 56523 18841 6.291 0.000309 ***
## Residuals 1283 3842204 2995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice how aov
is another hypothesis testing function
that uses formula notation. It turns out that one-way ANOVA is
equivalent to a two-sample t-test if there are only two groups.
\(~\)
A statistically significant ANOVA test indicates that there is at least one pairing of groups that are more different than could be expected by random chance.
This finding should be followed up by post-hoc testing to determine which groups are different from each other. One method for this is Tukey’s Honest Significant Differences:
TukeyHSD(anova_mod)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Income ~ Race, data = acs)
##
## $Race
## diff lwr upr p adj
## black-asian -26.750588 -46.403386 -7.0977905 0.0026929
## other-asian -30.045124 -50.285682 -9.8045650 0.0008099
## white-asian -15.697207 -31.049147 -0.3452658 0.0428292
## other-black -3.294535 -22.402516 15.8134456 0.9708621
## white-black 11.053382 -2.771118 24.8778819 0.1680528
## white-other 14.347917 -0.300105 28.9959390 0.0573859
The input to TukeyHSD
is an object containing the
results from our ANOVA test. We see that the function provides pairwise
confidence intervals and p-values for all combinations of groups. These
are also adjusted for multiple comparisons (to preserve the Type-1 error
rate at \(\alpha = 0.05\)).
\(~\)
The final combination of variables yet to be considered in this tutorial is two quantitative variables.
In this situation, the correlation coefficient can form the basis of a hypothesis test of \(H_0: \rho = 0\), or no correlation between the variables being studied:
cor.test(x = acs$HoursWk, y = acs$Income, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: acs$HoursWk and acs$Income
## t = 12.893, df = 1285, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2891509 0.3859513
## sample estimates:
## cor
## 0.3384462
Note that cor.test
expects two vectors, x
and y
, that are of the same length. The method
argument can be changed to calculate different types of correlation (ie:
Spearman rank correlation, etc.)
\(~\)
This tutorial is intended to provide a quick reference to several functions used to perform common statistical tests.
It does not cover:
\(~\)
For each scenario, write out a reasonable null hypothesis and evaluate it using the proper statistical test.
Question #1: The “infant heart” data set documents the results of a experiment investigating two developmental indices, PDI and MDI, after random assignment to one of two surgical approaches, low-flow bypass and circulatory arrest.
ih <- read.csv("https://remiller1450.github.io/data/InfantHeart.csv")
Part A: Use a statistical test to determine if there is sufficient statistical evidence to conclude that one of the two surgeries yields significantly greater PDI outcomes (indicating better physical development).
Part B: Use a statistical test to determine if there is sufficient statistical evidence to conclude that an infant’s PDI and MDI scores are related to each other.
Part C: Use a statistical test to determine if there is sufficient statistical evidence that a larger share of male infants were assigned to the low-flow bypass than the circulatory arrest group.
\(~\)
Question #2: The “commute tracker” data set is a sample of daily commutes tracked by a GPS app for a worker in the greater Toronto area.
ct <- read.csv("https://remiller1450.github.io/data/CommuteTracker.csv")
Part A: Use a statistical test to determine if there is sufficient statistical evidence to conclude that the commuter is more likely to take Hwy 407 on certain days of the week.
Part B: Use a statistical test to determine if there is sufficient statistical evidence to conclude that average value of “MaxSpeed” differs by month. If it does, decide which months are statistically different from each other.
Part C: Use a statistical test to determine if there is sufficient statistical evidence to conclude that the commuter is more to not record a commute whose destination is “Home” (opposed to one whose destination is “GSK”, the commuter’s place of employment). Hint: If each trip is equally likely to be missing, you’d expect half of the recorded commutes to be going “Home”.