R
This lab provides a brief overview of the functions used to perform the statistical tests commonly covered in an introductory statistics course.
If you are unfamiliar with the concept of hypothesis testing, I encourage you to read through my course notes on the topic
Directions (Please read before starting)
\(~\)
One-sample tests are used to assess whether a summary statistic observed in the sample data is statistically different from a hypothesized value.
For categorical data, a common null hypothesis is \(H_0: p = p_0\), where \(p_0\) is a hypothesized proportion for the categorical outcome of interest. This hypothesis can be evaluated using a one-sample Z-test:
acs <- read.csv("https://remiller1450.github.io/data/EmployedACS.csv") ## Random sample of 1287 employed individuals from the American Community Survey
n_male = sum(acs$Sex == 1) ## Number of males among respondents
prop.test(x = n_male, n = nrow(acs), p = 0.5, alternative = "two.sided")
##
## 1-sample proportions test with continuity correction
##
## data: n_male out of nrow(acs), null probability 0.5
## X-squared = 3.1826, df = 1, p-value = 0.07443
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4975476 0.5528050
## sample estimates:
## p
## 0.5252525
or an exact binomial test:
binom.test(x = n_male, n = nrow(acs), p = 0.5, alternative = "two.sided")
##
## Exact binomial test
##
## data: n_male and nrow(acs)
## number of successes = 676, number of trials = 1287, p-value = 0.07439
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.4975508 0.5528386
## sample estimates:
## probability of success
## 0.5252525
For either test, you must provide the numerator and denominator of the sample proportion (ie: count of males and total sample size), as well as the hypothesized population proportion, “p”.
For quantitative data, the one-sample \(t\)-test should be used to assess the hypothesis: \(H_0: \mu = \mu_0\), where \(\mu_0\) is a hypothesized mean.
t.test(x = acs$Income, mu = 40, alternative = "two.sided")
##
## One Sample t-test
##
## data: acs$Income
## t = 2.9449, df = 1286, p-value = 0.003289
## alternative hypothesis: true mean is not equal to 40
## 95 percent confidence interval:
## 41.50877 47.53074
## sample estimates:
## mean of x
## 44.51976
In this example, the quantitative variable is given as the
x
argument, and the hypothesized mean is given as
mu
.
Two-sample tests are used to assess whether two different sample groups are statistically different. A common example is an A/B test, where experimental participants are randomly assigned into one of two conditions (A or B) and an outcome is recorded.
For categorical data, you might use a difference in proportions Z-test:
## First you'll need the numerator and denominator of each sample's proportion
ins_white = sum(acs$HealthInsurance == 1 & acs$Race == "white")
n_white = sum(acs$Race == "white")
ins_black = sum(acs$HealthInsurance == 1 & acs$Race == "black")
n_black = sum(acs$Race == "black")
## Z-test
prop.test(x = c(ins_white, ins_black), n = c(n_white, n_black))
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(ins_white, ins_black) out of c(n_white, n_black)
## X-squared = 0.14491, df = 1, p-value = 0.7034
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.04383785 0.07295584
## sample estimates:
## prop 1 prop 2
## 0.9283521 0.9137931
For quantitative data, you should use a two-sample T-test:
t.test(Income ~ Sex, data = acs)
##
## Welch Two Sample t-test
##
## data: Income by Sex
## t = -4.921, df = 1231.2, p-value = 9.776e-07
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -20.650616 -8.878211
## sample estimates:
## mean in group 0 mean in group 1
## 36.76471 51.52913
The syntax provided to t.test
uses formula
notation. The formula Income ~ Sex
indicates the
quantitative outcome, “Income”, should be evaluated according to the two
groups created by the variable “Sex”.
Sometimes it is too simplistic to reduce a nominal categorical variable (many categories) into a binary variable (two categories) in order to use one of the aforementioned statistical tests. In these circumstances, you can consider a Chi-squared Test (either goodness of fit or association) or Fisher’s Exact Test (association):
## Goodness of fit Chi-squared test
chisq.test(x = table(acs$Race), p = c(0.05, 0.15, 0.1, 0.7))
##
## Chi-squared test for given probabilities
##
## data: table(acs$Race)
## X-squared = 54.6, df = 3, p-value = 8.356e-12
## Chi-squared test of association
chisq.test(x = table(acs$Race, acs$HealthInsurance))
##
## Pearson's Chi-squared test
##
## data: table(acs$Race, acs$HealthInsurance)
## X-squared = 25.378, df = 3, p-value = 1.287e-05
## Fisher's exact test
fisher.test(x = table(acs$Race, acs$HealthInsurance))
##
## Fisher's Exact Test for Count Data
##
## data: table(acs$Race, acs$HealthInsurance)
## p-value = 0.0001573
## alternative hypothesis: two.sided
Similarly, some studies will naturally require you to compare a quantitative outcome across more than two groups, in which case you should use one-way ANOVA:
anova_mod <- aov(Income ~ Race, data = acs)
summary(anova_mod)
## Df Sum Sq Mean Sq F value Pr(>F)
## Race 3 56523 18841 6.291 0.000309 ***
## Residuals 1283 3842204 2995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Recall that a statistically significant ANOVA test should be followed up on using some sort of post-hoc testing, for example Tukey’s Honest Significant Differences:
TukeyHSD(anova_mod)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Income ~ Race, data = acs)
##
## $Race
## diff lwr upr p adj
## black-asian -26.750588 -46.403386 -7.0977905 0.0026929
## other-asian -30.045124 -50.285682 -9.8045650 0.0008099
## white-asian -15.697207 -31.049147 -0.3452658 0.0428292
## other-black -3.294535 -22.402516 15.8134456 0.9708621
## white-black 11.053382 -2.771118 24.8778819 0.1680528
## white-other 14.347917 -0.300105 28.9959390 0.0573859
The final combination of variables yet to be considered in this tutorial is two quantitative variables.
In this situation, the correlation coefficient can form the basis of a hypothesis test of \(H_0: \rho = 0\), or no correlation between the variables being studied:
cor.test(x = acs$HoursWk, y = acs$Income, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: acs$HoursWk and acs$Income
## t = 12.893, df = 1285, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2891509 0.3859513
## sample estimates:
## cor
## 0.3384462
This tutorial is intended to provide a quick reference to several functions used to perform common statistical tests.
It does not cover:
\(~\)
For each scenario, write out a reasonable null hypothesis and evaluate it using the proper statistical test.
Question #1: The “infant heart” data set documents the results of a experiment investigating two developmental indices, PDI and MDI, after random assignment to one of two surgical approaches, low-flow bypass and circulatory arrest.
ih <- read.csv("https://remiller1450.github.io/data/InfantHeart.csv")
Part A: Use a statistical test to determine if there is sufficient statistical evidence to conclude that one of the two surgeries yields significantly greater PDI outcomes (indicating better physical development).
Part B: Use a statistical test to determine if there is sufficient statistical evidence to conclude that an infant’s PDI and MDI scores are related to each other.
Part C: Use a statistical test to determine if there is sufficient statistical evidence that a larger share of male infants were assigned to the low-flow bypass than the circulatory arrest group.
\(~\)
Question #2: The “commute tracker” data set is a sample of daily commutes tracked by a GPS app for a worker in the greater Toronto area.
ct <- read.csv("https://remiller1450.github.io/data/CommuteTracker.csv")
Part A: Use a statistical test to determine if there is sufficient statistical evidence to conclude that the commuter is more likely to take Hwy 407 on certain days of the week.
Part B: Use a statistical test to determine if there is sufficient statistical evidence to conclude that average value of “MaxSpeed” differs by month. If it does, conduct an appropriate follow-up test.
Part C: Use a statistical test to determine if there is sufficient statistical evidence to conclude that the commuter is more to not record a commute whose destination is “Home” (opposed to one whose destination is “GSK”, the commuter’s place of employment).