Lab #13 - Hypothesis Testing using Probability Models

\(~\)

Onboarding

Our previous lab introduced key concepts in hypothesis testing using simulations to approximate the null distribution of outcomes that might have been observed had the null hypothesis been true. You’ll recall that the observed outcome is compared against the null distribution to determine the \(p\)-value, which is a measure of how much evidence the data provide against the null hypothesis.

In this lab you’ll use probability models and standardized procedures rather than simulations to find the null distribution and \(p\)-value. To begin, we’ll revisit the topic of standardization, or transforming data onto a common scale by adjusting the mean and standard deviation.

When studying the correlation coefficient, we introduced the concept of Z-scores. For example, in Pearson’s height data, sons had an average height of \(\overline{x} = 63.3\) inches and a standard deviation of heights of \(s = 2.8\). Thus, the Z-score for a son with a height of 68.7 inches is given by:

\[Z = \frac{68.7 - 63.3}{2.8} = 1.9\]

We used this to conclude that a son who is 68.7 inches is 1.9 standard deviations above average.

This same idea applies to descriptive statistics such as sample means, proportions, and many others. Suppose we take a sample of \(n=5\) cases and obtain a sample mean of \(\overline{x} = 5\) and \(s = 3\). If we assume the population’s mean is \(\mu = 2\) we can calculate the following Z-score:

\[Z = \frac{5 - 2}{\frac{3}{\sqrt{5}}} = 2.24\] Thus we can conclude that this sample’s mean is 2.24 standard errors above average. You might recall that we use the term “standard error” to describe the standard deviation of a descriptive statistic across different samples, see our notes on confidence intervals and the corresponding lab for details.

Going one step further, if the conditions are right for \(Z\) to follow a known probability distribution we can use that distribution to calculate the probability of observing a sample mean at least as extreme as \(\overline{x} = 5\) under the assumption we made of the population’s mean being \(\mu = 2\). Since we’re working with a single mean and using an estimate of the population’s standard deviation, the \(t\)-distribution is appropriate:

Thus, the probability of observing a sample mean at least as extreme as 2.24 standard errors above the expected mean of 2 is 0.089 (shaded in blue). By definition, this is the \(p\)-value.

To conclude, we can use this example to establish a standardized hypothesis testing procedure based upon a test statistic of the form: \[\text{Test Statistic} = \frac{\text{observed statistic} - \text{hypothesized value}}{\text{standard error}}\]

The standard error can be found via the Central Limit theorem (or computational methods) and the distribution of the test statistic can be used to determine the \(p\)-value.

\(~\)

Lab

Throughout this lab you’ll use the “Infant Heart” data set, which comes from an experiment performed by researchers at Harvard Medical School who randomly assigned infants born with a congenital heart defect to one of two surgical approaches: low-flow bypass or circulatory arrest. The researchers followed each infant for two years, with the child’s MDI (mental development index) and PDI (psychomotor development index) scores at 2-years being the study’s primary outcomes.

## Libraries
library(dplyr)
library(ggplot2)

## Data used in the lab
infants = read.csv("https://remiller1450.github.io/data/InfantHeart.csv")

\(~\)

Single Proportions

Recall that statistical tests involving a single proportion use hypotheses of the form: \[H_0: p = \text{hypothesized value} \\ H_a: p \ne \text{hypothesized value}\]

As an example, we can hypothesize that congenital heart defects are equally likely for male and female infants, which would be a null hypothesis of \(H_0: p = 0.5\) (suggesting 50% of babies from congenital heart defects are male).

Using the table() function we can see that our sample is mostly male:

table(infants$Sex)

## 
## Female   Male 
##     44     99

A hypothesis test will evaluate whether this could have happened due to chance (sampling variability), or if there’s statistical evidence that male babies are more likely to be born with congenital heart defects.

To perform the test “by hand” we’d need all of the components of the test statistic formula: \[\text{Test Statistic} = \frac{\text{observed statistic} - \text{hypothesized value}}{\text{standard error}} = \frac{99/143 - 0.5}{\sqrt{\frac{0.5(1-0.5)}{143}}}=4.6\]

Note: The standard error (the denominator of the test statistic) arises from the Central Limit theorem. See our previous lab on confidence intervals for a table of various standard error formulas.

We can find the \(p\)-value by inputting our test statistic in pnorm(). In doing so we must be careful to recognize that we need to set lower.tail = FALSE to ensure we calculate \(Pr(Z>4.6)\) and not \(Pr(Z<4.6)\), and we need to multiply this probability by 2 to account for the extreme outcomes on the other side of the distribution (which is symmetric).

## Finding the p-value "by hand"
2*pnorm(q=4.6, lower.tail = FALSE)

## [1] 4.224909e-06

We can compare our “by hand” result with the results from prop.test() (which also uses a Normal probability model) and binom.test() (which uses the binomial distribution, an exact probability model for proportions).

## Finding the p-value using prop.test
prop.test(x = 99, n = 143, p = 0.5, correct = FALSE)

## 
##  1-sample proportions test without continuity correction
## 
## data:  99 out of 143, null probability 0.5
## X-squared = 21.154, df = 1, p-value = 4.238e-06
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.6124572 0.7620965
## sample estimates:
##         p 
## 0.6923077

## Finding the p-value using binom.test
binom.test(x = 99, n = 143, p = 0.5)

## 
##  Exact binomial test
## 
## data:  99 and 143
## number of successes = 99, number of trials = 143, p-value = 4.887e-06
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.6097308 0.7667202
## sample estimates:
## probability of success 
##              0.6923077

Regardless of the precise approach chosen, the \(p\)-value is on the order of 1e-6, or around 0.000001. This indicates overwhelming statistical evidence that males are more common among infants born with congenital heart defects.

Question #1: Citing a paper in Nature, Google’s search AI describes a study where the implied proportion of male infants needing surgery for congenital heart defects is 0.62. In this question you will consider the null hypothesis \(H_0: p=0.62\), which suggests that 62% of infants needing surgery for congenital heart defects are male, and perform a test of this hypothesis using the infants data set.

Part A: Using the infants data set and the null hypothesis provided above, show the calculation of test statistic for the hypothesis test that is described.
Part B: Use this StatKey menu to find the area of the Normal distribution that is at least as extreme as the test statistic you calculated in Part A. Note that this is the \(p\)-value.
Part C: Use the pnorm() function to confirm the \(p\)-value you found in Part B.
Part D: Provide a 1-sentence summary of the results of the hypothesis test you’ve performed.

Question #2: Repeat a similar hypothesis test to the one you performed in Question 1 using the binom.test() function. You only need to provide the code necessary to perform this test, though you should note a \(p\)-value that is similar to the one you found “by hand”.

\(~\)

Differences in Proportions

Statistical tests involving a difference in proportions use hypotheses of the form: \[H_0: p_1 - p_2= \text{hypothesized value} \\ H_a: p_1-p_2 \ne \text{hypothesized value}\]

A hypothesized value of 0 is almost always used in this context.

Difference in proportions tests are conducted using prop.test(), a function we’ve previously used to find confidence interval estimates for a difference in proportions. To make use of the function, we must provide the following:

x - a vector containing the frequencies of the event of interest in each group
n - a vector containing the sample sizes of each group

An example is shown below:

## Using the NFL games data set
nfl = read.csv("https://remiller1450.github.io/data/nfl_sample.csv")

## We'll re-code "game outcome" as a binary variable
nfl$win_binary = ifelse(nfl$game_outcome == 1, "Win", "Loss or Tie")

## Table relating game type and game outcome
type_outcome_table = table(nfl$game_type, nfl$win_binary)

## Printing the table
type_outcome_table

##          
##           Loss or Tie Win
##   Playoff           6   5
##   Reg              88 101

In order to test for a difference in the win rate of home teams in regular season and playoff games, we should provide the vector (5,101) as the x argument and (11,189) as the n argument. This is accomplished by the code below:

## Extract the vectors needed
my_x = c(type_outcome_table[1,2], type_outcome_table[2,2])
my_n = rowSums(type_outcome_table)

## Use prop.test
prop.test(x = my_x, n = my_n)

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  my_x out of my_n
## X-squared = 0.042056, df = 1, p-value = 0.8375
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.4306698  0.2709776
## sample estimates:
##    prop 1    prop 2 
## 0.4545455 0.5343915

In this example, the \(p\)-value is 0.8375, which suggests that this sample provides insufficient evidence of a relationship between game type and the win rate of the home team.

Question #3:

Part A: Use prop.test() to evaluate whether there is statistical evidence that the proportions of male infants in each type of surgery (low-flow and circulatory arrest) is unequal. You should clearly state your null hypothesis, then you should provide the \(p\)-value and 1-sentence summary.
Part B: Consider the design of the Infant Heart study. Why does it make sense that the hypothesis test you performed found insufficient evidence of a difference in proportions?

\(~\)

Means and Differences in Means

Statistical tests involving a single mean use hypotheses of the form: \[H_0: \mu= \text{hypothesized value} \\ H_a: \mu \ne \text{hypothesized value}\]

While tests involving a difference in means use hypotheses of the form: \[H_0: \mu_1 - \mu_2 = \text{hypothesized value} \\ H_a: \mu_1 - \mu_2 \ne \text{hypothesized value}\]

For a difference in means test, the hypothesized value is almost always zero, while for a test of a single mean, it can be any value depending on the application.

Both of these tests are performed using the t.test() function, which uses the \(t\)-distribution as a probability model to account for the additional variability introduced by the standard error of the mean requires us to estimate both the mean and standard deviation of the population using the same data.

To use t.test() to perform a test involving a single mean we provide the relevant quantitative variable as the x argument and the hypothesized value as the mu argument. The example below tests whether the average score differential in NFL games is zero:

## Example of t.test for a single mean
t.test(x = nfl$score_diff, mu = 0)

## 
##  One Sample t-test
## 
## data:  nfl$score_diff
## t = 0.91137, df = 199, p-value = 0.3632
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -1.140453  3.100453
## sample estimates:
## mean of x 
##      0.98

To use t.test() to perform a difference in means \(t\)-test we can use formula syntax similar to what we’ve previously seen for regression models. The example below tests whether the average points scored by the home team differs for playoff and regular season games.

## Example of t.test for a difference in means
t.test(home_score ~ game_type, mu = 0, data = nfl)

## 
##  Welch Two Sample t-test
## 
## data:  home_score by game_type
## t = 0.26493, df = 11.128, p-value = 0.7959
## alternative hypothesis: true difference in means between group Playoff and group Reg is not equal to 0
## 95 percent confidence interval:
##  -6.404705  8.160357
## sample estimates:
## mean in group Playoff     mean in group Reg 
##              23.45455              22.57672

Question #4: PDI scores are calculated such that a score of 100 reflects the population average for children of a certain age. In this question you will test whether the PDI scores at two-years of age for infants born with congenital heart defects deviate from the population average.

Part A: The standard error of a single mean is \(\sigma/\sqrt{n}\), or the standard deviation of the cases in the population divided by the square root of the sample size. With this in mind, calculate the test statistic involved in the scenario described above.
Part B: For a test of a single mean, the test statistic you calculated in Part A should be compared against a \(t\)-distribution with \(n-1\) degrees of freedom. Use this StatKey page to determine the \(p\)-value associated with the test statistic you found in Part A.
Part C: Use the t.test() function to confirm the test statistic and \(p\)-value you found in Parts A and B.
Part D: Suppose you had compared the test statistic from Part A against a Normal distribution rather than a \(t\)-distribution. Would you expect this mistake to result in a \(p\)-value that is larger or smaller than the one you found in Parts B and C? Briefly explain.

Question #5: A primary analysis goal in this study was to compare the mean PDI scores of infants receiving each type of surgery. For this question you should perform a hypothesis test that evaluates whether the study found a significant difference in mean PDI scores. You should clearly state your null and alternative hypotheses, calculate a \(p\)-value using t.test(), and report a 1-sentence summary of what you conclude from the hypothesis test.

\(~\)

Practice

The purpose of this section is practice your ability to decide the proper statistical test and hypotheses to use when given a new research question and data set. Each question will introduce a data set and research question, and your task is to perform an appropriate hypothesis test and report a 1-sentence conclusion based upon the results of that test.

When deciding upon a test you should consider the variables involved in the research questions and their type:

If the research question involves one categorical variable, then a single proportion test (using prop.test() or binom.test()) is appropriate.
If the research question involves one categorical variable and one quantitative variable, then a differences in means \(t\)-test is most likely appropriate.
If the research question involves two categorical variables, then a difference in proportions test (using prop.test()) is most likely appropriate.

Question #6: The “Oatbran” data set (read into R below) contains the results of an experiment where 14 male participants were randomly assigned to eat a certain type of cereal for two weeks, after which their LDL cholesterol levels (mmol/L) were measured, then after a washout period they ate a second type of cereal for two weeks and their LDL cholesterol was measured again after this period. Participants were randomly assigned to either consume oatbran cereal first, followed by cornflakes, or to eat cornflakes first, followed by oatbran during the second study period.

The data set contains 3 columns:

CornFlakes - the LDL measurement at the end of the period in which the subject was on the cornflakes diet
OatBran - the LDL measurement at the end of the period in which the subject was on the oatbran diet
difference - the difference in LDL measurements (CornFlakes minus Oatbran) for a subject

## Data for Question #6
oatbran = read.csv("https://remiller1450.github.io/data/Oatbran.csv")

For this question, you should perform an appropriate hypothesis test to evaluate whether this study provides evidence that oat bran cereal helps lower LDL cholesterol. Be sure your answer includes all of the components requested at the start of this section.

\(~\)

Question #7: The “Commute Tracker” data set (loaded below) contains a sample of daily commutes tracked by a GPS app used by a worker in the greater Toronto area.

ct <- read.csv("https://remiller1450.github.io/data/CommuteTracker.csv")

For this question, you should use a hypothesis test to evaluate whether the worker is more likely to take Hwy 407, a toll road which is faster but more expensive than their normal route, when they are headed to their workplace GoingTo = 'GSK', or headed home GoingTo = 'Home'. Be sure your answer includes all of the components requested at the start of this section.

\(~\)

Question #8: For this question, you will use the “Commute Tracker” data introduced in Question 7. You are to perform a hypothesis test to evaluate whether the average total time of trips back to the worker’s home GoingTo = 'Home' is longer than the average total time of trips to the worker’s office GoingTo = 'GSK'. Be sure your answer includes all of the components requested at the start of this section.