\(~\)

Onboarding

Statisticians will sometimes apply a transformation to their data before analyzing it as part of a hypothesis test or model, with log-transformations being the most common. Statisticians use the term “log” to reference what most others call the natural logarithm: \[log(X)=t \leftrightarrow X=e^T\] The natural logarithm has a base of \(e\), a mathematical constant with a value of approximately 2.718.

Other popular bases are 2 (Log2) and 10 (Log10). A 1-unit change on the Log2 scale reflects a doubling of values on the original scale, while a 1-unit change on the Log10 scale reflects a 10-fold increase.

The figure below illustrates how the values \(\{1, 2, 4, 10, 100\}\) appear on these scales:

A few things to notice:

  1. On the original scale the values 1, 2, and 4 are bunched together, but after log-transformation they are spread out.
  2. The distance between 1 and 10 on the Log10 scale is the same as the distance between 10 and 100 since both of these are a 10-fold increase.

These properties make log-transformations an attractive tool to reduce the impact of right-skew and outliers:

## Let's load some right-skewed data
library(dplyr)
tsa_eyeglasses <- read.csv("https://remiller1450.github.io/data/tsa.csv") %>% 
  filter(Item == "Eyeglasses - (including contact lenses)")

## We use the log() function to create a new log-transformed variable
tsa_eyeglasses$log10_claim = log(tsa_eyeglasses$Claim_Amount, base = 10)

## Graph the distributions of claim amount and log10(claim amount)
ggplot(data = tsa_eyeglasses, aes(x = Claim_Amount)) + geom_histogram(bins = 30) + theme_light()
ggplot(data = tsa_eyeglasses, aes(x = log10_claim)) + geom_histogram(bins = 30) + theme_light()

Another nice feature of logarithms is that differences calculated on the log-scale are ratios on the original scale after the transformation is undone: \[log(X) - log(Y) = log(X/Y) \\ e^{log(X/Y)} = X/Y\]
So, if we perform a hypothesis on a difference in means using log-transformed data, this test is equivalent to a test involving the ratio of means using the non-transformed data.

Note: Because \(\sum_{i=1}^{n}log(x_i)/n \ne log(\sum_{i=1}^{n}x_i/n)\) we actually get a ratio of geometric means rather than arithmetic means when undoing a log-transformation. This is a technical detail that is seldom practically important. The main idea is that by analyzing log-transformed data we can assess relative changes between groups.

\(~\)

Lab

This lab will use the “Tailgating” data set, which was introduced in our lecture on decision errors and false discoveries. This data set contains three columns:

  • Drug - the hardest substance that subject regularly consumed
  • D - the subject’s average following distance in the simulated drive
## Libraries
library(dplyr)
library(ggplot2)

## Data used in the lab
tailgating = read.csv("https://remiller1450.github.io/data/Tailgating.csv")

\(~\)

Outliers and Transformations

In a recent lecture, several large outliers were omitted from the visualizations of the Tailgating data. Here’s what the data look like when those outliers are included:

ggplot(data = tailgating, aes(x = D, y = Drug)) + geom_boxplot()

Recall that the \(t\)-test is used to compare the means of two groups, and the standard error used in the test statistic involves the sample standard deviation. We saw early in the semester that both the mean and standard deviation are sensitive to outliers, so what role is the large outlier in the THC group playing in the hypothesis tests involving that group?

Question #1: The code provided below creates a second version of the Tailgating data set using filter() to remove the large outliers whose average following distances exceed of 100 feet.

tailgating_filtered = tailgating %>% filter(D < 100)
  • Part A: Use the t.test() function on the original data set (which contains outliers) to evaluate whether there is statistical evidence of a difference in the mean following distance of drivers in the THC and MDMA groups. Report the \(p\)-value of your test and a 1-sentence conclusion. Hint: You might consider using the command filter(Drug %in% c("MDMA","THC")) to create a version of these data that only contains members of the MDMA and THC groups.
  • Part B: Now repeat the same test on the filtered data set (which has removed the outliers). Report the \(p\)-value of this test and a 1-sentence conclusion.
  • Part C: Assuming the outliers in this data set are real participants and not measurement errors, should these outliers be excluded prior to analysis? More generally, is it reasonable to arbitrarily remove data from the sample in a way that changes the conclusions drawn from the data? Briefly explain your thoughts.

\(~\)

Question #2:

  • Part A: Create a new variable in the original Tailgating data set that applies a Log10 transformation to the subject’s following distance (the variable D). Next, use side-by-side box plots similar to those seen earlier in this section, to describe the impacts of the log-transformation on the outliers present in these data.
  • Part B: Use t.test() to perform a hypothesis test comparing the mean log-transformed following distances in the THC and MDMA groups. Report the \(p\)-value of your test and a 1-sentence conclusion.
  • Part C: You’ve now considered three different approaches to testing whether the THC and MDMA groups differ (\(t\)-test on the non-transformed data including outliers, \(t\)-test on the non-transformed data excluding outliers, and \(t\)-test on the log-transformed data). Considering the strengths and limitations of each approach, which would you recommend the researchers in this study use? Hint recall that the \(t\)-test relies upon a probability model that assumes either a large sample size or data from a Normally distributed population.

\(~\)

Non-Parametric Tests

When we first studied confidence intervals and probability models we briefly discussed the circumstances where the Normal and \(t\)-distributions served as reasonable probability models. For hypothesis tests involving a proportion or a difference in proportions we could handle situations where a Normal probability model was unreasonable by using an exact test:

It’s beyond the scope of this course to learn about those probability models in detail, but the thing to know is that we have an exact test that we can use for those circumstances that doesn’t require any assumptions about our sample size.

In contrast, when conditions required to reasonably use the \(t\)-distribution are violated for tests involving a single mean (one-sample \(t\)-test) or a difference in means (two-sample \(t\)-test) there is no exact test we can fall back on. Instead, statisticians have developed “distribution free” or “non-parametric” testing approaches for scenarios where it is unreasonable to assume Normality.

\(~\)

Wilcoxon Signed-Rank Test

The Wilcoxon Signed-Rank Test is a non-parametric analog to the one-sample \(t\)-test for a single mean. It is most often used to evaluate whether the median difference in a paired design is zero.

Consider the “Oatbran” study, which used a paired design to evaluate changes in LDL cholesterol for two different diets:

The Wilcoxon Signed-Rank Test ranks each data-point based upon its absolute value, then data-points are grouped according to their sign (positive or negative). The test then compares a test statistic that is the sum of each data-point’s rank multiplied by its sign against a null distribution to produce a \(p\)-value.

The wilcox.test() function is used to perform Wilcoxon Signed-Rank Test. It operates analogously to t.test():

oatbran = read.csv("https://remiller1450.github.io/data/Oatbran.csv")

## Perform the test
wilcox.test(x = oatbran$Difference, mu = 0)
## 
##  Wilcoxon signed rank exact test
## 
## data:  oatbran$Difference
## V = 93, p-value = 0.008545
## alternative hypothesis: true location is not equal to 0

We will not cover the form of Wilcoxon Signed-Rank Test’s test statistic or null distribution in detail. Instead, you should know when to use the test, which in place of a one-sample \(t\)-test when the conditions needed to rely upon the \(t\)-distribution as a probability model are not met, and how to interpret the test’s results.

\(~\)

Wilcoxon Rank-Sum Test

The Wilcoxon Rank Sum Test (also known as the Mann-Whitney U-test) is a non-parametric analog to the two-sample \(t\)-test for a difference in means.

The null hypothesis is that both groups follow the same distribution, and the alternative is both groups follow distinct distributions. The test is conducted using the following steps:

  1. Each data-point, regardless of group, is ranked from smallest to largest
  2. These ranks are summed within each group
  3. These two sums, along with the sample sizes in each group, are used to compute a test statistic that is compared against a null distribution to produce a \(p\)-value

The test can be performed using the wilcox.test() function:

wetsuits = read.csv("https://remiller1450.github.io/data/Wetsuits_long.csv")

## Perform the test
wilcox.test(Velocity ~ Condition, mu = 0, data = wetsuits)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Velocity by Condition
## W = 48.5, p-value = 0.1838
## alternative hypothesis: true location shift is not equal to 0

Note: Similar to t.test(), the wilcox.test() function will determine whether to perform a difference in groups (two-sample) or a single group (one-sample) test based upon the way in which you provide the data. The signed rank test expects a single vector provided as the x argument, while the rank sum test can be done using formula notation.

Question #3: In a 1982 study, researchers had 12 subjects participate in a visual motor task where they had to steer a pencil along a moving track. Each subject was tested on two different tracks: a straight track and an oscillating one. The researchers were not interested in how well the participants kept their pencil on the track, but rather their blink rate (measured in blinks per minute) under each condition. The columns Straight and Oscillating record each subject’s blink rate in each of these conditions.

  • Part A: Considering the design of this study, is the Wilcoxon Rank-Sum Test or the Wilcoxon Signed-Rank Test more appropriate? Briefly explain.
  • Part B: Considering the assumptions of the method, would it be appropriate to perform a \(t\)-test on these data? Explain your reasoning.
  • Part C: Perform an appropriate hypothesis test evaluating whether blink rate was influenced by the experimental condition. A sufficient answer will show the R output of the test along with a 1-sentence conclusion. Hint: you may need to create a new variable to perform the test you deem appropriate.
blink = read.csv("https://remiller1450.github.io/data/blink.csv")

Question #4:

  • Part A: Earlier in this lab you used log-transformations and a \(t\)-test to evaluate differences in the average following distances of regular users of MDMA and THC. Considering your previous analysis, which nonparametric approach can be used in this application: the Wilcoxon Rank-Sum Test or the Wilcoxon Signed-Rank Test?
  • Part B: Perform the nonparametric test you deemed appropriate for this application in Part A. How does the \(p\)-value of this test compare with the one you found in Question #2?

\(~\)

Practice (required)

In our previous lab you practiced deciding upon the appropriate hypothesis test for a research question. This section will expand upon that by requiring you to decide whether a traditional parametric test (ie: \(t\)-test) or an exact or non-parametric (ie: exact binomial test or Wilcoxon test) should be used.

The table below summarizes all of the tests that we’ve learned about so far, as well as the conditions needed to trust the probability model involved in the parametric tests:

Explanatory Variable Outcome Variable Parametric Test Conditions Nonparametric or Exact Alternative
Binary Categorical Binary Categorical prop.test() - two-sample Z-test 10 cases in each cell of the 2x2 contingency table fisher.test() - Fisher’s Exact Test
Binary Categorical Quantitative t.test() - two-sample T-test Normal data or 30 cases per group wilcox.test() - Rank Sum Test
none Binary Categorical prop.test() - one-sample Z-test 10 cases experiencing each outcome binom.test() - Exact Binomial Test
none Quantitative t.test() - one-sample T-test Normal data or at least 30 cases wilcox.test() - Signed Rank Test

For the following questions you are expected to:

  1. Determine an appropriate statistical test and provide a justification for that test. For example, if you are performing a one-sample \(t\)-test you should indicate a sufficient sample size, or you should create a histogram or Q-Q plot that shows no evidence or skew/outliers.
  2. Accurately find the \(p\)-value for the test you selected.
  3. Report a one-sentence conclusion following the guidelines discussed in our lecture slides on hypothesis testing.

\(~\)

Question #5: Cloud seeding is a type of weather modification that aims to change the amount or type of precipitation that falls from clouds. The method works by dispersing a chemical into the air that alters the microphysical processes within the cloud. Whether or not cloud seeding is effective in producing a statistically significant increase in precipitation is an ongoing debate. In this question, you will analyze data from an experiment where clouds were randomly assigned to receiving seeding. You should evaluate whether there is statistical evidence that seeding (yes/no) influences rainfall (inches).

clouds = read.csv("https://remiller1450.github.io/data/clouds.csv")

\(~\)

Question #6: The “HairEyeColor” data set records the hair color, eye color, and sex of a sample of 592 students from the University of Delaware taken in 1974. Using these data, evaluate whether there is statistical evidence that a higher proportion of green eyes is higher among individuals with brown hair than it is among individuals with black hair. Hint: To simplify the steps of your analysis, you might choose to use the filter() function to only include the relevant combinations of hair and eye color.

hair = read.csv("https://remiller1450.github.io/data/HairEyeColor.csv")