Lab #10 - Multiple Comparisons, Transformations, and Outliers

\(~\)

Onboarding

As suggested in our previous lab, statisticians will sometimes apply a transformation to their data before analyzing it as part of a hypothesis test or model, with log-transformations being the most common. Statisticians use the term “log” to reference what most others call the natural logarithm: \[log(X)=t \leftrightarrow X=e^T\] The natural logarithm has a base of \(e\), a mathematical constant with a value of approximately 2.718.

Other popular bases are 2 (Log2) and 10 (Log10). A 1-unit change on the Log2 scale reflects a doubling of values on the original scale, while a 1-unit change on the Log10 scale reflects a 10-fold increase.

The figure below illustrates how the values \(\{1, 2, 4, 10, 100\}\) appear on these scales:

A few things to notice:

On the original scale the values 1, 2, and 4 are bunched together, but after log-transformation they spread out.
The distance between 1 and 10 on the Log10 scale is the same as the distance between 10 and 100 since both are a 10-fold increase.

These properties all log-transformations to alleviate problems stemming from right-skew and outliers:

## Let's load some right-skewed data
library(dplyr)
tsa_eyeglasses <- read.csv("https://remiller1450.github.io/data/tsa.csv") %>% 
  filter(Item == "Eyeglasses - (including contact lenses)")

## We use the log() function to create a new log-transformed variable
tsa_eyeglasses$log10_claim = log(tsa_eyeglasses$Claim_Amount, base = 10)

## Graph the distributions of claim amount and log10(claim amount)
ggplot(data = tsa_eyeglasses, aes(x = Claim_Amount)) + geom_histogram(bins = 30) + theme_light()
ggplot(data = tsa_eyeglasses, aes(x = log10_claim)) + geom_histogram(bins = 30) + theme_light()

Another nice feature of logarithms is that differences calculated on the log-scale are ratios on the original scale after the transformation is undone: \[log(X) - log(Y) = log(X/Y) \\ e^{log(X/Y)} = X/Y\]
So, if we perform a hypothesis on a difference in means using log-transformed data, this test is equivalent to a test involving the ratio of means using the non-transformed data.

Note: Because \(\sum_{i=1}^{n}log(x_i)/n \ne log(\sum_{i=1}^{n}x_i/n)\) we actually get a ratio of geometric means (rather than arithmetic means) after reversing a log-transformation. However, this is a technical detail that almost never has any practical importance. The main idea is that by analyzing log-transformed data we can assess relative changes between groups.

\(~\)

Lab

Our previous lab introduced the “tailgating” data set, which records the results of a driving simulator study where regular users of various types of drugs were asked to follow an erratic lead vehicle to a certain destination. The researchers hypothesized that regular users of harder drugs would be more prone to risky behavior, and thus they would follow the erratic vehicle more closely.

tailgating = read.csv("https://remiller1450.github.io/data/Tailgating.csv")

Recall that the outcome variable in this data set is D, the average following distance of a subject during the simulation. The data also contains the log-transformed version of this variable, which is recorded as LD.

\(~\)

Outliers and Transformations

The tailgating data set includes several large outliers whose average following distances were many times larger than most participants in the study:

ggplot(data = tailgating, aes(x = D, y = Drug)) + geom_boxplot()

You should remember from earlier in the semester that both the mean and standard deviation are sensitive to outliers, so naturally hypothesis tests like one-way ANOVA and the two-sample \(T\)-test will also be impacted by these outliers. In the question below you will investigate the role of the large outlier in the THC group.

Question #1: The code provided below creates a second version of the Tailgating data set using filter() to remove the large outliers whose average following distances exceed of 100 feet.

tailgating_no_outlier = tailgating %>% filter(D < 100)

Part A: Use the filter() function to create a subset of the original data set containing the outlier (tailgating) containing only the THC and MDMA groups (Recall we learned about filter() in Lab 4. Then, use a two-sample \(T\)-test to evaluate whether there is evidence of a difference in the average following distances of regular users of these drugs. Report an appropriate one-sentence conclusion that includes the \(p\)-value and observed sample means in each group (see Lab 4 to review example conclusions).
Part B: Repeat the procedure used in Part A on the version of the data where the large outlier in the THC group has been removed (tailgating_no_outlier). Report a one-sentence conclusion that includes the \(p\)-value and observed sample means in each group.
Part C: Consider the test statistic used in the two-sample \(T\)-test and notice that the observed difference in means was actually smaller in Part B (without the outlier) than it was in Part A (with the outlier). How do you explain the \(p\)-value being smaller in Part B despite the observed difference in means also being smaller? Hint: Think about the other components of the test statistic.
Part D: Assuming the outliers in this data set are real participants and not measurement errors, should these outliers be excluded from hypothesis tests? Or, more generally, is it reasonable to remove real data from the sample in a way that changes the statistical conclusions that are drawn? Briefly share your thoughts.
Part E: Now perform a two-sample \(T\)-test on the data containing the outlier (which you created in Part A) using log-transformed following distances, or the variable LD, as the outcome. Noting that these data include the outlier, compare the \(p\)-value of this test with those found in Parts A and B.
Part F: The lab’s introduction discussed how differences on the log-scale are ratios after applying an inverse transformation. Take the difference in means of the variable LD and apply the inverse transformation (the exp() function since the natural logarithm was used to create LD). Provide a brief interpretation of the resulting ratio of means (ie: Average following distances were 20% higher in the THC group compared to the MDMA group).
Part G: In simple linear regression, we interpet the model’s slope coefficient as telling us the expected change in the outcome variable for a 1-unit change in the explanatory variable. How would this interpretation change when the explanatory variable has been transformed using a base-two logarithm? You should explain the interpretation in the original (non-transformed) units of both variables.
Part H: Building off of Part G, how would the interpretation of a slope coefficient change in simple linear regression when the outcome variable has been transformed using a base-two logarithm? You should explain the interpretation in the original (non-transformed) units of both variables.

\(~\)

Multiple Comparisons

Parkinson’s disease is a progressive neurological disorder that impacts movement and coordination and worsens over time. However, if the disease can be identified before neurological symptoms appear it is possible to slow its progression.

The “Parkinson’s biomarkers” data set records a subset of results from a study seeking to find biomarkers that could identify people in the early stages of Parkinson’s when the disease is most treatable. The researchers collected blood samples from 50 patients with early-stage Parkinson’s, as well as blood samples from 55 controls who did not have the disease but were similar in age.

For each patient microarrays were used to measure the expression levels of 15,000 genetic features.

parkinsons_biomarkers = read.csv("https://remiller1450.github.io/data/parkinsons_biomarkers.csv")

Question #2: The code provided below performs 15,000 different two-sample \(T\)-tests comparing the average gene expression levels of cases and controls in each genetic feature present in the data set. The \(p\)-values of these 15,000 tests are stored in the object p_values, which you are to use in the questions that follow. Note: It might take a minute or so to run this code.

p_values = numeric(15000) ## creates an empty object to store results
for(i in 1:15000){        ## loops through all the genetic features, recording a p-value for each
p_values[i] = t.test(parkinsons_biomarkers[,i] ~ parkinsons_biomarkers$status)$p.value  
}

Part A: Were these data collected from an experiment or an observational study? If they came from an observational study, what type of design was used (prospective, retrospective, or cross-sectional) and why might this design have been chosen? If they came from an experiment, what variable did the researchers manipulate? Briefly explain your answers.
Part B: Suppose the \(p\)-values calculated by the provided code are each compared to a decision threshold of \(\alpha = 0.05\) to judge statistical significance. If the null hypothesis were true for every genetic feature tested, how many statistically significant findings would you expect to observe just by chance using this threshold?
Part C: If the null hypothesis were actually true for every genetic feature tested we’d expect a roughly uniform distribution of \(p\)-values. To investigate this further, create a histogram showing the distribution of the \(p\)-values calculated by the provided code. Use 25 bins and add a horizontal line using +geom_hline(yintercept = 15000/25) to display what a uniform distribution would look like. Hint: The object p_values was not created as a data frame, so you may want to use the data.frame() function to create a data frame containing it that you can give as an input to ggplot().
Part D: Based upon the histogram you created in Part C, do you believe it is the case that none of the genetic features tested have significantly different expression levels for cases and controls? Explain your reasoning.

\(~\)

Adjusted \(p\)-values

The p.adjust() function accepts a vector of \(p\)-values applies an adjustment method to control either the family-wise error rate or the false discovery rate.

The code below demonstrates how to apply the Bonferroni correction to obtain adjusted \(p\)-values. Recall that these \(p\)-values are simply the original, unadjusted \(p\)-value multiplied by the number of hypothesis tests (15000 in our case), and they can be compared against a typical significance threshold (such as \(\alpha = 0.05\)) to control the family-wise type I error rate at that level.

adj_p_values = p.adjust(p_values, method = "bonferroni")

We could change the method argument to method = "fdr" to obtain a set of adjusted \(p\)-values that control the false discovery rate at a certain level when compared against the corresponding threshold value.

Question #3: For this question you should use the \(p\)-values you found for the Parkinson’s biomarkers data in the previous section of the lab.

Part A: Use the Bonferroni adjustment to control the family-wise type I error rate at 20% and report the number of genetic features that are identified as statistically significant. Hint: You might try a command like sum(adj_p_values <= alpha) where alpha is the target error rate.
Part B: In your own words, explain what it means to control the family-wise type I error rate at 20% in this study.
Part C: Now use the false discovery rate approach to control the false discovery rate at 20% and report the number of genetic features that are identified as statistically significant.
Part D: In your own words, explain what it means to control the false discovery rate at 20% in this study.
Part E: In the context of this application, what is a type II error? Which approach, the Bonferroni adjustment or false discovery rate control, is likely to produce more type II errors?
Part F: Suppose you are the statistician in charge of analyzing these data and you have the option of using the Bonferroni adjustment, a false discovery rate control approach, or no adjustment to any of the \(p\)-values. Which approach would be most appropriate for this study? Briefly explain your reasoning.