\(~\)
As suggested in our previous lab, statisticians will sometimes apply a transformation to their data before analyzing it as part of a hypothesis test or model, with log-transformations being the most common. Statisticians use the term “log” to reference what most others call the natural logarithm: \[log(X)=t \leftrightarrow X=e^T\] The natural logarithm has a base of \(e\), a mathematical constant with a value of approximately 2.718.
Other popular bases are 2 (Log2) and 10 (Log10). A 1-unit change on the Log2 scale reflects a doubling of values on the original scale, while a 1-unit change on the Log10 scale reflects a 10-fold increase.
The figure below illustrates how the values \(\{1, 2, 4, 10, 100\}\) appear on these scales:
A few things to notice:
These properties all log-transformations to alleviate problems stemming from right-skew and outliers:
## Let's load some right-skewed data
library(dplyr)
tsa_eyeglasses <- read.csv("https://remiller1450.github.io/data/tsa.csv") %>%
filter(Item == "Eyeglasses - (including contact lenses)")
## We use the log() function to create a new log-transformed variable
tsa_eyeglasses$log10_claim = log(tsa_eyeglasses$Claim_Amount, base = 10)
## Graph the distributions of claim amount and log10(claim amount)
ggplot(data = tsa_eyeglasses, aes(x = Claim_Amount)) + geom_histogram(bins = 30) + theme_light()
ggplot(data = tsa_eyeglasses, aes(x = log10_claim)) + geom_histogram(bins = 30) + theme_light()
Another nice feature of logarithms is that differences calculated on
the log-scale are ratios on the original scale after the transformation
is undone: \[log(X) - log(Y) = log(X/Y) \\
e^{log(X/Y)} = X/Y\]
So, if we perform a hypothesis on a difference in means using
log-transformed data, this test is equivalent to a test
involving the ratio of means using the non-transformed
data.
Note: Because \(\sum_{i=1}^{n}log(x_i)/n \ne log(\sum_{i=1}^{n}x_i/n)\) we actually get a ratio of geometric means (rather than arithmetic means) after reversing a log-transformation. However, this is a technical detail that almost never has any practical importance. The main idea is that by analyzing log-transformed data we can assess relative changes between groups.
\(~\)
Our previous lab introduced the “tailgating” data set, which records the results of a driving simulator study where regular users of various types of drugs were asked to follow an erratic lead vehicle to a certain destination. The researchers hypothesized that regular users of harder drugs would be more prone to risky behavior, and thus they would follow the erratic vehicle more closely.
tailgating = read.csv("https://remiller1450.github.io/data/Tailgating.csv")
Recall that the outcome variable in this data set is D,
the average following distance of a subject during the simulation. The
data also contains the log-transformed version of this variable, which
is recorded as LD.
\(~\)
The tailgating data set includes several large outliers whose average following distances were many times larger than most participants in the study:
ggplot(data = tailgating, aes(x = D, y = Drug)) + geom_boxplot()
You should remember from earlier in the semester that both the mean and standard deviation are sensitive to outliers, so naturally hypothesis tests like one-way ANOVA and the two-sample \(T\)-test will also be impacted by these outliers. In the question below you will investigate the role of the large outlier in the THC group.
Question #1: The code provided below creates a second version of the Tailgating data set using filter() to remove the large outliers whose average following distances exceed of 100 feet.
tailgating_no_outlier = tailgating %>% filter(D < 100)
filter() function to
create a subset of the original data set containing the outlier
(tailgating) containing only the THC and MDMA groups
(Recall we learned about filter() in Lab 4.
Then, use a two-sample \(T\)-test to
evaluate whether there is evidence of a difference in the average
following distances of regular users of these drugs. Report an
appropriate one-sentence conclusion that includes the \(p\)-value and observed sample means in each
group (see Lab
4 to review example conclusions).tailgating_no_outlier). Report a one-sentence
conclusion that includes the \(p\)-value and observed sample means in each
group.LD, as the outcome. Noting
that these data include the outlier, compare the \(p\)-value of this test with those found in
Parts A and B.LD and apply the inverse transformation (the
exp() function since the natural logarithm was used to
create LD). Provide a brief interpretation of the resulting
ratio of means (ie: Average following distances were 20% higher
in the THC group compared to the MDMA group).\(~\)
Parkinson’s disease is a progressive neurological disorder that impacts movement and coordination and worsens over time. However, if the disease can be identified before neurological symptoms appear it is possible to slow its progression.
The “Parkinson’s biomarkers” data set records a subset of results from a study seeking to find biomarkers that could identify people in the early stages of Parkinson’s when the disease is most treatable. The researchers collected blood samples from 50 patients with early-stage Parkinson’s, as well as blood samples from 55 controls who did not have the disease but were similar in age.
For each patient microarrays were used to measure the expression levels of 15,000 genetic features.
parkinsons_biomarkers = read.csv("https://remiller1450.github.io/data/parkinsons_biomarkers.csv")
Question #2: The code provided below performs 15,000
different two-sample \(T\)-tests
comparing the average gene expression levels of cases and controls in
each genetic feature present in the data set. The \(p\)-values of these 15,000 tests are stored
in the object p_values, which you are to use in the
questions that follow. Note: It might take a minute or so to
run this code.
p_values = numeric(15000) ## creates an empty object to store results
for(i in 1:15000){ ## loops through all the genetic features, recording a p-value for each
p_values[i] = t.test(parkinsons_biomarkers[,i] ~ parkinsons_biomarkers$status)$p.value
}
+geom_hline(yintercept = 15000/25) to display what a
uniform distribution would look like. Hint: The object
p_values was not created as a data frame, so you may want
to use the data.frame() function to create a data frame
containing it that you can give as an input to
ggplot().\(~\)
The p.adjust() function accepts a vector of \(p\)-values applies an adjustment method to
control either the family-wise error rate or the false discovery
rate.
The code below demonstrates how to apply the Bonferroni correction to obtain adjusted \(p\)-values. Recall that these \(p\)-values are simply the original, unadjusted \(p\)-value multiplied by the number of hypothesis tests (15000 in our case), and they can be compared against a typical significance threshold (such as \(\alpha = 0.05\)) to control the family-wise type I error rate at that level.
adj_p_values = p.adjust(p_values, method = "bonferroni")
We could change the method argument to method = "fdr" to
obtain a set of adjusted \(p\)-values
that control the false discovery rate at a certain level when
compared against the corresponding threshold value.
Question #3: For this question you should use the \(p\)-values you found for the Parkinson’s biomarkers data in the previous section of the lab.
sum(adj_p_values <= alpha) where alpha is
the target error rate.