\(~\)
Statisticians will sometimes apply a transformation to their data before analyzing it as part of a hypothesis test or model, with log-transformations being the most common. Statisticians use the term “log” to reference what most others call the natural logarithm: \[log(X)=t \leftrightarrow X=e^T\] The natural logarithm has a base of \(e\), a mathematical constant with a value of approximately 2.718.
Other popular bases are 2 (Log2) and 10 (Log10). A 1-unit change on the Log2 scale reflects a doubling of values on the original scale, while a 1-unit change on the Log10 scale reflects a 10-fold increase.
The figure below illustrates how the values \(\{1, 2, 4, 10, 100\}\) appear on these scales:
A few things to notice:
These properties make log-transformations an attractive tool to reduce the impact of right-skew and outliers:
## Let's load some right-skewed data
library(dplyr)
tsa_eyeglasses <- read.csv("https://remiller1450.github.io/data/tsa.csv") %>%
filter(Item == "Eyeglasses - (including contact lenses)")
## We use the log() function to create a new log-transformed variable
tsa_eyeglasses$log10_claim = log(tsa_eyeglasses$Claim_Amount, base = 10)
## Graph the distributions of claim amount and log10(claim amount)
ggplot(data = tsa_eyeglasses, aes(x = Claim_Amount)) + geom_histogram(bins = 30) + theme_light()
ggplot(data = tsa_eyeglasses, aes(x = log10_claim)) + geom_histogram(bins = 30) + theme_light()
Another nice feature of logarithms is that differences calculated on
the log-scale are ratios on the original scale after the transformation
is undone: \[log(X) - log(Y) = log(X/Y) \\
e^{log(X/Y)} = X/Y\]
So, if we perform a hypothesis on a difference in means using
log-transformed data, this test is equivalent to a test
involving the ratio of means using the non-transformed
data.
Note: Because \(\sum_{i=1}^{n}log(x_i)/n \ne log(\sum_{i=1}^{n}x_i/n)\) we actually get a ratio of geometric means rather than arithmetic means when undoing a log-transformation. This is a technical detail that is seldom practically important. The main idea is that by analyzing log-transformed data we can assess relative changes between groups.
\(~\)
This lab will use the “Tailgating” data set, which was introduced in our lecture on decision errors and false discoveries. This data set contains three columns:
Drug
- the hardest substance that subject regularly
consumedD
- the subject’s average following distance in the
simulated drive## Libraries
library(dplyr)
library(ggplot2)
## Data used in the lab
tailgating = read.csv("https://remiller1450.github.io/data/Tailgating.csv")
\(~\)
In a recent lecture, several large outliers were omitted from the visualizations of the Tailgating data. Here’s what the data look like when those outliers are included:
ggplot(data = tailgating, aes(x = D, y = Drug)) + geom_boxplot()
Recall that the \(t\)-test is used to compare the means of two groups, and the standard error used in the test statistic involves the sample standard deviation. We saw early in the semester that both the mean and standard deviation are sensitive to outliers, so what role is the large outlier in the THC group playing in the hypothesis tests involving that group?
Question #1: The code provided below creates a
second version of the Tailgating data set using filter()
to
remove the large outliers whose average following distances exceed of
100 feet.
tailgating_filtered = tailgating %>% filter(D < 100)
t.test()
function on
the original data set (which contains outliers) to evaluate
whether there is statistical evidence of a difference in the mean
following distance of drivers in the THC and MDMA groups. Report the
\(p\)-value of your test and a
1-sentence conclusion. Hint: You might consider using the
command filter(Drug %in% c("MDMA","THC"))
to create a
version of these data that only contains members of the MDMA and THC
groups.\(~\)
Question #2:
D
). Next, use side-by-side
box plots similar to those seen earlier in this section, to describe the
impacts of the log-transformation on the outliers present in these
data.t.test()
to perform a
hypothesis test comparing the mean log-transformed following
distances in the THC and MDMA groups. Report the \(p\)-value of your test and a 1-sentence
conclusion.\(~\)
When we first studied confidence intervals and probability models we briefly discussed the circumstances where the Normal and \(t\)-distributions served as reasonable probability models. For hypothesis tests involving a proportion or a difference in proportions we could handle situations where a Normal probability model was unreasonable by using an exact test:
binom.test()
performs an exact test for a single
proportion using the binomial
distributionfisher.test()
performs an exact test for a difference
in proportions using the hypergeometric
distributionIt’s beyond the scope of this course to learn about those probability models in detail, but the thing to know is that we have an exact test that we can use for those circumstances that doesn’t require any assumptions about our sample size.
In contrast, when conditions required to reasonably use the \(t\)-distribution are violated for tests involving a single mean (one-sample \(t\)-test) or a difference in means (two-sample \(t\)-test) there is no exact test we can fall back on. Instead, statisticians have developed “distribution free” or “non-parametric” testing approaches for scenarios where it is unreasonable to assume Normality.
\(~\)
The Wilcoxon Signed-Rank Test is a non-parametric analog to the one-sample \(t\)-test for a single mean. It is most often used to evaluate whether the median difference in a paired design is zero.
Consider the “Oatbran” study, which used a paired design to evaluate changes in LDL cholesterol for two different diets:
The Wilcoxon Signed-Rank Test ranks each data-point based upon its absolute value, then data-points are grouped according to their sign (positive or negative). The test then compares a test statistic that is the sum of each data-point’s rank multiplied by its sign against a null distribution to produce a \(p\)-value.
The wilcox.test()
function is used to perform
Wilcoxon Signed-Rank Test. It operates analogously to
t.test()
:
oatbran = read.csv("https://remiller1450.github.io/data/Oatbran.csv")
## Perform the test
wilcox.test(x = oatbran$Difference, mu = 0)
##
## Wilcoxon signed rank exact test
##
## data: oatbran$Difference
## V = 93, p-value = 0.008545
## alternative hypothesis: true location is not equal to 0
We will not cover the form of Wilcoxon Signed-Rank Test’s test statistic or null distribution in detail. Instead, you should know when to use the test, which in place of a one-sample \(t\)-test when the conditions needed to rely upon the \(t\)-distribution as a probability model are not met, and how to interpret the test’s results.
\(~\)
The Wilcoxon Rank Sum Test (also known as the Mann-Whitney U-test) is a non-parametric analog to the two-sample \(t\)-test for a difference in means.
The null hypothesis is that both groups follow the same distribution, and the alternative is both groups follow distinct distributions. The test is conducted using the following steps:
The test can be performed using the wilcox.test()
function:
wetsuits = read.csv("https://remiller1450.github.io/data/Wetsuits_long.csv")
## Perform the test
wilcox.test(Velocity ~ Condition, mu = 0, data = wetsuits)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Velocity by Condition
## W = 48.5, p-value = 0.1838
## alternative hypothesis: true location shift is not equal to 0
Note: Similar to t.test()
, the
wilcox.test()
function will determine whether to perform a
difference in groups (two-sample) or a single group (one-sample) test
based upon the way in which you provide the data. The signed rank test
expects a single vector provided as the x
argument, while
the rank sum test can be done using formula notation.
Question #3: In a 1982 study, researchers had 12
subjects participate in a visual motor task where they had to steer a
pencil along a moving track. Each subject was tested on two different
tracks: a straight track and an oscillating one. The researchers were
not interested in how well the participants kept their pencil on the
track, but rather their blink rate (measured in blinks per minute) under
each condition. The columns Straight
and
Oscillating
record each subject’s blink rate in each of
these conditions.
R
output of
the test along with a 1-sentence conclusion. Hint: you may need
to create a new variable to perform the test you deem appropriate.blink = read.csv("https://remiller1450.github.io/data/blink.csv")
Question #4:
\(~\)
In our previous lab you practiced deciding upon the appropriate hypothesis test for a research question. This section will expand upon that by requiring you to decide whether a traditional parametric test (ie: \(t\)-test) or an exact or non-parametric (ie: exact binomial test or Wilcoxon test) should be used.
The table below summarizes all of the tests that we’ve learned about so far, as well as the conditions needed to trust the probability model involved in the parametric tests:
Explanatory Variable | Outcome Variable | Parametric Test | Conditions | Nonparametric or Exact Alternative |
---|---|---|---|---|
Binary Categorical | Binary Categorical | prop.test() - two-sample Z-test | 10 cases in each cell of the 2x2 contingency table | fisher.test() - Fisher’s Exact Test |
Binary Categorical | Quantitative | t.test() - two-sample T-test | Normal data or 30 cases per group | wilcox.test() - Rank Sum Test |
none | Binary Categorical | prop.test() - one-sample Z-test | 10 cases experiencing each outcome | binom.test() - Exact Binomial Test |
none | Quantitative | t.test() - one-sample T-test | Normal data or at least 30 cases | wilcox.test() - Signed Rank Test |
For the following questions you are expected to:
\(~\)
Question #5: Cloud seeding is a type of weather modification that aims to change the amount or type of precipitation that falls from clouds. The method works by dispersing a chemical into the air that alters the microphysical processes within the cloud. Whether or not cloud seeding is effective in producing a statistically significant increase in precipitation is an ongoing debate. In this question, you will analyze data from an experiment where clouds were randomly assigned to receiving seeding. You should evaluate whether there is statistical evidence that seeding (yes/no) influences rainfall (inches).
clouds = read.csv("https://remiller1450.github.io/data/clouds.csv")
\(~\)
Question #6: The “HairEyeColor” data set records the
hair color, eye color, and sex of a sample of 592 students from the
University of Delaware taken in 1974. Using these data, evaluate whether
there is statistical evidence that a higher proportion of green eyes is
higher among individuals with brown hair than it is among individuals
with black hair. Hint: To simplify the steps of your analysis,
you might choose to use the filter()
function to only
include the relevant combinations of hair and eye color.
hair = read.csv("https://remiller1450.github.io/data/HairEyeColor.csv")