When analyzing data we need ways to describe relationships between variables
If two variables share a relationship, we say these variables are associated. We can describe the association using descriptive statistics.
Right now we’re focusing on finding and describing associations between two categorical variables, or raw data that looks like this:
Name | Type | Selectivity |
---|---|---|
Capitol Technology University | Private | Not Selective |
Bard College | Private | Not Selective |
School of the Art Institute of Chicago | Private | Not Selective |
Lyon College | Private | Selective |
Morningside College | Private | Not Selective |
DeSales University | Private | Not Selective |
Wilberforce University | Private | Not Selective |
California Institute of the Arts | Private | Selective |
These data contain 2 categorical variables:
Type
- whether a college is public or privateSelectivity
- whether a college admits fewer than 50%
of applicants (“selective”) or more than 50% of applicants (“not
selective”)The first step in determining whether these variables are associated is to create two-way frequency table:
Not Selective | Selective | |
---|---|---|
Private | 689 | 158 |
Public | 163 | 27 |
This table shows us the frequencies of every combination of our two categorical variables. Because both categorical variables are binary, we can analyze the relationship between them using methods for contingency tables, which take the format:
Generic contingency table:
Event | No Event | |
---|---|---|
Exposure | A | B |
No Exposure | C | D |
Note that:
In our example, we might suspect that a college being “Private” is associated with an increased likelihood of the college being “Selective”, so we could opt to reorganize our two-way frequency table:
Selective | Not Selective | |
---|---|---|
Private | 158 | 689 |
Public | 27 | 163 |
In this rearranged table, the value “Private” is the exposure of interest, and the value “Selective” is the outcome of interest.
The frequencies in a contingency table are used to calculate the risk difference, relative risk, or odds ratio depending upon the application and how the data were collected. For illustrative purposes we’ll calculate all of them for this table, then we’ll discuss when each should be preferred.
The “risk” of a private college being selective is: \(\tfrac{158}{158+689}=0.187\)
The “risk” of a public college being selective is: \(\tfrac{27}{27+163}=0.142\)
The risk difference is \(0.187 - 0.142 = 0.045\). Whether this difference is large enough for us to decide that these variables are associated can be somewhat complicated. We’ll ultimately consider two criteria:
\(~\)
We already have the two risks necessary to calculate relative risk, it is found by:
\[\text{Relative Risk} = \tfrac{\text{Risk among group 1}}{\text{Risk among group 2}} = \tfrac{0.187}{0.142} = 1.32\]
Interpreting a relative risk requires less domain expertise than interpretting a risk difference:
Relative Risk (RR) | Interpretation |
---|---|
RR = 1 | There is no difference in risk between the two groups. |
RR = 1.2 | The event is 20% more likely to occur in the first group than in the second group. This is often considered a small increased risk. |
RR = 1.5 | The event is 50% more likely to occur in the first group than in the second group. This is often considered a moderate increased risk. |
RR = 2 | The event is twice as likely to occur in the first group than in the second group. This is often considered a large increased risk. |
RR = 0.6 | The event is 40% less likely to occur in the first group than in the second group. This is often considered a moderate decreased risk. |
One reason for this is that relative risk is more appropriate for rare events, though it is also reasonable to use for common events.
\(~\)
The “odds” of a private college being selective is: \(\tfrac{158}{689}=0.229\)
The “odds” of a public college being selective is: \(\tfrac{27}{163}=0.166\)
Thus the odds ratio is:
\[\text{Odds Ratio} = 0.229/0.166 = 1.38\]
Notice that we could calculate this directly from the two-way frequency table:
\[\text{Odds Ratio} = (158*163)/(27*689) = 1.38\] Finally, you should notice that the odds ratio and relative risk are close to each other. This is not a coincidence, the rarer the event of interest is the closer the odds ratio and the relative risk will be.
\(~\)
When deciding which descriptive statistic to use, you must consider how the data were collected. More specifically, you should be aware of 3 different data collection approaches:
In summary:
\(~\)
In 1992 an unknown illness struck several residents in a small Alaskan community. The CDC suspected fluoride poisoning in one of the town’s water sources. The data they analyzed consisted of 38 individuals with the illness, among whom 33 drank from the suspected water source. They also surveyed 54 residents who did not have symptoms and only 8 reported drinking from the suspected water source.
\(~\)
A major US university compiled admissions data after facing accusations of discrimination against female graduate school applicants. The data below records the outcomes of graduate school applicants to the university’s largest graduate programs (departments). Note that the data shared below are actually a simulated replicate of the original data, which is not publicly available.
adm = read.csv("https://remiller1450.github.io/data/admissions.csv")
\(~\)
A research group surveyed 665 licensed drivers, asking for the color of their car and if they’d had a speeding ticket in the last year. There were 150 drivers who drove a red car, and 15 of them reported getting a speeding ticket. In comparison, 515 drivers did not drive a red car and 45 of this group had reported a speeding ticket.