Overview

When analyzing data we need ways to describe relationships between variables

  • Data visualizations provide a way to qualitatively describe relationships
  • Descriptive statistics provide a way to quantitatively describe relationships

If two variables share a relationship, we say these variables are associated. We can describe the association using descriptive statistics.

Right now we’re focusing on finding and describing associations between two categorical variables, or raw data that looks like this:

Name Type Selectivity
Capitol Technology University Private Not Selective
Bard College Private Not Selective
School of the Art Institute of Chicago Private Not Selective
Lyon College Private Selective
Morningside College Private Not Selective
DeSales University Private Not Selective
Wilberforce University Private Not Selective
California Institute of the Arts Private Selective

These data contain 2 categorical variables:

  1. Type - whether a college is public or private
  2. Selectivity - whether a college admits fewer than 50% of applicants (“selective”) or more than 50% of applicants (“not selective”)

The first step in determining whether these variables are associated is to create two-way frequency table:

Not Selective Selective
Private 689 158
Public 163 27

This table shows us the frequencies of every combination of our two categorical variables. Because both categorical variables are binary, we can analyze the relationship between them using methods for contingency tables, which take the format:

Generic contingency table:

Event No Event
Exposure A B
No Exposure C D

Note that:

  • “Exposure”/“No Exposure” are categories of the binary explanatory variable.
  • “Event”/“No Event” are the categories of the binary response variable.

In our example, we might suspect that a college being “Private” is associated with an increased likelihood of the college being “Selective”, so we could opt to reorganize our two-way frequency table:

Selective Not Selective
Private 158 689
Public 27 163

In this rearranged table, the value “Private” is the exposure of interest, and the value “Selective” is the outcome of interest.

Describing an association

The frequencies in a contingency table are used to calculate the risk difference, relative risk, or odds ratio depending upon the application and how the data were collected. For illustrative purposes we’ll calculate all of them for this table, then we’ll discuss when each should be preferred.

  1. Risk difference
  • Difference in risk (ie: relative frequency of the outcome of interest: “Selective”) among each group created by the explanatory variable (ie: “Private” and “Public” colleges). ’

The “risk” of a private college being selective is: \(\tfrac{158}{158+689}=0.187\)

The “risk” of a public college being selective is: \(\tfrac{27}{27+163}=0.142\)

  • You should notice that these are row proportions, or proportions that condition on the variable “Type”.

The risk difference is \(0.187 - 0.142 = 0.045\). Whether this difference is large enough for us to decide that these variables are associated can be somewhat complicated. We’ll ultimately consider two criteria:

  1. Practical significance - based upon what we know about college admissions, is an absolute difference of 0.045 large enough to mean something?
  2. Statistical significance - a topic we’ll discuss extensively this semester after our first exam

\(~\)

  1. Relative risk (Risk ratio)

We already have the two risks necessary to calculate relative risk, it is found by:

\[\text{Relative Risk} = \tfrac{\text{Risk among group 1}}{\text{Risk among group 2}} = \tfrac{0.187}{0.142} = 1.32\]

Interpreting a relative risk requires less domain expertise than interpretting a risk difference:

Relative Risk (RR) Interpretation
RR = 1 There is no difference in risk between the two groups.
RR = 1.2 The event is 20% more likely to occur in the first group than in the second group. This is often considered a small increased risk.
RR = 1.5 The event is 50% more likely to occur in the first group than in the second group. This is often considered a moderate increased risk.
RR = 2 The event is twice as likely to occur in the first group than in the second group. This is often considered a large increased risk.
RR = 0.6 The event is 40% less likely to occur in the first group than in the second group. This is often considered a moderate decreased risk.

One reason for this is that relative risk is more appropriate for rare events, though it is also reasonable to use for common events.

\(~\)

  1. Odds Ratio
  • Odds describe how many times an event is expected to occur relative to how many times the event is not expected to occur
    • An odds of “3” indicate an event would be expected to occur 3 times for every 1 time it did not occur
  • An odds ratio is a relative comparison of the odds of an event across two groups

The “odds” of a private college being selective is: \(\tfrac{158}{689}=0.229\)

The “odds” of a public college being selective is: \(\tfrac{27}{163}=0.166\)

Thus the odds ratio is:

\[\text{Odds Ratio} = 0.229/0.166 = 1.38\]

Notice that we could calculate this directly from the two-way frequency table:

\[\text{Odds Ratio} = (158*163)/(27*689) = 1.38\] Finally, you should notice that the odds ratio and relative risk are close to each other. This is not a coincidence, the rarer the event of interest is the closer the odds ratio and the relative risk will be.

\(~\)

Which descriptive statistic should be used?

When deciding which descriptive statistic to use, you must consider how the data were collected. More specifically, you should be aware of 3 different data collection approaches:

  1. Prospective - recruit subjects, then follow them over time to see who gets exposed and who experiences the event.
  2. Cross-sectional - recruit everyone, then determine who was exposed and who has experienced the event.
  3. Case-control - recruit only those who experienced the event (cases) and ask about their exposure status, do the same for an equivalent group who did not experience the event (control).
  • We can use any measure (risk difference, relative risk, or odds ratio) for prospective or cross-sectional designs.
    • This is because the exposure groups and events are observed in proportion to how they naturally occur in the population we’re studying.
  • Case-control designs are popular for rare outcomes
    • However, because the outcomes are not observed in proportion to how they naturally occur we cannot use this design to estimate any risk-based descriptive statistics (ie: we cannot use risk difference or relative risk)
    • Odds ratios are an appropriate choice for this design

In summary:

  • Use risk difference for prospective or cross-sectional data where the event of interest is relatively common and it’s easy to interpret an absolute difference between groups.
  • Use relative risk for prospective or cross-sectional data where the event of interest is relative rare and interpreting an absolute difference requires a lot of domain-specific knowledge.
  • Use odds ratios any time you’d use relative risk, and use them exclusively if the data come from a case-control design.

\(~\)

Practice

Question #1

In 1992 an unknown illness struck several residents in a small Alaskan community. The CDC suspected fluoride poisoning in one of the town’s water sources. The data they analyzed consisted of 38 individuals with the illness, among whom 33 drank from the suspected water source. They also surveyed 54 residents who did not have symptoms and only 8 reported drinking from the suspected water source.

  1. Determine the study design used to collect these data
  2. Organize the provided information into a contingency table.
  3. Decide upon an appropriate way to report the relationship present in the contingency table and calculate the corresponding descriptive statistic.

\(~\)

Question #2

A major US university compiled admissions data after facing accusations of discrimination against female graduate school applicants. The data below records the outcomes of graduate school applicants to the university’s largest graduate programs (departments). Note that the data shared below are actually a simulated replicate of the original data, which is not publicly available.

adm = read.csv("https://remiller1450.github.io/data/admissions.csv")
  1. Identify the explanatory and response variable in this scenario.
  2. Create a contingency table involving these variables.
  3. Decide upon an appropriate way to report the relationship present in the contingency table and calculate the corresponding descriptive statistic.

\(~\)

Question #3 (time-permitting)

A research group surveyed 665 licensed drivers, asking for the color of their car and if they’d had a speeding ticket in the last year. There were 150 drivers who drove a red car, and 15 of them reported getting a speeding ticket. In comparison, 515 drivers did not drive a red car and 45 of this group had reported a speeding ticket.

  1. Determine the study design used to collect these data
  2. Organize the provided information into a contingency table.
  3. Decide upon an appropriate way to report the relationship present in the contingency table and calculate the corresponding descriptive statistic.