Directions
In July 2012 a gunman opened fire in a movie theater in Aurora Colorado, leading to 12 fatalities. As a result of this tragedy, Mother Jones, a liberal news organization, assembled an open-source database aimed at documenting all mass shootings in the United States.
To be recorded in this database, an incident must meet the following criteria:
A handful of additional incidents were also included which met the first and second criteria, but took place over a short period of time in multiple locations. These cases are documented as “spree killings”.
To download the Mass Shootings data, click here.
For the purposes of this lab, we will focus on the following variables:
The simplest way to summarize a categorical variable is using frequencies, which are simply the number of cases in a given category. Frequencies are often presented in a one-way frequency table. Below is a frequency table of the variable “Region” in the Happy Planet data:
Region | Frequency |
---|---|
1 | 24 |
2 | 24 |
3 | 16 |
4 | 33 |
5 | 7 |
6 | 12 |
7 | 27 |
Statisticians commonly denote frequencies using \(N\), sometimes adding a subscript for the category. So, \(N_{Africa}\) denotes the frequency of countries in Africa.
Proportions are a related summary measure, they are defined as the fraction of cases in a given category:
\[\text{proportion in category j} = \frac{\text{frequency of category j}}{\text{total number of cases}}\] Statisticians commonly denote proportions using \(p\), sometimes adding a subscript for the category. So, \(p_{Africa}\) denotes the proportion of countries in Africa.
In Minitab, frequencies and proportions can be calculated using these steps:
Question #1:
For the Mass Shootings data, use Minitab to construct a frequency table of the variable “Place” that shows both frequencies and proportions. Write one sentence describing what this table tells you, and whether it matches your preconceptions regarding mass shootings. Your sentence should reference either proportions or frequencies, and should use statistical notation to refer to specific values. Be sure to include a copy of your table (or a screenshot) in your lab write-up.
To summarize two categorical variables, frequencies are displayed in a two-way frequency table. The table below shows some common notation for a two-way frequency table.
Two-way frequency tables allow us to describe the data using many different types of proportions:
For any two-way frequency table there are two types of conditional proportions. Those which condition upon the row variable (such as the example above), and those which condition upon the column variable, such as \(p_{Z|B} = \frac{N_{ZB}}{N_{XB} + N_{YB} + N_{ZB}}\)
To construct a two-way frequency table in Minitab:
Question #2:
For the Mass Shootings data, create a two-way frequency table using “Mental” as the row variable, and “Place” as the column variable. Add your table to your lab write-up.
Question #3:
Use conditional proportions to determine if perpetrators with a mental health history are more likely to carry out their shooting at a school than perpetrators with no mental health history. Use proper statistical notation to express the conditional proportions you use.
Question #4:
In Question #3 you conditioned on either the row or column variable. Briefly explain your choice, including an explanation of why conditioning on the other variable would not answer the question.
The distribution of a single categorical variable can be visualized using bar charts and pie charts. These graphics can be created in Minitab using the following steps:
Question #5:
Construct both a pie chart and bar chart showing the distribution of the variable “Place”. Include both graphics in your lab write-up, along with a sentence describing which of the two you prefer.
Bar charts tend to be superior in visualizing two categorical variables. Two popular ways that the relationship between two categorical variables can be visualized are clustered bar charts, where bars are grouped by an outer variable, and stacked bar charts, where bars are stacked onto each other for the different levels of the outer variable. Our textbook sometimes refers to these as “side-by-side” and “segmented” bar charts.
In Minitab, you can create clustered or stacked bar charts by:
Question #6:
Use the steps listed above to construct a stacked bar chart showing the relationship between the variables “Mental” and “Place”. Include your graphic and 1-2 sentences describing what it tells you about these two variables.
Note that stacked bar charts can also be used to display conditional proportions, which are often more useful than frequencies or overall proportions for answering certain research questions because they more easily illustrate associations. You can construct stacked bar charts that display conditional proportions using the following steps:
Question #7:
Construct a stacked bar chart showing the distribution of “Place” conditional upon “Mental”. Include your graphic and 1-2 sentences explaining whether you think this representation is more/less useful than the graphic you created in Question #6.
In Minitab we can obtain these summary statistics using the “Stats” menu and selecting “Display Descriptive Statistics”.
Question #8:
Report the five number summary of the variable “Fatalities” and provide a 1 sentence interpretation of the 3rd quartile (Q3).
The most common way to summarize the relationship between two quantitative variables via their correlation coefficient. To understand the correlation coefficient we first must understand the concept of standardization and z-scores.
A z-score is a standardized (unit-less) measurement of the value of a particular variable for a particular case. For variable “X”, the \(i^{th}\) case’s z-score (\(z_i\)) is given by:
\[z_i = \frac{x_i - \bar{x}}{s_x}\] Where \(x_i\) is the original measurement of variable “X” for the \(i^{th}\) case, \(\bar{x}\) is the mean of that variable, and \(s_x\) is the standard deviation of that variable.
In other words, a z-scores are calculated using two steps: the first is to center the measurement by subtracting off its mean, the second is to scale the measurement by dividing by its standard deviation.
Standardizing a variable (ie: transforming the original measurements into z-scores) is useful because it allows for a consistent interpretation across variables with different units.
Question #9:
Use a Minitab formula to standardize the variable “Victims” (store your new variable as a column titled “Z”). What is the z-score of the Aurora Colorado theater shooting? What does the z-score tell you about the shooting?
Because z-scores put different variables onto a standardized scale, they can be used to relate two quantitative variables with different units. The correlation coefficient summarizes these relationships by calculating the average product of z-scores for two variables. Mathematically, the correlation coefficient of variables “X” and “Y” is calculated:
\[r_{x,y} = \frac{1}{n-1} \sum_i \bigg( \frac{x_i - \bar{x}}{s_x} \bigg) \bigg( \frac{y_i - \bar{y}}{s_y} \bigg)\] We can interpret the correlation coefficient as measuring how closely associated are two variables. A correlation coefficient near 1 indicates a strong positive association, while a correlation coefficient near -1 indicates a strong inverse association. A value near 0 indicates no association, as it reflects no systematic tendency for high z-scores in one variable to correspond with either high or low z-scores in the other variable across cases.
Question #10
Suppose a particular case has an above average value for variable “X” and an above average value for variable “Y”. Will this case make a positive or negative contribution to the correlation coefficient of variables “X” and “Y”? What if a case has an above average value for variable “X” but a below average value for variable “Y”?
In Minitab the correlation coefficient of two variables can be calculated using the following steps:
Question #11
Use Minitab to find the correlation coefficient between the variables “Fatalities” and “Injured”. Provide a 1 sentence interpretation of how these variables are related.
A few common ways to visualize the distribution of a quantitative variable are histograms, and the dotplots. Both of these graphs group the data into equally spaced intervals, then plot the frequencies within each group. These grouped intervals are called bins. By default, Minitab will label bins at their midpoint. Histograms of three different variables in the Happy Planet Data are shown below:
We are often interested in the general shape of a distribution, specifically whether it is symmetric or skewed.
In Minitab, histograms and dot plots can be constructed by:
Question #12:
Construct a histogram of the variable “Year” and describe whether the distribution of this variable is skewed right, skewed left, or approximately symmetric. Then, write 1-2 sentences describing what this tells you about the prevalence of mass shootings.
Histograms and dotpots are very effective ways of visualizing the entire distribution of a single quantitative variable, but sometimes we want some summarization. Boxplots are well-suited for this task. If you are rusty on boxplots the diagram below provides a quick summary of their key components:
In Minitab, boxplots can be constructed by:
Question #13:
Construct a boxplot of the variable “Age”. How would you use this plot to describe the age of most mass shooters? Refer to one or more statistics illustrated by the boxplot in your answer.
So far we’ve seen how to summarize a single categorical or quantitative variable, how to describe relationships between two categorical variables, and how to describe relationships between two quantitative variables. All that remains is how to describe the relationship between one categorical variable and one quantitative variable.
The relationship between a categorical and a quantitative variable can be expressed by calculating summary measures of the quantitative variable separately for each category defined by the categorical variable. In Minitab, this is done by supplying the categorical variable as a “By variable”.
Question #14:
Report the five summary of the variable “Victims” using “Mental” as a by variable. Then write 1-2 sentences describing the relationship between these two variables. Do they appear associated?
Question #15:
The mean number of victims for shooters with an unclear/unknown/tbd mental health status is much higher than the number of victims for other mental health statuses. But Q1, the median, and Q3 do not reflect such a drastic difference by mental health status. Briefly explain this disparity.
Any visualization suitable for a single quantitative variable may also be used to show the relationship between a quantitative and categorical variable if the categorical variable used as a “By variable”. However, boxplots tend to be most effective at illustrating this relationship.
In Minitab, the relationship between a quantitative and categorical variable can be visualized using a boxplot by:
Question #16:
Use boxplots to visualize the relationship between “Place” and “Victims”. Do these variables appear to be associated? (Hint: association only requires that two categories have different distributions)
Question #17:
Are the variables “Age” and “Victims” associated? Use the appropriate summary statistic(s) or a visual to support your answer.
Question #18:
Are the variables “Age” and “Place” associated? Use the appropriate summary statistic(s) or a visual to support your answer.
Question #19:
Use a Minitab formula to create new variable representing the fraction of victims who were killed and verify that mean of this variable is 0.6230. Using the 68-95-99 rule, which was briefly introduced in the Summarizing Data notes, characterize the middle 68% of mass shootings. That is, describe the fraction of victims who were killed in the 68% of mass shootings closest to average.
Question #20:
Using the methods and tools presented in this lab, construct an executive summary reporting the main characteristics of mass shooters. Your summary should include 2-3 insightful visuals or tables, along with a description of what these visuals/tables show and why they are important.