Sta-209 Lab #1 - Descriptive Statistics

Directions

Read through the entire lab (not just the questions). The lab will introduce course content that you will be responsible for on exams/homework.
Answer all questions in a separate document, attaching Minitab output if needed. Many groups choose to use google docs, or designate a group member to be in charge of the write-up, possibly using one PC for Minitab and the other PC for the write-up.
Do not use a “divide and conquer” strategy. While it is tempting to get done quicker, this approach negatively impacts you and your classmates. You are expected to work through the lab as a team. Also, you should recognize that Prof. Miller is happy to devote more class time to a lab if it is taking longer than anticipated.

The Mass Shootings Dataset

In July 2012 a gunman opened fire in a movie theater in Aurora Colorado, leading to 12 fatalities. As a result of this tragedy, Mother Jones, a liberal news organization, assembled an open-source database aimed at documenting all mass shootings in the United States.

To be recorded in this database, an incident must meet the following criteria:

The perpetrator took the lives of at least four people, not including the shooter
The killings were carried out by a lone perpetrator (with some rare exceptions involving two shooters)
The shooting occurred in a single, public place

A handful of additional incidents were also included which met the first and second criteria, but took place over a short period of time in multiple locations. These cases are documented as “spree killings”.

To download the Mass Shootings data, click here.

Variable Descriptions

For the purposes of this lab, we will focus on the following variables:

Year: When the shooting took place
Type: Did the shooting take place in a single location (Mass), or multiple locations in a short time period (Spree)
Fatalities: How many were killed (not including the perpetrator)
Injured: How many were injured but not killed (not including the perpetrator)
Victims: In total, how many were injured or killed (not including the perpetrator)
Age: How old, in years, was the killer at the time of the shooting
Race: What racial or ethnic category describes the shooter
Gender: What gender was the shooter
Place: What category of place describes where the shooting occurred
Legally: Was the weapon(s) obtained legally
Mental: Did the shooter have a mental health history

Categorical Variables

Univariate Summaries

The simplest way to summarize a categorical variable is using frequencies, which are simply the number of cases in a given category. Frequencies are often presented in a one-way frequency table. Below is a frequency table of the variable “Region” in the Happy Planet data:

Region	Frequency
1	24
2	24
3	16
4	33
5	7
6	12
7	27

Statisticians commonly denote frequencies using $N$, sometimes adding a subscript for the category. So, $N_{Africa}$ denotes the frequency of countries in Africa.

Proportions are a related summary measure, they are defined as the fraction of cases in a given category:

\[\text{proportion in category j} = \frac{\text{frequency of category j}}{\text{total number of cases}}\] Statisticians commonly denote proportions using $p$, sometimes adding a subscript for the category. So, $p_{Africa}$ denotes the proportion of countries in Africa.

In Minitab, frequencies and proportions can be calculated using these steps:

Go to the “Stat” menu and select “Tables” -> “Tally Individual Variables”
Select the variable you’re interested in, click “Counts” and “Percents” if you want both frequencies and proportions.

Question #1:

For the Mass Shootings data, use Minitab to construct a frequency table of the variable “Place” that shows both frequencies and proportions. Write one sentence describing what this table tells you, and whether it matches your preconceptions regarding mass shootings. Your sentence should reference either proportions or frequencies, and should use statistical notation to refer to specific values. Be sure to include a copy of your table (or a screenshot) in your lab write-up.

Bivariate Summaries

To summarize two categorical variables, frequencies are displayed in a two-way frequency table. The table below shows some common notation for a two-way frequency table.

Two-way frequency tables allow us to describe the data using many different types of proportions:

Overall (total) proportions: such as $p_{XA} = \frac{N_{XA}}{N}$ (where N is the total number of cases)
Conditional proportions: such as $p_{A|X} = \frac{N_{XA}}{N_{XA} + N_{XB} + N_{XC}}$

For any two-way frequency table there are two types of conditional proportions. Those which condition upon the row variable (such as the example above), and those which condition upon the column variable, such as $p_{Z|B} = \frac{N_{ZB}}{N_{XB} + N_{YB} + N_{ZB}}$

To construct a two-way frequency table in Minitab:

Go to the “Stat” menu and select “Tables” -> “Cross Tabulation and Chi-Square”
Select the variables you want to use for the rows and columns of the table and click “Ok”

Question #2:

For the Mass Shootings data, create a two-way frequency table using “Mental” as the row variable, and “Place” as the column variable. Add your table to your lab write-up.

Question #3:

Use conditional proportions to determine if perpetrators with a mental health history are more likely to carry out their shooting at a school than perpetrators with no mental health history. Use proper statistical notation to express the conditional proportions you use.

Question #4:

In Question #3 you conditioned on either the row or column variable. Briefly explain your choice, including an explanation of why conditioning on the other variable would not answer the question.

Visualizations

The distribution of a single categorical variable can be visualized using bar charts and pie charts. These graphics can be created in Minitab using the following steps:

Go to the “Graph” menu and select “Bar Chart” or “Pie Chart”
Since we are displaying a single variable, select “Simple” and hit “Ok”
Select your variable and hit “Ok” to create the chart

Question #5:

Construct both a pie chart and bar chart showing the distribution of the variable “Place”. Include both graphics in your lab write-up, along with a sentence describing which of the two you prefer.

Bar charts tend to be superior in visualizing two categorical variables. Two popular ways that the relationship between two categorical variables can be visualized are clustered bar charts, where bars are grouped by an outer variable, and stacked bar charts, where bars are stacked onto each other for the different levels of the outer variable. Our textbook sometimes refers to these as “side-by-side” and “segmented” bar charts.

In Minitab, you can create clustered or stacked bar charts by:

Go to the “Graph” menu and select “Bar Chart”
Select either “Cluster” or “Stack” and hit “Ok”
Select your two categorical variables, putting the outer variable (the one that determines the clusters) first, and the inner variable (the one displayed for each cluster) second. Hit “Ok” to create the chart

Question #6:

Use the steps listed above to construct a stacked bar chart showing the relationship between the variables “Mental” and “Place”. Include your graphic and 1-2 sentences describing what it tells you about these two variables.

Note that stacked bar charts can also be used to display conditional proportions, which are often more useful than frequencies or overall proportions for answering certain research questions because they more easily illustrate associations. You can construct stacked bar charts that display conditional proportions using the following steps:

Go to the “Graph” menu and select “Bar Chart” -> “Stack”
Select your two categorical variables, putting the outer variable (the one that determines the clusters) first, and the inner variable (the one displayed for each cluster) second; Note: your graph will condition on the outer variable
Click “Chart Options” and select both “Show Y as Percent” and “Within categories at level 1 (outermost)”, then hit “Ok”
Hit “Ok” to create the chart.

Question #7:

Construct a stacked bar chart showing the distribution of “Place” conditional upon “Mental”. Include your graphic and 1-2 sentences explaining whether you think this representation is more/less useful than the graphic you created in Question #6.

Quantitative Variables

Univariate Summaries

The mean is a numerical average of the data and one way of describing the center of a distribution
- For a given dataset, we define the mean of variable “X” as $\bar{x} = \tfrac{\Sigma_i x_i}{n}$, where $n$ is the number of data-points
The median is the middle value of the data when the data values are ordered from largest to smallest (or vice-versa)
- If the data contains an even number of cases, the median is the midpoint of the middle two values
The standard deviation is a measure of how much variability there is in a distribution, or how spread out are the data from the mean
- For a given dataset, we define the standard deviation of variable “X” as $s_x = \sqrt{\tfrac{\Sigma_i (x_i - \bar{x})^2}{n-1}}$, where $n$ is the number of data-points
The $P^{th}$ percentile is the value that is greater than $P$ percent of the data (Note: sometimes percentiles are defined using greater than or equal to)
- Some frequently reported percentiles are the First Quartile or $Q_1$, which is the $25^{th}$ percentile, and the Third Quartile or $Q_3$, which is the $75^{th}$ percentile
- The median is also a percentile (the $50^{th}$)
- The Interquartile Range or $IQR$ is a measure of variability defined as $IQR = Q_3 - Q_1$
The minimum and maximum are exactly what their names suggest, the smallest and largest values
The set of numbers: $\{$Min, Q1, Median, Q3, Maximum$\}$ is known as the five number summary, it is often reported to summarize a distribution

In Minitab we can obtain these summary statistics using the “Stats” menu and selecting “Display Descriptive Statistics”.

Question #8:

Report the five number summary of the variable “Fatalities” and provide a 1 sentence interpretation of the 3rd quartile (Q3).

Bivariate Summaries

The most common way to summarize the relationship between two quantitative variables via their correlation coefficient. To understand the correlation coefficient we first must understand the concept of standardization and z-scores.

A z-score is a standardized (unit-less) measurement of the value of a particular variable for a particular case. For variable “X”, the $i^{th}$ case’s z-score ($z_i$) is given by:

\[z_i = \frac{x_i - \bar{x}}{s_x}\] Where $x_i$ is the original measurement of variable “X” for the $i^{th}$ case, $\bar{x}$ is the mean of that variable, and $s_x$ is the standard deviation of that variable.

In other words, a z-scores are calculated using two steps: the first is to center the measurement by subtracting off its mean, the second is to scale the measurement by dividing by its standard deviation.

Standardizing a variable (ie: transforming the original measurements into z-scores) is useful because it allows for a consistent interpretation across variables with different units.

If a doctor tells you that your blood urea concentration is 50 mg/dl above average you aren’t sure whether you should be worried, you could be only slightly above average, or substantially above average.
If a doctor tells you that your blood urea concentration is 4 standard deviations above average (a z-score of 4) you should be concerned, you know that you are far above the average.

Question #9:

Use a Minitab formula to standardize the variable “Victims” (store your new variable as a column titled “Z”). What is the z-score of the Aurora Colorado theater shooting? What does the z-score tell you about the shooting?

Because z-scores put different variables onto a standardized scale, they can be used to relate two quantitative variables with different units. The correlation coefficient summarizes these relationships by calculating the average product of z-scores for two variables. Mathematically, the correlation coefficient of variables “X” and “Y” is calculated:

\[r_{x,y} = \frac{1}{n-1} \sum_i \bigg( \frac{x_i - \bar{x}}{s_x} \bigg) \bigg( \frac{y_i - \bar{y}}{s_y} \bigg)\] We can interpret the correlation coefficient as measuring how closely associated are two variables. A correlation coefficient near 1 indicates a strong positive association, while a correlation coefficient near -1 indicates a strong inverse association. A value near 0 indicates no association, as it reflects no systematic tendency for high z-scores in one variable to correspond with either high or low z-scores in the other variable across cases.

Question #10

Suppose a particular case has an above average value for variable “X” and an above average value for variable “Y”. Will this case make a positive or negative contribution to the correlation coefficient of variables “X” and “Y”? What if a case has an above average value for variable “X” but a below average value for variable “Y”?

In Minitab the correlation coefficient of two variables can be calculated using the following steps:

Go to the “Stat” menu and select “Basic Statistics” -> “Correlation”
Select your two quantitative variables, noting that the order doesn’t matter because multiplication is symmetric

Question #11

Use Minitab to find the correlation coefficient between the variables “Fatalities” and “Injured”. Provide a 1 sentence interpretation of how these variables are related.

Visualizations

A few common ways to visualize the distribution of a quantitative variable are histograms, and the dotplots. Both of these graphs group the data into equally spaced intervals, then plot the frequencies within each group. These grouped intervals are called bins. By default, Minitab will label bins at their midpoint. Histograms of three different variables in the Happy Planet Data are shown below:

We are often interested in the general shape of a distribution, specifically whether it is symmetric or skewed.

A symmetric distribution can be folded over a center line and the two sides will closely match each other.
A skewed distribution has most of its data piled up on one side and a long tail of smaller amounts of data in the other direction.
- The variable “LifeExpectency” is skewed to the left, most countries have life expectancies somewhere between 70 and 80 years, but there is a long tail of countries with lower life expectancies.
- The variable “GDPperCapita” is skewed to the right, it also contains an outlier in Luxembourg, whose per capita GDP of nearly $60,000 is much larger than any other country.
It is rare to see perfect symmetry in real data, oftentimes approximate symmetry is good enough. The histogram of the variable “Happiness” is an example of an approximately symmetric distribution.

In Minitab, histograms and dot plots can be constructed by:

Going to the “Graph” menu and selecting “Histogram” or “Dotplot”
Selecting “Simple” under “One Y” if you want a single plot of the overall distribution of your quantitative variable

Question #12:

Construct a histogram of the variable “Year” and describe whether the distribution of this variable is skewed right, skewed left, or approximately symmetric. Then, write 1-2 sentences describing what this tells you about the prevalence of mass shootings.

Histograms and dotpots are very effective ways of visualizing the entire distribution of a single quantitative variable, but sometimes we want some summarization. Boxplots are well-suited for this task. If you are rusty on boxplots the diagram below provides a quick summary of their key components:

In Minitab, boxplots can be constructed by:

Going to the “Graph” menu and selecting “Boxplot”
Selecting “Simple” under “One Y” if you want a single boxplot of the overall distribution of your quantitative variable

Question #13:

Construct a boxplot of the variable “Age”. How would you use this plot to describe the age of most mass shooters? Refer to one or more statistics illustrated by the boxplot in your answer.

One of Each

So far we’ve seen how to summarize a single categorical or quantitative variable, how to describe relationships between two categorical variables, and how to describe relationships between two quantitative variables. All that remains is how to describe the relationship between one categorical variable and one quantitative variable.

Bivariate Summaries

The relationship between a categorical and a quantitative variable can be expressed by calculating summary measures of the quantitative variable separately for each category defined by the categorical variable. In Minitab, this is done by supplying the categorical variable as a “By variable”.

Question #14:

Report the five summary of the variable “Victims” using “Mental” as a by variable. Then write 1-2 sentences describing the relationship between these two variables. Do they appear associated?

Question #15:

The mean number of victims for shooters with an unclear/unknown/tbd mental health status is much higher than the number of victims for other mental health statuses. But Q1, the median, and Q3 do not reflect such a drastic difference by mental health status. Briefly explain this disparity.

Visualizations

Any visualization suitable for a single quantitative variable may also be used to show the relationship between a quantitative and categorical variable if the categorical variable used as a “By variable”. However, boxplots tend to be most effective at illustrating this relationship.

In Minitab, the relationship between a quantitative and categorical variable can be visualized using a boxplot by:

Going to the “Graph” menu and selecting “Boxplot”
Selecting “With Groups” under “One Y”, then supplying the quantitative variable as the “Graph variable” and the categorical variable as the “Categorical variable for grouping”

Question #16:

Use boxplots to visualize the relationship between “Place” and “Victims”. Do these variables appear to be associated? (Hint: association only requires that two categories have different distributions)

Additional Questions

Question #17:

Are the variables “Age” and “Victims” associated? Use the appropriate summary statistic(s) or a visual to support your answer.

Question #18:

Are the variables “Age” and “Place” associated? Use the appropriate summary statistic(s) or a visual to support your answer.

Question #19:

Use a Minitab formula to create new variable representing the fraction of victims who were killed and verify that mean of this variable is 0.6230. Using the 68-95-99 rule, which was briefly introduced in the Summarizing Data notes, characterize the middle 68% of mass shootings. That is, describe the fraction of victims who were killed in the 68% of mass shootings closest to average.

Question #20:

Using the methods and tools presented in this lab, construct an executive summary reporting the main characteristics of mass shooters. Your summary should include 2-3 insightful visuals or tables, along with a description of what these visuals/tables show and why they are important.

Submission Directions

Double check that you’ve completed all of the lab’s questions, making sure that everyone in your group agrees with the answer you’ve provided. You will receive a single group score for the lab.
Make sure that everyone’s name is on the write-up.
Email your completed write-up to Professor Miller with a subject heading that includes the text “Sta-209-Lab1”. Please include this exact character string, including the dashes. You will lose 1 point off the top of your score if you don’t do so.
If you’d like to provide feedback on your group, fill out the optional review form at this link: https://forms.gle/wNWRFMbbra8oK4LJ8