Overview

The purpose of this project is for you to demonstrate your ability to perform a start to finish data analysis, including both descriptive and statistical methods.

You will be responsible for writing a 2-3 page report summarizing your analysis (including any figures and tables).

You are expected to complete the project independently, but you may use the same data from your midterm project. You also may choose an entirely new dataset, include those listed as “pre-approved” in the sections that follow.

\(~\)

Due Date

The project report is due at 2:00pm Eastern on Monday, May 2nd. This is the end of our assigned final exam timeslot.

You may choose to turn in your project earlier, but any extensions will require prior approval and might subject your project score to a penalty (please do not email me on Sunday night or Monday morning asking for an extension unless you’re willing to accept a 10-20% late penalty.)

\(~\)

Details

Your final report should contain five sections:

  1. An introduction section that describes context and purpose of your project
  2. A research question (within your introduction section) that is specific, answerable, and interesting (ie: “Are there regional differences in the net tuition costs of private colleges?”)
  3. A methods section that describes the study design, data collection, and how you pursued your research question, such as: the types of graphs you used, how you addressed/explored possible confounding variables, the descriptive statistics you used, and the methods of statistical inference (ie: hypothesis tests and confidence intervals) that you used.
  4. A results section that reports on each of items that were mentioned in your methods section
  5. A discussion section that puts your results into context and acknowledges any limitations of your data and/or your analysis approach.

Additionally, I will be looking for the following:

As a reminder, the final version of your report should be no more than 3 pages (single spaced, including figures and tables). You may include references and additional figures/tables in an optional appendix that will not count towards the page limit. The purpose of this restriction is to push you to convey the main points of your analysis succinctly.

\(~\)

Pre-Approved Datasets:

  1. Colleges 2019-20 - This dataset contains numerous variables with no natural outcome. The data were obtained from the “The College Scorecard”, a government run database containing information on all degree-granting higher education institutions. The data are filtered to only include institutions that primarily grant bachelors degrees, and enroll at 500 students.
  2. XU Basketball 2020-21 - This dataset contains team-level statistics for each Xavier University men’s basketball game played during the 2020-21 season. One variable, “Margin”, which is calculated as Xavier’s score minus the opponents score, is a natural outcome variable.
  3. Police-Involved Deaths in 2015 - This dataset contains numerous variables with no natural outcome. The data originate from the FiveThirtyEight article Where Police Have Killed Americans in 2015. It contains demographic and geographic information on everyone killed by the police in the year 2015, including the person’s name, age, race, gender, cause of death, whether the person was armed. It was merged to include the poverty, unemployment, and college education rates of the census tract where the killing took place.
  4. Mass Shootings - This dataset was assembled by Mother Jones, a liberal news organization, in response to the movie theater shooting in Aurora Colorado. It documents shootings in the United States where a lone gunman (with a few exceptions involving two shooters) killed at least four individuals (not including themselves) at a single location (with a few exceptions involving multiple locations within a short period). Variables include: demographic characteristics of the shooter, information on when/where the shooting occurred, information on the number of victims, and information about the mental health status of the shooter.
  5. Claims Against the TSA - The Transport Security Administration (TSA) is an agency within the US Department of Homeland Security that has authority of the safety and security of travel in the United States. This dataset records claims made by travelers against the TSA between 2003 and 2008, including information about the claim type, claim amount, whether it was approved, settled, or denied, and the final amount paid.

Aside from these pre-approved datasets, you’re welcome to use a dataset of your choosing that is better aligned with your interests. However, if you decide to go with a different dataset you must get instructor approval. Before your dataset is approved, you should be aware of the following guidelines:

  • No datasets taken from Kaggle.com, data.world, or any website that crowdsources datasets and analyses.
  • No datasets taken from textbooks or educational sources
  • No datasets that do not have a primary source (ie: you should know who collected it)

You are also welcome to collect your own data or use data that you’ve worked with in some other context (ie: another non-statistics course).

\(~\)

USCLAP Competition (Optional)

USCLAP is a undergraduate class project competition for students taking introductory courses in statistics or data science (click here for more info). The requirements of this project roughly align with those of USCLAP, and I encourage any individual or group that receives a high score to consider submitting their work to the competition. USCLAP has become very competitive in recent years, and being a finalist (or receiving an honorable mention) is an excellent recognition to include on your resume (or mention during a job interview).

If you think this is something you might be interested in, I encourage you to browse the finalists from prior semesters (you should only concern yourself with the “Introductory Statistics” category):