Directions:

  • My primary expectation is that you thoughtfully work through this activity collaboratively with your table, discussing the embedded questions and recording your responses in a shared document.
    • At times you will be asked to add screenshots to your write-up. If you are on a Windows PC, an easy way to do this is the “snipping tool”, which you can find using the search bar along the bottom of your screen. If you are on a Mac, you can find instructions on how to take a screenshot at this link.
  • Everyone should upload their own copy of the lab write-up to Canvas
  • Only a couple questions will be graded accuracy, so your focus should be on learning the material rather than “getting the right answers” as quickly as possible

\(~\)

Introduction

This lab covers univariate and bivariate graphs and summaries. The goal is to provide you practice working with these topics on two different real datasets prior to the Midterm Project. Please do not use a “divide and conquer” strategy, any groups who use such a strategy will be proportionately penalized.

\(~\)

Dataset #1 - The College Scorecard

The College Scorecard is a government run database that stores institutional level data on all accredited colleges and universities in the United States. A new version of these data is published yearly, and it contains over 400 variables. We’ll use a simplified version of these data for the 2019-2020 academic year.

The College 2019 Dataset is a reduced version of the 2019-2020 College Scorecard data that contains fewer variables and is filtered to include only primarily undergraduate institutions with at least 400 enrolled students.

A brief description of each variable is given below:

  • Name - Name of the institution
  • City - City where the institution is located
  • State - State where the institution is located
  • Enrollment - Number of full-time enrolled students
  • Private - Binary indicator distinguishing public and private institutions
  • Region - Geographic region of the institution
  • Adm_Rate - Admissions rate, the proportion of applications who are admitted
  • ACT_median - Median composite ACT score of enrolled students
  • ACT_Q1 - 25th percentile composite ACT score of enrolled students
  • ACT_Q3 - 75th percentile composite ACT score of enrolled students
  • Cost - Average yearly cost of attendance
  • Net_Tuition - Average tuition cost after discounts (scholarships, grants, etc.)
  • Avg_Fac_Salary - Average faculty salary
  • PercentFemale - Proportion of enrolled students who are female
  • PercentWhite - Proportion of enrolled students who identify as White
  • PercentBlack - Proportion of enrolled students who identify as Black
  • PercentHispanic - Proportion of enrolled students who identify as Hispanic
  • PercentAsian - Proportion of enrolled students who identify as Asian
  • FourYearComp_Males - Proportion of male students who go on to earn a degree within four years of their initial enrollment
  • FourYearComp_Females - Proportion of male students who go on to earn a degree within four years of their initial enrollment
  • Debt_median - Median student debt upon leaving the institution
  • Salary10yr_median - Median salary 10 years after graduating the institution

Note: This dataset has been filtered further to exclude colleges with missing data for one or more of these variables (missing data is not compatible with StatKey).

\(~\)

Analysis

Question #1: Summarize the distribution of median salaries 10 years after graduation for the colleges in this dataset. In doing so, you should comment upon the shape, central tendency, and spread of the relevant variable.

Question #2: Is there an association between whether a college is private or public and the median salary of its students 10 years after graduation? Answer this question using side-by-side graphs and summary statistics. In doing so, write 1-2 sentences describing the association (or lack thereof) that is present in these data.

Question #3: Summarize the distribution of median ACT scores of the colleges in this dataset. In doing so, you should comment upon the shape, central tendency, and spread of the relevant variable(s).

Question #4: In 1-2 sentences, describe the relationship between the median ACT score of a college and the median salary of its graduates 10 years after graduation. Include any relevant StatKey output in your lab write-up.

Question #5: Using a linear regression equation, predict the expected difference in the median 10 year salary of a college with a median ACT of 30 compared to a college with a median ACT of 20.

Question #6: Identify any two variable that you suspect might have an interesting connection. Then, use StatKey to explore their relationship. In your lab write-up, provide a 2-3 sentence summary of your investigation, along with any relevant graphs or descriptive statistics.

\(~\)

Dataset #2 - Police Involved Deaths

The Washington Post manages a comprehensive database of instances where police officers have used deadly force on an suspect dating back to 2015.

These data contain the following variables:

  • name - name of the individual killed
  • date - date of the incident
  • year - year (extracted from date)
  • manner_of_death - cause of death
  • armed - weapon the killed individual was carrying (if applicable)
  • age - age of the individual killed
  • gender - gender of the individual killed
  • race - racial/ethnic group of the individual killed, using the US Census designations, “A” = Asian, “B” = Black, “H” = Hispanic (of any race), “N” = Native American, “O” = Other, “W” = Non-Hispanic White
  • city - location of the incident
  • state - state where the incident took place
  • signs_of_mental_illness - whether any past signs of mental illness were present
  • threat_level - whether the suspect was attacking the involved officers
  • flee - how the suspect fled (if applicable)
  • body_camera - whether any of the involved officers were wearing a body camera

\(~\)

Analysis

Question #7: According to the US Census, the current racial composition of the US is 61.5% Non-Hispanic White, 17.6% Hispanic (of any race), 12.3% Black, 5.3% Asian, 0.7% Native American, and 2.6% other races. Does the racial distribution of individuals killed by the police appear to mirror that of the Census? Or do certain racial groups appear to be overrepresented? Include the appropriate descriptive statistics and/or graphs created by StatKey to support your answers.

Question #8: Use these data to assess whether previous signs of mental illness are associated with a decreased likelihood of the individual attempting to flee the scene during a deadly confrontation with police. Support your answer using an appropriate set of descriptive statistics.

Question #9: For the span of these data, which state had the most police-involved deaths? Can you think of a possible explanation (aside from issues related to policing) that might explain why this particular state had the most cases in this dataset?

Question #10: Does the proportion of police-involved deaths with a body-camera present appear to be increasing, decreasing, or remaining approximately constant over time? Justify your answer using an appropriate set of descriptive statistics.

Question #11: Identify any two variable that you suspect might have an interesting connection. Then, use StatKey to explore their relationship. In your lab write-up, provide a 2-3 sentence summary of your investigation, along with any relevant graphs or descriptive statistics.