A simplified data science project life cycle is shown below:

The cycle begins with data collection. For many projects the data has already been collected, but you still need to devote time and attention towards how it was collected and the broader context behind it. For other projects, you may be expected to contribute to the planning of data collection.

Before reaching a usable project endpoint there is an iterative cycle of:

  1. Data cleaning and manipulation
  2. Data visualization and exploration
  3. Modeling and analysis

These steps are repeated in response to new insights learned during previous passes through the cycle. It’s very difficult to achieve an ideal model, analysis, or conclusion on the first try.

At some point, results are either disseminated/deployed, or additional data is collected in hopes it might facilitate a better outcome.

\(~\)

Today’s activity

The rest of today will be devoted towards a mini-project aimed at helping everyone get to know one another, covering presentation guidelines, and reflecting upon the data science life cycle.

The data you’ll work with is available here:

https://remiller1450.github.io/data/admissions.csv

These data were collected by a public US university and were queried in response to allegations of sex-based discrimination in admissions to the university’s graduate programs.

Below are brief descriptions of the variables contained in these data:

Your group’s goal is to use these data to make a decision regarding whether you believe there is sufficient evidence of discrimination in the university’s admissions.

After completing your analysis, you will be paired with another group with whom you’ll share your executive summaries, provide critiques, and attempt to reach a shared conclusion.

\(~\)