A simplified data science project life cycle is shown below:

While many data scientists are intricately involved in data collection; the majority of projects in our course will begin with data collection having already been done. If you aren’t involved in the data collection stage of a project, you should devote extra time and attention to the context, background and purpose of the project.

Next, the process of reaching a usable project endpoint involves iteratively enganging in:

  1. Data cleaning and manipulation
  2. Data visualization and exploration
  3. Modeling and analysis

These steps are often repeated several times as new insights are learned during each pass through the cycle.

At some point, results are either disseminated/deployed, or additional data is collected in hopes of achieving a better outcome.

\(~\)

Today’s activity

To get to know each other (and refresh our memories on these steps) we’ll carry out a miniature project following the aforementioned life cycle and reflecting upon it.

The data you’ll work with is available here: https://remiller1450.github.io/data/admissions.csv

These data were collected by the routine procedures of a large public university. They were queried in response to allegations of sex-based discrimination in the admissions to the university’s graduate programs.

Below are brief descriptions of the variables contained in these data: