A simplified data science project life cycle is shown below:
While many data scientists are intricately involved in data collection; the majority of projects in our course will begin with data collection having already been done. If you aren’t involved in the data collection stage of a project, you should devote extra time and attention to the context, background and purpose of the project.
Next, the process of reaching a usable project endpoint involves iteratively enganging in:
These steps are often repeated several times as new insights are learned during each pass through the cycle.
At some point, results are either disseminated/deployed, or additional data is collected in hopes of achieving a better outcome.
\(~\)
To get to know each other (and refresh our memories on these steps) we’ll carry out a miniature project following the aforementioned life cycle and reflecting upon it.
The data you’ll work with is available here: https://remiller1450.github.io/data/admissions.csv
These data were collected by the routine procedures of a large public university. They were queried in response to allegations of sex-based discrimination in the admissions to the university’s graduate programs.
Below are brief descriptions of the variables contained in these data:
ID
- a unique applicant identifierdept
- an identifier of the graduate department the
applicant applied tosex
- the sex of the applicantgpa
- the applicant’s undergraduate grade point
averageadmit
- whether the applicant was admitted