Sta-330 Intro

While many data scientists are intricately involved in data collection; the majority of projects in our course will begin with data collection having already been done. If you aren’t involved in the data collection stage of a project, you should devote extra time and attention to the context, background and purpose of the project.

Next, the process of reaching a usable project endpoint involves iteratively enganging in:

These steps are often repeated several times as new insights are learned during each pass through the cycle.

At some point, results are either disseminated/deployed, or additional data is collected in hopes of achieving a better outcome.

Today’s activity

To get to know each other (and refresh our memories on these steps) we’ll carry out a miniature project following the aforementioned life cycle and reflecting upon it.

The data you’ll work with is available here: https://remiller1450.github.io/data/admissions.csv

These data were collected by the routine procedures of a large public university. They were queried in response to allegations of sex-based discrimination in the admissions to the university’s graduate programs.

Your group’s goal is to use these data to make a decision regarding whether you believe there to be evidence of discrimination in the university’s admissions.
You should aim to present one or more data visualizations, and a 1-3 sentence executive summary that uses modeling or data analysis to support those visualizations.
Additionally, I’d like you to follow the data science life cycle described in this document and briefly summarize what you did on each pass through the cycle. You should aim to go through the cycle at least twice.

Below are brief descriptions of the variables contained in these data:

ID - a unique applicant identifier
dept - an identifier of the graduate department the applicant applied to
sex - the sex of the applicant
gpa - the applicant’s undergraduate grade point average
admit - whether the applicant was admitted