Description

This project is a start-to-finish data science application on a non-trivial data set of your choosing. The final product is a three-page written report accompanied by R code and documentation.

Goals

  1. Mastery: Demonstrate a high level of competence in concepts, methods, and tools we covered during the semester (ie: ggplot2 graphics, data wrangling, etc.)
  2. Communication: Use familiar data science tools to take a complex data set and communicate important trends, relationships, and results to a general audience
  3. Research: Learn about and apply one or more new data science tools or methods. This could involve researching a completely new method, or an in-depth self-study of a method introduced in class.

General Details

Your project is expected to provide evidence of your skills in the following areas:

  • Data processing and cleaning
    • This includes merging/joining, string manipulations, tidying, and deriving new variables
  • Data visualization
    • This includes creating at least two professional quality visuals
    • Alternatively, you may create an RShiny app as a substitute for this requirement
  • Modeling
    • You may use any supervised or unsupervised modeling approach suitable for your data and research question
    • You are expected to be able to answer questions about how your model works, which likely requires you research the method. For example, if you decide that a random forest is the best model for your application, you should do enough research to able to explain random forests at a conceptual level.

Remarks on Balance: Not all of these components need to contribute equally to your project. Your technical proficiency will be scored separately in each of these categories, and you will receive another score for the overall level of difficulty of your project, which may be spread evenly across these areas, or concentrated in only of few of these areas.

Groups

You will to work in a group of 2 or 3 (including yourself). You may choose your group members, or you may asked to be randomly assigned into a group. Under special circumstances you may be allowed to work individually on the project, but you should consult with the instructor before doing. As a reminder, you may not work with the same partner from the R Shiny project.

Data Sources

You are allowed to choose your own data source with the following exceptions:

  1. It cannot come from a crowd-sourcing website (such as kaggle.com or data.world)
  2. It cannot come from a textbook or R package

If you need help finding a suitable data source, here are a few places you can start:

  1. Grinnell College Libraries Data homepage
  2. This project ideas page I’ve used in the past.
  3. Data.gov, an open database run by the US federal government

I encourage you to work with data that interests you. So, if you’ve worked on a research project in another class or over the summer, you may re-use that data.

You are also allowed to re-use your data from the R Shiny project, so long as you do something substantially different for this project. If you’re going this route, I encourage you to seek out new that you could merge/join with your existing data to increase the degree of difficulty of the project.

\(~\)

Report Details

The expected project report format follows the Undergraduate Class Project Competition (USCLAP) guidelines. To summarize, the project should be no more than three pages (single-spaced) and should include:

You are encouraged to add additional figures, tables, and text in a supplemental appendix. The appendix can contain the results of secondary analyses, evaluations of model assumptions, supplemental visualizations, etc. The purpose of this section is to share substantive work that you’ve done that isn’t necessary to understand to your final results.

For examples of how the paper should be formatted, you can view the submissions of past USCLAP winners. Because our course lists Sta-209 as a pre-req, the “intermediate statistics” category examples are most reflective of the expectations for this course.

If interested, I’d encourage you to consider submitting your project to the USCLAP competition. Numerous Grinnell students have won or placed in this competition, including students from Sta-230.

Timeline

Assessment Details

Proposal - 1 pt

Report - 70 pts

Data and Raw Code - 25 pts

Level of Difficulty - 25 pts

Progress Meeting - 3 pts

Self-review - 1 pt

\(~\)

Other Advice