Sta-230 Final Project

Description

This project is a start-to-finish data science application on a non-trivial data set of your choosing. The final product is a three-page written report accompanied by R code and documentation.

Goals

Mastery: Demonstrate a high level of competence in concepts, methods, and tools we covered during the semester (ie: ggplot2 graphics, data wrangling, etc.)
Communication: Use familiar data science tools to take a complex data set and communicate important trends, relationships, and results to a general audience
Research: Learn about and apply one or more new data science tools or methods. This could involve researching a completely new method, or an in-depth self-study of a method introduced in class.

General Details

Your project is expected to provide evidence of your skills in the following areas:

Data processing and cleaning
- This includes merging/joining, string manipulations, tidying, and deriving new variables
Data visualization
- This includes creating at least two professional quality visuals
- Alternatively, you may create an RShiny app as a substitute for this requirement
Modeling
- You may use any supervised or unsupervised modeling approach suitable for your data and research question
- You are expected to be able to answer questions about how your model works, which likely requires you research the method. For example, if you decide that a random forest is the best model for your application, you should do enough research to able to explain random forests at a conceptual level.

Remarks on Balance: Not all of these components need to contribute equally to your project. Your technical proficiency will be scored separately in each of these categories, and you will receive another score for the overall level of difficulty of your project, which may be spread evenly across these areas, or concentrated in only of few of these areas.

Groups

You will to work in a group of 2 or 3 (including yourself). You may choose your group members, or you may asked to be randomly assigned into a group. Under special circumstances you may be allowed to work individually on the project, but you should consult with the instructor before doing. As a reminder, you may not work with the same partner from the R Shiny project.

Data Sources

You are allowed to choose your own data source with the following exceptions:

It cannot come from a crowd-sourcing website (such as kaggle.com or data.world)
It cannot come from a textbook or R package

If you need help finding a suitable data source, here are a few places you can start:

I encourage you to work with data that interests you. So, if you’ve worked on a research project in another class or over the summer, you may re-use that data.

You are also allowed to re-use your data from the R Shiny project, so long as you do something substantially different for this project. If you’re going this route, I encourage you to seek out new that you could merge/join with your existing data to increase the degree of difficulty of the project.

\(~\)

Report Details

The expected project report format follows the Undergraduate Class Project Competition (USCLAP) guidelines. To summarize, the project should be no more than three pages (single-spaced) and should include:

A title page with a one paragraph abstract (150-word maximum)
An introduction that includes a clearly stated research question and background on the topic
A methods section that describes how the data were obtained, processed, and analyzed
A results section that presents key findings using text, figures, and tables
A discussion section that puts the results into context, addresses limitations or assumptions, and describes possible future research
A list of references (which does not contribute to the three-page limit)

You are encouraged to add additional figures, tables, and text in a supplemental appendix. The appendix can contain the results of secondary analyses, evaluations of model assumptions, supplemental visualizations, etc. The purpose of this section is to share substantive work that you’ve done that isn’t necessary to understand to your final results.

For examples of how the paper should be formatted, you can view the submissions of past USCLAP winners. Because our course lists Sta-209 as a pre-req, the “intermediate statistics” category examples are most reflective of the expectations for this course.

If interested, I’d encourage you to consider submitting your project to the USCLAP competition. Numerous Grinnell students have won or placed in this competition, including students from Sta-230.

Timeline

Tuesday 11/28 - 1-paragraph proposal, including your data source, analysis goals, and group members
Week of 12/4 to 12/8 - progress meeting including documented progress
Finals Week - final report and all materials (ie: code, data, etc.) are due (submit as a zipped folder) by 5:00pm on Friday 12/15

Assessment Details

Proposal - 1 pt

Proposal must include your intended data source and a brief outline of your planned analyses
It is okay to modify these plans later as your begin working with your data, but you must at least describe a viable starting point

Report - 70 pts

Formatting - 10 pts - your report should include all of the sections mentioned in the “Report Details” section of this assignment. Each section should contain the proper contents.
Clarity - 15 pts - your report should be written in a professional tone using clear and concise language. A peer who has taken this course should be able to easily understand your methods and results.
Data sourcing, cleaning, and manipulation - 15 pts - an accurate description of your data source, steps taken to clean it, and other manipulations (ie: variable transformation, etc.) should be included in your methods section.
Data visualization - 15 pts - your report should include at least two publication quality data visualizations. You may instead opt to create an R Shiny application to fulfill this component.
Modeling - 15 pts - your report should describe the modeling methods you used, and your results section should report on these models in an appropriate manner (ie: proper model accuracy criteria, proper interpretations, etc.)

Data and Raw Code - 25 pts

At minimum, I should be able to recreate all figures and tables used in your presentation and paper by running your code (assuming I have moved your raw data into the correct location)
- Ideally, I can recreate your paper in full by knitting “MyName.Rmd”, but it’s okay if you’d like to write the paper in word (or another program)
All data processing must be done using R, manipulating your raw data outside of the R environment is not reproducible and points will be deducted if I suspect you did this
Your code should be neatly organized (good use of spaces and indentation) and should include comments explaining each non-trivial R command that you use
Your submitted code should only include code that is needed for data cleaning and replicating the analyses used in your paper, any exploratory code should not be submitted
Your code should also be reasonably efficient and use generalizable practices whenever possible. This means avoiding things like “magic number” indexing, repetitive approaches to tasks that could be accomplished by a single command, etc.

Level of Difficulty - 25 pts

Your project should demonstrate a level of difficulty that exceeds homework and lab assignments
Below are a few things you can do to achieve a high score in this category:
- Choosing to work with complex, messy data
- Constructing exceptional data visualizations, or using R Shiny as a data visualization tool
- Choosing to research/self-study new model methods that are better suited for your application than the methods discussed in class
- Sophisticated comparisons of multiple analysis approaches (ie: careful comparisons of differing clustering approaches, different modeling approaches, etc.)
- Research into new data manipulation or visualization approaches not discussed in class (such as data.table, web scraping, new types of graphics, etc.)
This score is holistic, so you may use any combination of the above strategies (or even a single one of them, given enough depth)
- You are welcome to optionally submit a level of difficulty statement to provide context on your project; however, unlike the R Shiny project, this statement is not required.

Progress Meeting - 3 pts

You must provide documented evidence of progress prior to your meeting. This typically takes the form of an R Markdown document. Higher amounts of progress and documented content at this meet tend to be positively correlated with final grades on the project.
You will be able to sign-up for one of many time-slots during the week of Week of 12/4 to 12/8. The sign-up will be published after Thanksgiving break.

Self-review - 1 pt

Upon completing the project, you will be asked to fill out a short self-review questionnaire about the course and the statistics program at Grinnell (separate from the standard end of semester course evaluations). If you complete this questionnaire you will receive credit.

\(~\)

Other Advice

I strongly suggest you read through a few of the USCLAP finalist papers before you begin writing to get a sense of the formatting expectations and the variety of topics and strategies that work for this type of project.
An Introduction to Statistical Learning is an excellent resource for understanding modeling methods that we didn’t cover in class. The authors of the text also have a series of YouTube video lectures that go topic by topic through the book. You might consider researching and using one of these methods as part of your project.