Description
This project is a start-to-finish data science application on a
non-trivial data set of your choosing. The final product is a three-page
written report accompanied by R code and documentation.
Goals
- Mastery: Demonstrate a high level of competence in
concepts, methods, and tools we covered during the semester (ie:
ggplot2
graphics, data wrangling, etc.)
- Communication: Use familiar data science tools to
take a complex data set and communicate important trends, relationships,
and results to a general audience
- Research: Learn about and apply one or more new
data science tools or methods. This could involve researching a
completely new method, or an in-depth self-study of a method introduced
in class.
General Details
Your project is expected to provide evidence of your skills in the
following areas:
- Data processing and cleaning
- This includes merging/joining, string manipulations, tidying, and
deriving new variables
- Data visualization
- This includes creating at least two professional quality
visuals
- Alternatively, you may create an
RShiny
app as a
substitute for this requirement
- Modeling
- You may use any supervised or unsupervised modeling approach
suitable for your data and research question
- You are expected to be able to answer questions about how your model
works, which likely requires you research the method. For
example, if you decide that a random forest is the best model for your
application, you should do enough research to able to explain random
forests at a conceptual level.
Remarks on Balance: Not all of these components need
to contribute equally to your project. Your technical proficiency will
be scored separately in each of these categories, and you will receive
another score for the overall level of difficulty of your project, which
may be spread evenly across these areas, or concentrated in only of few
of these areas.
Groups
You will to work in a group of 2 or 3 (including yourself). You may
choose your group members, or you may asked to be randomly assigned into
a group. Under special circumstances you may be allowed to work
individually on the project, but you should consult with the instructor
before doing. As a reminder, you may not work with the same partner from
the R Shiny project.
Data Sources
You are allowed to choose your own data source with the following
exceptions:
- It cannot come from a crowd-sourcing website (such as kaggle.com or
data.world)
- It cannot come from a textbook or R package
If you need help finding a suitable data source, here are a few
places you can start:
- Grinnell College
Libraries Data homepage
- This
project ideas page I’ve used in the past.
- Data.gov, an open database run by the US
federal government
I encourage you to work with data that interests you. So, if you’ve
worked on a research project in another class or over the summer, you
may re-use that data.
You are also allowed to re-use your data from the R Shiny project, so
long as you do something substantially different for this project. If
you’re going this route, I encourage you to seek out new that you could
merge/join with your existing data to increase the degree of difficulty
of the project.
\(~\)
Report Details
The expected project report format follows the Undergraduate Class
Project Competition (USCLAP) guidelines. To summarize, the project
should be no more than three pages (single-spaced) and should
include:
- A title page with a one paragraph abstract (150-word
maximum)
- An introduction that includes a clearly stated research
question and background on the topic
- A methods section that describes how the data were
obtained, processed, and analyzed
- A results section that presents key findings using text,
figures, and tables
- A discussion section that puts the results into context,
addresses limitations or assumptions, and describes possible future
research
- A list of references (which does not contribute to the
three-page limit)
You are encouraged to add additional figures, tables, and text in a
supplemental appendix. The appendix can contain the results of secondary
analyses, evaluations of model assumptions, supplemental visualizations,
etc. The purpose of this section is to share substantive work that
you’ve done that isn’t necessary to understand to your final
results.
For examples of how the paper should be formatted, you can view the
submissions of past
USCLAP winners. Because our course lists Sta-209 as a pre-req, the
“intermediate statistics” category examples are most reflective of the
expectations for this course.
If interested, I’d encourage you to consider submitting your project
to the USCLAP competition. Numerous Grinnell students have won or placed
in this competition, including students from Sta-230.
Timeline
- Tuesday 11/28 - 1-paragraph proposal, including your data source,
analysis goals, and group members
- Week of 12/4 to 12/8 - progress meeting including documented
progress
- Finals Week - final report and all materials (ie: code, data, etc.)
are due (submit as a zipped folder) by 5:00pm on Friday 12/15
Assessment Details
Proposal - 1 pt
- Proposal must include your intended data source and a brief outline
of your planned analyses
- It is okay to modify these plans later as your begin working with
your data, but you must at least describe a viable starting point
Report - 70 pts
- Formatting - 10 pts - your report should include all of the
sections mentioned in the “Report Details” section of this assignment.
Each section should contain the proper contents.
- Clarity - 15 pts - your report should be written in a
professional tone using clear and concise language. A peer who has taken
this course should be able to easily understand your methods and
results.
- Data sourcing, cleaning, and manipulation - 15 pts - an
accurate description of your data source, steps taken to clean it, and
other manipulations (ie: variable transformation, etc.) should be
included in your methods section.
- Data visualization - 15 pts - your report should include at
least two publication quality data visualizations. You may instead opt
to create an R Shiny application to fulfill this component.
- Modeling - 15 pts - your report should describe the
modeling methods you used, and your results section should report on
these models in an appropriate manner (ie: proper model accuracy
criteria, proper interpretations, etc.)
Data and Raw Code - 25 pts
- At minimum, I should be able to recreate all figures and tables used
in your presentation and paper by running your code (assuming I have
moved your raw data into the correct location)
- Ideally, I can recreate your paper in full by knitting “MyName.Rmd”,
but it’s okay if you’d like to write the paper in word (or another
program)
- All data processing must be done using
R
,
manipulating your raw data outside of the R
environment is
not reproducible and points will be deducted if I suspect you did
this
- Your code should be neatly organized (good use of spaces and
indentation) and should include comments explaining each non-trivial
R
command that you use
- Your submitted code should only include code that is needed
for data cleaning and replicating the analyses used in your paper, any
exploratory code should not be submitted
- Your code should also be reasonably efficient and use generalizable
practices whenever possible. This means avoiding things like “magic
number” indexing, repetitive approaches to tasks that could be
accomplished by a single command, etc.
Level of Difficulty - 25 pts
- Your project should demonstrate a level of difficulty that exceeds
homework and lab assignments
- Below are a few things you can do to achieve a high score in this
category:
- Choosing to work with complex, messy data
- Constructing exceptional data visualizations, or using R Shiny as a
data visualization tool
- Choosing to research/self-study new model methods that are better
suited for your application than the methods discussed in class
- Sophisticated comparisons of multiple analysis approaches (ie:
careful comparisons of differing clustering approaches, different
modeling approaches, etc.)
- Research into new data manipulation or visualization approaches not
discussed in class (such as
data.table
, web scraping, new
types of graphics, etc.)
- This score is holistic, so you may use any combination of the above
strategies (or even a single one of them, given enough depth)
- You are welcome to optionally submit a level of difficulty statement
to provide context on your project; however, unlike the R Shiny project,
this statement is not required.
Progress Meeting - 3 pts
- You must provide documented evidence of progress prior to your
meeting. This typically takes the form of an R Markdown document. Higher
amounts of progress and documented content at this meet tend to be
positively correlated with final grades on the project.
- You will be able to sign-up for one of many time-slots during the
week of Week of 12/4 to 12/8. The sign-up will be published after
Thanksgiving break.
Self-review - 1 pt
- Upon completing the project, you will be asked to fill out a short
self-review questionnaire about the course and the statistics program at
Grinnell (separate from the standard end of semester course
evaluations). If you complete this questionnaire you will receive
credit.
\(~\)
Other Advice
- I strongly suggest you read through a few of the USCLAP finalist
papers before you begin writing to get a sense of the formatting
expectations and the variety of topics and strategies that work for this
type of project.
- An
Introduction to Statistical Learning is an excellent resource for
understanding modeling methods that we didn’t cover in class. The
authors of the text also have a
series of YouTube video lectures that go topic by topic through the
book. You might consider researching and using one of these methods as
part of your project.