Description

In this project you will work either individually, or with up to two classmates, on an applied statistical analysis. You are free to choose your topic and dataset (pending instructor approval of your proposal). The primary output of the analysis will be a 3-page paper. While the formal project requirements do not depend upon the size of your group, expectations will be higher for larger groups.

I encourage you to choose a topic that overlap with your other interests (either academic or non-academic), and I have no issue with you using data that you’ve previously worked with in another context, such as another class, an internship, or a research project, so long as the way in which you use that data on this project is new.

\(~\)

Timeline

  • Friday 11/5 at 11:59pm - Your project proposal must be submitted via Canvas
  • Friday 11/19 at 11:59pm - Your dataset and data dictionary must be submitted via Canvas
  • Tuesday 11/30 at 11:59pm - Your final report is due (please submit a zipped folder containing your R code and the report)

\(~\)

Details

Proposal:

Please submit (via Canvas) 1-2 paragraphs outlining:

  1. The general topic or research question that you plan to explore
  2. A plan for obtaining your data (ie: is the data available online, do you need to collect yourself, do you need to get it from someone else on campus, etc.)
  3. Who you’ll be working with (or if you plan on working alone)

If you are working with a partner, both of you need to upload the same proposal onto Canvas.

If you are having trouble thinking of a topic, I suggest browsing the catalog of over 200,000 publicly available datasets at data.gov as a starting point. Some other great data sources are sports reference, and world bank open data.

Please note: you may not use data from Kaggle.com, or any other source where users frequently post their own analyses.

Dataset and Data Dictionary:

Please submit (via Canvas) the following two items:

  1. A single csv or excel file containing your cleaned/finalized dataset (ie: the one you ended up analyzing, not an intermediate dataset that you needed to clean)
  2. A word or pdf file that provides a brief definition of each variable in your finalized dataset (see our lab assignments for examples of this)

Paper:

Your paper should be no more than 3-pages (any spacing you find appropriate), including graphs or tables, but not including references or supplemental information. It should include the following components:

  • An brief Introduction section that clearly articulates the guiding question that motivates your research
  • A Background section that addresses the significance/relevance of your work
  • A Methods section describes how you obtained your data and how you analyzed it
  • A Results section that includes graphs, tables, outcomes of hypothesis tests, confidence intervals, and brief interpretations
  • A Discussion section that puts your results into context, discusses limitations, and addresses possible future work
  • A References section that cites at least one source (can be related to the origin of your data or previous research on your topic)

I encourage you to write your paper using R Markdown.

R Code:

Along with your paper, you should submit any R code used at any point in the project (including data cleaning and statistical analysis). If you wrote your paper in R Markdown that is great! You can simply turn-in both the “knit” and “Rmd” files.

Grading:

A rubric outlining how your paper will be evaluated is available here: Rubric Link

Your final score will be out of 60 points, which 50 coming from the paper itself, and 10 coming from proposal and dataset/data dictionary (these 10 points are entirely based upon timely completion and meeting the minimum requirements listed above).

\(~\)

Resources

The structure of this assignment is based upon the Undergraduate Class Project Competition (USCLAP) for statistics, a national competition for class projects in introductory or intermediate statistics classes. I encourage you to consider submitting your paper in the introductory category. Each year numerous students win awards and honorable mentions that are great items to include on your resume. In addition, placing submissions get invited to present at a virtual conference (another resume booster).

If you are unsure what a good project might look like, here are a few examples of papers that would receive near perfect scores:

\(~\)

Extra Credit Opportunity

The following opportunity can be used to recoup up to 6 points missed on exams during the semester. The premise is that you independently work through an R lab describing how to use methods, functions, or procedures we did not cover in class, and then you incorporate those methods into your project. You will receive 3 pts for each lab/method you complete and implement. Listed below are eligible methods and their corresponding labs:

  • Customized ggplot2 graphics - Lab Link - you might make the plots and figures in your project report extra fancy with nice labels, colors, etc.
  • Merging and joining with dplyr - Lab Link - you might merge multiple datasets together and use the combined dataset in your project
  • Making maps using leaflet - Lab Link - you might use this knowledge to add a map to as supplementary information to your project report.
  • Data wrangling using tidyr and dplyr - Lab Link #1 and Lab Link #2 - you might use this knowledge to help you clean and process your data, or you might use it to aggregate cases into a more meaningful format
  • String processing using stringr - Lab Link - you might use this knowledge to process and analyze textual data as part of your project
  • Clustering - Lab Link - you might perform a cluster analysis in addition to the other analytic approaches you’re expected to use on the project
  • Principal components analysis - Lab Link - you might perform a PCA analysis in addition to the other analytic approaches you’re expected to use on the project
  • Web scraping - Lab Link - you might obtain some or all of your data by using R to scrape a web source

To receive this extra credit, you must submit answers to the Lab questions along with a briefly explanation (just a couple of sentences is fine) describing how you incorporated content from the lab into your project. If you are working with one or more partners on the project you are expected to complete these labs independently, but obviously the content will appear in everyone’s project (since you’re turning in the same report).