Overview
In this project you will analyze county-level data from the state of Ohio. The goal is to identify county characteristics that are predictive of cancer incidence (new cases) using aggregate data from three recent years (2015, 2016, and 2017). You will have the opportunity to model incidence for the cancer type/site of your choice in response to the demographic characteristic that you determine to be important.
\(~\)
Data Sources
- Ohio County Cancer Cases - counts of new cancer cases (types/sites indicated via columns) for each Ohio county (expressed as total new cases in the years 2015, 2016, and 2017)
- Midwest County Demographic Data - demographic data for each county in the “midwest” region (IL, IN, MI, OH, WI)
\(~\)
Guidelines
- Your model should only involve two variables (a cancer-related outcome and a demographic-related predictor, see the recommendations below)
- We will cover multi-variable models later, don’t attempt to create one on this project
- You may present a stratified analysis (this is neither expected nor required)
- You should include 1-3 high-quality data visualizations
- You should briefly describe your data exploration process
- You should include a thorough justification of your model
- You should include a thorough discussion of what makes your model interesting/useful
- You are expected to present your findings during class on Thursday 2/18
- It’s up to you how you want to organize the presentation (ie: slides, R Markdown, etc.), but it should be polished, professional in tone, and last no longer than 7-minutes
- You are expected to turn-in your presentation materials and any supporting R code no later than 1:00pm on Thursday 2/18
\(~\)
Getting Starting
- This project requires you merge the two datasets - you should do this first, but be aware that the “Midwest” dataset contains counties in the states other than Ohio with identical names
- Population sizes vary widely by county - you should consider accounting for this when you form your outcome variable
- You can choose a cancer type/site that you are interested in without any formal/statistical justification.
- When determining your model’s explanatory variable, please recognize that you have the freedom report on any single demographic characteristic that you find to be most interesting. You also have the freedom to choose the type of model that is best suited for what you’d like to report. That said, I am expecting you to justify how you made these choices. You should follow the principles exhibited in the Data Exploration lab (Lab #2) to help you choose an explanatory variable. And you should follow the principles in the Model Fitting (Lab #3) and Model Evaluation (Lab #4) labs when choosing a model and presenting your results.
\(~\)