Midterm Project #1

Overview

In this project you will analyze county-level data from the state of Ohio. The goal is to identify county characteristics that are predictive of cancer incidence (new cases) using aggregate data from three recent years (2015, 2016, and 2017). You will have the opportunity to model incidence for the cancer type/site of your choice in response to the demographic characteristic that you determine to be important.

\(~\)

Data Sources

Ohio County Cancer Cases - counts of new cancer cases (types/sites indicated via columns) for each Ohio county (expressed as total new cases in the years 2015, 2016, and 2017)
- Reference: http://publicapps.odh.ohio.gov/EDW/DataBrowser/Browse/StateLayoutLockdownCancers
Midwest County Demographic Data - demographic data for each county in the “midwest” region (IL, IN, MI, OH, WI)
- Reference: https://tidyverse.github.io/ggplot2-docs/reference/midwest.html

\(~\)

Guidelines

Your model should only involve two variables (a cancer-related outcome and a demographic-related predictor, see the recommendations below)
- We will cover multi-variable models later, don’t attempt to create one on this project
- You may present a stratified analysis (this is neither expected nor required)
You should include 1-3 high-quality data visualizations
You should briefly describe your data exploration process
You should include a thorough justification of your model
You should include a thorough discussion of what makes your model interesting/useful
You are expected to present your findings during class on Thursday 2/18
- It’s up to you how you want to organize the presentation (ie: slides, R Markdown, etc.), but it should be polished, professional in tone, and last no longer than 7-minutes
You are expected to turn-in your presentation materials and any supporting R code no later than 1:00pm on Thursday 2/18
- Please send these directly to me at millerr33@xavier.edu as a email attachment

\(~\)

Getting Starting

This project requires you merge the two datasets - you should do this first, but be aware that the “Midwest” dataset contains counties in the states other than Ohio with identical names
Population sizes vary widely by county - you should consider accounting for this when you form your outcome variable
You can choose a cancer type/site that you are interested in without any formal/statistical justification.
When determining your model’s explanatory variable, please recognize that you have the freedom report on any single demographic characteristic that you find to be most interesting. You also have the freedom to choose the type of model that is best suited for what you’d like to report. That said, I am expecting you to justify how you made these choices. You should follow the principles exhibited in the Data Exploration lab (Lab #2) to help you choose an explanatory variable. And you should follow the principles in the Model Fitting (Lab #3) and Model Evaluation (Lab #4) labs when choosing a model and presenting your results.

\(~\)

Grading

A rubric detailing how your project will be scored is available at this link

Midterm Project #1

Assigned: Tuesday 1/26/21, Due: Thursday 2/18/21

Overview

Data Sources

Guidelines

Getting Starting

Grading