Goals:

The purpose of this lab is to provide you with hands-on practice in the areas of summarizing and graphing data. We will pay particular attention to identifying associations between variables, describing them, and commenting on causation.

Directions:

  • You are expected to progress through the analyses described in this document as a group, recording your answers in a shared document. It’s completely up to your group how you’d like to organize this - some groups like using a shared Google Doc, while other might designate one person to be the group’s recorder.
  • You are expected to work together, any attempts to “divide and conquer” the lab questions may result in point deductions on your group’s lab score.
  • Labs are graded primarily for completion, and we will get together as group for the last 10-15 minutes of class to discuss some of the lab questions. This means you should focus on learning the material (while also helping the teammates in your group) rather than seeing labs as an assessment (like homework or exams).
  • Please upload your responses to the Lab’s questions on Canvas. The expectation is that everyone uploads their own copy (they can be identical within your group).
  • Use the snipping tool on Windows or take a Mac screenshot to add a screenshots to your lab write-up as requested.

\(~\)

Dataset #1 - Ohio Department of Health County-level Metrics

The field of biostatistics is frequently burdened by issues of confidentially, privacy, and ethics that are less prevalent in other applications of statistics. One consequence of this is that ecological data, or data where the cases or units of observation are aggregated communities, are frequently used in the early stages of many epidemiological investigations.

The data we’ll use in this lab were obtained from the Ohio Department of Health. Click the link below to download a CSV of the data: (If you’re on a Mac you might need to right click and select “save as …”)

These data record information for each county in Ohio for the year 2021. A data dictionary providing more precise definitions of each variable is provided below:

  • County - Name of the county
  • LifeExpectancy - Estimated life expectancy at birth
  • AgeAdjustedDeathRate - Number of deaths among residents under age 75 per 100,000 population (age-adjusted)
  • PercentExcessDrinking - Percentage of adults that report excess drinking
  • PercentInactive - Percentage of adults that report no leisure-time physical activity
  • PercentSmokers - Percentage of adults that report smoking
  • WaterViolation - Whether or not the county received a water violation (related to drinking water pollution)
  • AirPPM - Average daily amount of fine particulate matter in micrograms per cubic meter (a measure of air pollution)
  • PercentDiabetes - Percentage of adults aged 20 and above with diagnosed diabetes
  • PercentFoodDesert - Percentage of population who are low-income and do not live close to a grocery store
  • PercentLongCommute - Among workers who commute in their car alone, the percentage that commute more than 30 minutes (one way)
  • PercentInsufficientSleep - Percentage of adults who report fewer than 7 hours of sleep on average (age-adjusted)
  • PercentUninsured - Percentage of the working age population with health insurance
  • HighSchoolGradRate - Percentage of high school students who graduate
  • HouseholdIncome - Median household income (in dollars)
  • JuvenileArrestRate - Rate of delinquency cases per 1,000 juveniles
  • BroadbandAccess - Percentage of the population living in an area with access to broadband internet
  • PercentUnder18 - Percentage of population under the age of 18
  • PercentOver65 - Percentage of population aged 65 or older
  • InMetro - “1” if the county is part of a census-defined metropolitan area, “0” otherwise

If you’d like the primary source for these data you can find it here: https://www.countyhealthrankings.org/app/ohio/2020/downloads

\(~\)

Data Basics

The following questions are intended to help you orient yourselves with the broader context and scope of these data.

Question #1: What are the cases/units of observation of these data and how many cases are there? Then, briefly explain why these data should be described as “ecological”.

Question #2: How many of variables in the these data are categorical? List all of the categorical variables.

\(~\)

Exploration and Hypothesis Generation

Once you understand how a dataset is structured, you’re now ready embark on more interesting investigations.

  • A rigorous scientific analysis proposes a hypothesis or research question before doing any sort of data analysis (summarization and graphing included)
  • An exploratory analysis uses the data to generate or identify interesting relationships that might inform future hypotheses (which typically get evaluated using more advanced experimental studies)

Hypothesis #1 - Food deserts impact larger segments of the population in counties that are not part of a metropolitan area

Question #3: For Hypothesis #1, identify the explanatory and response variables involved. Then, provide a univariate summary of each variable using at least one graph (per variable) and descriptive statistics. Briefly describe the distribution of any quantitative variables.

Question #4: Using graphs and comparative summary statistics as support, write 1-2 sentences describing whether or not you believe the data appear to support Hypothesis #1. Include a screenshot of your graphs/StatKey output in your write-up.

\(~\)

Hypothesis #2 - Water violations are more common in counties that are part of a metropolitan area

Question #5: For Hypothesis #2, identify the explanatory and response variables involved. Then, provide a univariate summary of each variable using at least one graph (per variable) and descriptive statistics. Briefly describe the distribution of any quantitative variables.

Question #6: Using graphs and comparative summary statistics as support, write 1-2 sentences describing whether or not you believe the data appear to support Hypothesis #2. Include a screenshot of your graphs/StatKey output in your write-up.

\(~\)

Dataset #2 - The College Scorecard

The College Scorecard is a government run database that stores institutional level data on all accredited colleges and universities in the United States. A new version of these data is published yearly, and it contains over 400 variables. We’ll use a simplified version of these data for the 2019-2020 academic year.

The data linked above are a reduced version of the 2019-2020 College Scorecard data. They contain fewer variables and are filtered to include only primarily undergraduate institutions with at least 400 enrolled students. Colleges with missing data in any of the variables listed below are also excluded. A brief description of each variable is given below:

  • Name - Name of the institution
  • City - City where the institution is located
  • State - State where the institution is located
  • Enrollment - Number of full-time enrolled students
  • Private - Binary indicator distinguishing public and private institutions
  • Region - Geographic region
  • Adm_Rate - Admissions rate, the proportion of applications who are admitted
  • ACT_median - Median composite ACT score of enrolled students
  • ACT_Q1 - 25th percentile composite ACT score of enrolled students
  • ACT_Q3 - 75th percentile composite ACT score of enrolled students
  • Cost - Average yearly cost of attendance
  • Net_Tuition - Average tuition cost after discounts (scholarships, grants, etc.)
  • Avg_Fac_Salary - Average faculty salary
  • PercentFemale - Proportion of enrolled students who are female
  • PercentWhite - Proportion of enrolled students who identify as White
  • PercentBlack - Proportion of enrolled students who identify as Black
  • PercentHispanic - Proportion of enrolled students who identify as Hispanic
  • PercentAsian - Proportion of enrolled students who identify as Asian
  • FourYearComp_Males - Proportion of male students who go on to earn a degree within four years of their initial enrollment
  • FourYearComp_Females - Proportion of male students who go on to earn a degree within four years of their initial enrollment
  • Debt_median - Median student debt upon leaving the institution
  • Salary10yr_median - Median salary 10 years after graduating the institution

Univariate Analysis

Question #8: The purpose of this question is for you to have the opportunity to thoughtfully report on a new dataset in the form of a 1-paragraph summary accompanied by 1-2 figures/tables. In this paragraph you should briefly describe the cases in these data, then briefly introduce a variable of interest, then provide a detailed summary of that variable using descriptive statistics and graphs.

\(~\)

Bivariate Analysis

Question #9: The purpose of this question is for you to have the opportunity to thoughtfully report on a new dataset in the form of a 1-paragraph summary accompanied by 1-2 figures/tables. In this paragraph you should propose a research question (hypothesis) involving the variable you identified in Question #8 and one of the categorical variables in the data. Then, you should address your research question using comparative summaries and graphs to provide support.