Sta-209 (Fall 2025) Homework #1

Question #1 (Statistical Inference using StatKey)

A study published in 2023 investigated whether dogs could be trained to detect cancer using their sense of smell. As part of the study, researchers collected breath samples from healthy controls and cancer patients, and they then repeated trials where a trained dog was exposed to five bags containing breath samples, one of which was from a cancer patient and the others were from healthy controls. The dogs had been trained to select exactly one bag in each trial, and during training they were given rewards for identifying cancer samples. In each trial the researchers recorded whether or not the trained dog correctly identified the breath sample from a cancer patient.

Part A: Suppose dogs are unable to identify cancer by smelling and are selecting breath samples at random in the study. Using statistical notation, what is an appropriate null hypothesis corresponding to this possibility.
Part B: Suppose one dog from the study correctly identified the cancer sample in 30 of 33 trials. Use a StatKey simulation to find the one-sided \(p\)-value measuring how much evidence these results provide against the null hypothesis from Part A. Use at least 1000 simulated outcomes and report an estimate of the \(p\)-value to 3 decimal places.
Part C: Provide a one-sentence conclusion based upon the \(p\)-value you found in Part B. Remember that an appropriate conclusion in this class should include scientific context, information about the strength of evidence, and a description of the type of relationship found. It should not include phrases like “reject the null” or “conclude the alternative”.
Part D: Suppose the researchers had designed this study differently such that 4 of the 5 bags came from cancer patients. Assuming the same results were observed (30 of 33 correct identifications) what would the \(p\)-value now be? Use a StatKey simulation similar to the one you conducted in Part B to estimate this \(p\)-value.
Part E: Provide a one-sentence conclusion for the \(p\)-value you found in Part D.

\(~\)

Question #2 (One-sample Data Analysis in `R`)

The Transport Security Administration (TSA) is an agency within the US Department of Homeland Security with authority over the safety and security of travel in the United States. For this question, you will analyze data from a random sample of \(n=5000\) claims made by travelers against the TSA between 2003 and 2008, the first five years that the agency existed. These data are found at the following link:

https://remiller1450.github.io/data/tsa_small.csv

The relevant variables in this analysis are:

Status - whether the claim was approved (paid in full), settled (partially paid/negotiated), or denied (not paid at all)
Claim_Amount - the amount of monetary damages requested in the initial claim
Part A: Write R code that stores these data in a data frame object named tsa_data. Then find the average amount of monetary damages claimed by travelers.
Part B: Create a data visualization showing the distribution of Claim_Amount. Briefly describe 1 or 2 things you can learn from this distribution that you could not have known using just the average value you calculated in Part A.
Part C: Create a table displaying the frequencies of each status. Using this table, calculate the proportion of claims that are denied.
Part D: Create a data visualization showing the distribution of claim statuses.
Part E: According to research by Weiss Ratings, USAA denies the highest percentage of home insurance claims, rejecting 48.1% of the claims they receive. Suppose we use this value to inform a null hypothesis. Does the sample of claims against the TSA support the conclusion that the TSA rejects a lower percentage of claims than USAA? Report a \(p\)-value and an appropriate one-sentence conclusion.
Part F: Suppose we had used a smaller sized sample data in Part E, such as a random sample of 500 claims rather than 5000. Would you expect the \(p\)-value to be larger or smaller? Try dividing the observed count and sample size by 10 (and rounding to the nearest whole number) then inputting these new “data” into StatKey to verify your expectation.

Sta-209 (Fall 2025) Homework #1

Question #1 (Statistical Inference using StatKey)

Question #2 (One-sample Data Analysis in R)

Question #2 (One-sample Data Analysis in `R`)