\(~\)
Before we begin let’s introduce a few definitions:
How statisticians use these terms is summarized in the diagram below (from the book “Data Science: A First Introduction”):
\(~\)
We’d like to take our estimate(s) and use it at as the basis for a decision. However, unless we’re somehow able to sample the entire population (which generally doesn’t happen) the estimate is an imperfect representation of the population parameter (even if our sampling procedure is free of bias).
However, if we could somehow repeat the way we collected our sample data many times, we could see how much variation is present in these estimates and we could use that information to make a more informed decision. A hypothetical example is given below:
Unfortunately, data collection is time consuming and expensive - it’s an inefficient use of resources to repeat it the hundreds (or even thousands) of times that are needed to understand sampling variability.
\(~\)
Bootstrapping is an ingenious statistical method intended to mimic the process of repeating a study (ie: sampling from a population). However, instead of collecting new samples from the population, bootstrap samples are obtained by randomly selecting observations from the available sample data with replacement
We can bootstrap our sample many times to come up with a variety of estimates. These estimates will center around the original estimate from our original sample (which isn’t particularly useful); however, the variation in these estimates has been mathematically proven to reliably reflect the true amount of variation in the real sampling distribution (under a few mild assumptions).
\(~\)
One method that statisticians frequency use to aide in decision making is the confidence interval. These intervals express a range of possible values with an attached “confidence level”.
Bootstrapping can be used to construct confidence intervals from sample data using a statistical method known as the percentile bootstrap. We’ll demonstrate this method, and its usefulness in decision, in the application that follows.
\(~\)
Wikipedia claims that the survival rate among babies born at 25 weeks gestation (15 weeks early) is 70%. Researchers used medical records from the Johns Hopkins University Hospital to identify 39 infants delivered at 25 weeks gestation at their facilities. Among these infants, 31 of 39 survived (79.5%).
Should a mother at risk of giving birth 15 weeks early should travel to Baltimore (where Johns Hopkins is located) to deliver their baby?
For example, if we looked at a 90% confidence interval we might feel borderline on whether traveling to Johns Hopkins is worthwhile.
\(~\)
The following questions are intended to guide you and your partner through the thought process statisticians use to make a decision from sample data.
In this activity, I want you to suppose that you (or your partner) recently gave birth to a child with a congenital heart defect. Such a defect requires surgery immediately after birth if the child is to survive. A long-standing surgical procedure to address the defect is “circulatory arrest”, which has the downside of cutting off blood flow to the brain during the surgery and potentially leading to brain damage. A more recent alternative is “low-flow bypass”, which maintains circulation to the brain using an external pump that might lead to other types of brain injuries.
Because both surgeries come with risks, and the low-flow bypass approach is newer, you want to use the more standard “circulatory arrest” approach unless you can be confident that the mental and physical development of your child will be better if the “low-flow bypass” surgery is performed.
Question #1: Given the design of this study, do you think that the average MDI and PDI scores observed in the two groups of infants (defined by the surgery they received) could be confounded by a variable other that type of surgery? Briefly explain.
Question #2: Given the design of this study, are you worried about the role of sampling bias when using these data to inform your decision? briefly explain.
Question #3: Use the “Descriptive Statistics for one Quantitative and one Categorical Variable” to find the mean PDI and MDI scores of each surgical group. Record these averages and briefly comment upon whether the new low-flow approach seems to have desirable outcomes.
Question #4: Use the “CI for Difference in Means” to find 90% percentile bootstrap confidence interval for the difference in mean PDI scores across groups. At this level of confidence, are you convinced that the observed difference is not due to sampling variability? Briefly explain.
Question #5: Use the “CI for Difference in Means” to find 90% percentile bootstrap confidence interval for the difference in mean MDI scores across groups. At this level of confidence, are you convinced that the observed difference is not due to sampling variability? Briefly explain.
After answering these questions, have one member of your group upload a copy of your responses to P-web for completion credit. You may then move onto the individual portion of today’s assignment.
\(~\)
Using the information from Questions 1-5 in the previous section, make a decision about the type of surgery you’d prefer for your child (in this hypothetical scenario). Then, write a 6-8 sentence paragraph that argues for your decision using the data introduced in today’s activity and your findings in Questions 1-5. You may assume that the reader of your paragraph has access to the same description of the data that you do, so you do not need to devote time in your paragraph towards summarizing the study.
I encourage you to review of class materials from 8/29 and 8/31 (which covered argumentation) to help shape your paragraph.
Submit a copy of your paragraph to P-web by Thursday 10/5 at 11:59pm.