Sta-209 (Spring 2025) Homework #2

Directions:

Submit your work via the “Assignments” tab on Canvas
For this assignment you should record your answers/code using R Markdown
- Please upload HTML, Word, or PDF output created using R Markdown and make sure it contains your code, output, and written answers. You should not include extraneous output, such as printing an entire data frame.
Homework is an individual assignment. It’s okay to check your work or collaborate with your classmates, student mentors, and others, but it is not okay to pass off their work as your own.
- Please clearly acknowledge any help you get from individuals other than yourself, or resources other than the materials on our course website (such as external websites and AI)

Question #1

For this question you should use the data found at the link below. These data are a random sample of 1287 employed individuals collected as part of the American Community Survey (ACS) performed by during the US Census Bureau.

https://remiller1450.github.io/data/EmployedACS.csv

Part A: Consider an analysis of the variable USCitizen, which records whether the respondent holds US citizenship (a value of 1) or is not a US citizen (a value of 0). Is this an example of one-sample categorical or one-sample quantitative data? You do not need to explain your answer.
Part B: Create a data visualization showing the distribution of the variable USCitizen.
Part C: Provide an appropriate set of descriptive statistics that summarize the variable USCitizen.
Part D: According to Ballotpedia 7.1% of the US population were non-citizens as of 2014. Write the null and alternative hypothesis that could be used to test whether these data support the conclusion that the percentage of non-citizens residing in the US differs from the percentage reported by Ballotpedia.
Part E: Are the sample size conditions met to use either a Z or T test in this scenario (described in Part D)? Briefly explain.
Part F: Show the calculation of the Z or T statistic for the hypothesis test described in Part D.
Part G: Use either prop.test() or t.test() to find the \(p\)-value for the hypothesis test described in Part D. Report your \(p\)-value as part of a one-sentence conclusion. Make sure you include all of the components of a proper conclusion (context, strength of evidence, and type of relationship).

\(~\)

Question #2

For this question you will continue working with the ACS data set provided in Question #1.

Part A: Consider an analysis of the variable HoursWk, which records the typical number of hours that each respondent reports working each week. Is this an example of one-sample categorical or one-sample quantitative data? You do not need to explain your answer.
Part B: Create a data visualization showing the distribution of the variable HoursWk. Briefly describe the shape of this distribution.
Part C: Calculate an appropriate measure of center and an appropriate measure of spread for the variable HoursWk.
Part D: French labor law defines a standard workweek in France as 35 hours per week. Write the null and alternative hypothesis that could be used to test whether these data support the conclusion that Americans on average work longer than France’s definition of a standard workweek.
Part E: Are the sample size conditions met to use either a Z or T test in this scenario (described in Part D)? Briefly explain.
Part F: Show the calculation of the Z or T statistic for the hypothesis test described in Part D.
Part G: Use either prop.test() or t.test() to find the \(p\)-value for the hypothesis test described in Part D. Report your \(p\)-value as part of a one-sentence conclusion. Make sure you include all of the components of a proper conclusion (context, strength of evidence, and type of relationship).

\(~\)

Question #3

For this question you will use the “Lead IQ” data set found at the link below. These data were collected as part of a CDC study investigating the relationship between low-level lead absorption and neurological function in children. The researchers sampled children who lived within one mile of a large lead smelter in El Paso, Texas (the Distance = "Near" group), as well as a control group of comparable children who lived in other parts of El Paso that were at least one mile away from the smelter (the Distance = "Far" group).

https://remiller1450.github.io/data/LeadIQ.csv

Part A: Researchers in this study wanted to compare the IQ scores of the “Near” and “Far” groups. Based upon this research question and the study description, would this constitute one-sample or two-sample data? Additionally, is the outcome categorical or quantitative? Briefly explain.
Part B: Suppose the researchers would like to generalize their findings about the impact of lead exposure to all children in the United States, not just those in El Paso. Do you believe these data are 1) an unbiased sample, 2) a sample with minimal to moderate sampling bias, 3) a sample with bias so significant that no generalizations should be made? Briefly explain your choice.
Part C: Create a boxplot displaying the distribution of the variable IQ (ignoring the variable Distance). Use your boxplot to provide estimates of the distribution’s shape, center, and spread.
Part D: IQ tests are calibrated such that a score of 100 is considered “average”. Perform a hypothesis test to evaluate whether there is statistically significant evidence that the children represented by this sample differ from the “average” IQ score. Ignore the variable Distance in your analysis. Your answer should include the R code used to perform the test as well as a one-sentence conclusion that includes the \(p\)-value. You do not need to explicitly state the hypotheses, name of the test, or any other details, but your conclusion should include all of the components of a proper conclusion (context, strength of evidence, and type of relationship).