Directions:
- Submit your work via the “Assignments” tab on Canvas
- For this assignment you should record your answers/code using
R Markdown
- Please upload HTML, Word, or PDF output created using R Markdown and
make sure it contains your code, output, and written answers. You should
not include extraneous output, such as printing an entire data
frame.
- Homework is an individual assignment. It’s okay to check
your work or collaborate with your classmates, student mentors, and
others, but it is not okay to pass off their work as your own.
- Please clearly acknowledge any help you get from individuals other
than yourself, or resources other than the materials on our course
website (such as external websites and AI)
Question #1
For this question you should use the data found at the link below.
These data are a random sample of 1287 employed individuals collected as
part of the American Community Survey (ACS) performed by during the US
Census Bureau.
https://remiller1450.github.io/data/EmployedACS.csv
- Part A: Consider an analysis of the variable
USCitizen
, which records whether the respondent holds US
citizenship (a value of 1) or is not a US citizen (a value of 0). Is
this an example of one-sample categorical or one-sample
quantitative data? You do not need to explain your answer.
- Part B: Create a data visualization showing the
distribution of the variable
USCitizen
.
- Part C: Provide an appropriate set of descriptive
statistics that summarize the variable
USCitizen
.
- Part D: According
to Ballotpedia 7.1% of the US population were non-citizens as of
2014. Write the null and alternative hypothesis that could be used to
test whether these data support the conclusion that the percentage of
non-citizens residing in the US differs from the percentage reported by
Ballotpedia.
- Part E: Are the sample size conditions met to use
either a Z or T test in this scenario (described in Part D)? Briefly
explain.
- Part F: Show the calculation of the Z or T
statistic for the hypothesis test described in Part D.
- Part G: Use either
prop.test()
or
t.test()
to find the \(p\)-value for the hypothesis test described
in Part D. Report your \(p\)-value as
part of a one-sentence conclusion. Make sure you include all of the
components of a proper conclusion (context, strength of evidence, and
type of relationship).
\(~\)
Question #2
For this question you will continue working with the ACS data set
provided in Question #1.
- Part A: Consider an analysis of the variable
HoursWk
, which records the typical number of hours that
each respondent reports working each week. Is this an example of
one-sample categorical or one-sample quantitative
data? You do not need to explain your answer.
- Part B: Create a data visualization showing the
distribution of the variable
HoursWk
. Briefly describe the
shape of this distribution.
- Part C: Calculate an appropriate measure of center
and an appropriate measure of spread for the variable
HoursWk
.
- Part D: French
labor law defines a standard workweek in France as 35 hours per
week. Write the null and alternative hypothesis that could be used to
test whether these data support the conclusion that Americans on average
work longer than France’s definition of a standard workweek.
- Part E: Are the sample size conditions met to use
either a Z or T test in this scenario (described in Part D)? Briefly
explain.
- Part F: Show the calculation of the Z or T
statistic for the hypothesis test described in Part D.
- Part G: Use either
prop.test()
or
t.test()
to find the \(p\)-value for the hypothesis test described
in Part D. Report your \(p\)-value as
part of a one-sentence conclusion. Make sure you include all of the
components of a proper conclusion (context, strength of evidence, and
type of relationship).
\(~\)
Question #3
For this question you will use the “Lead IQ” data set found at the
link below. These data were collected as part of a CDC study
investigating the relationship between low-level lead absorption and
neurological function in children. The researchers sampled children who
lived within one mile of a large lead smelter in El Paso, Texas (the
Distance = "Near"
group), as well as a control group of
comparable children who lived in other parts of El Paso that were at
least one mile away from the smelter (the Distance = "Far"
group).
https://remiller1450.github.io/data/LeadIQ.csv
- Part A: Researchers in this study wanted to compare
the IQ scores of the “Near” and “Far” groups. Based upon this research
question and the study description, would this constitute
one-sample or two-sample data? Additionally, is the
outcome categorical or quantitative? Briefly
explain.
- Part B: Suppose the researchers would like to
generalize their findings about the impact of lead exposure to all
children in the United States, not just those in El Paso. Do you believe
these data are 1) an unbiased sample, 2) a sample with minimal to
moderate sampling bias, 3) a sample with bias so significant that no
generalizations should be made? Briefly explain your choice.
- Part C: Create a boxplot displaying the
distribution of the variable
IQ
(ignoring the variable
Distance
). Use your boxplot to provide estimates of the
distribution’s shape, center, and
spread.
- Part D: IQ tests are calibrated such that a score
of 100 is considered “average”. Perform a hypothesis test to evaluate
whether there is statistically significant evidence that the children
represented by this sample differ from the “average” IQ score. Ignore
the variable
Distance
in your analysis. Your answer should
include the R
code used to perform the test as well as a
one-sentence conclusion that includes the \(p\)-value. You do not need to explicitly
state the hypotheses, name of the test, or any other details, but your
conclusion should include all of the components of a proper conclusion
(context, strength of evidence, and type of relationship).