The Armed Forces Qualification Test (AFQT) is a multiple-choice aptitude test designed to measure an individual’s reasoning and verbal skills to determine their suitability for military service. Because the test has been so broadly applied, percentile scores are frequently used as a measure of general intelligence.
The data in this question are a random sample of \(n=2584\) Americans who were first selected and tested in 1979, then later re-interviewed in 2006 and asked about their education and annual income. The focus of this analysis will be modeling 2005 income of each respondent using AFQT scores (from 1979). The data contain the following variables of interest:
afqt <- read.csv("https://remiller1450.github.io/data/AFQT.csv")
Income2005
and briefly describe the distribution of this
variable.AFQT
(the subject’s AFQT score) and
Income2005
. Add a smoothing line to the graph using
geom_smooth()
and state whether there appears to be a
relationship between these two variables.AFQT
to
predict Income2005
has a significantly better fit than the
intercept-only model (which implies independence between these
variables). Report the \(p\)-value of
this test and a brief conclusion.Educ
and
AFQT
fits the data significantly better than the simple
linear regression model you considered in Part C. Report the \(p\)-value of this test and a brief
conclusion.\(~\)
The “email spam” data set contains roughly 4000 emails received by the Gmail Account of the statistician David Diez in the early months of 2012. Additional details on the data can be found here; however, this question focuses on the variables below:
emails = read.csv("https://remiller1450.github.io/data/email_spam.csv")
exclaim_mess
, dollar
, and
number
in predicting whether an email is spam using the
variables. In your own words, briefly explain why logistic
regression is the appropriate method for this task.dollar
to predict spam
. Interpret the
estimated intercept in this model. Be sure you exponentiate to
facilitate a meaningful interpretation.dollar
in the model you fit in Part B. Be sure you
exponentiate to facilitate a meaningful interpretation.spam ~ dollar + exclaim_mess
provides a
significantly better fit than the model spam ~ dollar
.
Report the \(p\)-value and a brief
conclusion.spam ~ dollar + number
provides a
significantly better fit than the model spam ~ dollar
.
Report the \(p\)-value and a brief
conclusion.numbersmall
in the model
spam ~ dollar + number
. Be careful to recognize that this
model also contains the variable dollar
.