Sta-209 (Spring 25) Homework #10

Question #1

The Armed Forces Qualification Test (AFQT) is a multiple-choice aptitude test designed to measure an individual’s reasoning and verbal skills to determine their suitability for military service. Because the test has been so broadly applied, percentile scores are frequently used as a measure of general intelligence.

The data in this question are a random sample of \(n=2584\) Americans who were first selected and tested in 1979, then later re-interviewed in 2006 and asked about their education and annual income. The focus of this analysis will be modeling 2005 income of each respondent using AFQT scores (from 1979). The data contain the following variables of interest:

AFQT: The percentile score of the subject on the Armed Forces Qualification Test (AFQT).
Educ: The number of years of education received by the subject.
Income2005: The subject’s annual income in 2005.

afqt <- read.csv("https://remiller1450.github.io/data/AFQT.csv")

Part A: Create a histogram using the variable Income2005 and briefly describe the distribution of this variable.
Part B: Create a scatter plot relating AFQT (the subject’s AFQT score) and Income2005. Add a smoothing line to the graph using geom_smooth() and state whether there appears to be a relationship between these two variables.
Part C: Use an appropriate test to determine whether the simple linear regression model using AFQT to predict Income2005 has a significantly better fit than the intercept-only model (which implies independence between these variables). Report the \(p\)-value of this test and a brief conclusion.
Part D: Use an appropriate test to determine if a multivariable regression model using both Educ and AFQT fits the data significantly better than the simple linear regression model you considered in Part C. Report the \(p\)-value of this test and a brief conclusion.
Part E: Check the assumptions for inference of the best model you identified across Parts C and D. You should state each assumption and justify whether or not you believe it to be met by referencing an appropriate plot.
Part F: Apply a log-2 transformation to the outcome variable in your previously selected best model. Do the assumptions for inference look better or worse after applying this transformation?
Part G: Repeat the test you performed in Part D using models involving log-transformed outcomes. Do you reach the same conclusion?
Part H: Find a 95% confidence interval estimate for the expected percentage change in 2005 income for a 1-percentile increase in AFQT (holding education constant).
Part I: Consider the effects of a 1-year increase in educational attainment and a 1-percentile increase in AFQT score. Which of these increases has a greater impact on 2005 income (assuming the other remains constant)? Can you be statistically confident that its impact is larger? Hint: Use 95% confidence interval estimates to make an argument justifying your answer.

\(~\)

Question #2

The “email spam” data set contains roughly 4000 emails received by the Gmail Account of the statistician David Diez in the early months of 2012. Additional details on the data can be found here; however, this question focuses on the variables below:

spam - the outcome variable, a binary indicator of whether the user considered the email to be spam
exclaim_mess - the number of exclamation points that appear in the email
dollar - the number of times a dollar sign or the word “dollar” appeared in the email
number - a categorical variable indicating whether no number, a small number (less than 1-million), or a large number (over 1-million) was included in the email.

emails = read.csv("https://remiller1450.github.io/data/email_spam.csv")

Part A: Consider the goal of evaluating the role of the variables exclaim_mess, dollar, and number in predicting whether an email is spam using the variables. In your own words, briefly explain why logistic regression is the appropriate method for this task.
Part B: Fit a logistic regression using dollar to predict spam. Interpret the estimated intercept in this model. Be sure you exponentiate to facilitate a meaningful interpretation.
Part C: Interpret the estimated coefficient of dollar in the model you fit in Part B. Be sure you exponentiate to facilitate a meaningful interpretation.
Part D: Use a likelihood ratio test to determine whether the model you fit in Part B fits the data better than an intercept-only model. Report the \(p\)-value and a brief conclusion.
Part E: Use a likelihood ratio test to determine whether the model spam ~ dollar + exclaim_mess provides a significantly better fit than the model spam ~ dollar. Report the \(p\)-value and a brief conclusion.
Part F: Use a likelihood ratio test to determine whether the model spam ~ dollar + number provides a significantly better fit than the model spam ~ dollar. Report the \(p\)-value and a brief conclusion.
Part G: Interpret the coefficient of the re-coded variable numbersmall in the model spam ~ dollar + number. Be careful to recognize that this model also contains the variable dollar.