Problem Set #1 (MATH-257, Spring 2020)

Directions:

On all problems sets you are expected to show sufficient work, either in the form of R code and/or written/typed calculations, for a third party to understand how you arrived at your solution.
I encourage you to write your answers in R Markdown, a nifty package capable of conveniently combining R code, R output, and LaTex typesetting into a single document.
Any submissions received after than the assigned due-date may be subject to a late penalty unless prior approval for an adjusted due-date has been given.

\(~\)

Question #1 - A Second Course In Statistics #1.10

A Gallup Youth Poll was conducted to determine the topics that teenagers most want to discuss with their parents. The findings show that 46% would like more discussion about the family’s financial situation, 37% would like to talk about school, and 30% would like to talk about religion. The survey was based on a national sampling of 505 teenagers, selected at random from all U.S. teenagers.

Describe the sample.
Describe the population from which the sample was selected.
Is the sample representative of the population?
What is the variable of interest?
How is the inference expressed?
Newspaper accounts of most polls usually give a margin of error (e.g., plus or minus 3%) for the survey result. What is the purpose of the margin of error and what is its interpretation?

\(~\)

Question #2 - Intro to Statistical Learning #2.10 (adapted)

This exercise involves the Boston housing data set. To begin, load in the Boston data set. The Boston data set is part of the MASS library in R (note: this is package comes installed by default, a full description of the data set is available at this link).

library(MASS)
data <- Boston

How many rows are in this data set? How many columns? What do the rows and columns represent?
Create three pairwise scatterplots involving different predictors (columns) in this data set using per-capita crime rate as the \(Y\) variable. Describe your findings.
Are any of the predictors you explored associated with per-capita crime rate? If so, explain the relationship.
Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? That is, identify any tracts that are outliers in these variables.
How many of the tracts in this data set border the Charles river? Create a graph showing the distribution of tracts that border/do not border the river.
Do tracts that border the Charles river tend to have higher median home values? Create a graph supporting your answer.

\(~\)

Question #3 - Intro to Statistical Learning #2.6

Describe the differences between a parametric and a non-parametric modeling approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages? (a satisfactory answer should be 2-5 sentences)

\(~\)

Question #4 - A Second Course In Statistics #3.10

Mechanical engineers at the University of Newcastle (Australia) investigated the use of timber in high-efficiency small wind turbine blades (Wind Engineering, January 2004). The strengths of two types of timber—radiata pine and hoop pine—were compared. Twenty specimens (called “coupons”) of each timber blade were fatigue tested by measuring the stress (in MPa) on the blade after various numbers of blade cycles.Asimple linear regression analysis of the data—one conducted for each type of timber—yielded the following results (where y = stress and x = natural logarithm of number of cycles):

Radiata Pine: \(\hat{y} = 97.37 - 2.50 X\)

Hoop Pine: \(\hat{y} = 122.03 - 2.36 X\)

Interpret the estimated slope of each line
Interpret the estimated y-intercept of each line
Based on these results, which type of timber blade appears to be stronger and more fatigue resistant? Explain.

\(~\)

Question #5 - A Second Course In Statistics #3.22

A study was conducted to model the thermal performance of integral-fin tubes used in the refrigeration and process industries (Journal of Heat Transfer, August 1990). Twenty-four specially manufactured integral-fin tubes with rectangular fins made of copper were used in the experiment. Vapor was released downward into each tube and the vapor-side heat transfer coefficient (based on the outside surface area of the tube) was measured.

The dependent variable for the study is the heat transfer enhancement ratio, y, defined as the ratio of the vapor-side coefficient of the fin tube to the vapor-side coefficient of a smooth tube evaluated at the same temperature. Theoretically, heat transfer will be related to the area at the top of the tube that is “unflooded” by condensation of the vapor. The data in the table are the unflooded area ratio (x) and heat transfer enhancement (y) values recorded for the 24 integral-fin tubes.

Fit a least squares line to the data.
Plot the data and graph the least squares line as a check on your calculations.
Calculate SSE and s2.
Calculate s and interpret its value.

The data needed for this question are given below:

Q5_data <- data.frame(unflooded_area_ratio = c(1.93, 1.95, 1.78, 1.64,
                                               1.54, 1.32, 2.12, 1.88,
                                               1.70, 1.58, 2.47, 2.37,
                                               2.00, 1.77, 1.62, 2.77,
                                               2.47, 2.24, 1.32, 1.26,
                                               1.21, 2.26, 2.04, 1.88),
                      heat_transfer_enhancement = c(4.4, 5.3, 4.5, 4.5,
                                                    3.7, 2.8, 6.1, 4.9,
                                                    4.9, 4.1, 7.0, 6.7,
                                                    5.2, 4.7, 4.2, 6.0,
                                                    5.8, 5.2, 3.5, 3.2,
                                                    2.9, 5.3, 5.1, 4.6))

\(~\)

Question #6 - A Second Course In Statistics #3.22

Refer to the Journal of Heat Transfer study of the straight-line relationship between heat transfer enhancement (y) and unflooded area ratio (x), Exercise 3.22 (p. 109 in the textbook, or the previous question in this assignment). Construct a 95% confidence interval for \(\beta_1\), the slope of the line. Interpret the result.

\(~\)

Question #7 - A Second Course In Statistics #3.28

The British Journal of Sports Medicine (April 2000) published a study of the effect of massage on boxing performance. Two variables measured on the boxers were blood lactate concentration (mM) and the boxer’s perceived recovery (28-point scale). Based on information provided in the article, the data in the table were obtained for 16 five-round boxing performances, where a massage was given to the boxer between rounds. Conduct a test to determine whether blood lactate level (y) is linearly related to perceived recovery (x). Use \(\alpha = 0.10\) as the significance threshold.

The data needed for this question are given below:

Q7_data <- data.frame(blood_lactate_level = c(3.8, 4.2, 4.8, 4.1, 5.0, 5.3,
                                              4.2, 2.4, 3.7, 5.3, 5.8, 6.0,
                                              5.9, 6.3, 5.5, 6.5),
                      perceived_recovery = c(7,7,11,12,12,12,13,17,
                                             17,17,18,18,21,21,20,24))

\(~\)

Question #8 - Intro to Statistical Learning #2.7 (adapted)

The table below provides a data set containing six observations, three predictors, and one response variable.

Obs.	X1	X2	X3	Y
1	0	3	0	2.5
2	2	0	0	1
3	0	1	3	7
4	0	1	2	5
5	−1	0	1	0.5
6	1	1	1	3

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.

Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.
What is our prediction with K = 1? Why?
What is our prediction with K = 3? Why?

Problem Set #1 (MATH-257, Spring 2020)

Assigned: Tuesday Jan 19th, Due: Friday Feb 12th at 11:59pm

Question #1 - A Second Course In Statistics #1.10

Question #2 - Intro to Statistical Learning #2.10 (adapted)

Question #3 - Intro to Statistical Learning #2.6

Question #4 - A Second Course In Statistics #3.10

Question #5 - A Second Course In Statistics #3.22

Question #6 - A Second Course In Statistics #3.22

Question #7 - A Second Course In Statistics #3.28

Question #8 - Intro to Statistical Learning #2.7 (adapted)