Homework #1

Directions:

Homework must be completed individually. Any guidance or help received from mentors, classmates, online resources other than course materials (including AI/LLMs) must be acknowledged.
Clearly organize your responses so that each question (1, 2, 3) and sub-question (A, B, C, etc.) are clearly evident.
Please submit a single Jupyter notebook displaying all output and recording textual answers using neatly formatted markdown chunks.
Your submission should be made via Canvas no later than 11:59pm on the assigned due-date.

Question #1 (Minkowski distance/Python skills)

Some machine learning methods, such as \(k\)-nearest neighbors (KNN) are based upon calculating distances between data-points. The Minkowski distance between two points is defined as: \[D(\mathbf{x_1}, \mathbf{x_2}) = \bigg(\sum_{j=1}^{p}|x_{1,j} - x_{2,j}|^k\bigg)^{1/k}\] Where \(x_{i,j}\) is the \(j^{th}\) feature of the \(i^{th}\) data-point and \(k\) is a user-specified tuning parameter.

Part A - Create a Python function that finds the Minkowski distance between two data-points with an arbitrary number of dimensions for a user-specified value of \(k\).
Part B - Show, either mathematically or empirically using a convincing Python demonstration, that as \(k\) becomes large the Minkowski distance between \(\mathbf{x_1}\) and \(\mathbf{x_2}\) approaches the maximum coordinate-wise difference between the data-points.
Part C - For the data-points [1,1.5,0.5] and [0,0,0], use a while loop and the function you created in Part A to find a value of \(k\) such that the Minkowski distance between these points is less than a tolerance of 0.001 from their maximum coordinate-wise distance.

\(~\)

Question #2 (Data preparation/Python skills)

The data in this question are from a drugged-driving experiment where subjects received various combinations of alcohol (coded P or M) and cannabis (coded X, Y, or Z). The zipped folder at this link contains numerous CSV files, each of which records time-series information for lane departures made by the subject in a simulator run. The CSV file at the URL below contains information that can be used to link these departures with dosing information for the involved subject:

https://remiller1450.github.io/data/lane_departures_key.csv

Part A - Use a for loop and functions in the pandas library to create a combined data frame storing every lane departure in the zipped folder. Note that this data frame should contain 27 columns and a large number of rows.
Part B - Attach dosing information from lane_departures_key.csv to the data frame you created in Part A.
Part C - Find the maximum absolute lateral distance (the variable lat_dist) of each lane departure using grouped summarization to remove the time component. Hint: each DAQ is a unique identifier of a drive, but not an individual lane departure. The variable instance must be combined with the DAQ to provide a unique identifier.
Part D - Display a histogram showing the distribution of maximum absolute lateral distances you found in Part C.

\(~\)

Question #3 (KNN concepts and practice)

The table below provides a training data set consisting of 6 observations, 3 predictors, and a categorical outcome:

Observation	X1	X2	X3	Y
1	0	3	0	Red
2	2	0	0	Red
3	0	1	3	Red
4	0	1	2	Green
5	-1	0	1	Green
6	1	1	1	Red

Suppose we’re interested using \(k\)-nearest neighbors to predict the outcome of a test data-point at \(\{X_1=0, X_2=0, X_3=0\}\).

You should answer the following questions using a calculator or basic Python functions. You should not use any functions in sklearn. Additionally, you do not need to perform any standardization/scaling.

Part A: Calculate the euclidean distance between each observation and the test data-point of \(\{X_1=0, X_2=0, X_3=0\}\).
Part B: What is the predicted class of \(Y\) for the test data-point if uniform weighting and \(k=1\) (one neighbor) are used? Why?
Part C: What is the predicted probability that \(Y=\text{Green}\) if uniform weighting and \(k=3\) (three neighbors) are used? Why?
Part D: Using \(k=3\), will the predicted probability of \(Y=\text{Green}\) be higher if distance weighting is used instead of uniform weighting? Briefly explain (you do not need to perform the calculation).
Part E: Re-scale each data-point by subtracting that feature’s mean and dividing by its standard deviation (ie: what StandardScaler() does) and re-calculate the distance between the test data-point and each observation. What is the predicted probability of this data-point being “Green” using uniform weighting and three neighbors? How does this compare to Part C?

\(~\)

Question #4 (Application using `sklearn`)

For this question you should use the dataset available here:

https://remiller1450.github.io/data/Ozone.csv

These data document daily Ozone concentrations in New York City in 1973. Ozone is a pollutant that is has been linked to numerous health problems. The goal of this application is develop methods for accurately predicting the Ozone concentration on future dates using that date’s expected solar radiation, wind speed, and temperature.

Part A: Read these data into Python and perform an 80-20 train-test split using random_state=3. Next, separate the outcome from the predictors (dropping the “Day” column), and create a pre-processing pipeline that performs standardization before applying a KNN regressor model.
Part B: Create a sequence of values of \(k\) from 5 to 35 by increments of 5. Calculate the RMSE of the model from Part A on the training data for each of these values of \(k\) and display the results on a line graph showing \(k\) on the x-axis and the training RMSE on the y-axis.
Part C: Use 5-fold cross-validation to estimate the out-of-sample RMSE for each value of \(k\) involved in Part B. Add a line to your previous graph depicting these estimates in a different color.
Part D: Now consider using a decision tree model for these data. Conduct a cross-validated grid search that optimizes the max_depth and min_samples_split hyperparameters. Your search must explore at least 3 reasonable values for each hyperparameter. Report the final set of hyperparameters and the corresponding cross-validated RMSE.
Part E: Using the test set, find and report the RMSE of the best KNN model from Part C and the best decision tree model from Part D.
Part F: Suppose you are asked to recommend one of the two models evaluated in Part E to a city planner who cares about both interpretability of the model’s decisions and its overall performance. Provide a recommendation for one of the two models you evaluated accompanied by a 3-5 sentence justification of the model written in a way that the city planner (who is not familiar with machine learning or algorithms) would understand.

Homework #1

Question #1 (Minkowski distance/Python skills)

Question #2 (Data preparation/Python skills)

Question #3 (KNN concepts and practice)

Question #4 (Application using sklearn)

Question #4 (Application using `sklearn`)