Directions:

Question #1 (Minkowski distance/Python skills)

Some machine learning methods, such as \(k\)-nearest neighbors (KNN) are based upon calculating distances between data-points. The Minkowski distance between two points is defined as: \[D(\mathbf{x_1}, \mathbf{x_2}) = \bigg(\sum_{j=1}^{p}|x_{1,j} - x_{2,j}|^k\bigg)^{1/k}\] Where \(x_{i,j}\) is the \(j^{th}\) feature of the \(i^{th}\) data-point and \(k\) is a user-specified tuning parameter.

\(~\)

Question #2 (Data preparation/Python skills)

The data in this question are from a drugged-driving experiment where subjects received various combinations of alcohol (coded P or M) and cannabis (coded X, Y, or Z). The zipped folder at this link contains numerous CSV files, each of which records time-series information for lane departures made by the subject in a simulator run. The CSV file at the URL below contains information that can be used to link these departures with dosing information for the involved subject:

https://remiller1450.github.io/data/lane_departures_key.csv

\(~\)

Question #3 (KNN concepts and practice)

The table below provides a training data set consisting of 6 observations, 3 predictors, and a categorical outcome:

Observation X1 X2 X3 Y
1 0 3 0 Red
2 2 0 0 Red
3 0 1 3 Red
4 0 1 2 Green
5 -1 0 1 Green
6 1 1 1 Red

Suppose we’re interested using \(k\)-nearest neighbors to predict the outcome of a test data-point at \(\{X_1=0, X_2=0, X_3=0\}\).

You should answer the following questions using a calculator or basic Python functions. You should not use any functions in sklearn. Additionally, you do not need to perform any standardization/scaling.

\(~\)

Question #4 (Application using sklearn)

For this question you should use the dataset available here:

https://remiller1450.github.io/data/Ozone.csv

These data document daily Ozone concentrations in New York City in 1973. Ozone is a pollutant that is has been linked to numerous health problems. The goal of this application is develop methods for accurately predicting the Ozone concentration on future dates using that date’s expected solar radiation, wind speed, and temperature.