Directions:
Some machine learning methods, such as \(k\)-nearest neighbors (KNN) are based upon calculating distances between data-points. The Minkowski distance between two points is defined as: \[D(\mathbf{x_1}, \mathbf{x_2}) = \bigg(\sum_{j=1}^{p}|x_{1,j} - x_{2,j}|^k\bigg)^{1/k}\] Where \(x_{i,j}\) is the \(j^{th}\) feature of the \(i^{th}\) data-point and \(k\) is a user-specified tuning parameter.
[1,1.5,0.5]
and [0,0,0]
, use a while loop and
the function you created in Part A to find a value of \(k\) such that the Minkowski distance
between these points is less than a tolerance of 0.001 from their
maximum coordinate-wise distance.\(~\)
The data in this question are from a drugged-driving experiment where subjects received various combinations of alcohol (coded P or M) and cannabis (coded X, Y, or Z). The zipped folder at this link contains numerous CSV files, each of which records time-series information for lane departures made by the subject in a simulator run. The CSV file at the URL below contains information that can be used to link these departures with dosing information for the involved subject:
https://remiller1450.github.io/data/lane_departures_key.csv
pandas
library to create a combined data frame storing
every lane departure in the zipped folder. Note that this data frame
should contain 27 columns and a large number of rows.lane_departures_key.csv
to the data frame you created in
Part A.lat_dist
) of each lane departure using
grouped summarization to remove the time component. Hint: each
DAQ is a unique identifier of a drive, but not an individual lane
departure. The variable instance
must be combined with the
DAQ to provide a unique identifier.\(~\)
The table below provides a training data set consisting of 6 observations, 3 predictors, and a categorical outcome:
Observation | X1 | X2 | X3 | Y |
---|---|---|---|---|
1 | 0 | 3 | 0 | Red |
2 | 2 | 0 | 0 | Red |
3 | 0 | 1 | 3 | Red |
4 | 0 | 1 | 2 | Green |
5 | -1 | 0 | 1 | Green |
6 | 1 | 1 | 1 | Red |
Suppose we’re interested using \(k\)-nearest neighbors to predict the outcome of a test data-point at \(\{X_1=0, X_2=0, X_3=0\}\).
You should answer the following questions using a calculator or basic
Python functions. You should not use any functions in
sklearn
. Additionally, you do not need to perform any
standardization/scaling.
StandardScaler()
does) and re-calculate the distance
between the test data-point and each observation. What is the predicted
probability of this data-point being “Green” using uniform weighting and
three neighbors? How does this compare to Part C?\(~\)
sklearn
)For this question you should use the dataset available here:
https://remiller1450.github.io/data/Ozone.csv
These data document daily Ozone concentrations in New York City in 1973. Ozone is a pollutant that is has been linked to numerous health problems. The goal of this application is develop methods for accurately predicting the Ozone concentration on future dates using that date’s expected solar radiation, wind speed, and temperature.
random_state=3
. Next, separate
the outcome from the predictors (dropping the “Day” column), and create
a pre-processing pipeline that performs standardization before applying
a KNN regressor model.max_depth
and min_samples_split
hyperparameters. Your search must explore at least 3 reasonable values
for each hyperparameter. Report the final set of hyperparameters and the
corresponding cross-validated RMSE.