Instructions:
This assignment differs from previous homework and is intended to be a stepping stone towards the class project. Your task is to create the best performing model/pipeline you can for two different applications and briefly write about your approach.
For both applications, your approach must use the following strategies:
You may only use functions that are native to Python, or contained in
the sklearn
, numpy
, scipy
,
matplotlib
or pandas
libraries, but you may go
beyond the functions within these libraries that we’ve explicitly seen
in our labs.
Your submitted notebook should print out the best model (ie:
best_estimator_
) and its cross-validated performance score
(5-fold CV) using GridSearchCV()
. You do not need to
perform a training-testing split for this assignment.
Additionally, for each challenge you must submit a paragraph explaining your modeling philosophy and choices - this paragraph will contribute to your score on the assignment. You should write in a scientific tone and describe each choice you made when crafting your modeling approach. Your paragraph will be evaluated on it’s clarity, level of detail, and technical correctness.
\(~\)
The goal of this challenge is to obtain the best cross-validated F1 score for the SMS spam classification data set. You can find the data below:
https://remiller1450.github.io/data/sms_spam.txt
These data are unstructured and you should use features engineered from the message text to predict the ‘label’.
\(~\)
The premise of this challenge is to obtain the best ROC-AUC score on
a time-series classification task. The data provided below are a
collection of 60-second driving samples from a simulated driving
experiment where subjects drove on a curvy stretch of a 4-lane divided
highway with a posted speed limit of 70 miles per hour. The driving
simulator records driver inputs and vehicle states at a rate of 60 Hz,
so each sample (defined by a unique value of
Full_Sample_ID
) consists of 3600 time steps.
https://remiller1450.github.io/data/drugdetection.csv
These data are from a high-fidelity driving simulator experiment where subjects were dosed on one of the following experimental conditions:
DosingLevel = 'XP'
DosingLevel = 'YM'
DosingLevel = 'ZM'
The goal of this application is to use vehicle input patterns to
accurately predict target
, which is 1 for the active drug
conditions (ie: 'YM'
and 'ZM'
) and 0 for the
placebo condition, for each Full_Sample_ID
.
Your approach should consider at least one (but does not need to use all) of the following predictors:
CFS.Brake.Pedal.Force
- Force (J) recorded on the brake
pedal.CFS.Accelerator.Pedal.Position
- The percent depression
of the accelerator pedal (a value of 1 is 100% depressed or “pedal to
the floor”).CFS.Steering.Wheel.Angle
- The angle of the steer wheel
relative to neutral.CFS.Steering.Wheel.Angle.Rate
- The instantaneous rate
of change in the steering wheel angle.SCC.Lane.Deviation.2
- Lateral displacement of the
center of the vehicle from the center of the lane. A value of 0 means
the vehicle’s center is exactly in the center of the lane.VDS.Veh.Speed
- Speed of the vehicle in miles per
hour.ID
- The participant ID, you may opt to use this as a
variable to facilitate driver-specific effects, but I would advise
against retaining it as a numeric variable.Because each sample contains 3600 time steps, you should use feature engineering to assemble a data set with 1 row per sample and then predict the variable ‘target’ from this data set.