Instructions:

This assignment differs from previous homework and is intended to be a stepping stone towards the class project. Your task is to create the best performing model/pipeline you can for two different applications and briefly write about your approach.

For both applications, your approach must use the following strategies:

  1. Feature engineering
  2. Pipelines and data pre-processing
  3. Model selection and hyperparameter tuning

You may only use functions that are native to Python, or contained in the sklearn, numpy, scipy, matplotlib or pandas libraries, but you may go beyond the functions within these libraries that we’ve explicitly seen in our labs.

Your submitted notebook should print out the best model (ie: best_estimator_) and its cross-validated performance score (5-fold CV) using GridSearchCV(). You do not need to perform a training-testing split for this assignment.

Additionally, for each challenge you must submit a paragraph explaining your modeling philosophy and choices - this paragraph will contribute to your score on the assignment. You should write in a scientific tone and describe each choice you made when crafting your modeling approach. Your paragraph will be evaluated on it’s clarity, level of detail, and technical correctness.

\(~\)

Question #1 (Spam Classification Challenge)

The goal of this challenge is to obtain the best cross-validated F1 score for the SMS spam classification data set. You can find the data below:

https://remiller1450.github.io/data/sms_spam.txt

These data are unstructured and you should use features engineered from the message text to predict the ‘label’.

\(~\)

Question #2 (Drugged Driving Detection Challenge)

The premise of this challenge is to obtain the best ROC-AUC score on a time-series classification task. The data provided below are a collection of 60-second driving samples from a simulated driving experiment where subjects drove on a curvy stretch of a 4-lane divided highway with a posted speed limit of 70 miles per hour. The driving simulator records driver inputs and vehicle states at a rate of 60 Hz, so each sample (defined by a unique value of Full_Sample_ID) consists of 3600 time steps.

https://remiller1450.github.io/data/drugdetection.csv

These data are from a high-fidelity driving simulator experiment where subjects were dosed on one of the following experimental conditions:

The goal of this application is to use vehicle input patterns to accurately predict target, which is 1 for the active drug conditions (ie: 'YM' and 'ZM') and 0 for the placebo condition, for each Full_Sample_ID.

Your approach should consider at least one (but does not need to use all) of the following predictors:

Because each sample contains 3600 time steps, you should use feature engineering to assemble a data set with 1 row per sample and then predict the variable ‘target’ from this data set.