Directions:

\(~\)

Instructions

This assignment is intended to prepare you for the final project. For each question your approach should involve the following strategies:

  1. Feature engineering
  2. Data preparation pipelines and pre-processing
  3. Model selection and hyperparameter tuning

For each challenge you will work with the full data set, and you should aim to maximize the cross-validated performance on the full data set. Final scores in the competition will be determined using cross-validation with a randomly chosen random_state, so it is in your best interest to develop an approach that will generalize well to new data.

The best scoring submission on each question will receive 2 percentage points of extra credit on the mid-semester exam.

For each challenge, please submit code that creates and stores your best performing model. You may accomplish this by storing the output of GridSearchCV() and I will assume that the best_estimator_ of your search is your final model.

Additionally, for each challenge you must submit a paragraph explaining your modeling approach - this paragraph will contribute to your score on the assignment. You should write in a scientific tone and describe each choice you made when crafting your modeling approach. Your paragraph will be evaluated on it’s clarity, level of detail, and technical correctness.

\(~\)

Question #1 (Spam Classification Challenge)

The goal of this challenge is to obtain the best F1 score for the SMS spam classification data set. You can find the data below:

https://remiller1450.github.io/data/sms_spam.txt

These data are unstructured and you should use features engineered from the message text to predict the ‘label’.

\(~\)

Question #2 (Drugged Driving Detection Challenge)

The premise of this challenge is to obtain the best AUC score on a time-series classification task. These data are from a high-fidelity driving simulator experiment where subjects were dosed on one of the following experimental conditions:

  • Placebo - DosingLevel = 'XP'
  • Low THC cannabis and low-dose alcohol - DosingLevel = 'YM'
  • High THC cannabis and low-dose alcohol - DosingLevel = 'ZM'

The goal of this application is to use vehicle input patterns to accurately predict target, which is 1 for the active drug conditions (ie: 'YM' and 'ZM') and 0 for the placebo condition, for each Full_Sample_ID.

Your approach should consider at least one (but does not need to use all) of the following predictors:

  • CFS.Brake.Pedal.Force - Force (J) recorded on the brake pedal.
  • CFS.Accelerator.Pedal.Position - The percent depression of the accelerator pedal (a value of 1 is 100% depressed or “pedal to the floor”).
  • CFS.Steering.Wheel.Angle - The angle of the steer wheel relative to neutral.
  • CFS.Steering.Wheel.Angle.Rate - The instantaneous rate of change in the steering wheel angle.
  • SCC.Lane.Deviation.2 - Lateral displacement of the center of the vehicle from the center of the lane. A value of 0 means the vehicle’s center is exactly in the center of the lane.
  • VDS.Veh.Speed - Speed of the vehicle in miles per hour.
  • ID - The participant ID, you may opt to use this as a variable to facilitate driver-specific effects, but I would advise against retaining it as a numeric variable.

The data provided below are a collection of 60-second driving samples from a portion of the experiment named “interstate curves” where subjects drove on a curvy section of a 4-lane divided highway with a posted speed limit of 70 miles per hour. The driving simulator records driver inputs and vehicle states at a rate of 60 Hz, so each sample (defined by a unique value of Full_Sample_ID) consists of 3600 time steps.

https://remiller1450.github.io/data/drugdetection.csv

Because each sample contains 3600 time steps, you should use feature engineering to assemble a data set with 1 row per sample and then predict the variable ‘target’ from this data set.

\(~\)

Question #3 (Home Price Prediction Challenge)

The premise of this challenge is to obtain the lowest RMSE for home price prediction on the Ames Housing data set:

https://remiller1450.github.io/data/AmesHousing.csv

A detailed description of the variables in the Ames Housing data can be found at this link.

These data are the most structured in this assignment, but you should recognize that there are many categorical predictors and a substantial amount of missing data in several predictors.