Directions:
\(~\)
This assignment is intended to prepare you for the final project. For each question your approach should involve the following strategies:
For each challenge you will work with the full data set, and you
should aim to maximize the cross-validated performance on the full data
set. Final scores in the competition will be determined using
cross-validation with a randomly chosen random_state
, so it
is in your best interest to develop an approach that will generalize
well to new data.
The best scoring submission on each question will receive 2 percentage points of extra credit on the mid-semester exam.
For each challenge, please submit code that creates and stores your
best performing model. You may accomplish this by storing the output of
GridSearchCV()
and I will assume that the
best_estimator_
of your search is your final model.
Additionally, for each challenge you must submit a paragraph explaining your modeling approach - this paragraph will contribute to your score on the assignment. You should write in a scientific tone and describe each choice you made when crafting your modeling approach. Your paragraph will be evaluated on it’s clarity, level of detail, and technical correctness.
\(~\)
The goal of this challenge is to obtain the best F1 score for the SMS spam classification data set. You can find the data below:
https://remiller1450.github.io/data/sms_spam.txt
These data are unstructured and you should use features engineered from the message text to predict the ‘label’.
\(~\)
The premise of this challenge is to obtain the best AUC score on a time-series classification task. These data are from a high-fidelity driving simulator experiment where subjects were dosed on one of the following experimental conditions:
DosingLevel = 'XP'
DosingLevel = 'YM'
DosingLevel = 'ZM'
The goal of this application is to use vehicle input patterns to
accurately predict target
, which is 1 for the active drug
conditions (ie: 'YM'
and 'ZM'
) and 0 for the
placebo condition, for each Full_Sample_ID
.
Your approach should consider at least one (but does not need to use all) of the following predictors:
CFS.Brake.Pedal.Force
- Force (J) recorded on the brake
pedal.CFS.Accelerator.Pedal.Position
- The percent depression
of the accelerator pedal (a value of 1 is 100% depressed or “pedal to
the floor”).CFS.Steering.Wheel.Angle
- The angle of the steer wheel
relative to neutral.CFS.Steering.Wheel.Angle.Rate
- The instantaneous rate
of change in the steering wheel angle.SCC.Lane.Deviation.2
- Lateral displacement of the
center of the vehicle from the center of the lane. A value of 0 means
the vehicle’s center is exactly in the center of the lane.VDS.Veh.Speed
- Speed of the vehicle in miles per
hour.ID
- The participant ID, you may opt to use this as a
variable to facilitate driver-specific effects, but I would advise
against retaining it as a numeric variable.The data provided below are a collection of 60-second driving samples
from a portion of the experiment named “interstate curves” where
subjects drove on a curvy section of a 4-lane divided highway with a
posted speed limit of 70 miles per hour. The driving simulator records
driver inputs and vehicle states at a rate of 60 Hz, so each sample
(defined by a unique value of Full_Sample_ID
) consists of
3600 time steps.
https://remiller1450.github.io/data/drugdetection.csv
Because each sample contains 3600 time steps, you should use feature engineering to assemble a data set with 1 row per sample and then predict the variable ‘target’ from this data set.
\(~\)
The premise of this challenge is to obtain the lowest RMSE for home price prediction on the Ames Housing data set:
https://remiller1450.github.io/data/AmesHousing.csv
A detailed description of the variables in the Ames Housing data can be found at this link.
These data are the most structured in this assignment, but you should recognize that there are many categorical predictors and a substantial amount of missing data in several predictors.