Directions:
- Homework must be completed individually. Any guidance or help
received from mentors, classmates, online resources other than course
materials (including AI/LLMs) must be acknowledged.
- Clearly organize your responses so that each question (1, 2, 3) and
sub-question (A, B, C, etc.) are clearly evident.
- Please submit a single Jupyter notebook (.ipynb file) displaying all
output and recording textual answers using neatly formatted markdown
chunks.
- Your submission should be made via Canvas no later than 11:59pm on
the assigned due-date.
Question #1 (Decision Tree concepts)
Consider a small data set containing six observations of two
predictors, \(X_1\) and \(X_2\), and a binary outcome, \(Y\):
Observation
|
X1
|
X2
|
Y
|
1
|
0
|
3
|
Red
|
2
|
2
|
0
|
Red
|
3
|
0
|
2
|
Red
|
4
|
0
|
1
|
Green
|
5
|
-1
|
0
|
Green
|
6
|
1
|
1
|
Red
|
- Part A: Calculate the Gini impurity of these data
before any splitting rules/models are applied.
- Part B: Consider the splitting rule \(X_1 \leq -1\). What is the Gini gain
resulting from this split?
- Part C: Now consider the splitting rule \(X_1 \leq 0\). What is the Gini gain
resulting from this split?
- Part D: If only \(X_1\) is considered, are there any
splitting rules that will lead to a larger Gini gain than the ones
stated in Parts B and C? Briefly explain.
- Part E: If only \(X_2\) is considered, what is the splitting
rule that will produce the best Gini gain? State the rule and the Gini
gain it produces.
- Part F: If the
DecisionTreeClassifier()
function were fit using \(X_1\) and \(X_2\) as predictors of \(Y\), what would be the first splitting rule
in the tree? Justify your answer without actually fitting the function
to these data.
- Part G: Considering both predictors, is it possible
for a decision tree with maximum depth of two to perfectly classify
these data if the first splitting rule is \(X_1 \leq -1\)? What about if the first
splitting rule is \(X_1 \leq 0\)?
\(~\)
Question #3 (Cross-validation)
Consider a toy data set of \(n=200\)
observations generated using \(f(X) = 2*x_1 +
10\) and \(Y = f(X) + \epsilon\)
where \(\epsilon \sim N(0,\sigma =5)\).
In other words, the true relationship between \(x_1\) and \(Y\) is linear, with 5 units of irreducible
error. These data can be found at the URL given below:
https://remiller1450.github.io/data/toy_linear_data.csv
- Part A: In your own words, explain whether a KNN
regressor or decision tree model is better suited to estimating the true
\(f()\). You should rely upon
conceptual arguments in favor of your chosen method, not empirical
investigations using the provided data.
- Part B: Use a for loop to create your own
implementation of 4-fold cross validation for a decision tree model with
a maximum depth of 4. You are encouraged to look at the pseudocode in
our slides, and you should use the
random.choices()
function in numpy
to sample fold indices with replacement.
Use your implementation to report the cross-validated RMSE.
- Part C: Replace the decision tree model in your
cross validation loop with a
LinearRegression
model from
the linear_model
library of sklearn
. How does
the change in model impact the cross-validated RMSE?
- Part D: The decision tree model in Part B used a
maximum depth of 4. If you performed a grid search allowing this
hyperparameter to be any integer, would you expect to find one that
produces a cross-validated RMSE as low as the linear regression modeling
approach in Part C? Briefly explain. Note: you should make a
conceptual argument and should not actually perform a grid
search.
\(~\)
Question #4 (Application)
For this question you should use the dataset available here:
https://remiller1450.github.io/data/beans.csv
This data set was constructed using a computer vision system that
segmented 13,611 images of 7 types of dry beans, extracting 16 features
(12 dimensional features, 4 shape-form features). Additional details are
contained in
this paper
The following questions should be answered using Python code,
including functions in the sklearn
library. Unless
otherwise indicated, use the 'accuracy'
scoring criteria
during model tuning and evaluation.
- Part A: Read these data into Python and perform a
90-10 training-testing split using
random_state=1
.
- Part B: Separate the outcome, “Class”, from the
predictors and create a histogram of every predictor. Based on these
histograms, do you think that re-scaling and/or transformation should be
part of a data preparation pipeline? You may assume that the pipeline
will consider models like KNN that are sensitive to the scale and
distribution of the predictors.
- Part C: Perform a cross-validated grid search and
show a data frame displaying the top 5 best performing methods while
satisfying the following guidelines:
- Consideration of at least two pre-processing options in the form of
re-scaling, transformation, or “passthrough” consistent with your
assessment in Part B
- Consideration of several KNN models with at least 3 choices of \(k\) and either uniform or distance
weighting
- Consideration of several decision tree models with at least 3
choices of maximum depth
- Your approach should not explore re-scaling or
transformation for decision tree models
- Part D: Create a visualization of the confusion
matrix for your best estimator from Part C that displays
classification results for the test data. Report the most common
type of misclassification made by the model.
- Part E: Report both the macro-averaged and
micro-averaged F1-scores of the best classification approach on the
test data. Which of these approaches (macro or micro averaging) do
you believe is more appropriate for this application? Or are both
approaches reasonable? Briefly explain.