Directions:
- Homework must be completed individually
- Please submit a single Jupyter Notebook containing your responses
via P-web. Use markdown chunks to format the assignment and record
responses to questions that involve written answers.
\(~\)
Question #1 (\(k\)-Nearest
Neighbors)
Shown below is a simple training data set consisting of 6
observations, 3 predictors, and a categorical outcome:
Observation
|
X1
|
X2
|
X3
|
Y
|
1
|
0
|
3
|
0
|
Red
|
2
|
2
|
0
|
0
|
Red
|
3
|
0
|
1
|
3
|
Red
|
4
|
0
|
1
|
2
|
Green
|
5
|
-1
|
0
|
1
|
Green
|
6
|
1
|
1
|
1
|
Red
|
Suppose we’re interested using \(k\)-nearest neighbors to predict the
outcome of a test data-point at \(\{X_1=0,
X_2=0, X_3=0\}\).
You should answer the following questions using a calculator or basic
Python functions (ie: addition, subtraction, powers, roots, etc.). You
should not use any functions in sklearn
.
Additionally, you should not perform any standardization/scaling when
answering Parts A - D.
- Part A: Calculate the euclidean distance
between each observation and the test data-point of \(\{X_1=0, X_2=0, X_3=0\}\).
- Part B: What is the predicted class of
\(Y\) for the test data-point if
uniform weighting and \(k=1\) (one
neighbor) are used? Why?
- Part C: What is the predicted probability
that \(Y=\text{Green}\) if uniform
weighting and \(k=3\) (three
neighbors) are used? Why?
- Part D: Using \(k=3\), will the predicted
probability of \(Y=\text{Green}\)
be higher if distance weighting is used instead of uniform
weighting? Briefly explain (you do not need to perform the
calculation).
Now consider these same data after re-scaling:
Observation
|
X1
|
X2
|
X3
|
Y
|
1
|
0.3333333
|
1.0000000
|
0.0000000
|
Red
|
2
|
1.0000000
|
0.0000000
|
0.0000000
|
Red
|
3
|
0.3333333
|
0.3333333
|
1.0000000
|
Red
|
4
|
0.3333333
|
0.3333333
|
0.6666667
|
Green
|
5
|
0.0000000
|
0.0000000
|
0.3333333
|
Green
|
6
|
0.6666667
|
0.3333333
|
0.3333333
|
Red
|
- Part E: Calculate the euclidean distance
between each observation and the test data-point \(\{X_1=0, X_2=0, X_3=0\}\).
- Part F: What is the predicted probability
of \(Y=\text{Green}\) using uniform
weighting and \(k=3\) (three
neighbors)? Why?
\(~\)
Question #2 (Decision Tree concepts)
Consider a small data set containing six observations of two
predictors, \(X_1\) and \(X_2\), and a binary outcome, \(Y\):
Observation
|
X1
|
X2
|
Y
|
1
|
0
|
3
|
Red
|
2
|
2
|
0
|
Red
|
3
|
0
|
2
|
Red
|
4
|
0
|
1
|
Green
|
5
|
-1
|
0
|
Green
|
6
|
1
|
1
|
Red
|
- Part A: Calculate the Gini impurity of these data
before any splitting rules/models are applied.
- Part B: Consider the splitting rule \(X_1 \leq -1\). What is the Gini gain
resulting from this split?
- Part C: Now consider the splitting rule \(X_1 \leq 0\). What is the Gini gain
resulting from this split?
- Part D: If only \(X_1\) is considered, are there any
splitting rules that will lead to a larger Gini gain than the ones
stated in Parts B and C? Briefly explain.
- Part E: If only \(X_2\) is considered, what is the splitting
rule that will produce the best Gini gain? State the rule and the Gini
gain it produces.
- Part F: If the
DecisionTreeClassifier()
function were fit using \(X_1\) and \(X_2\) as predictors of \(Y\), what would be the first splitting rule
in the tree? Justify your answer without actually fitting the function
to these data.
- Part G: Considering both predictors, is it possible
for a decision tree with maximum depth of two to perfectly classify
these data if the first splitting rule is \(X_1 \leq -1\)? What about if the first
splitting rule is \(X_1 \leq 0\)?
\(~\)
Question #4 (Application, Pipelines)
For this question you should use the dataset available here:
https://remiller1450.github.io/data/beans.csv
This data set was constructed using a computer vision system that
segmented 13,611 images of 7 types of dry beans, extracting 16 features
(12 dimensional features, 4 shape-form features). Additional details are
contained in
this paper
The following questions should be answered using Python code,
including functions in the sklearn
library. Unless
otherwise indicated, use the 'accuracy'
scoring criteria
during model tuning and evaluation.
- Part A: Read these data into Python and perform a
90-10 training-testing split using
random_state=1
.
- Part B: Separate the outcome, “Class”, from the
predictors and graph a histogram of every predictor. Based upon these
histograms, do you think that re-scaling and/or transformation should be
part of a data preparation pipeline? You should that \(k\)-nearest neighbors is one of several
models that will be considered.
- Part C: Use the
corrcoef()
function in
numpy
to explore the pairwise correlations between
predictors. Based upon these correlations, do you think dimension
reduction via principal component analysis should be considered as part
of your data preparation pipeline?
- Part D: Create a machine learning pipeline that
includes the data preparation steps you deemed important in Parts B and
C. Then, perform a grid search using 5-fold cross-validation to find a
well-fitting \(k\)-nearest neighbors
model. Your search should explore at least two variations of
your data preparation steps (ie: two different scalers, or two different
numbers of retained principal components), at least three
values of \(k\), both
euclidean or Manhattan distance, and both uniform or distance
weighting. Report the hyperparameters of the best KNN model.
- Part E: Repeat the same basic steps of Part D to
find a well-fitting decision tree model. Your search should
explore at least two variations of your data preparation steps
(ie: two different scalers, or two different numbers of retained
principal components), at least three different maximum depths,
and at least two values of minimum samples required to split a
node. Report the hyperparameters of the best decision tree model.
- Part F: Repeat the same basic steps of Part D to
find a well-fitting support vector machine. Your search should
explore at least two variations of your data preparation steps
(ie: two different scalers, or two different numbers of retained
principal components) and at least three different kernel
types. You may tune any other hyperparameters as desired. Report
the hyperparameters of the best fitting support vector machine.
- Part G: Use a pipeline to choose between the best
approaches identified in Parts D, E and F. If two approaches have
exactly equal performance you may choose either of them.
- Part H: Create a visualization of the confusion
matrix for your best estimator from Part G that displays
classification results for the test data. Report the most common
type of misclassification made by the model.
- Part I: Report both the macro-averaged and
micro-averaged F1-scores of the classier on the test data.
Which of these approaches (macro or micro averaging) do you believe is
more appropriate for this application? Or are both approaches
reasonable? Briefly explain.