Directions:
- Homework must be completed individually
- Please type your responses, clearly separating each question and
sub-question (A, B, C, etc.)
- You may type your written answers using Markdown chucks in a
JupyterNotebook, or you may use any word processing software and submit
your Python code separately
- Questions that require Python coding should include all commands
used to reach the answer, nothing more and nothing less
- Submit your work via P-web
Question #1
The table below provides a training data set consisting of 6
observations, 3 predictors, and a categorical outcome:
Observation
|
X1
|
X2
|
X3
|
Y
|
1
|
0
|
3
|
0
|
Red
|
2
|
2
|
0
|
0
|
Red
|
3
|
0
|
1
|
3
|
Red
|
4
|
0
|
1
|
2
|
Green
|
5
|
-1
|
0
|
1
|
Green
|
6
|
1
|
1
|
1
|
Red
|
Suppose we’re interested using \(k\)-nearest neighbors to predict the
outcome of a test data-point at \(\{X_1=0,
X_2=0, X_3=0\}\).
You should answer the following questions using a calculator or basic
Python functions. You should not use any functions in
sklearn
. Additionally, you do not need to perform any
standardization/scaling.
- Calculate the euclidean distance between each observation
and the test data-point of \(\{X_1=0, X_2=0,
X_3=0\}\).
- What is the predicted class of \(Y\) for the test data-point if uniform
weighting and \(k=1\) (one neighbor)
are used? Why?
- What is the predicted probability that \(Y=\text{Green}\) if uniform
weighting and \(k=3\) (three
neighbors) are used? Why?
- Using \(k=3\), will the
predicted probability of \(Y=\text{Green}\) be higher if distance
weighting is used instead of uniform weighting? Briefly
explain (you do not need to perform the calculation).
\(~\)
Question #2
The table below displays the same data described in Question #1 after
a scaling procedure has been applied:
Observation
|
X1
|
X2
|
X3
|
Y
|
1
|
0.3333333
|
1.0000000
|
0.0000000
|
Red
|
2
|
1.0000000
|
0.0000000
|
0.0000000
|
Red
|
3
|
0.3333333
|
0.3333333
|
1.0000000
|
Red
|
4
|
0.3333333
|
0.3333333
|
0.6666667
|
Green
|
5
|
0.0000000
|
0.0000000
|
0.3333333
|
Green
|
6
|
0.6666667
|
0.3333333
|
0.3333333
|
Red
|
Suppose we’re interested using \(k\)-nearest neighbors to predict an outcome
for \(\{X_1=0, X_2=0, X_3=0\}\), and
assume this new observation has already been appropriately scaled.
You should answer the following questions using a calculator or basic
Python functions. You should not use any functions in
sklearn
.
- Identify the scaling method that was used. Briefly explain how you
made this determination.
- Calculate the euclidean distance between each observation
and the test data-point \(\{X_1=0, X_2=0,
X_3=0\}\).
- What is the predicted probability of \(Y=\text{Green}\) using uniform
weighting and \(k=3\) (three
neighbors)? Why?
\(~\)
Question #3
Consider an application where machine learning is used to predict
loan default (yes or no) based upon information relating to the loan’s
parameters, the customer’s income, and the customer’s credit history.
Suppose a sample of 10,000 loans is obtained as a training set, where
333 of these loans ended in default and the remaining 9667 were repaid
in full. Further, suppose the institution’s greatest concern is
identifying the loans that end in default.
For the questions that follow, denote “yes” as the positive class and
“no” as the negative class.
- Suppose a machine learning model yields 195 true positives and 9432
true negatives when applied to the training data. Construct a confusion
matrix summarizing the performance of this model on the training data.
Label the rows and columns of the matrix using “yes” and “no”, and
follow the organizational conventions used in our lecture slides.
- Calculate classification accuracy from the confusion matrix
you found in Part A. Show your work.
- Calculate balanced accuracy from the confusion matrix you
found in Part A. Show your work.
- Calculate the F1-score from the confusion matrix you found
in Part A. Show your work.
- Of the three performance metrics you calculated in Parts B-D, which
would you recommend as most appropriate for this application?
Briefly explain.
- Of the three performance metrics you calculated in Parts B-D, which
would you deem as least appropriate for this application?
Briefly explain.
\(~\)
Question #4
For this question you should use the dataset available here:
https://remiller1450.github.io/data/beans.csv
This data set was constructed using a computer vision system that
segmented 13,611 images of 7 types of dry beans, extracting 16 features
(12 dimensional features, a 4 shape form features). Additional details
are contained in
this paper
The following questions should be answered using Python code,
including functions in the sklearn
library.
- Read these data into Python and perform a 90-10 training-testing
split with
random_state=1
.
- Separate the outcome, “Class”, from the predictors and graph a
histogram of every predictor.
- Create a pipeline and perform 20 iterations of random
search using 5-fold cross-validation to find a well-fitting \(k\)-nearest neighbors model. Your search
should consider at least two different choices of scaling functions,
values of \(k\) sampled from a Poisson
distribution with mean of 50, euclidean or Manhattan distance, and
uniform or distance weighting. Report the best estimator.
- Create a visualization of the confusion matrix for your best
estimator using the test data. Report the most common type of
misclassification made by the model.
- Report both the macro-averaged and micro-averaged F1-scores of the
classier on the test data. Which of these approaches (macro or micro
averaging) do you believe is more appropriate for this application? Or
are both approaches reasonable? Briefly explain.