Homework #1

Directions:

Homework must be completed individually
Please type your responses, clearly separating each question and sub-question (A, B, C, etc.)
You may type your written answers using Markdown chucks in a JupyterNotebook, or you may use any word processing software and submit your Python code separately
Questions that require Python coding should include all commands used to reach the answer, nothing more and nothing less
Submit your work via P-web

Question #1

The table below provides a training data set consisting of 6 observations, 3 predictors, and a categorical outcome:

Observation	X1	X2	X3	Y
1	0	3	0	Red
2	2	0	0	Red
3	0	1	3	Red
4	0	1	2	Green
5	-1	0	1	Green
6	1	1	1	Red

Suppose we’re interested using \(k\)-nearest neighbors to predict the outcome of a test data-point at \(\{X_1=0, X_2=0, X_3=0\}\).

You should answer the following questions using a calculator or basic Python functions. You should not use any functions in sklearn. Additionally, you do not need to perform any standardization/scaling.

Calculate the euclidean distance between each observation and the test data-point of \(\{X_1=0, X_2=0, X_3=0\}\).
What is the predicted class of \(Y\) for the test data-point if uniform weighting and \(k=1\) (one neighbor) are used? Why?
What is the predicted probability that \(Y=\text{Green}\) if uniform weighting and \(k=3\) (three neighbors) are used? Why?
Using \(k=3\), will the predicted probability of \(Y=\text{Green}\) be higher if distance weighting is used instead of uniform weighting? Briefly explain (you do not need to perform the calculation).

\(~\)

Question #2

The table below displays the same data described in Question #1 after a scaling procedure has been applied:

Observation	X1	X2	X3	Y
1	0.3333333	1.0000000	0.0000000	Red
2	1.0000000	0.0000000	0.0000000	Red
3	0.3333333	0.3333333	1.0000000	Red
4	0.3333333	0.3333333	0.6666667	Green
5	0.0000000	0.0000000	0.3333333	Green
6	0.6666667	0.3333333	0.3333333	Red

Suppose we’re interested using \(k\)-nearest neighbors to predict an outcome for \(\{X_1=0, X_2=0, X_3=0\}\), and assume this new observation has already been appropriately scaled.

You should answer the following questions using a calculator or basic Python functions. You should not use any functions in sklearn.

Identify the scaling method that was used. Briefly explain how you made this determination.
Calculate the euclidean distance between each observation and the test data-point \(\{X_1=0, X_2=0, X_3=0\}\).
What is the predicted probability of \(Y=\text{Green}\) using uniform weighting and \(k=3\) (three neighbors)? Why?

\(~\)

Question #3

Consider an application where machine learning is used to predict loan default (yes or no) based upon information relating to the loan’s parameters, the customer’s income, and the customer’s credit history. Suppose a sample of 10,000 loans is obtained as a training set, where 333 of these loans ended in default and the remaining 9667 were repaid in full. Further, suppose the institution’s greatest concern is identifying the loans that end in default.

For the questions that follow, denote “yes” as the positive class and “no” as the negative class.

Suppose a machine learning model yields 195 true positives and 9432 true negatives when applied to the training data. Construct a confusion matrix summarizing the performance of this model on the training data. Label the rows and columns of the matrix using “yes” and “no”, and follow the organizational conventions used in our lecture slides.
Calculate classification accuracy from the confusion matrix you found in Part A. Show your work.
Calculate balanced accuracy from the confusion matrix you found in Part A. Show your work.
Calculate the F1-score from the confusion matrix you found in Part A. Show your work.
Of the three performance metrics you calculated in Parts B-D, which would you recommend as most appropriate for this application? Briefly explain.
Of the three performance metrics you calculated in Parts B-D, which would you deem as least appropriate for this application? Briefly explain.

\(~\)

Question #4

For this question you should use the dataset available here:

https://remiller1450.github.io/data/beans.csv

This data set was constructed using a computer vision system that segmented 13,611 images of 7 types of dry beans, extracting 16 features (12 dimensional features, a 4 shape form features). Additional details are contained in this paper

The following questions should be answered using Python code, including functions in the sklearn library.

Read these data into Python and perform a 90-10 training-testing split with random_state=1.
Separate the outcome, “Class”, from the predictors and graph a histogram of every predictor.
Create a pipeline and perform 20 iterations of random search using 5-fold cross-validation to find a well-fitting \(k\)-nearest neighbors model. Your search should consider at least two different choices of scaling functions, values of \(k\) sampled from a Poisson distribution with mean of 50, euclidean or Manhattan distance, and uniform or distance weighting. Report the best estimator.
Create a visualization of the confusion matrix for your best estimator using the test data. Report the most common type of misclassification made by the model.
Report both the macro-averaged and micro-averaged F1-scores of the classier on the test data. Which of these approaches (macro or micro averaging) do you believe is more appropriate for this application? Or are both approaches reasonable? Briefly explain.