Directions:

\(~\)

Question #1 (\(k\)-Nearest Neighbors)

Shown below is a simple training data set consisting of 6 observations, 3 predictors, and a categorical outcome:

Observation X1 X2 X3 Y
1 0 3 0 Red
2 2 0 0 Red
3 0 1 3 Red
4 0 1 2 Green
5 -1 0 1 Green
6 1 1 1 Red

Suppose we’re interested using \(k\)-nearest neighbors to predict the outcome of a test data-point at \(\{X_1=0, X_2=0, X_3=0\}\).

You should answer the following questions using a calculator or basic Python functions (ie: addition, subtraction, powers, roots, etc.). You should not use any functions in sklearn. Additionally, you should not perform any standardization/scaling when answering Parts A - D.

Now consider these same data after re-scaling:

Observation X1 X2 X3 Y
1 0.3333333 1.0000000 0.0000000 Red
2 1.0000000 0.0000000 0.0000000 Red
3 0.3333333 0.3333333 1.0000000 Red
4 0.3333333 0.3333333 0.6666667 Green
5 0.0000000 0.0000000 0.3333333 Green
6 0.6666667 0.3333333 0.3333333 Red

\(~\)

Question #2 (Decision Tree concepts)

Consider a small data set containing six observations of two predictors, \(X_1\) and \(X_2\), and a binary outcome, \(Y\):

Observation X1 X2 Y
1 0 3 Red
2 2 0 Red
3 0 2 Red
4 0 1 Green
5 -1 0 Green
6 1 1 Red

\(~\)

Question #3 (Performance Metrics)

Consider an application where machine learning is used to predict loan default (yes or no) based upon information relating to the loan’s parameters, the customer’s income, and the customer’s credit history. Suppose a sample of 10,000 loans is obtained as a training set, where 333 of these loans ended in default and the remaining 9667 were repaid in full. Further, suppose the institution’s greatest concern is identifying loans that are likely to end in default.

For the questions that follow, denote “yes” as the positive class and “no” as the negative class.

\(~\)

Question #4 (Application, Pipelines)

For this question you should use the dataset available here:

https://remiller1450.github.io/data/beans.csv

This data set was constructed using a computer vision system that segmented 13,611 images of 7 types of dry beans, extracting 16 features (12 dimensional features, 4 shape-form features). Additional details are contained in this paper

The following questions should be answered using Python code, including functions in the sklearn library. Unless otherwise indicated, use the 'accuracy' scoring criteria during model tuning and evaluation.