Homework #2

Directions:

Homework must be completed individually
Please submit a single Jupyter Notebook containing your responses via P-web. Use markdown chunks to format the assignment and record responses to questions that involve written answers.

\(~\)

Question #1 (\(k\)-Nearest Neighbors)

Shown below is a simple training data set consisting of 6 observations, 3 predictors, and a categorical outcome:

Observation	X1	X2	X3	Y
1	0	3	0	Red
2	2	0	0	Red
3	0	1	3	Red
4	0	1	2	Green
5	-1	0	1	Green
6	1	1	1	Red

Suppose we’re interested using \(k\)-nearest neighbors to predict the outcome of a test data-point at \(\{X_1=0, X_2=0, X_3=0\}\).

You should answer the following questions using a calculator or basic Python functions (ie: addition, subtraction, powers, roots, etc.). You should not use any functions in sklearn. Additionally, you should not perform any standardization/scaling when answering Parts A - D.

Part A: Calculate the euclidean distance between each observation and the test data-point of \(\{X_1=0, X_2=0, X_3=0\}\).
Part B: What is the predicted class of \(Y\) for the test data-point if uniform weighting and \(k=1\) (one neighbor) are used? Why?
Part C: What is the predicted probability that \(Y=\text{Green}\) if uniform weighting and \(k=3\) (three neighbors) are used? Why?
Part D: Using \(k=3\), will the predicted probability of \(Y=\text{Green}\) be higher if distance weighting is used instead of uniform weighting? Briefly explain (you do not need to perform the calculation).

Now consider these same data after re-scaling:

Observation	X1	X2	X3	Y
1	0.3333333	1.0000000	0.0000000	Red
2	1.0000000	0.0000000	0.0000000	Red
3	0.3333333	0.3333333	1.0000000	Red
4	0.3333333	0.3333333	0.6666667	Green
5	0.0000000	0.0000000	0.3333333	Green
6	0.6666667	0.3333333	0.3333333	Red

Part E: Calculate the euclidean distance between each observation and the test data-point \(\{X_1=0, X_2=0, X_3=0\}\).
Part F: What is the predicted probability of \(Y=\text{Green}\) using uniform weighting and \(k=3\) (three neighbors)? Why?

\(~\)

Question #2 (Decision Tree concepts)

Consider a small data set containing six observations of two predictors, \(X_1\) and \(X_2\), and a binary outcome, \(Y\):

Observation	X1	X2	Y
1	0	3	Red
2	2	0	Red
3	0	2	Red
4	0	1	Green
5	-1	0	Green
6	1	1	Red

Part A: Calculate the Gini impurity of these data before any splitting rules/models are applied.
Part B: Consider the splitting rule \(X_1 \leq -1\). What is the Gini gain resulting from this split?
Part C: Now consider the splitting rule \(X_1 \leq 0\). What is the Gini gain resulting from this split?
Part D: If only \(X_1\) is considered, are there any splitting rules that will lead to a larger Gini gain than the ones stated in Parts B and C? Briefly explain.
Part E: If only \(X_2\) is considered, what is the splitting rule that will produce the best Gini gain? State the rule and the Gini gain it produces.
Part F: If the DecisionTreeClassifier() function were fit using \(X_1\) and \(X_2\) as predictors of \(Y\), what would be the first splitting rule in the tree? Justify your answer without actually fitting the function to these data.
Part G: Considering both predictors, is it possible for a decision tree with maximum depth of two to perfectly classify these data if the first splitting rule is \(X_1 \leq -1\)? What about if the first splitting rule is \(X_1 \leq 0\)?

\(~\)

Question #3 (Performance Metrics)

Consider an application where machine learning is used to predict loan default (yes or no) based upon information relating to the loan’s parameters, the customer’s income, and the customer’s credit history. Suppose a sample of 10,000 loans is obtained as a training set, where 333 of these loans ended in default and the remaining 9667 were repaid in full. Further, suppose the institution’s greatest concern is identifying loans that are likely to end in default.

For the questions that follow, denote “yes” as the positive class and “no” as the negative class.

Part A: Suppose a machine learning model yields 195 true positives and 9432 true negatives when applied to the training data. Write out a confusion matrix summarizing the performance of this model on the training data. Label the rows and columns of the matrix using “yes” and “no”, and follow the organizational conventions used in our lecture slides.
Part B: Calculate classification accuracy from the confusion matrix you found in Part A. Show your work.
Part C: Calculate balanced accuracy from the confusion matrix you found in Part A. Show your work.
Part D: Calculate the F1-score from the confusion matrix you found in Part A. Show your work.
Part E: Considering the three performance metrics you calculated in Parts B-D, which would you recommend as most appropriate for this application? Briefly explain.
Part F: Considering the three performance metrics you calculated in Parts B-D, which would you deem as least appropriate for this application? Briefly explain.
Part G: Can you calculate the area under the ROC curve for this application using the information that was given? If so, calculate the AUC achieved with the training data, showing your work. If not, briefly describe what else you would need to calculate ROC AUC.

\(~\)

Question #4 (Application, Pipelines)

For this question you should use the dataset available here:

https://remiller1450.github.io/data/beans.csv

This data set was constructed using a computer vision system that segmented 13,611 images of 7 types of dry beans, extracting 16 features (12 dimensional features, 4 shape-form features). Additional details are contained in this paper

The following questions should be answered using Python code, including functions in the sklearn library. Unless otherwise indicated, use the 'accuracy' scoring criteria during model tuning and evaluation.

Part A: Read these data into Python and perform a 90-10 training-testing split using random_state=1.
Part B: Separate the outcome, “Class”, from the predictors and graph a histogram of every predictor. Based upon these histograms, do you think that re-scaling and/or transformation should be part of a data preparation pipeline? You should that \(k\)-nearest neighbors is one of several models that will be considered.
Part C: Use the corrcoef() function in numpy to explore the pairwise correlations between predictors. Based upon these correlations, do you think dimension reduction via principal component analysis should be considered as part of your data preparation pipeline?
Part D: Create a machine learning pipeline that includes the data preparation steps you deemed important in Parts B and C. Then, perform a grid search using 5-fold cross-validation to find a well-fitting \(k\)-nearest neighbors model. Your search should explore at least two variations of your data preparation steps (ie: two different scalers, or two different numbers of retained principal components), at least three values of \(k\), both euclidean or Manhattan distance, and both uniform or distance weighting. Report the hyperparameters of the best KNN model.
Part E: Repeat the same basic steps of Part D to find a well-fitting decision tree model. Your search should explore at least two variations of your data preparation steps (ie: two different scalers, or two different numbers of retained principal components), at least three different maximum depths, and at least two values of minimum samples required to split a node. Report the hyperparameters of the best decision tree model.
Part F: Repeat the same basic steps of Part D to find a well-fitting support vector machine. Your search should explore at least two variations of your data preparation steps (ie: two different scalers, or two different numbers of retained principal components) and at least three different kernel types. You may tune any other hyperparameters as desired. Report the hyperparameters of the best fitting support vector machine.
Part G: Use a pipeline to choose between the best approaches identified in Parts D, E and F. If two approaches have exactly equal performance you may choose either of them.
Part H: Create a visualization of the confusion matrix for your best estimator from Part G that displays classification results for the test data. Report the most common type of misclassification made by the model.
Part I: Report both the macro-averaged and micro-averaged F1-scores of the classier on the test data. Which of these approaches (macro or micro averaging) do you believe is more appropriate for this application? Or are both approaches reasonable? Briefly explain.