Homework #2

Directions:

Homework must be completed individually. Any guidance or help received from mentors, classmates, online resources other than course materials (including AI/LLMs) must be acknowledged.
Clearly organize your responses so that each question (1, 2, 3) and sub-question (A, B, C, etc.) are clearly evident.
Please submit a single Jupyter notebook (.ipynb file) displaying all output and recording textual answers using neatly formatted markdown chunks.
Your submission should be made via Canvas no later than 11:59pm on the assigned due-date.

Question #1 (Decision Tree concepts)

Consider a small data set containing six observations of two predictors, \(X_1\) and \(X_2\), and a binary outcome, \(Y\):

Observation	X1	X2	Y
1	0	3	Red
2	2	0	Red
3	0	2	Red
4	0	1	Green
5	-1	0	Green
6	1	1	Red

Part A: Calculate the Gini impurity of these data before any splitting rules/models are applied.
Part B: Consider the splitting rule \(X_1 \leq -1\). What is the Gini gain resulting from this split?
Part C: Now consider the splitting rule \(X_1 \leq 0\). What is the Gini gain resulting from this split?
Part D: If only \(X_1\) is considered, are there any splitting rules that will lead to a larger Gini gain than the ones stated in Parts B and C? Briefly explain.
Part E: If only \(X_2\) is considered, what is the splitting rule that will produce the best Gini gain? State the rule and the Gini gain it produces.
Part F: If the DecisionTreeClassifier() function were fit using \(X_1\) and \(X_2\) as predictors of \(Y\), what would be the first splitting rule in the tree? Justify your answer without actually fitting the function to these data.
Part G: Considering both predictors, is it possible for a decision tree with maximum depth of two to perfectly classify these data if the first splitting rule is \(X_1 \leq -1\)? What about if the first splitting rule is \(X_1 \leq 0\)?

\(~\)

Question #2 (Performance Metrics)

Consider an application where machine learning is used to predict loan default (yes or no) based upon information relating to the loan’s parameters, the customer’s income, and the customer’s credit history. Suppose a sample of 10,000 loans is obtained as a training set, where 333 of these loans ended in default and the remaining 9667 were repaid in full. Further, suppose the institution’s greatest concern is identifying loans that are likely to end in default.

For the questions that follow, denote “yes” as the positive class and “no” as the negative class.

Part A: Suppose a machine learning model yields 195 true positives and 9432 true negatives when applied to the training data. Write out a confusion matrix summarizing the performance of this model on the training data. Label the rows and columns of the matrix using “yes” and “no”, and follow the organizational conventions used in our lecture slides.
Part B: Calculate classification accuracy from the confusion matrix you found in Part A. Show your work.
Part C: Calculate balanced accuracy from the confusion matrix you found in Part A. Show your work.
Part D: Calculate the F1-score from the confusion matrix you found in Part A. Show your work.
Part E: Considering the three performance metrics you calculated in Parts B-D, which would you recommend as most appropriate for this application? Briefly explain.
Part F: Considering the three performance metrics you calculated in Parts B-D, which would you deem as least appropriate for this application? Briefly explain.
Part G: Can you calculate the area under the ROC curve for this application using the information that was given? If so, calculate the AUC achieved with the training data, showing your work. If not, briefly describe what else you would need to calculate ROC AUC.

\(~\)

Question #3 (Cross-validation)

Consider a toy data set of \(n=200\) observations generated using \(f(X) = 2*x_1 + 10\) and \(Y = f(X) + \epsilon\) where \(\epsilon \sim N(0,\sigma =5)\). In other words, the true relationship between \(x_1\) and \(Y\) is linear, with 5 units of irreducible error. These data can be found at the URL given below:

https://remiller1450.github.io/data/toy_linear_data.csv

Part A: In your own words, explain whether a KNN regressor or decision tree model is better suited to estimating the true \(f()\). You should rely upon conceptual arguments in favor of your chosen method, not empirical investigations using the provided data.
Part B: Use a for loop to create your own implementation of 4-fold cross validation for a decision tree model with a maximum depth of 4. You are encouraged to look at the pseudocode in our slides, and you should use the random.choices() function in numpy to sample fold indices with replacement. Use your implementation to report the cross-validated RMSE.
Part C: Replace the decision tree model in your cross validation loop with a LinearRegression model from the linear_model library of sklearn. How does the change in model impact the cross-validated RMSE?
Part D: The decision tree model in Part B used a maximum depth of 4. If you performed a grid search allowing this hyperparameter to be any integer, would you expect to find one that produces a cross-validated RMSE as low as the linear regression modeling approach in Part C? Briefly explain. Note: you should make a conceptual argument and should not actually perform a grid search.

\(~\)

Question #4 (Application)

For this question you should use the dataset available here:

https://remiller1450.github.io/data/beans.csv

This data set was constructed using a computer vision system that segmented 13,611 images of 7 types of dry beans, extracting 16 features (12 dimensional features, 4 shape-form features). Additional details are contained in this paper

The following questions should be answered using Python code, including functions in the sklearn library. Unless otherwise indicated, use the 'accuracy' scoring criteria during model tuning and evaluation.

Part A: Read these data into Python and perform a 90-10 training-testing split using random_state=1.
Part B: Separate the outcome, “Class”, from the predictors and create a histogram of every predictor. Based on these histograms, do you think that re-scaling and/or transformation should be part of a data preparation pipeline? You may assume that the pipeline will consider models like KNN that are sensitive to the scale and distribution of the predictors.
Part C: Perform a cross-validated grid search and show a data frame displaying the top 5 best performing methods while satisfying the following guidelines:
- Consideration of at least two pre-processing options in the form of re-scaling, transformation, or “passthrough” consistent with your assessment in Part B
- Consideration of several KNN models with at least 3 choices of \(k\) and either uniform or distance weighting
- Consideration of several decision tree models with at least 3 choices of maximum depth
- Your approach should not explore re-scaling or transformation for decision tree models
Part D: Create a visualization of the confusion matrix for your best estimator from Part C that displays classification results for the test data. Report the most common type of misclassification made by the model.
Part E: Report both the macro-averaged and micro-averaged F1-scores of the best classification approach on the test data. Which of these approaches (macro or micro averaging) do you believe is more appropriate for this application? Or are both approaches reasonable? Briefly explain.