Directions:

Question #1 (Decision Tree concepts)

Consider a small data set containing six observations of two predictors, \(X_1\) and \(X_2\), and a binary outcome, \(Y\):

Observation X1 X2 Y
1 0 3 Red
2 2 0 Red
3 0 2 Red
4 0 1 Green
5 -1 0 Green
6 1 1 Red

\(~\)

Question #2 (Performance Metrics)

Consider an application where machine learning is used to predict loan default (yes or no) based upon information relating to the loan’s parameters, the customer’s income, and the customer’s credit history. Suppose a sample of 10,000 loans is obtained as a training set, where 333 of these loans ended in default and the remaining 9667 were repaid in full. Further, suppose the institution’s greatest concern is identifying loans that are likely to end in default.

For the questions that follow, denote “yes” as the positive class and “no” as the negative class.

\(~\)

Question #3 (Cross-validation)

Consider a toy data set of \(n=200\) observations generated using \(f(X) = 2*x_1 + 10\) and \(Y = f(X) + \epsilon\) where \(\epsilon \sim N(0,\sigma =5)\). In other words, the true relationship between \(x_1\) and \(Y\) is linear, with 5 units of irreducible error. These data can be found at the URL given below:

https://remiller1450.github.io/data/toy_linear_data.csv

\(~\)

Question #4 (Application)

For this question you should use the dataset available here:

https://remiller1450.github.io/data/beans.csv

This data set was constructed using a computer vision system that segmented 13,611 images of 7 types of dry beans, extracting 16 features (12 dimensional features, 4 shape-form features). Additional details are contained in this paper

The following questions should be answered using Python code, including functions in the sklearn library. Unless otherwise indicated, use the 'accuracy' scoring criteria during model tuning and evaluation.