Homework #3

Directions:

Homework must be completed individually
Please type your responses, clearly separating each question and sub-question (A, B, C, etc.)
You may type your written answers using Markdown chucks in a JupyterNotebook, or you may use any word processing software and submit your Python code separately
Questions that require Python coding should include all commands used to reach the answer, nothing more and nothing less
Submit your work via P-web

Question #1 (Decision Trees)

Consider a small data set containing six observations of two predictors, \(X_1\) and \(X_2\), and a binary outcome, \(Y\):

Observation	X1	X2	Y
1	0	3	Red
2	2	0	Red
3	0	2	Red
4	0	1	Green
5	-1	0	Green
6	1	1	Red

Part A: Calculate the Gini impurity of these data before any splitting rules/models are applied.
Part B: Consider the splitting rule \(X_1 \leq -1\). What is the Gini gain resulting from this split?
Part C: Now consider the splitting rule \(X_1 \leq 0\). What is the Gini gain resulting from this split?
Part D: If only \(X_1\) is considered, are there any splitting rules that will lead to a larger Gini gain than the ones stated in Parts B and C? Briefly explain.
Part E: If only \(X_2\) is considered, what is the splitting rule that will produce the best Gini gain? State the rule and the Gini gain it produces.
Part F: If the DecisionTreeClassifier() function were fit using \(X_1\) and \(X_2\) as predictors of \(Y\), what would be the first splitting rule in the tree? Justify your answer without actually fitting the function to these data.
Part G: Considering both predictors, is it possible for a decision tree with maximum depth of two to perfectly classify these data if the first splitting rule is \(X_1 \leq -1\)? What about if the first splitting rule is \(X_1 \leq 0\)?

\(~\)

Question #2 (Application)

Protein localization describes the prediction of where a protein resides in a cell. The motivation behind protein localization is help inform tools, such as targeted drugs, whose behavior can be predicted.

The data used in this application are housed at the UC-Irvine machine learning repository and can accessed using the code provided below:

import pandas as pd
colnames = ['seq_id', 'mcg','gvh', 'alm', 'mit', 'erl', 'pox', 'vac', 'nuc', 'class']
yeast = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/yeast.data', sep='\s+', names = colnames)

The goal of this application is develop models that predict class, the protein localization site, given other information about the protein.

You may find additional information at this page and a description of the variables can be found here.

Part A: Perform an 80-20 training-testing split using random_state=1, then separate the outcome variable, class, and remove sequence ID from the predictors.
Part B: Fit a decision tree with a maximum depth of 2 to the training data. If this model were used on new data, what would the true positive rate be for proteins whose localization site is MIT (mitochondrial)?
Part C: Use either cross-validated grid search or out-of-bag accuracy find a well-fitting random forest model. You should consider maximum depths of 2, 3, and 4; minimum samples per split of 20 and 50; and max features of 2 and 4. Report the tuning parameters of your final model and it’s cross-validated or out-of-bag accuracy.
Part D: Using a pipeline with a scaling step, find a well-fitting k-nearest neighbors model. You should find this model using cross-validated grid search over any set of tuning parameters you consider to be appropriate.
Part E: Construct a weighted ensemble classifier that uses the random forest from Part C with a weight of 1.1 and the KNN classifier from Part D with a weight of 0.9. Does this ensemble perform better than the random forest by itself?
Part F: Use stacked generalization to take the outputs of the weighted ensemble you considered in Part E and map them to predicted classes via a decision tree. Does this approach lead to improved classification accuracy?
Part G: Now consider a gradient boosted ensemble built using xgboost. Use a cross-validated grid search to tune the parameters eta, gamma, reg_alpha, reg_lambda, max_depth, and colsample_bytree. A satisfactory answer will try at least 3 reasonably chosen values for each of these tuning parameters. Report the classification accuracy of this model.
Part H: Based on your work in Parts A-G, select an appropriate final model and evaluate it on the test data. Report the classification accuracy.

\(~\)

Question #3 (Boosting and Linear Regression)

Consider gradient boosting applied to the linear regression model. To begin, recall that linear regression has a closed form solution: \[\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

Part A: After fitting a regression model using the weights given above, denote the residuals \(\mathbf{r}^{(1)}\), then determine the weights of the second model (call them - \(\hat{\mathbf{w}}^{(2)}\)) in the ensemble in terms of \(\mathbf{r}_1\).
Part B: Substitute the \(\mathbf{y} - \mathbf{X}\hat{\mathbf{w}^{(1)}}\) into the expression you provided in Part A, then argue that \(\hat{\mathbf{w}}^{(2)} = \mathbf{0}\).
Part C: Provide a conceptual argument for why gradient boosting fails to improve the linear regression model.