Directions:
- Homework must be completed individually
- Please type your responses, clearly separating each question and
sub-question (A, B, C, etc.)
- You may type your written answers using Markdown chucks in a
JupyterNotebook, or you may use any word processing software and submit
your Python code separately
- Questions that require Python coding should include all commands
used to reach the answer, nothing more and nothing less
- Submit your work via P-web
Question #1 (Decision Trees)
Consider a small data set containing six observations of two
predictors, \(X_1\) and \(X_2\), and a binary outcome, \(Y\):
Observation
|
X1
|
X2
|
Y
|
1
|
0
|
3
|
Red
|
2
|
2
|
0
|
Red
|
3
|
0
|
2
|
Red
|
4
|
0
|
1
|
Green
|
5
|
-1
|
0
|
Green
|
6
|
1
|
1
|
Red
|
- Part A: Calculate the Gini impurity of these data
before any splitting rules/models are applied.
- Part B: Consider the splitting rule \(X_1 \leq -1\). What is the Gini gain
resulting from this split?
- Part C: Now consider the splitting rule \(X_1 \leq 0\). What is the Gini gain
resulting from this split?
- Part D: If only \(X_1\) is considered, are there any
splitting rules that will lead to a larger Gini gain than the ones
stated in Parts B and C? Briefly explain.
- Part E: If only \(X_2\) is considered, what is the splitting
rule that will produce the best Gini gain? State the rule and the Gini
gain it produces.
- Part F: If the
DecisionTreeClassifier()
function were fit using \(X_1\) and \(X_2\) as predictors of \(Y\), what would be the first splitting rule
in the tree? Justify your answer without actually fitting the function
to these data.
- Part G: Considering both predictors, is it possible
for a decision tree with maximum depth of two to perfectly classify
these data if the first splitting rule is \(X_1 \leq -1\)? What about if the first
splitting rule is \(X_1 \leq 0\)?
\(~\)
Question #2 (Application)
Protein localization describes the prediction of where a protein
resides in a cell. The motivation behind protein localization is help
inform tools, such as targeted drugs, whose behavior can be
predicted.
The data used in this application are housed at the UC-Irvine machine
learning repository and can accessed using the code provided below:
import pandas as pd
colnames = ['seq_id', 'mcg','gvh', 'alm', 'mit', 'erl', 'pox', 'vac', 'nuc', 'class']
yeast = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/yeast.data', sep='\s+', names = colnames)
The goal of this application is develop models that predict
class
, the protein localization site, given other
information about the protein.
You may find additional information at this page and a
description of the variables can
be found here.
- Part A: Perform an 80-20 training-testing split
using
random_state=1
, then separate the outcome variable,
class
, and remove sequence ID from the predictors.
- Part B: Fit a decision tree with a maximum depth of
2 to the training data. If this model were used on new data, what would
the true positive rate be for proteins whose localization site is MIT
(mitochondrial)?
- Part C: Use either cross-validated grid search or
out-of-bag accuracy find a well-fitting random forest model. You should
consider maximum depths of 2, 3, and 4; minimum samples per split of 20
and 50; and max features of 2 and 4. Report the tuning parameters of
your final model and itβs cross-validated or out-of-bag accuracy.
- Part D: Using a pipeline with a scaling step, find
a well-fitting k-nearest neighbors model. You should find this model
using cross-validated grid search over any set of tuning parameters you
consider to be appropriate.
- Part E: Construct a weighted ensemble classifier
that uses the random forest from Part C with a weight of 1.1 and the KNN
classifier from Part D with a weight of 0.9. Does this ensemble perform
better than the random forest by itself?
- Part F: Use stacked generalization to take the
outputs of the weighted ensemble you considered in Part E and map them
to predicted classes via a decision tree. Does this approach lead to
improved classification accuracy?
- Part G: Now consider a gradient boosted ensemble
built using
xgboost
. Use a cross-validated grid search to
tune the parameters eta
, gamma
,
reg_alpha
, reg_lambda
, max_depth
,
and colsample_bytree
. A satisfactory answer will try at
least 3 reasonably chosen values for each of these tuning parameters.
Report the classification accuracy of this model.
- Part H: Based on your work in Parts A-G, select an
appropriate final model and evaluate it on the test data. Report the
classification accuracy.
\(~\)
Question #3 (Boosting and Linear Regression)
Consider gradient boosting applied to the linear regression model. To
begin, recall that linear regression has a closed form solution: \[\hat{\mathbf{w}} =
(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
- Part A: After fitting a regression model using the
weights given above, denote the residuals \(\mathbf{r}^{(1)}\), then determine the
weights of the second model (call them - \(\hat{\mathbf{w}}^{(2)}\)) in the ensemble
in terms of \(\mathbf{r}_1\).
- Part B: Substitute the \(\mathbf{y} -
\mathbf{X}\hat{\mathbf{w}^{(1)}}\) into the expression you
provided in Part A, then argue that \(\hat{\mathbf{w}}^{(2)} = \mathbf{0}\).
- Part C: Provide a conceptual argument for why
gradient boosting fails to improve the linear regression model.