Directions:

Question #1 (Decision Trees)

Consider a small data set containing six observations of two predictors, \(X_1\) and \(X_2\), and a binary outcome, \(Y\):

Observation X1 X2 Y
1 0 3 Red
2 2 0 Red
3 0 2 Red
4 0 1 Green
5 -1 0 Green
6 1 1 Red

\(~\)

Question #2 (Application)

Protein localization describes the prediction of where a protein resides in a cell. The motivation behind protein localization is help inform tools, such as targeted drugs, whose behavior can be predicted.

The data used in this application are housed at the UC-Irvine machine learning repository and can accessed using the code provided below:

import pandas as pd
colnames = ['seq_id', 'mcg','gvh', 'alm', 'mit', 'erl', 'pox', 'vac', 'nuc', 'class']
yeast = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/yeast.data', sep='\s+', names = colnames)

The goal of this application is develop models that predict class, the protein localization site, given other information about the protein.

You may find additional information at this page and a description of the variables can be found here.

\(~\)

Question #3 (Boosting and Linear Regression)

Consider gradient boosting applied to the linear regression model. To begin, recall that linear regression has a closed form solution: \[\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]