Homework #4

Directions:

Homework must be completed individually
Please submit a single Jupyter Notebook containing your responses via P-web. Use markdown chunks to format the assignment and record responses to questions that involve written answers.

\(~\)

Question 1

Consider gradient boosting applied to the linear regression model using the squared error cost function. To begin, recall that linear regression has a closed form solution in this scenario: \[\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

Part A: After fitting a regression model using the weights given above, denote the residuals \(\mathbf{r}^{(1)}\), then determine the weights of the second model (call them - \(\hat{\mathbf{w}}^{(2)}\)) in the ensemble in terms of \(\mathbf{r}_1\).
Part B: Substitute the \(\mathbf{y} - \mathbf{X}\hat{\mathbf{w}^{(1)}}\) into the expression you provided in Part A, then argue that \(\hat{\mathbf{w}}^{(2)} = \mathbf{0}\).
Part C: Provide a conceptual argument for why gradient boosting fails to improve the linear regression model.

\(~\)

Question 2

Recall that squared error cost function can be written:

\[Cost = \tfrac{1}{n}(\mathbf{y} - \mathbf{\hat{y}})^T(\mathbf{y} - \mathbf{\hat{y}})\]

Here \(\mathbf{y}\) is the vector of observed outcomes and \(\mathbf{\hat{y}}\) is a vector of model predictions.

In Poisson Regression, \(\mathbf{\hat{y}} = e^{\mathbf{X}\mathbf{\hat{w}}}\)

The standard approach to Poisson regression is estimate the unknown weights using maximum likelihood estimation. However, for this question I’ll ask you to estimate a reasonable set of weights by differentiating the squared error cost function and optimizing it via gradient descent (which is not equivalent to maximum likelihood estimation for this scenario).

Part A: Use chain rule to show that the \(Gradient = \tfrac{-2}{n}\mathbf{X}^T(diag(e^{\mathbf{X}\mathbf{\hat{w}}}))(\mathbf{y} -e^{\mathbf{X}\mathbf{\hat{w}}})\) where \(diag(v)\) is used to denote a diagonal matrix whose diagonal elements are given by \(v\) and whose off-diagonal elements are zero.
Part B: Write Python functions that calculate the cost (given \(X\), \(\hat{w}\), and \(y\)) and perform gradient descent for this scenario. You should use the functions provided in this unit’s labs as examples. You may also consider using np.diag() to set up a diagonal matrix. For reference, the minimum cost for the example data (see below) should be between 4.69 and 4.70.

## Setup trial data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic_y = ic['bedrooms']
ic_X = ic[['assessed','area.living']]

## Scale X
from sklearn.preprocessing import StandardScaler
ic_Xs = StandardScaler().fit_transform(ic_X)

## Fit via grad descent, 250 iter w/ 0.01 learning rate 
gdres = grad_descent(X=ic_Xs,y=ic_y,w=np.zeros(2),alpha=0.01, n_iter=250)

## Min of cost function
print(min(gdres[1]))

Part C: Graph the cost function for 300 iterations of gradient descent with a learning rate of 0.001. Comment on whether this learning rate is appropriate.
Part D: Graph the cost function for 300 iterations of gradient descent with a learning rate of 0.1. Comment on whether this learning rate is appropriate.

Comments: Poisson regression can be fit via maximum likelihood in sklearn using the PoissonRegressor function. The arguments alpha=0 and fit_intercept=False can be used to mimic the model fit to the example data in this question. However, you should expect somewhat different weight estimates since maximum likelihood estimation is not equivalent to minimizing squared error loss for the Poisson regression model. Further, in this example, the variables “assessed” and “area.living” are highly correlated, so there are many combinations of weights that will fit the training data similarly well.

\(~\)

Question 3

Consider a simple neural network consisting of 1 hidden layer containing 4 neurons that use the sigmoid activation function. The network will predict a numeric outcome, so the weighted outputs of each neuron contribute directly to the outcome (rather than being passed into another sigmoid function).

For future reference, this network will applied to Iowa City home sales data set, using sale.amount as the outcome and area.living, area.lot, and bedrooms as the predictors.

Because the outcome is numeric, you will use the squared error cost function: \[Cost = \tfrac{1}{n}(\mathbf{y} - \mathbf{\hat{y}})^T(\mathbf{y} - \mathbf{\hat{y}})\]

Part A: How many weight parameters are used to generate the neurons in this network from the input features? Briefly explain.
Part B: How many weight parameters are used to generate the predicted outcome from this network from the hidden neurons? Briefly explain.
Part C: Consider data on a single training example with its predictors stored in the vector \(\mathbf{x}_i\). To prepare for only one stochastic gradient descent, differentiate the cost function with respect to each weight parameter in the final layer of the network.
Part D: Now find the gradient component for each of the remaining parameters, which are the biases and the remaining weights.
Part E: Using functions from this unit’s labs as a guide, write your function that performs stochastic gradient descent function to train this network.
Part F: Load the Iowa City home sales data (located at https://remiller1450.github.io/data/IowaCityHomeSales.csv), perform an 80-20 train-test split with random_state=7, then separate the predictors from the the outcome.
Part G: Standardize the model outcome, sale.amount, by subtracting its mean then dividing by its standard deviation (given by np.std), and re-scale the predictors using MinMaxScaler(). Note: these steps provide numerical stability.
Part H: Train your network with a learning rate of 0.00001 and graph the cost function over 100 passes through the training data. Your network should reach a minimum cost of approximately 0.91 (if you’ve standardized properly)
Part I: Use MLPRegressor, the implementation of neural networks in sklearn, to fit the same model to the same data. Be sure to specify the proper activation function, number of hidden layers, and number of neurons. Then, print the fitted model’s final loss and verify that it’s roughly 0.45. Note: this loss should be half of what you found because sklearn defines squared error cost as \(\tfrac{1}{2n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2\), but we haven’t included the “2”.