Homework #4

Directions:

Remember that homework assignments are to be completed individually, and any classmates you discuss the questions with, or any outside resources (besides our course materials) you consult for assistance should be cited.
Please submit a single Jupyter Notebook containing your responses via P-web. Use markdown chunks to format the assignment and record responses to questions that involve written answers. Note that you may type LaTex style equations in markdown chunks.

\(~\)

Question 1

Consider gradient boosting using the squared error cost function with linear regression as the base learner. You may use the fact that linear regression has a closed form solution for its weights, and you do not need to re-derive this solution yourself.

Part A: Let \(\mathbf{r}^{(1)}\) denote the vector of residuals for the first base learner in the ensemble. Write an expression for the weights of the second model (call them \(\hat{\mathbf{w}}^{(2)}\)) in the ensemble in terms of \(\mathbf{r}_1\).
Part B: Substitute the \(\mathbf{y} - \mathbf{X}\hat{\mathbf{w}}^{(1)}\) (the definition of a residual) into the expression you provided in Part A, then argue that \(\hat{\mathbf{w}}^{(2)} = \mathbf{0}\).
Part C: Provide a conceptual argument for why gradient boosting fails to improve the linear regression model.

\(~\)

Question 2

Recall that squared error cost function can be written:

\[Cost = \tfrac{1}{n}(\mathbf{y} - \mathbf{\hat{y}})^T(\mathbf{y} - \mathbf{\hat{y}})\]

Here \(\mathbf{y}\) is the vector of observed outcomes and \(\mathbf{\hat{y}}\) is a vector of model predictions.

Poisson Regression uses the form: \(\mathbf{\hat{y}} = e^{\mathbf{X}\mathbf{\hat{w}}}\)

The standard approach to Poisson regression is estimate the unknown weights using maximum likelihood estimation. However, for this question I’ll ask you to estimate a reasonable set of weights by differentiating the squared error cost function and optimizing it via gradient descent (which is not equivalent to maximum likelihood estimation in this scenario).

Part A: Use chain rule to show that the gradient of the cost with respect to the weight parameters is \(\tfrac{-2}{n}(\mathbf{y} -e^{\mathbf{X}\mathbf{\hat{w}}})^T \cdot diag(e^{\mathbf{X}\mathbf{\hat{w}}})\cdot \mathbf{X}\) where \(diag(v)\) is used to denote a diagonal matrix whose diagonal elements are given by \(v\) and whose off-diagonal elements are zero.
Part B: Write your own Python functions to calculate the cost (given \(X\), \(\hat{w}\), and \(y\)) and perform gradient descent for this scenario. You should use the functions provided in this unit’s labs as examples. You should consider using np.diag() to set up a diagonal matrix. For reference, the minimum cost for the example data (see below) should be between 4.69 and 4.70.

## Set up trial data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic_y = ic['bedrooms']
ic_X = ic[['assessed','area.living']]

## Scale X
from sklearn.preprocessing import StandardScaler
ic_Xs = StandardScaler().fit_transform(ic_X)

## Fit via grad descent, 250 iter w/ 0.01 learning rate 
gdres = grad_descent(X=ic_Xs,y=ic_y,w=np.zeros(2),alpha=0.01, n_iter=250)

## Min of cost function
print(min(gdres[1]))

Part C: Graph the cost function for 300 iterations of gradient descent with a learning rate of 0.001. Comment on whether this learning rate is appropriate.
Part D: Graph the cost function for 300 iterations of gradient descent with a learning rate of 0.1. Comment on whether this learning rate is appropriate.

Comments: Poisson regression can be fit via maximum likelihood in sklearn using the PoissonRegressor function. The arguments alpha=0 and fit_intercept=False can be used to mimic the model fit to the example data in this question. However, you should expect somewhat different weight estimates since maximum likelihood estimation is not equivalent to minimizing squared error loss for the Poisson regression model. Further, in this example, the variables “assessed” and “area.living” are highly correlated, so there are many combinations of weights that will fit the training data similarly well.

\(~\)

Question 3

Consider a simple neural network consisting of 1 hidden layer containing 4 neurons that use the sigmoid activation function, and a final output layer that applies the sigmoid function to a linear combination of activated outputs from the hidden layer. You are to apply the squared error cost function to the final activated output.

The input data used in your network will be generated from the make_moons() function in sklearn:

## Set up trial data
X, y = make_moons(n_samples=200, noise=0.1)

Note that data has 2 input input features and a binary outcome.

Part A: How many weight and bias parameters are used to generate the neurons in this network from the input features? Briefly explain.
Part B: How many weight and bias parameters are used to generate the predicted outcome from this network from the hidden neurons? Briefly explain.
Part C: Consider data on a single training example with its predictors stored in the vector \(\mathbf{x}_i\). To prepare for only one stochastic gradient descent, differentiate the cost function with respect to each weight parameter in the final layer of the network.
Part D: Now find the gradient components for each of the remaining parameters, all of the biases and the remaining weights. Clearly show your expression for each component, using the notation from our class and using matrices and vectors wherever possible. Significant deviations from the notation we’ve used during class will be penalized.
Part E: Using functions from this unit’s labs as a guide, write a function that uses only-one stochastic gradient descent function to train this network. Your function does not need to generalize to scenarios other than this toy data set, so it is okay if certain attributes (such as network architecture and object sizes) are hard-coded into it.
Part F: Train your network using an appropriately selected learning rate (try something small and increase as needed) and graph the cost function over 100 passes through the entire training data set for a learning rate you identify as reasonable.
Part G: Once you think you’ve found an appropriate learning rate in Part F, create an independent validation set by using make_moons() again with the same parameters. Add a line to your previous graph showing the cost function evaluated using this new data set at each training iteration.