

Question 1

Consider gradient boosting applied to the linear regression model using the squared error cost function. To begin, recall that linear regression has a closed form solution in this scenario: \[\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]


Question 2

Recall that squared error cost function can be written:

\[Cost = \tfrac{1}{n}(\mathbf{y} - \mathbf{\hat{y}})^T(\mathbf{y} - \mathbf{\hat{y}})\]

Here \(\mathbf{y}\) is the vector of observed outcomes and \(\mathbf{\hat{y}}\) is a vector of model predictions.

In Poisson Regression, \(\mathbf{\hat{y}} = e^{\mathbf{X}\mathbf{\hat{w}}}\)

The standard approach to Poisson regression is estimate the unknown weights using maximum likelihood estimation. However, for this question I’ll ask you to estimate a reasonable set of weights by differentiating the squared error cost function and optimizing it via gradient descent (which is not equivalent to maximum likelihood estimation for this scenario).

## Setup trial data
ic = pd.read_csv("")
ic_y = ic['bedrooms']
ic_X = ic[['assessed','']]

## Scale X
from sklearn.preprocessing import StandardScaler
ic_Xs = StandardScaler().fit_transform(ic_X)

## Fit via grad descent, 250 iter w/ 0.01 learning rate 
gdres = grad_descent(X=ic_Xs,y=ic_y,w=np.zeros(2),alpha=0.01, n_iter=250)

## Min of cost function

Comments: Poisson regression can be fit via maximum likelihood in sklearn using the PoissonRegressor function. The arguments alpha=0 and fit_intercept=False can be used to mimic the model fit to the example data in this question. However, you should expect somewhat different weight estimates since maximum likelihood estimation is not equivalent to minimizing squared error loss for the Poisson regression model. Further, in this example, the variables “assessed” and “” are highly correlated, so there are many combinations of weights that will fit the training data similarly well.


Question 3

Consider a simple neural network consisting of 1 hidden layer containing 4 neurons that use the sigmoid activation function. The network will predict a numeric outcome, so the weighted outputs of each neuron contribute directly to the outcome (rather than being passed into another sigmoid function).

For future reference, this network will applied to Iowa City home sales data set, using sale.amount as the outcome and, area.lot, and bedrooms as the predictors.

Because the outcome is numeric, you will use the squared error cost function: \[Cost = \tfrac{1}{n}(\mathbf{y} - \mathbf{\hat{y}})^T(\mathbf{y} - \mathbf{\hat{y}})\]