Directions:
\(~\)
Consider gradient boosting applied to the linear regression model using the squared error cost function. To begin, recall that linear regression has a closed form solution in this scenario: \[\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
\(~\)
Recall that squared error cost function can be written:
\[Cost = \tfrac{1}{n}(\mathbf{y} - \mathbf{\hat{y}})^T(\mathbf{y} - \mathbf{\hat{y}})\]
Here \(\mathbf{y}\) is the vector of observed outcomes and \(\mathbf{\hat{y}}\) is a vector of model predictions.
In Poisson Regression, \(\mathbf{\hat{y}} = e^{\mathbf{X}\mathbf{\hat{w}}}\)
The standard approach to Poisson regression is estimate the unknown weights using maximum likelihood estimation. However, for this question I’ll ask you to estimate a reasonable set of weights by differentiating the squared error cost function and optimizing it via gradient descent (which is not equivalent to maximum likelihood estimation for this scenario).
np.diag()
to set up a
diagonal matrix. For reference, the minimum cost for the example data
(see below) should be between 4.69 and 4.70.## Setup trial data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic_y = ic['bedrooms']
ic_X = ic[['assessed','area.living']]
## Scale X
from sklearn.preprocessing import StandardScaler
ic_Xs = StandardScaler().fit_transform(ic_X)
## Fit via grad descent, 250 iter w/ 0.01 learning rate
gdres = grad_descent(X=ic_Xs,y=ic_y,w=np.zeros(2),alpha=0.01, n_iter=250)
## Min of cost function
print(min(gdres[1]))
Comments: Poisson regression can be fit via maximum
likelihood in sklearn
using the
PoissonRegressor
function. The arguments
alpha=0
and fit_intercept=False
can be used to
mimic the model fit to the example data in this question. However, you
should expect somewhat different weight estimates since maximum
likelihood estimation is not equivalent to minimizing squared error loss
for the Poisson regression model. Further, in this example, the
variables “assessed” and “area.living” are highly correlated, so there
are many combinations of weights that will fit the training data
similarly well.
\(~\)
Consider a simple neural network consisting of 1 hidden layer containing 4 neurons that use the sigmoid activation function. The network will predict a numeric outcome, so the weighted outputs of each neuron contribute directly to the outcome (rather than being passed into another sigmoid function).
For future reference, this network will applied to Iowa City home
sales data set, using sale.amount
as the outcome and
area.living
, area.lot
, and
bedrooms
as the predictors.
Because the outcome is numeric, you will use the squared error cost function: \[Cost = \tfrac{1}{n}(\mathbf{y} - \mathbf{\hat{y}})^T(\mathbf{y} - \mathbf{\hat{y}})\]
random_state=7
, then
separate the predictors from the the outcome.sale.amount
, by subtracting its mean then dividing by its
standard deviation (given by np.std
), and re-scale the
predictors using MinMaxScaler()
. Note: these steps
provide numerical stability.MLPRegressor
, the
implementation of neural networks in sklearn
, to fit the
same model to the same data. Be sure to specify the proper activation
function, number of hidden layers, and number of neurons. Then, print
the fitted model’s final loss and verify that it’s roughly 0.45.
Note: this loss should be half of what you found because
sklearn
defines squared error cost as \(\tfrac{1}{2n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2\),
but we haven’t included the “2”.