Directions:
- Remember that homework assignments are to be completed individually,
and any classmates you discuss the questions with, or any outside
resources (besides our course materials) you consult for assistance
should be cited.
- Please submit a single Jupyter Notebook containing your responses
via P-web. Use markdown chunks to format the assignment and record
responses to questions that involve written answers. Note that you may
type LaTex style equations in markdown chunks.
\(~\)
Question 1
Consider gradient boosting using the squared error cost function with
linear regression as the base learner. You may use the fact that linear
regression has a closed form solution for its weights, and you do not
need to re-derive this solution yourself.
- Part A: Let \(\mathbf{r}^{(1)}\) denote the vector of
residuals for the first base learner in the ensemble. Write an
expression for the weights of the second model (call them \(\hat{\mathbf{w}}^{(2)}\)) in the ensemble
in terms of \(\mathbf{r}_1\).
- Part B: Substitute the \(\mathbf{y} -
\mathbf{X}\hat{\mathbf{w}}^{(1)}\) (the definition of a residual)
into the expression you provided in Part A, then argue that \(\hat{\mathbf{w}}^{(2)} = \mathbf{0}\).
- Part C: Provide a conceptual argument for why
gradient boosting fails to improve the linear regression model.
\(~\)
Question 2
Recall that squared error cost function can be written:
\[Cost = \tfrac{1}{n}(\mathbf{y} -
\mathbf{\hat{y}})^T(\mathbf{y} - \mathbf{\hat{y}})\]
Here \(\mathbf{y}\) is the vector of
observed outcomes and \(\mathbf{\hat{y}}\) is a vector of model
predictions.
Poisson Regression uses the form: \(\mathbf{\hat{y}} =
e^{\mathbf{X}\mathbf{\hat{w}}}\)
The standard approach to Poisson regression is estimate the unknown
weights using maximum likelihood estimation. However, for this
question I’ll ask you to estimate a reasonable set of weights by
differentiating the squared error cost function and optimizing it via
gradient descent (which is not equivalent to maximum likelihood
estimation in this scenario).
- Part A: Use chain rule to show that the gradient of
the cost with respect to the weight parameters is \(\tfrac{-2}{n}(\mathbf{y}
-e^{\mathbf{X}\mathbf{\hat{w}}})^T \cdot
diag(e^{\mathbf{X}\mathbf{\hat{w}}})\cdot \mathbf{X}\) where
\(diag(v)\) is used to denote a
diagonal matrix whose diagonal elements are given by \(v\) and whose off-diagonal elements are
zero.
- Part B: Write your own Python functions to
calculate the cost (given \(X\), \(\hat{w}\), and \(y\)) and perform gradient descent for this
scenario. You should use the functions provided in this unit’s labs as
examples. You should consider using
np.diag() to set up a
diagonal matrix. For reference, the minimum cost for the example data
(see below) should be between 4.69 and 4.70.
## Set up trial data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic_y = ic['bedrooms']
ic_X = ic[['assessed','area.living']]
## Scale X
from sklearn.preprocessing import StandardScaler
ic_Xs = StandardScaler().fit_transform(ic_X)
## Fit via grad descent, 250 iter w/ 0.01 learning rate
gdres = grad_descent(X=ic_Xs,y=ic_y,w=np.zeros(2),alpha=0.01, n_iter=250)
## Min of cost function
print(min(gdres[1]))
- Part C: Graph the cost function for 300 iterations
of gradient descent with a learning rate of 0.001. Comment on
whether this learning rate is appropriate.
- Part D: Graph the cost function for 300 iterations
of gradient descent with a learning rate of 0.1. Comment on
whether this learning rate is appropriate.
Comments: Poisson regression can be fit via maximum
likelihood in sklearn using the
PoissonRegressor function. The arguments
alpha=0 and fit_intercept=False can be used to
mimic the model fit to the example data in this question. However, you
should expect somewhat different weight estimates since maximum
likelihood estimation is not equivalent to minimizing squared error loss
for the Poisson regression model. Further, in this example, the
variables “assessed” and “area.living” are highly correlated, so there
are many combinations of weights that will fit the training data
similarly well.
\(~\)
Question 3
Consider a simple neural network consisting of 1 hidden layer
containing 4 neurons that use the sigmoid activation function, and a
final output layer that applies the sigmoid function to a linear
combination of activated outputs from the hidden layer. You are to apply
the squared error cost function to the final activated
output.
The input data used in your network will be generated from the
make_moons() function in sklearn:
## Set up trial data
X, y = make_moons(n_samples=200, noise=0.1)
Note that data has 2 input input features and a binary outcome.
- Part A: How many weight and bias parameters are
used to generate the neurons in this network from the input
features? Briefly explain.
- Part B: How many weight and bias parameters are
used to generate the predicted outcome from this network from
the hidden neurons? Briefly explain.
- Part C: Consider data on a single training example
with its predictors stored in the vector \(\mathbf{x}_i\). To prepare for only
one stochastic gradient descent, differentiate the cost function
with respect to each weight parameter in the final layer of the
network.
- Part D: Now find the gradient components for each
of the remaining parameters, all of the biases and the remaining
weights. Clearly show your expression for each component, using the
notation from our class and using matrices and vectors wherever
possible. Significant deviations from the notation we’ve used during
class will be penalized.
- Part E: Using functions from this unit’s labs as a
guide, write a function that uses only-one stochastic gradient descent
function to train this network. Your function does not need to
generalize to scenarios other than this toy data set, so it is okay if
certain attributes (such as network architecture and object sizes) are
hard-coded into it.
- Part F: Train your network using an appropriately
selected learning rate (try something small and increase as needed) and
graph the cost function over 100 passes through the entire training data
set for a learning rate you identify as reasonable.
- Part G: Once you think you’ve found an appropriate
learning rate in Part F, create an independent validation set by using
make_moons() again with the same parameters. Add a line to
your previous graph showing the cost function evaluated using this new
data set at each training iteration.