Homework #2

Directions:

Homework must be completed individually
Please type your responses, clearly separating each question and sub-question (A, B, C, etc.)
You may type your written answers using Markdown chucks in a JupyterNotebook, or you may use any word processing software and submit your Python code separately
Questions that require Python coding should include all commands used to reach the answer, nothing more and nothing less
Submit your work via P-web

Question #1 (Gradient Descent)

Recall that the cost function for squared error loss can be written as:

\[Cost = \tfrac{1}{n}(\mathbf{y} - \mathbf{\hat{y}})^T(\mathbf{y} - \mathbf{\hat{y}})\]

Here \(\mathbf{y}\) is the vector of observed outcomes and \(\mathbf{\hat{y}}\) is a vector of model predictions.

In Poisson Regression, \(\mathbf{\hat{y}} = e^{\mathbf{X}\mathbf{\hat{w}}}\)

Standard procedure for Poisson regression is estimate the unknown weights using maximum likelihood estimation. However, for this question I’ll ask you to estimate a reasonable set of weights by differentiating the squared error cost function and optimizing it via gradient descent (which is not equivalent to maximum likelihood estimation for this scenario).

Part A: Use chain rule to show that the \(Gradient = \tfrac{-2}{n}\mathbf{X}^T(diag(e^{\mathbf{X}\mathbf{\hat{w}}}))(\mathbf{y} -e^{\mathbf{X}\mathbf{\hat{w}}})\) where \(diag(v)\) is used to denote a diagonal matrix whose diagonal elements are given by \(v\) and whose off-diagonal elements are zero.
Part B: Write Python functions that calculate the cost (given \(X\), \(\hat{w}\), and \(y\)) and perform gradient descent for this scenario. You should use the functions in Lab 4 (part 1) as examples. You may also consider using np.diag() to set up a diagonal matrix. For reference, the minimum cost for the example data (see below) should be between 4.69 and 4.70.

## Setup trial data
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic_y = ic['bedrooms']
ic_X = ic[['assessed','area.living']]

## Scale X
from sklearn.preprocessing import StandardScaler
ic_Xs = StandardScaler().fit_transform(ic_X)

## Fit via grad descent, 250 iter w/ 0.01 learning rate 
gdres = grad_descent(X=ic_Xs,y=ic_y,w=np.zeros(2),alpha=0.01, n_iter=250)

## Min of cost function
print(min(gdres[1]))

Part C: Graph the cost function for 300 iterations of gradient descent with a learning rate of 0.001. Comment on whether this learning rate is appropriate.
Part D: Graph the cost function for 300 iterations of gradient descent with a learning rate of 0.1. Comment on whether this learning rate is appropriate.

Comments: Poisson regression can be fit via maximum likelihood in sklearn using the PoissonRegressor function. The arguments alpha=0 and fit_intercept=False can be used to mimic the model fit to the example data in this question. However, you should expect somewhat different weight estimates since maximum likelihood estimation is not equivalent to minimizing squared error loss for the Poisson regression model. Further, in this example, the variables “assessed” and “area.living” are highly collinear, so there are many combinations of weights that will fit the training data similarly well.

\(~\)

Question #2 (Softmax Regression vs. kNN)

For this question you should use the dataset available here:

https://remiller1450.github.io/data/beans.csv

These data were originally introduced in Homework #1. They were constructed using a computer vision system that segmented 13,611 images of 7 types of dry beans, extracting 16 features (12 dimensional features, a 4 shape form features). Additional details are contained in this paper.

Part A: Read these data into Python, perform a 90-10 training-testing split with random_state=1, and separate the outcome (Class) from the predictors. Then setup a pre-processing pipeline that includes three steps: 1. Normalizing transformation using the Yeo-Johnson method, 2. Rescaling using a min-max scaler, 3. Model fitting using Softmax regression with no regularization. Use the “saga” solver with a maximum of 1000 iterations. You may ignore the “divide by zero encountered …” and “overflow encountered …” warning messages.
Part B: Use 5-fold cross-validation to estimate the out-of-sample classification accuracy achieved by the modeling pipeline from Part A.
Part C: Use the GridSearchCV function to compare the out-of-sample classification accuracy of the Softmax regression model from Parts A/B with a k-nearest neighbors classifier using 30 neighbors, distance weighting, and euclidean distance. Which model performs better?
Part D: Reconfigure the pipeline you created in Part A to use the best model from Part C, then find the classification accuracy of this model on the test set.

\(~\)

Question #3 (Linear Regression, Feature Expansion, and Regularization)

For this question you should use the dataset available here:

https://remiller1450.github.io/data/Ozone.csv

These data contain Ozone concentrations in New York City in 1973. For context, Ozone is a pollutant that is has been linked to numerous health problems. The goal of this application is develop methods for accurately predicting the Ozone concentration on a future date based upon the expected solar radiation, wind speed, and temperature on that date.

Part A: Read these data into Python and perform an 80-20 train-test split using random_state=3. Then, separate the outcome from the predictors (dropping the “Day” column), and create a pre-processing pipeline that performs standardization before fitting a linear regression model.
Part B: Use 5-fold cross-validation to estimate ‘neg_root_mean_squared_error’ score of the modeling steps described in Part A.
Part C: Add a step to the pipeline you created in Part A that uses SplineTransformer to expand original set of features to facilitate non-linear relationships. Then evaluate this new approach using 5-fold cross-validation (using the ‘neg_root_mean_squared_error’ score). Does the feature expansion seem beneficial, or does it appear to result in overfitting?
Part D: Modify the model step of your pipeline from Part C to use RidgeCV with a log-spaced sequence of regularization amounts going from 0.01 (1.0e-02) to 100 (1.0e+03). Use 5-fold cross-validation and neg_root_mean_squared_error as the scoring metric. Report the optimal amount of regularization.
Part E: Does the introduction of regularization improve the predictive ability of the model fit during Part C? Briefly explain.

\(~\)

Question #4 (Modeling Choices)

Shown below are a hypothetical example data set that we discussed in the early stages of the semester. You will not be given access to the underlying data, so you should base your assessments on the information that is visibly available:

Part A: Consider a k-nearest neighbors classifier (with sensibly chosen tuning parameters) and an ordinary logistic regression model without any modifications to the input features, \(x_1\) and \(x_2\). Which model would you expect to perform better on these data?
Part B: Now consider adding regularization to the logistic regression model described in Part A. Would expect this model to perform better than both, one, or none of the models described in Part A if it were applied to new data that was not used for model training? Briefly explain.
Part C: Is a polynomial expansion of \(x_1\) and \(x_2\) likely to improve the performance of a logistic regression classifier? Briefly explain.
Part D: Is discretization as a feature expansion strategy likely to improve the performance of logistic regression classifier? Briefly explain.
Part E: Suppose the data shown in this visualization are used as a training set for a model that will make predictions on new data points that encountered one-at-a-time. Between k-nearest neighbors and logistic regression, which approach needs to store more information to generate its prediction on a new observation? Briefly explain.