Homework #1

Directions:

Homework must be completed individually. Any guidance or helped received from mentors, classmates, online resources other than course materials (including LLMs) must be acknowledged
Clearly organize your responses to each question (1, 2, 3) and sub-question (A, B, C, etc.)
Type your written answers using Markdown chucks in a Jupyter Notebook, or you may use any word processing software and submit your Python code separately
Questions that require Python coding should include all commands necessary to reach the answer, nothing more and nothing less
Submit your work via P-web

Question #1 (Minkowski distance and clustering)

The Minkowski distance between two points is defined as:

\[D(\mathbf{x_1}, \mathbf{x_2}) = \bigg(\sum_{j=1}^{p}|x_{1,j} - x_{2,j}|^k\bigg)^{1/k}\]

Part A - Create a Python function that finds the Minkowski distance between a 3-dimensional data-point, x, and an input prototype, p, for a user-specified value of \(k\).
Part B - Show, either mathematically or empirically using a convincing Python demonstration, that as \(k\) becomes large the Minkowski distance between \(\mathbf{x_1}\) and \(\mathbf{x_2}\) approaches the maximum coordinate-wise difference the data-points.
Part C - Using the data-point [1,1.5,0.5] and the prototype [0,0,0], use a while loop and the function you created in Part B to find a value of \(k\) such that the Minkowski distance between these points is less than a tolerance of 0.001 from their maximum coordinate-wise distance.
Part D - Briefly describe how a \(k\)-means algorithm could use Minkowski distance with a large value of \(k\). How would you expect the clustering results of this algorithm to differ from a k-means algorithm using euclidean distance?
Part E (extra credit) - Consider simple data set created by the following code: X = pd.DataFrame([[1,1.5,0.5], [-1,0,0.5], [-1,0.5,0], [-0.5,0.5,2], [2,2,0.5]], columns = ['x1','x2','x3']) Implement your own \(k\)-means algorithm using Minkowski distance and the value of \(k\) (in reference to Minkowski distance) you identified in Part C to group these data-points into 2 clusters. Do not worry about standardization or re-scaling.

Question #2 (clustering application)

The data stored at this URL: https://remiller1450.github.io/data/Colleges2019_Complete.csv contain information on all primarily undergraduate colleges in the United States with at least 400 enrolled students for the year 2019. The given data have been further restricted to exclude any colleges that do not report one or more of the recorded variables.

Part A: Perform a clustering analysis using the variables: ['Cost','Net_Tuition','Debt_median','Salary10yr_median'] to identify a group of “high value” colleges that tend to yield higher than average median salaries for their alumni with lower debt levels and cheaper costs/net tuition. Your analysis should include results using both \(k\)-means and hierarchical clustering. You should record the members of the “high value” cluster along with their silhouette scores for each method and this information via a table (descriptive statistics) or a visualization (graph involving silhouette scores and members of this cluster for each clustering method). Write a short paragraph using a scientific tone and appropriate level of detail describing your methods and results.
Part B: Using the colleges you identified as belonging to the “high value” cluster you identified in Part A with the \(k\)-means algorithm. Use DBSCAN to perform outlier detection. Report any outliers and justify your choice of hyperparameters.

\(~\)

Question #3 (principal components)

The data stored at this URL: https://remiller1450.github.io/data/mnist_small.csv are a subset of size \(n = 6000\) from the MNIST database of handwritten digits. This database contains 28 by 28 pixel grayscale images of handwritten digits 0-9. The sample contained at the given URL have been flattened, meaning each row represents an example image with 784 columns used to represent grayscale intensities of that image’s pixels.

Part A - Read the data into Python then separate the pixel intensities from the label column, thereby creating the objects pixels and labels
Part B - Create a version of pixels that is a 3-dimensional numpy array where the first axis (axis 0) represents the number of samples. Use the function io.imshow() with the argument cmap = plt.cm.Greys (assuming you loaded the matplotlib library under the commonly used alias “plt”) to confirm that the 6th sample in this array is the digit “3”.
Part C - Perform principal components using n_components = 10 on pixels. Next, use the inverse_transform() method and reshape the result to match the dimensions of the 3-d numpy array you created in Part B.
Part D: Starting with the array created in Part C, use the np.maximum() function to replace any negative intensities produced by the inverse transformation with 0, then visualize the 6th sample. How does this compare with the image you created in part B? Briefly explain how these images are related.
Part E: The original data set requires 4,704,000 numeric values to store the pixel intensities of the entire set of images. If PCA with n_components = 10 is used to “compress” these data, how many values need to be stored? Hint: Think about the dimensions of the data after using PCA for dimension reduction as well as the number of values that are necessary to reconstruct the original shape.
Part F: Find the largest number of principal components that can be retained such that the “compressed” data contains at most 1% as many numeric values as the original. Then, repeat Parts C and D (ie: perform PCA, reshape, and display the 6th image) with this number of components and briefly describe how the resulting image compares to the image you displayed in Part B.
Part G: Find and report the percentage of variance from the original image that was retained by the number of components you used in Part F. Briefly explain the meaning of this number to someone who is not familiar with principal component analysis.