Directions:
The Minkowski distance between two points is defined as:
\[D(\mathbf{x_1}, \mathbf{x_2}) = \bigg(\sum_{j=1}^{p}|x_{1,j} - x_{2,j}|^k\bigg)^{1/k}\]
x
,
and an input prototype, p
, for a user-specified value of
\(k\).[1,1.5,0.5]
and the prototype [0,0,0]
, use a
while loop and the function you created in Part B to find a value of
\(k\) such that the Minkowski distance
between these points is less than a tolerance of 0.001 from their
maximum coordinate-wise distance.X = pd.DataFrame([[1,1.5,0.5], [-1,0,0.5], [-1,0.5,0], [-0.5,0.5,2], [2,2,0.5]], columns = ['x1','x2','x3'])
Implement your own \(k\)-means
algorithm using Minkowski distance and the value of \(k\) (in reference to Minkowski distance)
you identified in Part C to group these data-points into 2 clusters. Do
not worry about standardization or re-scaling.The data stored at this URL:
https://remiller1450.github.io/data/Colleges2019_Complete.csv
contain information on all primarily undergraduate colleges in the
United States with at least 400 enrolled students for the year 2019. The
given data have been further restricted to exclude any colleges that do
not report one or more of the recorded variables.
['Cost','Net_Tuition','Debt_median','Salary10yr_median']
to
identify a group of “high value” colleges that tend to yield higher than
average median salaries for their alumni with lower debt levels and
cheaper costs/net tuition. Your analysis should include results using
both \(k\)-means and hierarchical
clustering. You should record the members of the “high value” cluster
along with their silhouette scores for each method and this information
via a table (descriptive statistics) or a visualization (graph involving
silhouette scores and members of this cluster for each clustering
method). Write a short paragraph using a scientific tone and appropriate
level of detail describing your methods and results.\(~\)
The data stored at this URL:
https://remiller1450.github.io/data/mnist_small.csv
are a
subset of size \(n = 6000\) from the MNIST database of
handwritten digits. This database contains 28 by 28 pixel grayscale
images of handwritten digits 0-9. The sample contained at the given URL
have been flattened, meaning each row represents an example image with
784 columns used to represent grayscale intensities of that image’s
pixels.
label
column, thereby
creating the objects pixels
and labels
pixels
that is a 3-dimensional numpy
array where the first axis
(axis 0) represents the number of samples. Use the function
io.imshow()
with the argument
cmap = plt.cm.Greys
(assuming you loaded the matplotlib
library under the commonly used alias “plt”) to confirm that the 6th
sample in this array is the digit “3”.n_components = 10
on pixels
. Next, use the
inverse_transform()
method and reshape the result to match
the dimensions of the 3-d numpy
array you created in Part
B.np.maximum()
function to replace any negative
intensities produced by the inverse transformation with 0
,
then visualize the 6th sample. How does this compare with the image you
created in part B? Briefly explain how these images are related.n_components = 10
is used to “compress”
these data, how many values need to be stored? Hint: Think
about the dimensions of the data after using PCA for dimension reduction
as well as the number of values that are necessary to reconstruct the
original shape.