This lab focuses on clustering, or methods for identifying meaningful groupings of similar data-points.
Directions (Please read before starting)
\(~\)
This lab will continue using the factoextra
package, and
will also use functions from the cluster
package.
# load the following packages
# install.packages("cluster")
library(cluster)
library(factoextra)
library(ggplot2)
library(dplyr)
\(~\)
Consider data consisting of \(n\) measurements (observations) of \(p\) variables, represented by the matrix \(\mathbf{X}\):
\[ \mathbf{X} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{pmatrix} \]
Clustering approaches fall into two broad categories:
This lab will focus on partitional clustering, in particular \(k\)-means and PAM clustering.
\(~\)
As a motivating example, consider the “USArrests” data from the PCA lab. Shown below are two clusters (red and blue) from a partitional method.
Which cluster should contain Virginia? Why?
Good clusters contain data-points that are similar to each other (cohesive), and also dissimilar from the members of other clusters (distinct).
To assess “similarity” we’ll use distance. There are many possible ways to define distance, one of the most popular is Euclidean distance:
\[d_{a,b} = \sqrt{\sum_{j = 1}^{p} (x_{aj} - x_{bj})^2}\]
In words, the Euclidean distance between two data-points, \(x_a\) and \(x_b\), is the square root of the sum of squared differences in their values across each variable (we index these variables by \(j \in \{1, \ldots, p\}\)).
\(~\)
Perhaps the most well-known clustering approach is k-means clustering, which groups \(n\) observations into \(k\) distinct groups (clusters) such that within cluster variation is minimized.
The original \(k\)-means algorithm solves for the \(k\) cluster centroids (\(\mu_1, \ldots, \mu_k\)) that minimize “Total Withiness”:
\[ \text{Total Withiness} = \sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k} (x_i - \mu_k)^2 \]
To apply \(k\)-means, the analyst must specify \(k\), the number of clusters, then the \(k\)-means algorithm is used to group the observations such that total squared distance from each data point to the center of its assigned cluster is minimized.
The gif below illustrates the progression of the \(k\)-means algorithm for \(k = 2\):
\(~\)
In R, the kmeans()
function is used to perform \(k\)-means clustering. For example, we can
use the code below to assign states in the USArrests
data
into two clusters:
data("USArrests")
k2 <- kmeans(USArrests, centers = 2, nstart = 25)
The centers
argument is the number of clusters we want
(ie: \(k\)), while nstart
is a seed for the random starting point of the algorithm. There is no
guarantee that k-means finds a globally optimal set of cluster
assignments, so nstart
is used to ensure reproducibility.
That said, in most circumstances the starting point has little or no
impact on the clustering results.
Next, notice that kmeans()
outputs a list containing 9
different objects. The code below prints the coordinates of the cluster
centroids (multi-dimensional means), which are stored in the
centers
object within that list:
k2$centers
## Murder Assault UrbanPop Rape
## 1 11.857143 255.0000 67.61905 28.11429
## 2 4.841379 109.7586 64.03448 16.24828
These centroids are often called prototypes since each of them is a central point that can be used to describe the average characteristics of that cluster’s members.
If we are interested in the cluster assignments of individual
data-points, we can find them in the cluster
object
returned by kemans()
. The code below demonstrates how to
find the assigned clusters for the states “Iowa” and “Illinois”, which
were row names in USArrests
data frame.
k2$cluster[c("Iowa", "Illinois")]
## Iowa Illinois
## 2 1
Question #1: Apply \(k\)-means clustering with \(k = 4\) to the USArrests data. Then, find the centroid prototype of the cluster that contains Minnesota. How does the cluster containing Minnesota compare to the other clusters (in terms of labels you derive from the cluster centroids)?
\(~\)
We can visualize the results \(k\)-means using
fviz_cluster()
, which displays the clusters on a scatter
plot that uses the first two principal component scores as the
X-Y coordinates for each observation.
k2 <- kmeans(USArrests, centers = 2, nstart = 25)
fviz_cluster(k2, data = USArrests)
Notice that fviz_cluster()
labels each data-point using
the rownames of the data object. You should recall that you can
change these using the rownames()
function. It’s also worth
knowing that you can also use the argument repel = TRUE
(in
fviz_cluster()
) to prevent overlapping labels.
Finally, instead of using PC1 and PC2, you can specify any of the original variables:
fviz_cluster(k2, data = USArrests, choose.vars = c("UrbanPop", "Murder"), stand = FALSE, repel = TRUE, title = "Without Standardization")
You should note that clusters will generally appear most distinct when viewed using the PC1 and PC2 axes.
Question #2 (Part A): Use the scale()
function to standardize the USArrests
data, then use
fviz_cluster()
to graph the \(k\)-means clusters (for \(k = 2\)) found using the standardized data
with “UrbanPop” and “Murder” as the \(X\) and \(Y\) variables (similar to the graph in the
previous example).
Question #2 (Part B): Explain why the clusters are somewhat different after standardizing the data. Do you think the data should be standardized prior to clustering? Briefly explain.
\(~\)
As previously mentioned, the \(k\)-means algorithm requires the user to supply the value of \(k\) before the algorithm can be run.
As a general guideline, \(k\) should be large enough that clusters are relatively homogeneous, but it should also be small enough to avoid unnecessary complexity. The next few sections will cover three different methods for choosing a reasonable value of \(k\).
\(~\)
The \(k\)-means algorithm solves for centroids that minimize “Total withiness” or \(\sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k}(x_i - \mu_k)^2\). This quantity always decreases as \(k\) increases, but there are diminishing returns. Eventually, adding more cluster centroids won’t lead to much improvement in Total Withiness.
With this relationship in mind, we can graph “Total withiness”,
abbreviated “wss”, versus different choices of \(k\) using the fviz_nbclust()
function:
fviz_nbclust(USArrests, FUNcluster = kmeans, method = "wss", k.max = 8)
The Elbow Method suggests the optimal number of clusters corresponds with this plot’s “elbow”, or the point where the curve starts flatting/bending towards more gradual decreases in wss as \(k\) increases.
For the USArrests
data (shown in this example), the
elbow method suggests either 2 or 3 clusters would be reasonable, as
going further doesn’t lead to much of a decrease in wss.
\(~\)
The Silhouette Method measures the fit of observations within their assigned clusters using quantities known as silhouette values, which are a comparison of the similarity of an observation to its assigned cluster (cohesion) relative to its similarity to other clusters (separation).
The difference between these two quantities (the measure of cohesion and the measure of separation) is standardized to produce a silhouette value between -1 to +1, with high values indicating an observation has “good fit” in its assigned cluster, and low values indicating “poor fit” within the assigned cluster.
If you’d like a more mathematical description of how silhouette values are calculated, the Wikipedia page is fairly accessible.
The average silhouette (across the entire data set) can be
used to evaluate the effectiveness of different values of \(k\) in creating clusters that are
distinctive and cohesive. We can visualize this relationship by
supplying the “silhouette” method to fviz_nbclust()
.
fviz_nbclust(USArrests, FUNcluster = kmeans, method = "silhouette", k.max = 8)
For the USArrest
data, the average silhouette metric
suggests that \(k = 2\) is optimal.
Silhouettes are also useful for evaluating the cluster membership of individual observations:
k2 <- kmeans(USArrests, centers = 2, nstart = 25)
sil <- silhouette(k2$cluster, dist(USArrests), ordered = FALSE)
row.names(sil) <- row.names(USArrests) # Needed to use label option
fviz_silhouette(sil, label = TRUE)
## cluster size ave.sil.width
## 1 1 21 0.58
## 2 2 29 0.60
Here we see there are a handful of states that aren’t great fits in their assigned clusters, so we’d want to avoid over-interpreting their assignments when sharing our results.
If we’d like to take a closer look, we can always filter the silhouette information:
## Filter to show low sil states
as.data.frame(sil) %>% filter(sil_width < 0.3)
## cluster neighbor sil_width
## Arkansas 1 2 0.11160017
## Missouri 2 1 0.06586521
## Rhode Island 2 1 0.17099027
## Tennessee 1 2 0.09530834
Note that the output of the silhouette()
function is not
a data frame, so we must coerce it before filtering.
Question #3: For each part of the following question
you should use cds
, which is a standardized subset of the
college scorecard data created by the code given below:
## Keep this code as pre-processing
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>% filter(State == "IA") %>%
select(Name, Enrollment, Adm_Rate, ACT_median,
Net_Tuition, FourYearComp_Males, FourYearComp_Females,
Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),] ## Keeps only complete cases (no missing values)
rownames(cd) <- cd$Name
## Use the data below for this question
cds <- scale(select(cd, -Name))
Part A: Using the Elbow Method, determine an appropriate number of \(k\)-means clusters for these data. Then, report the centroids of each cluster. Note: because the data have been standardized, the units describing these centroids are standard deviations.
Part B: Use the Silhouette Method to determine an appropriate number of \(k\)-means clusters for these data.
Part C: Create a silhouette plot using the value of \(k\) selected in Part B. Which college(s) have negative silhouettes? Briefly explain what these values say about these colleges and their cluster membership.
\(~\)
The gap statistic is a statistical approach that compares the observed wss for a certain choice of \(k\) versus a distribution of wss that would be expected if there were no actual relationships between observations (ie: a null hypothesis of no clustering). A larger gap statistic indicates a greater difference between the real data and what could be explained by random chance.
For additional mathematical details, see this paper.
The code below calculates the gap statistic for \(k=1\) through \(k=5\) for the “USArrests” data:
fviz_nbclust(USArrests, FUNcluster = kmeans, method = "gap", k.max = 5)
By default, fviz_nbclust
will use two criteria when
recommending a value of \(k\) using the
gap statistic:
The aim is to provide more parsimonious results (fewer clusters) while still achieving performance that isn’t significantly different from the maximum.
Remark: You’ll notice in the USArrests example that each method suggested a slightly different number of clusters. This is a common occurrence, and the choice of \(k\) ultimately comes down to a decision made by the analyst. You should view the elbow, silhouette, and gap statistic as different tools that can help guide your choice.
Question #4: Use the gap statistic to determine the
optimal choice of \(k\) for the
cds
data set used in Question #3. Does this method agree
with the Elbow and Silhouette Methods? Briefly explain.
\(~\)
\(k\)-means clustering is a straightforward and widely used method, but it has a few weaknesses:
Partitioning Around Medoids (PAM) is an alternative clustering approach that searches for \(k\) data-points called medoids, which serve as cluster centers. The remaining data-points are assigned to the cluster defined by the nearest medoid.
usarrests_std <- scale(USArrests)
us_pam <- pam(usarrests_std, k = 3) ## Apply PAM clustering
us_pam$medoids ## Print the medoids
## Murder Assault UrbanPop Rape
## New Mexico 0.8292944 1.3708088 0.3081225 1.1603196
## Oklahoma -0.2727580 -0.2371077 0.1699510 -0.1315342
## New Hampshire -1.3059321 -1.3650491 -0.6590781 -1.2525644
fviz_cluster(us_pam) ## Plot the clusters
PAM addresses many of the drawbacks of \(k\)-means. It is more robust to outliers, and the cluster centers are themselves data-points, making the results more interpretable. And, we’ll later see PAM can be used to cluster data with categorical variables.
The biggest weakness of PAM is its computational efficiency. PAM doesn’t scale well to large datasets due to the complex constraints on the medoid update step. Additionally, PAM tends to be more sensitive to the starting medoids than k-means is to the starting centroids.
Question #5: Apply PAM clustering to the
cds
data set used in Questions #3 and #4. Your analysis
should choose \(k\) using one of the
three approaches from the previous section, and you should report the
prototypes (cluster medoids) for each cluster.
\(~\)
Question #6: The “mlb” data set contains offensive statistics for all Major League Baseball players with at least 200 at bats during the 2022 season. Aside from the player ID and “AB” (number of at bats), all other columns are reported as rates (per at bat).
mlb <- read.csv("https://remiller1450.github.io/data/mlb_hitters22.csv")
For those unfamiliar with baseball, here are some brief definitions of the remaining variables:
Part A: Create a new data frame,
mlb_std
, where the player IDs are used as row names, the
variable “AB” is removed, and all other variables have been standardized
by the scale()
function.
Part B: These data were standardized prior to you receiving them to reflect the rate of each event per batting attempt. Given this standardization has taken place, briefly explain why the data should be further standardized (into Z-scores by subtracting each variable’s mean and scaling by its standard deviation) prior to clustering.
Part C: Use the Silhouette Method to determine the number of clusters that should be used when applying \(k\)-means clustering to these data. Then, apply the \(k\)-means algorithm and print the cluster centroids.
Part D: Use the cluster centroids from Part B to provide a brief description of the members of Cluster #1 and Cluster #2. That is, how would you characterize the hitters in each of these clusters?
Part E: Suppose a sports columnist wants to write an article titled “7 different types of hitters”. Use PAM clustering to find a prototype players to use in this article (you use their Player ID). Then, report the number of hitters belonging to each of these clusters.
Part F: Use the silhouette()
function
to find the silhouette values for each hitter using the clusters from
Part D. Then, report the player with the lowest value (worst fit) for
each cluster. Hint: Use as.data.frame()
to coerce
the output of silhoutte()
to a data.frame object, then add
the Player IDs. Next, use group_by()
and
filter()
to find the minimum silhouettes in each
cluster.