This lab focuses on clustering, or methods for identifying meaningful groupings of similar data-points.

Directions (Please read before starting)

  1. Please work together with your assigned partner. Make sure you both fully understand each concept before you move on.
  2. Please record your answers and any related code for all embedded lab questions. I encourage you to try out the embedded examples, but you shouldn’t turn them in.
  3. Please ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Preamble

Packages and Datasets

This lab will continue using the factoextra package, and will also use functions from the cluster package.

# load the following packages
# install.packages("cluster")
library(cluster)
library(factoextra)
library(ggplot2)
library(dplyr)

\(~\)

Introduction

Consider data consisting of \(n\) measurements (observations) of \(p\) variables, represented by the matrix \(\mathbf{X}\):

\[ \mathbf{X} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{pmatrix} \]

  • PCA can be used to identify relationships in the columns of \(\mathbf{X}\) (ie: groupings/patterns among variables).
  • Clustering can be used to identify relationships in the rows of \(\mathbf{X}\) (ie: groupings/patterns among observations).

Clustering approaches fall into two broad categories:

  • Partitional Clustering divides the data-points into non-overlapping subsets (regions in space)
  • Hierarchical Clustering organizes the data-points into a nested subsets (tree-like structures)

This lab will focus on partitional clustering, in particular \(k\)-means and PAM clustering.

\(~\)

Distance

As a motivating example, consider the “USArrests” data from the PCA lab. Shown below are two clusters (red and blue) from a partitional method.

Which cluster should contain Virginia? Why?

Good clusters contain data-points that are similar to each other (cohesive), and also dissimilar from the members of other clusters (distinct).

To assess “similarity” we’ll use distance. There are many possible ways to define distance, one of the most popular is Euclidean distance:

\[d_{a,b} = \sqrt{\sum_{j = 1}^{p} (x_{aj} - x_{bj})^2}\]

In words, the Euclidean distance between two data-points, \(x_a\) and \(x_b\), is the square root of the sum of squared differences in their values across each variable (we index these variables by \(j \in \{1, \ldots, p\}\)).

\(~\)

K-means

Perhaps the most well-known clustering approach is k-means clustering, which groups \(n\) observations into \(k\) distinct groups (clusters) such that within cluster variation is minimized.

The original \(k\)-means algorithm solves for the \(k\) cluster centroids (\(\mu_1, \ldots, \mu_k\)) that minimize “Total Withiness”:

\[ \text{Total Withiness} = \sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k} (x_i - \mu_k)^2 \]

  • \(x_i\) is a data-point belonging to cluster \(k\)
  • \(\mu_k\) is the centroid (multi-dimensional mean) of cluster \(k\)
  • \(W(C_k)\) is the “withiness” of cluster \(k\), measuring the distance of each member to the cluster center

To apply \(k\)-means, the analyst must specify \(k\), the number of clusters, then the \(k\)-means algorithm is used to group the observations such that total squared distance from each data point to the center of its assigned cluster is minimized.

The \(k\)-means Algorithm

  1. Initialize the positions of \(k\) centroids
  2. Calculate the distance between each data-point and each centroid (using Euclidean distance)
  3. Assign data-points to the nearest centroid
  4. Update the cluster centroids (to the mean of its assigned data-points)
  5. Repeat (2) - (4) until there is no change

The gif below illustrates the progression of the \(k\)-means algorithm for \(k = 2\):

\(~\)

Lab

K-means in R

In R, the kmeans() function is used to perform \(k\)-means clustering. For example, we can use the code below to assign states in the USArrests data into two clusters:

data("USArrests")
k2 <- kmeans(USArrests, centers = 2, nstart = 25)

The centers argument is the number of clusters we want (ie: \(k\)), while nstart is a seed for the random starting point of the algorithm. There is no guarantee that k-means finds a globally optimal set of cluster assignments, so nstart is used to ensure reproducibility. That said, in most circumstances the starting point has little or no impact on the clustering results.

Next, notice that kmeans() outputs a list containing 9 different objects. The code below prints the coordinates of the cluster centroids (multi-dimensional means), which are stored in the centers object within that list:

k2$centers
##      Murder  Assault UrbanPop     Rape
## 1 11.857143 255.0000 67.61905 28.11429
## 2  4.841379 109.7586 64.03448 16.24828

These centroids are often called prototypes since each of them is a central point that can be used to describe the average characteristics of that cluster’s members.

If we are interested in the cluster assignments of individual data-points, we can find them in the cluster object returned by kemans(). The code below demonstrates how to find the assigned clusters for the states “Iowa” and “Illinois”, which were row names in USArrests data frame.

k2$cluster[c("Iowa", "Illinois")]
##     Iowa Illinois 
##        2        1

Question #1: Apply \(k\)-means clustering with \(k = 4\) to the USArrests data. Then, find the centroid prototype of the cluster that contains Minnesota. How does the cluster containing Minnesota compare to the other clusters (in terms of labels you derive from the cluster centroids)?

\(~\)

Visualizing K-means

We can visualize the results \(k\)-means using fviz_cluster(), which displays the clusters on a scatter plot that uses the first two principal component scores as the X-Y coordinates for each observation.

k2 <- kmeans(USArrests, centers = 2, nstart = 25)
fviz_cluster(k2, data = USArrests)

Notice that fviz_cluster() labels each data-point using the rownames of the data object. You should recall that you can change these using the rownames() function. It’s also worth knowing that you can also use the argument repel = TRUE (in fviz_cluster()) to prevent overlapping labels.

Finally, instead of using PC1 and PC2, you can specify any of the original variables:

fviz_cluster(k2, data = USArrests, choose.vars = c("UrbanPop", "Murder"), stand = FALSE, repel = TRUE, title = "Without Standardization")

You should note that clusters will generally appear most distinct when viewed using the PC1 and PC2 axes.

Question #2 (Part A): Use the scale() function to standardize the USArrests data, then use fviz_cluster() to graph the \(k\)-means clusters (for \(k = 2\)) found using the standardized data with “UrbanPop” and “Murder” as the \(X\) and \(Y\) variables (similar to the graph in the previous example).

Question #2 (Part B): Explain why the clusters are somewhat different after standardizing the data. Do you think the data should be standardized prior to clustering? Briefly explain.

\(~\)

Choosing \(k\)

As previously mentioned, the \(k\)-means algorithm requires the user to supply the value of \(k\) before the algorithm can be run.

As a general guideline, \(k\) should be large enough that clusters are relatively homogeneous, but it should also be small enough to avoid unnecessary complexity. The next few sections will cover three different methods for choosing a reasonable value of \(k\).

\(~\)

The Elbow Method

The \(k\)-means algorithm solves for centroids that minimize “Total withiness” or \(\sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k}(x_i - \mu_k)^2\). This quantity always decreases as \(k\) increases, but there are diminishing returns. Eventually, adding more cluster centroids won’t lead to much improvement in Total Withiness.

With this relationship in mind, we can graph “Total withiness”, abbreviated “wss”, versus different choices of \(k\) using the fviz_nbclust() function:

fviz_nbclust(USArrests, FUNcluster = kmeans, method = "wss", k.max = 8)

The Elbow Method suggests the optimal number of clusters corresponds with this plot’s “elbow”, or the point where the curve starts flatting/bending towards more gradual decreases in wss as \(k\) increases.

For the USArrests data (shown in this example), the elbow method suggests either 2 or 3 clusters would be reasonable, as going further doesn’t lead to much of a decrease in wss.

\(~\)

The Silhouette Method

The Silhouette Method measures the fit of observations within their assigned clusters using quantities known as silhouette values, which are a comparison of the similarity of an observation to its assigned cluster (cohesion) relative to its similarity to other clusters (separation).

  • Cohesion is defined as the mean distance of a data-point to the other members of its assigned cluster
  • Separation is defined as the mean distance of a data-point to the members of the next closest cluster.

The difference between these two quantities (the measure of cohesion and the measure of separation) is standardized to produce a silhouette value between -1 to +1, with high values indicating an observation has “good fit” in its assigned cluster, and low values indicating “poor fit” within the assigned cluster.

If you’d like a more mathematical description of how silhouette values are calculated, the Wikipedia page is fairly accessible.

The average silhouette (across the entire data set) can be used to evaluate the effectiveness of different values of \(k\) in creating clusters that are distinctive and cohesive. We can visualize this relationship by supplying the “silhouette” method to fviz_nbclust().

fviz_nbclust(USArrests, FUNcluster = kmeans, method = "silhouette", k.max = 8)

For the USArrest data, the average silhouette metric suggests that \(k = 2\) is optimal.

Silhouettes are also useful for evaluating the cluster membership of individual observations:

k2 <- kmeans(USArrests, centers = 2, nstart = 25)
sil <- silhouette(k2$cluster, dist(USArrests), ordered = FALSE)
row.names(sil) <- row.names(USArrests) # Needed to use label option
fviz_silhouette(sil, label = TRUE)
##   cluster size ave.sil.width
## 1       1   21          0.58
## 2       2   29          0.60

Here we see there are a handful of states that aren’t great fits in their assigned clusters, so we’d want to avoid over-interpreting their assignments when sharing our results.

If we’d like to take a closer look, we can always filter the silhouette information:

## Filter to show low sil states
as.data.frame(sil) %>% filter(sil_width < 0.3)
##              cluster neighbor  sil_width
## Arkansas           1        2 0.11160017
## Missouri           2        1 0.06586521
## Rhode Island       2        1 0.17099027
## Tennessee          1        2 0.09530834

Note that the output of the silhouette() function is not a data frame, so we must coerce it before filtering.

Question #3: For each part of the following question you should use cds, which is a standardized subset of the college scorecard data created by the code given below:

## Keep this code as pre-processing
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>% filter(State == "IA") %>%
  select(Name, Enrollment, Adm_Rate, ACT_median, 
         Net_Tuition, FourYearComp_Males, FourYearComp_Females,
        Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),]  ## Keeps only complete cases (no missing values)
rownames(cd) <- cd$Name

## Use the data below for this question
cds <- scale(select(cd, -Name))

Part A: Using the Elbow Method, determine an appropriate number of \(k\)-means clusters for these data. Then, report the centroids of each cluster. Note: because the data have been standardized, the units describing these centroids are standard deviations.

Part B: Use the Silhouette Method to determine an appropriate number of \(k\)-means clusters for these data.

Part C: Create a silhouette plot using the value of \(k\) selected in Part B. Which college(s) have negative silhouettes? Briefly explain what these values say about these colleges and their cluster membership.

\(~\)

The Gap Statistic

The gap statistic is a statistical approach that compares the observed wss for a certain choice of \(k\) versus a distribution of wss that would be expected if there were no actual relationships between observations (ie: a null hypothesis of no clustering). A larger gap statistic indicates a greater difference between the real data and what could be explained by random chance.

For additional mathematical details, see this paper.

The code below calculates the gap statistic for \(k=1\) through \(k=5\) for the “USArrests” data:

fviz_nbclust(USArrests, FUNcluster = kmeans, method = "gap", k.max = 5)

By default, fviz_nbclust will use two criteria when recommending a value of \(k\) using the gap statistic:

  1. The recommended \(k\) cannot appear after a decrease in the gap statistic (so only \(k \in \{1, \ldots, 4\}\) are eligible in this example).
  2. Among the eligible values, the recommendation is the smallest \(k\) where the gap statistic is within one standard error of its maximum value.

The aim is to provide more parsimonious results (fewer clusters) while still achieving performance that isn’t significantly different from the maximum.

Remark: You’ll notice in the USArrests example that each method suggested a slightly different number of clusters. This is a common occurrence, and the choice of \(k\) ultimately comes down to a decision made by the analyst. You should view the elbow, silhouette, and gap statistic as different tools that can help guide your choice.

Question #4: Use the gap statistic to determine the optimal choice of \(k\) for the cds data set used in Question #3. Does this method agree with the Elbow and Silhouette Methods? Briefly explain.

\(~\)

PAM Clustering

\(k\)-means clustering is a straightforward and widely used method, but it has a few weaknesses:

  1. As a measure of central tendency, the mean is sensitive to outliers, so \(k\)-means centroids tend to be strongly influenced by unusual data-points.
  2. The centroids found by \(k\)-means aren’t actual data-points, so using them to describe the typical members of a cluster isn’t always accurate, especially when the number of variables is large and the data contain outliers.
  3. Distances that rely upon \(x_i - \mu_k\) aren’t defined for categorical variables, so \(k\)-means will only work when all variables are numeric.

Partitioning Around Medoids (PAM) is an alternative clustering approach that searches for \(k\) data-points called medoids, which serve as cluster centers. The remaining data-points are assigned to the cluster defined by the nearest medoid.

usarrests_std <- scale(USArrests)
us_pam <- pam(usarrests_std, k = 3)  ## Apply PAM clustering
us_pam$medoids  ## Print the medoids
##                   Murder    Assault   UrbanPop       Rape
## New Mexico     0.8292944  1.3708088  0.3081225  1.1603196
## Oklahoma      -0.2727580 -0.2371077  0.1699510 -0.1315342
## New Hampshire -1.3059321 -1.3650491 -0.6590781 -1.2525644
fviz_cluster(us_pam) ## Plot the clusters

PAM addresses many of the drawbacks of \(k\)-means. It is more robust to outliers, and the cluster centers are themselves data-points, making the results more interpretable. And, we’ll later see PAM can be used to cluster data with categorical variables.

The biggest weakness of PAM is its computational efficiency. PAM doesn’t scale well to large datasets due to the complex constraints on the medoid update step. Additionally, PAM tends to be more sensitive to the starting medoids than k-means is to the starting centroids.

Question #5: Apply PAM clustering to the cds data set used in Questions #3 and #4. Your analysis should choose \(k\) using one of the three approaches from the previous section, and you should report the prototypes (cluster medoids) for each cluster.

\(~\)

Practice (required)

Question #6: The “mlb” data set contains offensive statistics for all Major League Baseball players with at least 200 at bats during the 2022 season. Aside from the player ID and “AB” (number of at bats), all other columns are reported as rates (per at bat).

mlb <- read.csv("https://remiller1450.github.io/data/mlb_hitters22.csv")

For those unfamiliar with baseball, here are some brief definitions of the remaining variables:

  • R - runs scored per at bat
  • H - hit rate (per at bat)
  • X2B - double rate (per at bat)
  • HR - home run rate (per at bat)
  • RBI - runs batted in per at bat
  • SB - stolen bases (scaled by number of batting attempts)
  • CS - caught stealing rate (per at bat)
  • BB - bases on balls (walks) per at bat
  • SO - strike outs per at at bat

Part A: Create a new data frame, mlb_std, where the player IDs are used as row names, the variable “AB” is removed, and all other variables have been standardized by the scale() function.

Part B: These data were standardized prior to you receiving them to reflect the rate of each event per batting attempt. Given this standardization has taken place, briefly explain why the data should be further standardized (into Z-scores by subtracting each variable’s mean and scaling by its standard deviation) prior to clustering.

Part C: Use the Silhouette Method to determine the number of clusters that should be used when applying \(k\)-means clustering to these data. Then, apply the \(k\)-means algorithm and print the cluster centroids.

Part D: Use the cluster centroids from Part B to provide a brief description of the members of Cluster #1 and Cluster #2. That is, how would you characterize the hitters in each of these clusters?

Part E: Suppose a sports columnist wants to write an article titled “7 different types of hitters”. Use PAM clustering to find a prototype players to use in this article (you use their Player ID). Then, report the number of hitters belonging to each of these clusters.

Part F: Use the silhouette() function to find the silhouette values for each hitter using the clusters from Part D. Then, report the player with the lowest value (worst fit) for each cluster. Hint: Use as.data.frame() to coerce the output of silhoutte() to a data.frame object, then add the Player IDs. Next, use group_by() and filter() to find the minimum silhouettes in each cluster.