Introduction to Clustering

1. Clustering vs. PCA

Principal components analysis (PCA) can aid an analyst in visualizing and identifying patterns in data with many variables. PCA is an unsupervised learning method because it does not require the analyst to specify an outcome variable.

Generally speaking, PCA groups related variables together as derived dimensions.

Clustering is another unsupervised learning method, but its goal is to group related observations together as clusters.

As an example, suppose a dataset consists of \(n = 100\) observations and \(p = 10\) variables. PCA might suggest those 10 variables can be reasonably represented by 3 dimensions (the first 3 principle components), while clustering might tell us that those 100 observations can be reasonably represented by 5 groups (clusters).

In this lab we will use the cluster and factoextra packages to perform clustering, and visualize clustering results.

library(cluster)     ## Functions to perform clustering
library(factoextra)  ## Functions for cluster visualization
library(tidyr)       ## Used to tidy data
library(dplyr)       ## Used to manipulate data

2. k-Means Clustering

Perhaps the most popular clustering approach is k-means clustering, which groups \(n\) observations into \(k\) groups (clusters) such that within cluster variation is minimized. Since its inception many k-means variations and algorithms have been proposed, but the original method minimizes “Total Withiness”: \[ \text{Total Withiness} = \sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k} (x_i - \mu_k)^2 \]

\(W(C_k)\) is the “withiness” of cluster \(k\)
\(x_i\) is a data point belonging to cluster \(k\)
\(\mu_k\) is the mean of cluster \(k\)

In order to use \(k\)-means, the analyst must specify \(k\), the number of clusters. For a fixed \(k\), the \(k\)-means algorithm will group the observations such that total squared distance from each data point to the center of its assigned cluster is minimized. The gif below illustrates the behavior of the algorithm:

Put briefly, the \(k\)-means algorithm alternates between an assignment step where observations are assigned to the nearest cluster center, and an update step where the cluster centers are adjusted in accordance with the observations that changed clusters.

In R, the kmeans function is used to perform \(k\)-means clustering. For example, we can group the states in the USArrests data into two clusters:

k2 <- kmeans(USArrests, centers = 2, nstart = 25)

The centers argument describes the number of clusters we want (ie: \(k\)), while nstart describes a starting point for the algorithm. (Here it is specified to guarantee reproducibility, for most datasets the starting point has little impact on the results)

We can visualize these clusters using fviz_cluster, which displays the clusters (which are by default were created using all columns of USArrests) using a scatterplot where first two principle component scores define the X-Y coordinates of each observation.

fviz_cluster(k2, data = USArrests)

Notice that fviz_cluster labeled each point. By default, these labels are determined by the rownames of the data object. You can change these using the rownames function. Additionally, you can specify repel = TRUE to ask fviz_cluster to prevent overlapping labels.

Rather than using PC1 and PC2, we could choose to display the clusters using any two of the original variables:

fviz_cluster(k2, data = USArrests, choose.vars = c("UrbanPop", "Murder"), stand = FALSE, repel = TRUE)

Because the variables in USArrests are on different scales, we should consider standardizing the data using the scale function prior to clustering:

Std_USArrests <- scale(USArrests)
ks <- kmeans(Std_USArrests, centers = 2)
fviz_cluster(ks, data = Std_USArrests)

Standardization ensures that each variable is given equal importance when determining the clusters. In many cases, standardizing the data is desirable, but there are some situations where it makes sense to cluster on unstandardized data. We’ll see an example of this soon.

After we’ve applied \(k\)-means clustering we might be interested in describing or summarizing each cluster. One way to do this is using the cluster centers, for \(k\)-means clustering these are the average values of each cluster member for each variable:

ks$centers

##      Murder    Assault   UrbanPop       Rape
## 1  1.004934  1.0138274  0.1975853  0.8469650
## 2 -0.669956 -0.6758849 -0.1317235 -0.5646433

In this example, we see that first cluster contains states with low crime rates and lower urbanization, while the second cluster contains states with higher crime rates and more urbanization.

When interpreting the cluster centers, it is important to consider whether the data were standardized prior to clustering. In our example we did standardize, so the cluster centers are reported in standardized units (z-scores) and should be described and interpreted accordingly.

3. How Many Clusters?

As previously mentioned, the \(k\)-means algorithm requires the analyst to specify \(k\), but choosing the optimal number is tricky.

As a general guideline, \(k\) should be large enough to yield relatively homogenous clusters, but small enough to limit unnecessary complexity. Sometimes \(k\) might be based on the needs of a particular analysis or an expert opinion. In other situations, \(k\) might be chosen based upon empirical methods.

Several methods exist to guide the choice of \(k\). We will briefly cover three of the most popular strategies:

The Elbow Method
The Silhoutte Method
The Gap Statistic

The Elbow Method

Recall that the \(k\)-means algorithm minimizes “Total withiness” or \(\sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k} (x_i - \mu_k)^2\). This quantity always decreases as \(k\) increases, but past a certain point adding more clusters tends not to provide much improvement.

We can visualize the reduction in “Total withiness”, which we’ll refer to as “wss”, versus different choices of \(k\) using the fviz_nbclust function:

fviz_nbclust(Std_USArrests, kmeans, method = "wss", k.max = 8)

The Elbow Method suggests the optimal number of clusters should be chosen by this plot’s “elbow”, or where the curve appears to bend.

In the USArrests example, the elbow method suggests 2 or 3 clusters would be reasonable, since more clusters don’t do much to decrease wss.

The Silhouette Method

The Silhouette Method measures how well each observation fits within its assigned cluster by calculating a silhouette value. Silhouette values measure of how similar an observation is to its assigned cluster (cohesion) relative to how similar they are to the other clusters (separation).

If you’d like a more mathematically description of how silhouette values are calculated, the Wikipedia page is fairly accessible.

Silhouette values range from -1 to +1, with high values indicating an observation fits well within its assigned cluster, and low values indicating poor fit within its assigned cluster. They can be used to assess individual observations, or the average silhouette can be used to assess the choice of \(k\).

fviz_nbclust(Std_USArrests, kmeans, method = "silhouette", k.max = 8)

For the USArrest data, the average silhouette metric suggests that \(k = 2\) is optimal.

Silhouettes are also used to better understand the cluster membership of individual observations:

k2 <- kmeans(USArrests, centers = 2, nstart = 25)
sil <- silhouette(k2$cluster, dist(USArrests), ordered = FALSE)
row.names(sil) <- row.names(USArrests) # Needed to use label option
fviz_silhouette(sil, label = TRUE)

##   cluster size ave.sil.width
## 1       1   21          0.58
## 2       2   29          0.60

In this example we see there are a handful of states that aren’t great fits in their assigned clusters, and we’d want to avoid over-emphasizing the cluster membership for those states when reporting on the trends in these data.

The Gap Statistic

The gap statistic compares the wss that is achieved by a certain choice of \(k\) with what would be expected for that \(k\) if there were no actual relationships between observations. The code below calculates the gap statistic for \(k=1\) through \(k=8\):

fviz_nbclust(Std_USArrests, kmeans, method = "gap", k.max = 8)

By default, fviz_nbclust suggests the smallest \(k\) where the gap statistic within one standard error of its maximum. The rationale is to provide more interpretable results (fewer clusters) while still achieving performance that isn’t significantly different from the maximum.

Remark: You’ll notice in the USArrests example that each method suggested a slightly different number of clusters. This is a common occurrence, and the exact value of \(k\) usually comes down to a decision made by the analyst. You should view the elbow, silhouette, and gap statistic as tools to help guide your choice.

5. PAM Clustering

\(k\)-means clustering is an interpretable and widely used method, but it does have a few weaknesses.

As a measure of central tendency, the mean is sensitive to outliers, so \(k\)-means cluster centers tend to be strongly influenced by unusual data-points
Cluster centers themselves aren’t actual data-points, so using them to describe the typical members of each cluster isn’t always accurate, particularly when the number of variables is large and the dataset contains outliers
The distance \(x_i - \mu_k\) isn’t defined for categorical variables, so \(k\)-means will only work when all variables are numeric

Partitioning Around Medoids (PAM) is an alternative clustering approach that searches for \(k\) data-points called medoids, which serve as cluster centers. The remaining data-points are assigned to the cluster defined by the nearest medoid.

pam_std <- pam(Std_USArrests, k = 3)
pam_std$medoids  ## Print the medoids

##                   Murder    Assault   UrbanPop       Rape
## New Mexico     0.8292944  1.3708088  0.3081225  1.1603196
## Oklahoma      -0.2727580 -0.2371077  0.1699510 -0.1315342
## New Hampshire -1.3059321 -1.3650491 -0.6590781 -1.2525644

fviz_cluster(pam_std) ## Plot the clusters

PAM addresses many of the drawbacks of \(k\)-means. It is more robust to outliers, its cluster centers are themselves data-points, making the results more interpretable, and as we’ll later see PAM can be used to cluster data with categorical variables.

These advantages don’t come without weaknesses, the biggest of which is computational efficiency, PAM doesn’t scale well to large datasets. Additionally, PAM can be sensitive to the starting medoids.

6. Application #1 - Customer Segmentation

The OnlineRetail dataset contains over 500,000 transactions from a UK-based retailer occurring between 1/12/2010 and 9/12/2011. For details on these data, visit the documentation at the UC-Irvine machine learning repository.

## Reading the data takes a long time, they contain >500k purchase records
data <- read.csv("https://remiller1450.github.io/data/OnlineRetail.csv")

The goal of this application will be to perform an [RFM](https://en.wikipedia.org/wiki/RFM_(customer_value) (Recency, Frequency, and Monetary value) customer segmentation analysis. The idea is to group the customers into clusters using the RFM variables. Cluster characteristics could then be used to evaluate customer value and/or guide marketing efforts.

For this application we will restrict our analysis to only include records with complete information:

data <- data[complete.cases(data),]

To perform the analysis, we’ll need to derive suitable “Recency”, “Frequency”, and “Monetary” value variables ourselves. The most challenging to derive is Recency, which we will choose to define as the number of days since a customer’s most recent purchase. Setting up this variable requires some knowledge of the “date” variable type, so the code to create it, along with some comments, is provided below:

dates <- as.Date(substr(data$InvoiceDate,1,10), "%m/%d/%Y")    ## Extract only the date, specifying its format
data$days_since <-  as.Date("2011-12-10") - dates      ## Subtract from a reference date
Rec <- data %>% group_by(CustomerID) %>% summarise(min = min(days_since))   ## Keep only the minimum per customer

Next we’ll create the Frequency variable, which we will define as the number of purchases a customer made during the span of these data.

Freq <- data %>% group_by(CustomerID) %>% summarise(npurchases = n())   ## Derive a variable for customer frequency

Finally we create the Monetary value variable, which we will define as the total purchase amount made by each customer:

## Derive a variable for monetary value
data <- mutate(data, total = Quantity*UnitPrice)
Mon <- data %>% group_by(CustomerID) %>% summarise(mon = sum(total))

Putting these three variables together:

RFM_data <- data.frame(R = as.numeric(Rec$min), F = Freq$npurchases, M = Mon$mon)

For our analysis of these data we’ll exclude customers with negative monetary values. These mostly represent customers who returned a purchase from the prior year who did not make a purchase during the year in question.

RFM_data <- filter(RFM_data, M > 0)

Question #1:

Investigate the distribution of these variables using visuals or summary statistics. Based upon your assessment, should any of these variables be transformed to mitigate the influence of outliers? Provide a brief justification.

Question #2:

Analyze these data using \(k\)-means clustering (transforming variables when necessary). Be sure to report on how many clusters you used, and how you might describe the members of each cluster to a non-statistician. A satisfactory answer should include 2-4 sentences and at least one visualization.

Question #3:

Analyze these data using PAM clustering. Be sure to report on how many clusters you used, and how you might describe the members of each cluster to a non-statistician. A satisfactory answer should include 2-4 sentences and at least one visualization.

7. Hierarchical Clustering

\(k\)-means and PAM are both partitioning algorithms that require a pre-specified number of clusters. In contrast, hierarchical clustering aggregates (or splits) the data into a tree-like representation known as a dendrogram, which can be cut to define a desired number of clusters.

There are two categories of approaches to hierarchical clustering:

Agglomerative or “bottom-up” methods, which begin with each data-point as its own cluster, then proceed to aggregate clusters until all observations are grouped together.
Divisive or “top-down” methods, which begin will all data-points in a single cluster, then proceed to split clusters until all observations are their own cluster.

There are many different implementations and algorithms falling under these two general schemes. The example below illustrates Agglomerative Nesting (AGNES), an agglomerative algorithm, and Divisive Analysis (DIANA), a divisive algorithm.

d <- get_dist(scale(USArrests))  ## Hierarchical Clustering requires a distance matrix
ag <- agnes(d)  ## AGNES
fviz_dend(ag, cex = 0.4, k = 4)

di <- diana(d)  ## DIANA
fviz_dend(di, cex = 0.4, k = 4)

This example can be used to highlight a few key aspects of hierarchical clustering:

The y-axis of the dendrogram is a measure of closeness between data-points/clusters, for our application it is based upon Euclidean distance, the default distance measure used by get_dist. You can visit this link for additional details.
In fviz_dend, the argument k = 4 is used to cut dendrogram at a point yielding 4 clusters
Hierarchical clustering algorithms are greedy, so divisive and agglomerative methods almost always produce very different dendrograms.

8. Categorical Variables and Gower Distance

Until now we’ve exclusively looked at datasets containing only numeric predictors, this is due to the clustering methods we’ve covered being based upon distance calculations that aren’t defined for categories. In order to generalize clustering to datasets containing categorical predictors we need an alternative method for calculating distance.

Gower Distance is one way to calculate distance for both numeric and categorical predictors. Gower distance is calculated as follows:

For any two data-points, a and b, define:

\(d_{a,b} = \sum_{j = 1}^{p} h_j( x_{a,j}, x_{b,j})\)

The function \(h_j()\) is dependent on the type of variable:

For numeric variables, \(h_j(x_{a,j}, x_{b,j}) = \tfrac{|x_{a,j} - x_{b,j}|}{R_j}\), where \(R_j\) is the range of the \(j^{th}\) variable
For categorical variables, \(h_j(x_{a,j}, x_{b,j}) = 0\) if the \(x_{a,j}\) and \(x_{b,j}\) are the same category, and \(h_j(x_{a,j}, x_{b,j}) = 1\) otherwise

Gower distance can be used to construct a distance matrix, allowing for clustering method like PAM, AGNES, and DIANA to be used. The daisy function can be used to construct the Gower distance matrix, which we could plot, or use to perform clustering:

# Example
homes <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
homes2 <- select(homes, style, built, bedrooms, bsmt, ac, area.living, area.lot)

D <- daisy(homes2, metric = "gower") ## Use daisy to calculate distance matrix
fviz_dist(D, show_labels = FALSE)  ## We could view the distance matrix

pam_homes <- pam(D, k = 3)
homes2[pam_homes$medoids, ]  ## Print the medoids

##             style built bedrooms bsmt  ac area.living area.lot
## 219 1 Story Frame  1965        3 Full Yes        1128     7875
## 431 1 Story Frame  1999        2 None Yes        1182     5087
## 108 2 Story Frame  1994        3 Full Yes        1359     8313

For many applications we’d want to check whether using \(k = 3\) clusters is appropriate. Unfortunately, the fviz_nbclust function only accepts numeric data matrices, so we’ll need to do this manually. Additionally, the gap statistic cannot be calculated for non-numeric data. Luckily, we can still use average silhouette width and the elbow method with a little bit of extra code:

### Average Silhouette Width
avg_sil = numeric(9) ## Setup object to store avg sil width for each possible k
for(k in 2:10){
  pam_homes <- pam(D, k = k)
  avg_sil[k-1] <- pam_homes$silinfo$avg.width
}
plot(x = 2:10, y = avg_sil, type = "b", xlab = "k", ylab = "Avg Silhouette")

### Elbow Method
elbow = numeric(9)
for(k in 2:10){
  pam_homes <- pam(D, k = k)
  elbow[k-1] <- pam_homes$objective[1]
}
plot(x = 2:10, y = elbow, type = "b", xlab = "k", ylab = "Objective")

9. Application #2 - Clustering Colleges

The colleges2015 dataset contains information on predominantly bachelor’s-degree granting institutions. College applicants might be interested in seeing groups of similar schools, which we will determine using clustering. The colleges_reduced data.frame contains a subset of variables that we be used for clustering.

colleges2015 <- read.csv('https://raw.githubusercontent.com/ds4stats/r-tutorials/master/data-wrangling/data/colleges2015.csv')
colleges2015 <- colleges2015[complete.cases(colleges2015),]
colleges_reduced <- select(colleges2015, type, admissionRate, ACTmath, ACTenglish, undergrads, cost, gradRate, debt)
rownames(colleges_reduced) <- paste(colleges2015$college, colleges2015$state)  ## Add school names as rownames

Question #4:

Use the daisy function to calculate the Gower distance matrix for the colleges_reduced data. Then perform a clustering analysis using PAM. You should have a justification for the number of clusters you used, and you should interpret the trends you see in your clusters.

Question #5:

Filter the colleges2015 data to only include schools in the “New England” region and select the same variables that were contained in the colleges_reduced data. Then cluster your filtered dataset using a hierarchal clustering algorithm of your choice (AGNES or DIANA). Plot a dendrogram of displaying \(k = 4\) clusters, and interpret these clusters.