Principal components analysis (PCA) can aid an analyst in visualizing and identifying patterns in data with many variables. PCA is an unsupervised learning method because it does not require the analyst to specify an outcome variable.
Generally speaking, PCA groups related variables together as derived dimensions.
Clustering is another unsupervised learning method, but its goal is to group related observations together as clusters.
As an example, suppose a dataset consists of \(n = 100\) observations and \(p = 10\) variables. PCA might suggest those 10 variables can be reasonably represented by 3 dimensions (the first 3 principle components), while clustering might tell us that those 100 observations can be reasonably represented by 5 groups (clusters).
In this lab we will use the cluster
and factoextra
packages to perform clustering, and visualize clustering results.
library(cluster) ## Functions to perform clustering
library(factoextra) ## Functions for cluster visualization
library(tidyr) ## Used to tidy data
library(dplyr) ## Used to manipulate data
Perhaps the most popular clustering approach is k-means clustering, which groups \(n\) observations into \(k\) groups (clusters) such that within cluster variation is minimized. Since its inception many k-means variations and algorithms have been proposed, but the original method minimizes “Total Withiness”: \[ \text{Total Withiness} = \sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k} (x_i - \mu_k)^2 \]
In order to use \(k\)-means, the analyst must specify \(k\), the number of clusters. For a fixed \(k\), the \(k\)-means algorithm will group the observations such that total squared distance from each data point to the center of its assigned cluster is minimized. The gif below illustrates the behavior of the algorithm:
Put briefly, the \(k\)-means algorithm alternates between an assignment step where observations are assigned to the nearest cluster center, and an update step where the cluster centers are adjusted in accordance with the observations that changed clusters.
In R, the kmeans
function is used to perform \(k\)-means clustering. For example, we can group the states in the USArrests
data into two clusters:
k2 <- kmeans(USArrests, centers = 2, nstart = 25)
The centers
argument describes the number of clusters we want (ie: \(k\)), while nstart
describes a starting point for the algorithm. (Here it is specified to guarantee reproducibility, for most datasets the starting point has little impact on the results)
We can visualize these clusters using fviz_cluster
, which displays the clusters (which are by default were created using all columns of USArrests
) using a scatterplot where first two principle component scores define the X-Y coordinates of each observation.
fviz_cluster(k2, data = USArrests)
Notice that fviz_cluster
labeled each point. By default, these labels are determined by the rownames of the data object. You can change these using the rownames
function. Additionally, you can specify repel = TRUE
to ask fviz_cluster
to prevent overlapping labels.
Rather than using PC1 and PC2, we could choose to display the clusters using any two of the original variables:
fviz_cluster(k2, data = USArrests, choose.vars = c("UrbanPop", "Murder"), stand = FALSE, repel = TRUE)
Because the variables in USArrests
are on different scales, we should consider standardizing the data using the scale
function prior to clustering:
Std_USArrests <- scale(USArrests)
ks <- kmeans(Std_USArrests, centers = 2)
fviz_cluster(ks, data = Std_USArrests)
Standardization ensures that each variable is given equal importance when determining the clusters. In many cases, standardizing the data is desirable, but there are some situations where it makes sense to cluster on unstandardized data. We’ll see an example of this soon.
After we’ve applied \(k\)-means clustering we might be interested in describing or summarizing each cluster. One way to do this is using the cluster centers, for \(k\)-means clustering these are the average values of each cluster member for each variable:
ks$centers
## Murder Assault UrbanPop Rape
## 1 1.004934 1.0138274 0.1975853 0.8469650
## 2 -0.669956 -0.6758849 -0.1317235 -0.5646433
In this example, we see that first cluster contains states with low crime rates and lower urbanization, while the second cluster contains states with higher crime rates and more urbanization.
When interpreting the cluster centers, it is important to consider whether the data were standardized prior to clustering. In our example we did standardize, so the cluster centers are reported in standardized units (z-scores) and should be described and interpreted accordingly.
As previously mentioned, the \(k\)-means algorithm requires the analyst to specify \(k\), but choosing the optimal number is tricky.
As a general guideline, \(k\) should be large enough to yield relatively homogenous clusters, but small enough to limit unnecessary complexity. Sometimes \(k\) might be based on the needs of a particular analysis or an expert opinion. In other situations, \(k\) might be chosen based upon empirical methods.
Several methods exist to guide the choice of \(k\). We will briefly cover three of the most popular strategies:
Recall that the \(k\)-means algorithm minimizes “Total withiness” or \(\sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k} (x_i - \mu_k)^2\). This quantity always decreases as \(k\) increases, but past a certain point adding more clusters tends not to provide much improvement.
We can visualize the reduction in “Total withiness”, which we’ll refer to as “wss”, versus different choices of \(k\) using the fviz_nbclust
function:
fviz_nbclust(Std_USArrests, kmeans, method = "wss", k.max = 8)
The Elbow Method suggests the optimal number of clusters should be chosen by this plot’s “elbow”, or where the curve appears to bend.
In the USArrests example, the elbow method suggests 2 or 3 clusters would be reasonable, since more clusters don’t do much to decrease wss.
The Silhouette Method measures how well each observation fits within its assigned cluster by calculating a silhouette value. Silhouette values measure of how similar an observation is to its assigned cluster (cohesion) relative to how similar they are to the other clusters (separation).
If you’d like a more mathematically description of how silhouette values are calculated, the Wikipedia page is fairly accessible.
Silhouette values range from -1 to +1, with high values indicating an observation fits well within its assigned cluster, and low values indicating poor fit within its assigned cluster. They can be used to assess individual observations, or the average silhouette can be used to assess the choice of \(k\).
fviz_nbclust(Std_USArrests, kmeans, method = "silhouette", k.max = 8)
For the USArrest data, the average silhouette metric suggests that \(k = 2\) is optimal.
Silhouettes are also used to better understand the cluster membership of individual observations:
k2 <- kmeans(USArrests, centers = 2, nstart = 25)
sil <- silhouette(k2$cluster, dist(USArrests), ordered = FALSE)
row.names(sil) <- row.names(USArrests) # Needed to use label option
fviz_silhouette(sil, label = TRUE)
## cluster size ave.sil.width
## 1 1 21 0.58
## 2 2 29 0.60
In this example we see there are a handful of states that aren’t great fits in their assigned clusters, and we’d want to avoid over-emphasizing the cluster membership for those states when reporting on the trends in these data.
The gap statistic compares the wss that is achieved by a certain choice of \(k\) with what would be expected for that \(k\) if there were no actual relationships between observations. The code below calculates the gap statistic for \(k=1\) through \(k=8\):
fviz_nbclust(Std_USArrests, kmeans, method = "gap", k.max = 8)
By default, fviz_nbclust
suggests the smallest \(k\) where the gap statistic within one standard error of its maximum. The rationale is to provide more interpretable results (fewer clusters) while still achieving performance that isn’t significantly different from the maximum.
Remark: You’ll notice in the USArrests example that each method suggested a slightly different number of clusters. This is a common occurrence, and the exact value of \(k\) usually comes down to a decision made by the analyst. You should view the elbow, silhouette, and gap statistic as tools to help guide your choice.
\(k\)-means clustering is an interpretable and widely used method, but it does have a few weaknesses.
Partitioning Around Medoids (PAM) is an alternative clustering approach that searches for \(k\) data-points called medoids, which serve as cluster centers. The remaining data-points are assigned to the cluster defined by the nearest medoid.
pam_std <- pam(Std_USArrests, k = 3)
pam_std$medoids ## Print the medoids
## Murder Assault UrbanPop Rape
## New Mexico 0.8292944 1.3708088 0.3081225 1.1603196
## Oklahoma -0.2727580 -0.2371077 0.1699510 -0.1315342
## New Hampshire -1.3059321 -1.3650491 -0.6590781 -1.2525644
fviz_cluster(pam_std) ## Plot the clusters
PAM addresses many of the drawbacks of \(k\)-means. It is more robust to outliers, its cluster centers are themselves data-points, making the results more interpretable, and as we’ll later see PAM can be used to cluster data with categorical variables.
These advantages don’t come without weaknesses, the biggest of which is computational efficiency, PAM doesn’t scale well to large datasets. Additionally, PAM can be sensitive to the starting medoids.
The OnlineRetail
dataset contains over 500,000 transactions from a UK-based retailer occurring between 1/12/2010 and 9/12/2011. For details on these data, visit the documentation at the UC-Irvine machine learning repository.
## Reading the data takes a long time, they contain >500k purchase records
data <- read.csv("https://remiller1450.github.io/data/OnlineRetail.csv")
The goal of this application will be to perform an [RFM](https://en.wikipedia.org/wiki/RFM_(customer_value) (Recency, Frequency, and Monetary value) customer segmentation analysis. The idea is to group the customers into clusters using the RFM variables. Cluster characteristics could then be used to evaluate customer value and/or guide marketing efforts.
For this application we will restrict our analysis to only include records with complete information:
data <- data[complete.cases(data),]
To perform the analysis, we’ll need to derive suitable “Recency”, “Frequency”, and “Monetary” value variables ourselves. The most challenging to derive is Recency, which we will choose to define as the number of days since a customer’s most recent purchase. Setting up this variable requires some knowledge of the “date” variable type, so the code to create it, along with some comments, is provided below:
dates <- as.Date(substr(data$InvoiceDate,1,10), "%m/%d/%Y") ## Extract only the date, specifying its format
data$days_since <- as.Date("2011-12-10") - dates ## Subtract from a reference date
Rec <- data %>% group_by(CustomerID) %>% summarise(min = min(days_since)) ## Keep only the minimum per customer
Next we’ll create the Frequency variable, which we will define as the number of purchases a customer made during the span of these data.
Freq <- data %>% group_by(CustomerID) %>% summarise(npurchases = n()) ## Derive a variable for customer frequency
Finally we create the Monetary value variable, which we will define as the total purchase amount made by each customer:
## Derive a variable for monetary value
data <- mutate(data, total = Quantity*UnitPrice)
Mon <- data %>% group_by(CustomerID) %>% summarise(mon = sum(total))
Putting these three variables together:
RFM_data <- data.frame(R = as.numeric(Rec$min), F = Freq$npurchases, M = Mon$mon)
For our analysis of these data we’ll exclude customers with negative monetary values. These mostly represent customers who returned a purchase from the prior year who did not make a purchase during the year in question.
RFM_data <- filter(RFM_data, M > 0)
Question #1:
Investigate the distribution of these variables using visuals or summary statistics. Based upon your assessment, should any of these variables be transformed to mitigate the influence of outliers? Provide a brief justification.
Question #2:
Analyze these data using \(k\)-means clustering (transforming variables when necessary). Be sure to report on how many clusters you used, and how you might describe the members of each cluster to a non-statistician. A satisfactory answer should include 2-4 sentences and at least one visualization.
Question #3:
Analyze these data using PAM clustering. Be sure to report on how many clusters you used, and how you might describe the members of each cluster to a non-statistician. A satisfactory answer should include 2-4 sentences and at least one visualization.
\(k\)-means and PAM are both partitioning algorithms that require a pre-specified number of clusters. In contrast, hierarchical clustering aggregates (or splits) the data into a tree-like representation known as a dendrogram, which can be cut to define a desired number of clusters.
There are two categories of approaches to hierarchical clustering:
There are many different implementations and algorithms falling under these two general schemes. The example below illustrates Agglomerative Nesting (AGNES), an agglomerative algorithm, and Divisive Analysis (DIANA), a divisive algorithm.
d <- get_dist(scale(USArrests)) ## Hierarchical Clustering requires a distance matrix
ag <- agnes(d) ## AGNES
fviz_dend(ag, cex = 0.4, k = 4)
di <- diana(d) ## DIANA
fviz_dend(di, cex = 0.4, k = 4)
This example can be used to highlight a few key aspects of hierarchical clustering:
get_dist
. You can visit this link for additional details.fviz_dend
, the argument k = 4
is used to cut dendrogram at a point yielding 4 clustersUntil now we’ve exclusively looked at datasets containing only numeric predictors, this is due to the clustering methods we’ve covered being based upon distance calculations that aren’t defined for categories. In order to generalize clustering to datasets containing categorical predictors we need an alternative method for calculating distance.
Gower Distance is one way to calculate distance for both numeric and categorical predictors. Gower distance is calculated as follows:
For any two data-points, a and b, define:
The function \(h_j()\) is dependent on the type of variable:
Gower distance can be used to construct a distance matrix, allowing for clustering method like PAM, AGNES, and DIANA to be used. The daisy
function can be used to construct the Gower distance matrix, which we could plot, or use to perform clustering:
# Example
homes <- read.csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
homes2 <- select(homes, style, built, bedrooms, bsmt, ac, area.living, area.lot)
D <- daisy(homes2, metric = "gower") ## Use daisy to calculate distance matrix
fviz_dist(D, show_labels = FALSE) ## We could view the distance matrix
pam_homes <- pam(D, k = 3)
homes2[pam_homes$medoids, ] ## Print the medoids
## style built bedrooms bsmt ac area.living area.lot
## 219 1 Story Frame 1965 3 Full Yes 1128 7875
## 431 1 Story Frame 1999 2 None Yes 1182 5087
## 108 2 Story Frame 1994 3 Full Yes 1359 8313
For many applications we’d want to check whether using \(k = 3\) clusters is appropriate. Unfortunately, the fviz_nbclust
function only accepts numeric data matrices, so we’ll need to do this manually. Additionally, the gap statistic cannot be calculated for non-numeric data. Luckily, we can still use average silhouette width and the elbow method with a little bit of extra code:
### Average Silhouette Width
avg_sil = numeric(9) ## Setup object to store avg sil width for each possible k
for(k in 2:10){
pam_homes <- pam(D, k = k)
avg_sil[k-1] <- pam_homes$silinfo$avg.width
}
plot(x = 2:10, y = avg_sil, type = "b", xlab = "k", ylab = "Avg Silhouette")
### Elbow Method
elbow = numeric(9)
for(k in 2:10){
pam_homes <- pam(D, k = k)
elbow[k-1] <- pam_homes$objective[1]
}
plot(x = 2:10, y = elbow, type = "b", xlab = "k", ylab = "Objective")
The colleges2015
dataset contains information on predominantly bachelor’s-degree granting institutions. College applicants might be interested in seeing groups of similar schools, which we will determine using clustering. The colleges_reduced
data.frame contains a subset of variables that we be used for clustering.
colleges2015 <- read.csv('https://raw.githubusercontent.com/ds4stats/r-tutorials/master/data-wrangling/data/colleges2015.csv')
colleges2015 <- colleges2015[complete.cases(colleges2015),]
colleges_reduced <- select(colleges2015, type, admissionRate, ACTmath, ACTenglish, undergrads, cost, gradRate, debt)
rownames(colleges_reduced) <- paste(colleges2015$college, colleges2015$state) ## Add school names as rownames
Question #4:
Use the daisy
function to calculate the Gower distance matrix for the colleges_reduced
data. Then perform a clustering analysis using PAM. You should have a justification for the number of clusters you used, and you should interpret the trends you see in your clusters.
Question #5:
Filter the colleges2015
data to only include schools in the “New England” region and select the same variables that were contained in the colleges_reduced
data. Then cluster your filtered dataset using a hierarchal clustering algorithm of your choice (AGNES or DIANA). Plot a dendrogram of displaying \(k = 4\) clusters, and interpret these clusters.