This lab continues our study of clustering, or methods for identifying meaningful groupings of similar data-points.
Directions (Please read before starting)
\(~\)
This lab will continue using the factoextra
package, and
will also use some functions from the cluster
package.
# load the following packages
# install.packages("cluster")
library(cluster)
library(factoextra)
library(ggplot2)
library(dplyr)
\(~\)
Clustering finds groups of similar data-points. Ideally, clusters are cohesive (all members are close to each other) with separation from each other (cluster members are far from the members of other clusters).
Our previous lab relied upon euclidean distance: \[d_{euclidean}(a,b) = \sqrt{\sum_{j = 1}^{p} (x_{aj} - x_{bj})^2}\]
However, this reliance is very restrictive because euclidean distance is not defined for categorical variables.
An alternative distance measure is Gower distance: \[d_{Gower}(a,b) = \tfrac{1}{p}\sum_{j = 1}^{p} h_j( x_{a,j}, x_{b,j})\]
The function \(h_j()\) differs depending upon whether \(X_j\) is numeric or categorical
Notice that \(h_j()\) has a range of
\([0,1]\) regardless of whether \(X_j\) is categorical or numeric, so Gower
distance implicitly standardizes our data (so we don’t need to worry
about using scale()
function when clustering using Gower
distance).
\(~\)
A distance matrix, sometimes called a dissimilarity matrix, is an \(n\) by \(n\) matrix containing the pairwise distances between all pairs of data-points.
To better understand Gower distance and distance matrices, let’s look at a simple data set consisting of 6 yes/no questions that describe \(n = 20\) different animal species:
Warm_blooded? | Can_fly? | Vertebrate? | Endangered? | Lives_in_groups? | Has_hair? | |
---|---|---|---|---|---|---|
ant | No | No | No | No | Yes | No |
bee | No | Yes | No | No | Yes | Yes |
cat | Yes | No | Yes | No | No | Yes |
cephalopod | No | No | No | No | No | Yes |
chimp | Yes | No | Yes | Yes | Yes | Yes |
cow | Yes | No | Yes | No | Yes | Yes |
duck | Yes | Yes | Yes | No | Yes | No |
eagle | Yes | Yes | Yes | Yes | No | No |
elephant | Yes | No | Yes | Yes | Yes | No |
fly | No | Yes | No | No | No | No |
frog | No | No | Yes | Yes | NA | No |
hermit | No | No | Yes | No | Yes | No |
lion | Yes | No | Yes | NA | Yes | Yes |
lizard | No | No | Yes | No | No | No |
lobster | No | No | No | No | NA | No |
human | Yes | No | Yes | Yes | Yes | Yes |
rabbit | Yes | No | Yes | No | Yes | Yes |
salamander | No | No | Yes | No | NA | No |
spider | No | No | No | NA | No | Yes |
whale | Yes | No | Yes | Yes | Yes | No |
\(~\)
We can calculate a matrix of Gower distances using the
daisy()
function in the cluster
package:
animals <- read.csv("https://remiller1450.github.io/data/animals.csv", row.names = 1, stringsAsFactors = TRUE)
D <- daisy(animals, metric = "gower") ## Calculate distance matrix
fviz_dist(D) ## Visualize
We can then visualize the distance matrix using
fviz_dist()
.
\(~\)
Some clustering algorithms can use the distance matrix, rather than the raw data, to identify clusters. In this lab we’ll introduce hierarchical clustering, which can be applied to any distance matrix.
In contrast with prototype-based approaches like \(k\)-means or PAM, hierarchical clustering organizes observations into a nested sets. These sets can be displayed using a tree-like structure known as a dendrogram:
ag <- agnes(D) ## Apply a hierarchical clustering algorithm
fviz_dend(ag) + labs(title = "Clustering Animals") ## Display the dendrogram
Note that we can always obtain partitional clusters by “cutting” a dendrogram:
fviz_dend(ag, k = 4) + labs(title = "Clustering Animals (k = 4)")
In this example, we asked for \(k = 4\) clusters, so the dendrogram was cut at a height of approximately 0.4. Recall that this height indicates that the members of these clusters differ by an average Gower distance of around 0.4, so we can say that they’re relative different when considering the scale of Gower distance.
\(~\)
To begin, we’ll introduce two basic philosophies behind different types of hierarchical clustering algorithms:
The acronyms AGNES (AGglomerative NESting) and DIANA (DIvisive ANAlysis) describe two of the most popular methods. The code below gives an example of each:
D <- daisy(animals, metric = "gower")
ag <- agnes(D)
fviz_dend(ag, k = 3) + labs(title = "Clustering Animals (AGNES)")
di <- diana(D)
fviz_dend(di, k = 3) + labs(title = "Clustering Animals (DIANA)")
Notice that AGNES and DIANA are greedy algorithms (meaning each cluster merge/split is locally optimal, without any consideration of the overall hierarchy). Consequently, agglomerative and divisive approaches will usually produce different dendrograms.
Specific details of the AGNES and DIANA algorithms are summarized below:
AGNES Algorithm:
At each step, AGNES first identifies the pair of most similar clusters using one of the following methods:
method = "average"
- the average distance between all
data-points in one cluster relative to those in another clustermethod = "single"
- the smallest distance between a
single data-point in one cluster relative to another cluster (nearest
neighbor)method = "complete
- the smallest instance of the
largest distance between a single data-point in one cluster relative to
another cluster (furthest neighbor)The two clusters deemed most similar (ie: smallest metric according to the selected method) are merged and the algorithm moves on to determine the next merge. This is done until all clusters have been merged.
DIANA Algorithm:
This process is repeated until each data-point is it’s own cluster.
\(~\)
Question #1: The code below applies the AGNES
algorithm with method = "single"
to the
USArrests
data using the matrix of euclidean distances
calculated after the data are standardized. For Question #1, use the
provided code and its output to answer the following questions:
Part A: How many clusters were present at the initialization (start) of this clustering algorithm?
Part B: What was the first merge/split that
occurred? Use the distance matrix, D
, to briefly justify
why this merge/split happened first. Hint: min(D)
will return the distance between the closest two data-points. You can
confirm that this distance corresponds to the merge/split you identified
by using View(as.matrix(D))
.
Part C: If this dendrogram were cut to obtain \(k = 3\) clusters, how many members would belong to the largest of these clusters? Note: You may answer this question by inspecting the given output without writing any of your own code.
data("USArrests")
USArrests_std <- scale(USArrests)
D <- daisy(USArrests_std, metric = "euclidean")
ag1 <- agnes(D, method = "single")
fviz_dend(ag1, cex = 0.25) + labs(title = 'AGNES, Method = "Single"')
\(~\)
Question #2: The code below applies the DIANA
algorithm to the distance matrix calculated from the standardized
USarrests
data. Using the code and output provided below,
answer the following questions:
Part A: How many clusters were present at the initialization (start) of this algorithm?
Part B: Briefly describe the merge/split resulting from the first step of the algorithm.
Part C: Which observation you believe was used to
form the “splinter group” in the first step of the algorithm? Justify
your answer. Hint: Because each row/column of the distance
matrix corresponds to an observation, you might look at row or column
means. Before doing so, you might need to coerce D
into a
matrix using as.matrix()
.
di1 <- diana(D)
fviz_dend(di1, cex = 0.25) + labs(title = 'DIANA')
\(~\)
Dendrograms provide a unique perspective on the relationships between data-points, but oftentimes we’d still like to report cluster membership for an optimized choice of \(k\).
Fortunately, all of the methods described in our previous clustering lab can also be applied to hierarchical methods. Unfortunately, some of them will require all of our variables are numeric.
The code below demonstrates these three methods using DIANA clustering on the US Arrests data described in Question #2.
### The elbow method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "wss")
### The average silhouette method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "silhouette")
## The gap statistic method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "gap",
k.max = 6, verbose = FALSE)
The argument FUNcluster = hcut
indicates the use of
hierarchical clustering that is to be cut into clusters, while the
argument hc_func
specifies the specific hierarchical
clustering algorithm that should be used.
Question #3: The code below creates
cds
, a standardized subset of the Colleges 2019 data set
that was used in our previous clustering lab. For this question, use the
average silhouette method to find the optimal value of \(k\) for AGNES clustering performed on these
data. Then, create a dendrogram displaying the AGNES clustering results
with branches colored according to the value of \(k\) you selected. You should use euclidean
distance for this analysis.
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>% filter(State == "IA") %>%
select(Name, Enrollment, Adm_Rate, ACT_median,
Net_Tuition, FourYearComp_Males, FourYearComp_Females,
Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),] ## Keeps only complete cases (no missing values)
rownames(cd) <- cd$Name
cds <- scale(select(cd, -Name))
\(~\)
The aim of this section is to provide you an opportunity to apply clustering algorithms to new data sets in order to practice identifying scenarios where clustering might be a useful analysis approach.
Question #4: The “European foods” data set summarizes the results of a household survey administered to 16 European nations in 1975. The survey asked households whether they had various foods in their house at the time of the questionnaire. The data below display the percentage of households in each country that had the food in question (expressed as a numeric value between 0 and 100).
euro <- read.csv("https://remiller1450.github.io/data/europeanfoods.csv", row.names = 1, stringsAsFactors = TRUE)
Part A: Create a distance matrix using euclidean
distance without standardizing (since all variables are already
on a 0-100 scale). Then, visualize this matrix using
fviz_dist()
. Which country’s household food habits are
most similar to Britain’s?
Part B: Use AGNES clustering with average distance as the merging method to create a dendrogram displaying these data.
Part C: Use DIANA clustering to create a dendrogram displaying these data.
Part D: Sweden, Norway, Denmark, and Finland are commonly grouped together as a region known as the Nordic Countries. Which algorithm, AGNES or DIANA, seems to do a better job capturing this grouping? Briefly explain your answer.
Part E: Use the average silhouette method to determine an optimal number of clusters for the application of AGNES clustering in Part B. Then, use this value to color the dendrogram you created in Part B.
Part F: Now use the gap statistic to determine an optimal number of clusters (again using the clustering results from Part B). Does this method agree with the average silhouette approach?
\(~\)
Question #5: The “starwars” data set documents the
attributes of major characters in the Star Wars film series. For this
question, you will perform a cluster analysis on the subset created
below. The variables definitions are largely self-explanatory, but you
can run ?starwars
in the console for descriptions.
data("starwars")
sw <- starwars %>%
select(name, height, mass, hair_color, skin_color, eye_color, birth_year, sex, gender, homeworld, species) %>%
mutate_if(is.character, as.factor)
sw <- as.data.frame(sw[complete.cases(sw),])
rownames(sw) <- sw$name
sw <- select(sw, -name)
Part A: After inspecting the contents of
sw
, briefly explain why Gower distance should be used to
cluster these data.
Part B: Create a distance matrix and visualize it
using fviz_dist()
.
Part C: Using the AGNES algorithm, create a dendrogram displaying these data.
Part D (optional, not graded): If you are familiar with Star Wars, do any relationships displayed in the dendrogram from Part C stand out as particularly interesting?