This lab continues our study of clustering, or methods for identifying meaningful groupings of similar data-points.

Directions (Please read before starting)

  1. Please work together with your assigned partner. Make sure you both fully understand each concept before you move on.
  2. Please record your answers and any related code for all embedded lab questions. I encourage you to try out the embedded examples, but you shouldn’t turn them in.
  3. Please ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Preamble

Packages and Datasets

This lab will continue using the factoextra package, and will also use some functions from the cluster package.

# load the following packages
# install.packages("cluster")
library(cluster)
library(factoextra)
library(ggplot2)
library(dplyr)

\(~\)

Distance and Categorical Variables

Clustering finds groups of similar data-points. Ideally, clusters are cohesive (all members are close to each other) with separation from each other (cluster members are far from the members of other clusters).

Our previous lab relied upon euclidean distance: \[d_{euclidean}(a,b) = \sqrt{\sum_{j = 1}^{p} (x_{aj} - x_{bj})^2}\]

However, this reliance is very restrictive because euclidean distance is not defined for categorical variables.

An alternative distance measure is Gower distance: \[d_{Gower}(a,b) = \tfrac{1}{p}\sum_{j = 1}^{p} h_j( x_{a,j}, x_{b,j})\]

The function \(h_j()\) differs depending upon whether \(X_j\) is numeric or categorical

  • For numeric variables, \(h_j(x_{a,j}, x_{b,j}) = \tfrac{|x_{a,j} - x_{b,j}|}{R_j}\), where \(R_j\) is the observed range of \(X_j\)
  • For categorical variables, \(h_j(x_{a,j}, x_{b,j}) = 0\) if the \(x_{a,j}\) and \(x_{b,j}\) are the same category, and \(h_j(x_{a,j}, x_{b,j}) = 1\) otherwise

Notice that \(h_j()\) has a range of \([0,1]\) regardless of whether \(X_j\) is categorical or numeric, so Gower distance implicitly standardizes our data (so we don’t need to worry about using scale() function when clustering using Gower distance).

\(~\)

Distance Matrices

A distance matrix, sometimes called a dissimilarity matrix, is an \(n\) by \(n\) matrix containing the pairwise distances between all pairs of data-points.

To better understand Gower distance and distance matrices, let’s look at a simple data set consisting of 6 yes/no questions that describe \(n = 20\) different animal species:

Warm_blooded? Can_fly? Vertebrate? Endangered? Lives_in_groups? Has_hair?
ant No No No No Yes No
bee No Yes No No Yes Yes
cat Yes No Yes No No Yes
cephalopod No No No No No Yes
chimp Yes No Yes Yes Yes Yes
cow Yes No Yes No Yes Yes
duck Yes Yes Yes No Yes No
eagle Yes Yes Yes Yes No No
elephant Yes No Yes Yes Yes No
fly No Yes No No No No
frog No No Yes Yes NA No
hermit No No Yes No Yes No
lion Yes No Yes NA Yes Yes
lizard No No Yes No No No
lobster No No No No NA No
human Yes No Yes Yes Yes Yes
rabbit Yes No Yes No Yes Yes
salamander No No Yes No NA No
spider No No No NA No Yes
whale Yes No Yes Yes Yes No

\(~\)

We can calculate a matrix of Gower distances using the daisy() function in the cluster package:

animals <- read.csv("https://remiller1450.github.io/data/animals.csv", row.names = 1, stringsAsFactors = TRUE)
D <- daisy(animals, metric = "gower")  ## Calculate distance matrix
fviz_dist(D)  ## Visualize

We can then visualize the distance matrix using fviz_dist().

  • Why are all entries on the diagonal of this matrix exactly 0?
  • What must be true of lions (“lio”) and flys (“fly”) for their distance to be 1?

\(~\)

Hierarchical Clustering

Some clustering algorithms can use the distance matrix, rather than the raw data, to identify clusters. In this lab we’ll introduce hierarchical clustering, which can be applied to any distance matrix.

In contrast with prototype-based approaches like \(k\)-means or PAM, hierarchical clustering organizes observations into a nested sets. These sets can be displayed using a tree-like structure known as a dendrogram:

ag <- agnes(D)  ## Apply a hierarchical clustering algorithm
fviz_dend(ag) + labs(title = "Clustering Animals")  ## Display the dendrogram

  • The top of the dendrogram is a single cluster containing all observations (ie: \(k = 1\))
  • The bottom of the dendrogram has every observation in its own cluster (ie: \(k = n\))
  • The y-value at which two clusters merge is a reflection of the distance/dissimilarity between the two clusters

Note that we can always obtain partitional clusters by “cutting” a dendrogram:

fviz_dend(ag, k = 4) + labs(title = "Clustering Animals (k = 4)")

In this example, we asked for \(k = 4\) clusters, so the dendrogram was cut at a height of approximately 0.4. Recall that this height indicates that the members of these clusters differ by an average Gower distance of around 0.4, so we can say that they’re relative different when considering the scale of Gower distance.

\(~\)

Lab

Hierarchical Clustering Algorithms

To begin, we’ll introduce two basic philosophies behind different types of hierarchical clustering algorithms:

  • Agglomerative, or “bottom-up” approaches, that start with each data-point as its own cluster (leaf) and combine similar clusters (branches) until all of the data-points are grouped together in single cluster (root).
  • Divisive, or “top-down” approaches, that start all data-points in a single cluster (root), then divide that cluster into smaller clusters (branches). The resulting branches are further divided until each data-point is its own cluster (leaf).

The acronyms AGNES (AGglomerative NESting) and DIANA (DIvisive ANAlysis) describe two of the most popular methods. The code below gives an example of each:

D <- daisy(animals, metric = "gower")

ag <- agnes(D)
fviz_dend(ag, k = 3) + labs(title = "Clustering Animals (AGNES)")

di <- diana(D)
fviz_dend(di, k = 3) + labs(title = "Clustering Animals (DIANA)")

Notice that AGNES and DIANA are greedy algorithms (meaning each cluster merge/split is locally optimal, without any consideration of the overall hierarchy). Consequently, agglomerative and divisive approaches will usually produce different dendrograms.

Specific details of the AGNES and DIANA algorithms are summarized below:

AGNES Algorithm:

At each step, AGNES first identifies the pair of most similar clusters using one of the following methods:

  • method = "average" - the average distance between all data-points in one cluster relative to those in another cluster
  • method = "single" - the smallest distance between a single data-point in one cluster relative to another cluster (nearest neighbor)
  • method = "complete - the smallest instance of the largest distance between a single data-point in one cluster relative to another cluster (furthest neighbor)

The two clusters deemed most similar (ie: smallest metric according to the selected method) are merged and the algorithm moves on to determine the next merge. This is done until all clusters have been merged.

DIANA Algorithm:

  • DIANA begins by selecting the existing cluster with the largest diameter, which is defined as the average pairwise distance among it’s members.
  • Within the selected cluster, DIANA locates the most disparate observation, which is defined by having the largest average pairwise distance from the other members of the cluster.
  • The most disparate observation is moved out of the cluster into a “splinter group”, which is the starting point of a new cluster (branch).
  • Members of the original cluster are iteratively assigned to the “splinter group” or “original party” (initial cluster) using the specified distance metric (usually average pairwise distance)

This process is repeated until each data-point is it’s own cluster.

\(~\)

Question #1: The code below applies the AGNES algorithm with method = "single" to the USArrests data using the matrix of euclidean distances calculated after the data are standardized. For Question #1, use the provided code and its output to answer the following questions:

Part A: How many clusters were present at the initialization (start) of this clustering algorithm?

Part B: What was the first merge/split that occurred? Use the distance matrix, D, to briefly justify why this merge/split happened first. Hint: min(D) will return the distance between the closest two data-points. You can confirm that this distance corresponds to the merge/split you identified by using View(as.matrix(D)).

Part C: If this dendrogram were cut to obtain \(k = 3\) clusters, how many members would belong to the largest of these clusters? Note: You may answer this question by inspecting the given output without writing any of your own code.

data("USArrests")
USArrests_std <- scale(USArrests)
D <- daisy(USArrests_std, metric = "euclidean")

ag1 <- agnes(D, method = "single")
fviz_dend(ag1, cex = 0.25) + labs(title = 'AGNES, Method = "Single"')

\(~\)

Question #2: The code below applies the DIANA algorithm to the distance matrix calculated from the standardized USarrests data. Using the code and output provided below, answer the following questions:

Part A: How many clusters were present at the initialization (start) of this algorithm?

Part B: Briefly describe the merge/split resulting from the first step of the algorithm.

Part C: Which observation you believe was used to form the “splinter group” in the first step of the algorithm? Justify your answer. Hint: Because each row/column of the distance matrix corresponds to an observation, you might look at row or column means. Before doing so, you might need to coerce D into a matrix using as.matrix().

di1 <- diana(D)
fviz_dend(di1, cex = 0.25) + labs(title = 'DIANA')

\(~\)

Where to cut?

Dendrograms provide a unique perspective on the relationships between data-points, but oftentimes we’d still like to report cluster membership for an optimized choice of \(k\).

Fortunately, all of the methods described in our previous clustering lab can also be applied to hierarchical methods. Unfortunately, some of them will require all of our variables are numeric.

The code below demonstrates these three methods using DIANA clustering on the US Arrests data described in Question #2.

### The elbow method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "wss")

### The average silhouette method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "silhouette")

## The gap statistic method
fviz_nbclust(USArrests_std, FUNcluster = hcut, hc_func = "diana", method = "gap", 
             k.max = 6, verbose = FALSE)

The argument FUNcluster = hcut indicates the use of hierarchical clustering that is to be cut into clusters, while the argument hc_func specifies the specific hierarchical clustering algorithm that should be used.

Question #3: The code below creates cds, a standardized subset of the Colleges 2019 data set that was used in our previous clustering lab. For this question, use the average silhouette method to find the optimal value of \(k\) for AGNES clustering performed on these data. Then, create a dendrogram displaying the AGNES clustering results with branches colored according to the value of \(k\) you selected. You should use euclidean distance for this analysis.

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>% filter(State == "IA") %>%
  select(Name, Enrollment, Adm_Rate, ACT_median, 
         Net_Tuition, FourYearComp_Males, FourYearComp_Females,
        Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),]  ## Keeps only complete cases (no missing values)
rownames(cd) <- cd$Name
cds <- scale(select(cd, -Name))

\(~\)

Applications

The aim of this section is to provide you an opportunity to apply clustering algorithms to new data sets in order to practice identifying scenarios where clustering might be a useful analysis approach.

Question #4: The “European foods” data set summarizes the results of a household survey administered to 16 European nations in 1975. The survey asked households whether they had various foods in their house at the time of the questionnaire. The data below display the percentage of households in each country that had the food in question (expressed as a numeric value between 0 and 100).

euro <- read.csv("https://remiller1450.github.io/data/europeanfoods.csv", row.names = 1, stringsAsFactors = TRUE)

Part A: Create a distance matrix using euclidean distance without standardizing (since all variables are already on a 0-100 scale). Then, visualize this matrix using fviz_dist(). Which country’s household food habits are most similar to Britain’s?

Part B: Use AGNES clustering with average distance as the merging method to create a dendrogram displaying these data.

Part C: Use DIANA clustering to create a dendrogram displaying these data.

Part D: Sweden, Norway, Denmark, and Finland are commonly grouped together as a region known as the Nordic Countries. Which algorithm, AGNES or DIANA, seems to do a better job capturing this grouping? Briefly explain your answer.

Part E: Use the average silhouette method to determine an optimal number of clusters for the application of AGNES clustering in Part B. Then, use this value to color the dendrogram you created in Part B.

Part F: Now use the gap statistic to determine an optimal number of clusters (again using the clustering results from Part B). Does this method agree with the average silhouette approach?

\(~\)

Question #5: The “starwars” data set documents the attributes of major characters in the Star Wars film series. For this question, you will perform a cluster analysis on the subset created below. The variables definitions are largely self-explanatory, but you can run ?starwars in the console for descriptions.

data("starwars")
sw <- starwars %>%
  select(name, height, mass, hair_color, skin_color, eye_color, birth_year, sex, gender, homeworld, species) %>% 
  mutate_if(is.character, as.factor)
sw <- as.data.frame(sw[complete.cases(sw),])
rownames(sw) <- sw$name
sw <- select(sw, -name)

Part A: After inspecting the contents of sw, briefly explain why Gower distance should be used to cluster these data.

Part B: Create a distance matrix and visualize it using fviz_dist().

Part C: Using the AGNES algorithm, create a dendrogram displaying these data.

Part D (optional, not graded): If you are familiar with Star Wars, do any relationships displayed in the dendrogram from Part C stand out as particularly interesting?