Lab 2 - Hierarchical Clustering and DBSCAN¶

In this lab we will transition towards focusing on machine learning methods themselves and away from the inner working of Python, numpy, and pandas. You should be prepared to use information from Lab 1, but also to use published documentation to learn about useful functions, methods, and attributes on your own.

In [1]:
## Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

The data used in this lab's applications are loaded below. These data describe the workforce characteristics of college degree holders in a variety of fields:

In [2]:
jobs = pd.read_csv("https://remiller1450.github.io/data/majors.csv")
jobs.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Major                    34 non-null     object 
 1   Category                 34 non-null     object 
 2   Workforce                34 non-null     float64
 3   Per_Male                 34 non-null     float64
 4   Per_Female               34 non-null     float64
 5   Age_25_34                34 non-null     float64
 6   Age_35_44                34 non-null     float64
 7   Age_45_54                34 non-null     float64
 8   Age_55_64                34 non-null     float64
 9   Per_White                34 non-null     float64
 10  Per_Black                34 non-null     float64
 11  Per_Asian                34 non-null     float64
 12  Per_Hispanic             34 non-null     float64
 13  Per_Other_Race           34 non-null     float64
 14  Per_Bachelors            34 non-null     float64
 15  Per_Masters              34 non-null     float64
 16  Per_Professional_Degree  34 non-null     float64
 17  Per_PhD                  34 non-null     float64
 18  Per_Employed             34 non-null     float64
 19  Per_Unemployed           34 non-null     float64
 20  Per_Non_Labor            34 non-null     float64
 21  Bach_Med_Income          34 non-null     float64
 22  Grad_Med_Income          34 non-null     float64
dtypes: float64(21), object(2)
memory usage: 6.2+ KB

Examples throughout the lab will use the "moons" and "blobs" toy data sets that were introduced in Lab 1:

In [3]:
## Datasets module
from sklearn import datasets

## Toy Dataset #1 - "moons"
moons = datasets.make_moons(n_samples=500, noise=0.11, random_state=8)
moons_X = moons[0]         # features to use in clustering
moons_labels = moons[1]    # true cluster identities

## Toy Dataset #2 - "blobs"
blobs = datasets.make_blobs(n_samples=500, cluster_std=2, random_state=8)
blobs_X = blobs[0]         # features to use in clustering
blobs_labels = blobs[1]    # true cluster identities

Part 1 - Agglomerative clustering¶

Model Fitting and Dendrograms¶

Agglomerative clustering is implemented in the cluster module of sklearn. The basic workflow of many models implemented in sklearn is to first initialize a model then fit to your data using its fit() method. However, we will use the fit_predict() method because it will also return each sample's cluster label:

In [4]:
from sklearn.cluster import AgglomerativeClustering
agg_cl = AgglomerativeClustering(linkage = 'single', distance_threshold=0, n_clusters=None).fit(moons_X)

The arguments distance_threshold=0 and n_clusters=None are used to ensure that the full dendrogram is fit to the data.

Unfortunately there not currently an official function to plot dendrograms in sklearn, but the following function, which is described here, can be used to create a basic dendrogram:

In [5]:
from scipy.cluster.hierarchy import dendrogram

def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)
In [6]:
## Part of the dendrogram for "moons"
plot_dendrogram(model = agg_cl, truncate_mode="level", p=10) # Plot is truncated at "level 10"

A few things you should notice in this example:

  1. The function relies upon the dendogram() function in the scipy library. This function is assigning colors to the branches of the dendrogram via its default arguments. You can read more here. You might also note that $k=2$ is the default number of clusters for AgglomerativeClustering().
  2. Along the x-axis of the dendrogram numbers appearing in parentheses indicate the number of samples in that branch, while numbers without parentheses are the index positions of the single sample represented by that branch.
  3. We used the single linkage criterion, which is what causes the "chaining" behavior that can be seen in this subsection of the dendrogram (notably on the right side of the graph where samples are merged into the cluster one by one)

Question #1:

  • Part A: Apply agglomerative clustering using ward linkage to the "moons" data set. Display the first 10 levels of the resulting dendrogram.
  • Part B: How does this dendrogram compare the example that used single linkage? Highlight any noticeable differences you see.

Cluster labels¶

It is possible to "cut" a hierarchical clustering model to obtain a set $k$ non-overlapping clusters.

This practice allows us to compare the performance of hierarchical methods to alternatives like $k$-means, at least when we know the true clusters like we do in the simulated "moons" and "blobs" examples. In real-world applications, we must use our own judgement to decide if a clustering algorithm is producing useful and appropriate results.

The number of clusters can be chosen based upon context of the application, or inspection of the dendrogram to find a location where the height differences between merges begins to become large (an approach analogous to the "elbow" approach in $k$-means).

Below we refit our single linkage agglomerative clustering using $k=3$ and display the results. We use the fit_predict() method to get the predicted cluster labels for the given data:

In [7]:
agg_cl = AgglomerativeClustering(linkage = 'single', n_clusters=3).fit_predict(moons_X)
plt.scatter(moons_X[:,0], moons_X[:,1], c = agg_cl) 
plt.show()

In this example we see the failure of the single linkage method. Other linkage criteria, such as 'complete' tend to perform a little better but not great:

In [8]:
agg_cl = AgglomerativeClustering(linkage = 'complete', n_clusters=3).fit_predict(moons_X)
plt.scatter(moons_X[:,0], moons_X[:,1], c = agg_cl) 
plt.show()

You might also have noticed that we did not standardize our features, so these clusters might be disproportionately influenced by our first feature (shown on the x-axis). Unfortunately, standardization isn't enough to yield the results we desire:

In [9]:
from sklearn.preprocessing import StandardScaler
moons_XS = StandardScaler().fit_transform(moons_X) 

agg_cl = AgglomerativeClustering(linkage = 'complete', n_clusters=3).fit_predict(moons_XS)
plt.scatter(moons_XS[:,0], moons_XS[:,1], c = agg_cl) 
plt.show()

Question #2:

  • Part A: Analyze the "blobs" data set using agglomerative clustering to obtain $k=3$ clusters. Use scatter plots and your own judgements to decide upon an acceptable linkage criterion. Display a scatter plot showing cluster label assignments of your choice. Standardize the data before you begin your analysis.
  • Part B: Analyze the "blobs" data set using $k$-means clustering to obtain $k=3$ clusters. Display a scatter plot showing cluster label assignments of your choice. Standardize the data before you begin your analysis.
  • Part C: Compare the percentage of labels that $k$-means correctly identified with the percentage that agglomerative clustering correctly identified. Be careful to note that these labels are exchangeable, so you'll need to pay careful attention to their alignment.

Part 2 - DBSCAN¶

In this section we'll demonstrate DBSCAN clustering using the "moons" data. The code below fits DBSCAN clusters to these data using the default arguments:

In [10]:
from sklearn.cluster import DBSCAN
results_dbscan = DBSCAN().fit(moons_XS)
plt.scatter(moons_XS[:,0], moons_XS[:,1], c = results_dbscan.labels_)
Out[10]:
<matplotlib.collections.PathCollection at 0x1643349ce50>

We see that the default values of eps and min_samples are way too large for our data, as they group all of the data into the same cluster. We can try some smaller values:

In [11]:
results_dbscan = DBSCAN(eps=0.1, min_samples=2).fit(moons_XS)
plt.scatter(moons_XS[:,0], moons_XS[:,1], c = results_dbscan.labels_)
Out[11]:
<matplotlib.collections.PathCollection at 0x164323ca850>

This time we see that our hyperparameter choices are too small. For a simple data set like "moons" we can manually tune them through trial and error to get some really nice looking results:

In [12]:
results_dbscan = DBSCAN(eps=0.28, min_samples=10).fit(moons_XS)
plt.scatter(moons_XS[:,0], moons_XS[:,1], c = results_dbscan.labels_)
Out[12]:
<matplotlib.collections.PathCollection at 0x16432442160>

We notice that we were able to get a reasonable fit with 3 detected outliers. We can find out the index positions of these outliers:

In [13]:
## Find predicted assignments
outliers_dbscan = DBSCAN(eps=0.28, min_samples=10).fit_predict(moons_XS)

## Outliers are denoted by the label '-1', so we can see that 0.6% of the data were deemed outliers
np.sum(outliers_dbscan == -1)/outliers_dbscan.shape[0]
Out[13]:
0.006

Question #3:

  • Part A: Write a simple function that accepts the inputs: eps, min_samples, data and applies DBSCAN clustering to data using the given parameters. The function should then return the percentage of data that was deemed to be an outlier for that combination of parameters.
  • Part B: Use looping and the function you created in Part A to explore a sequence of parameter values to find a combination of parameters that identifies 1% of the "moons" data set as outliers. You may choose to fix one parameter, such as setting min_samples=10, then searching over the other parameter.
  • Part C: Display a scatter plot showing the DBSCAN results you obtained in Part B.

Part 3 - Application¶

Question #4:

Your task in this portion of the lab is to apply your knowledge of clustering to analyze the workforce data loaded at the start of the lab. Your goal is to prepare a brief report that groups the workforce in different degree fields into clusters based upon commonalities, such as similar salaries, similar racial/ethnic/gender distributions, similar rates of graduate degree attainment, etc.

You should explore at least two clustering algorithms (ie: choose two of the following: $k$-means, agglomerative clustering, DBSCAN). You should be careful to standardize the data if necessary and only cluster using variables you deem relevant to the analysis. You must explain and justify all of your choices in a short "methods" paragraph, using graphs or numerical information as appropriate. And you should describe your final cluster assignments in a "results" paragraph, again using graphs or numerical information as appropriate.

This question is intentionally open-ended and is aimed at preparing you to use clustering methods on your own in the future. I encourage you to check-in with me if you ever feel unsure about what you've done.