In this lab we will transition towards focusing on machine learning methods themselves and away from the inner working of Python, numpy
, and pandas
. You should be prepared to use information from Lab 1, but also to use published documentation to learn about useful functions, methods, and attributes on your own.
## Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
The data used in this lab's applications are loaded below. These data describe the workforce characteristics of college degree holders in a variety of fields:
jobs = pd.read_csv("https://remiller1450.github.io/data/majors.csv")
jobs.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 34 entries, 0 to 33 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Major 34 non-null object 1 Category 34 non-null object 2 Workforce 34 non-null float64 3 Per_Male 34 non-null float64 4 Per_Female 34 non-null float64 5 Age_25_34 34 non-null float64 6 Age_35_44 34 non-null float64 7 Age_45_54 34 non-null float64 8 Age_55_64 34 non-null float64 9 Per_White 34 non-null float64 10 Per_Black 34 non-null float64 11 Per_Asian 34 non-null float64 12 Per_Hispanic 34 non-null float64 13 Per_Other_Race 34 non-null float64 14 Per_Bachelors 34 non-null float64 15 Per_Masters 34 non-null float64 16 Per_Professional_Degree 34 non-null float64 17 Per_PhD 34 non-null float64 18 Per_Employed 34 non-null float64 19 Per_Unemployed 34 non-null float64 20 Per_Non_Labor 34 non-null float64 21 Bach_Med_Income 34 non-null float64 22 Grad_Med_Income 34 non-null float64 dtypes: float64(21), object(2) memory usage: 6.2+ KB
Examples throughout the lab will use the "moons" and "blobs" toy data sets that were introduced in Lab 1:
## Datasets module
from sklearn import datasets
## Toy Dataset #1 - "moons"
moons = datasets.make_moons(n_samples=500, noise=0.11, random_state=8)
moons_X = moons[0] # features to use in clustering
moons_labels = moons[1] # true cluster identities
## Toy Dataset #2 - "blobs"
blobs = datasets.make_blobs(n_samples=500, cluster_std=2, random_state=8)
blobs_X = blobs[0] # features to use in clustering
blobs_labels = blobs[1] # true cluster identities
Agglomerative clustering is implemented in the cluster
module of sklearn. The basic workflow of many models implemented in sklearn
is to first initialize a model then fit to your data using its fit()
method. However, we will use the fit_predict()
method because it will also return each sample's cluster label:
from sklearn.cluster import AgglomerativeClustering
agg_cl = AgglomerativeClustering(linkage = 'single', distance_threshold=0, n_clusters=None).fit(moons_X)
The arguments distance_threshold=0
and n_clusters=None
are used to ensure that the full dendrogram is fit to the data.
Unfortunately there not currently an official function to plot dendrograms in sklearn
, but the following function, which is described here, can be used to create a basic dendrogram:
from scipy.cluster.hierarchy import dendrogram
def plot_dendrogram(model, **kwargs):
# Create linkage matrix and then plot the dendrogram
# create the counts of samples under each node
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # leaf node
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack(
[model.children_, model.distances_, counts]
).astype(float)
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, **kwargs)
## Part of the dendrogram for "moons"
plot_dendrogram(model = agg_cl, truncate_mode="level", p=10) # Plot is truncated at "level 10"
A few things you should notice in this example:
dendogram()
function in the scipy
library. This function is assigning colors to the branches of the dendrogram via its default arguments. You can read more here. You might also note that $k=2$ is the default number of clusters for AgglomerativeClustering()
.Question #1:
ward
linkage to the "moons" data set. Display the first 10 levels of the resulting dendrogram.single
linkage? Highlight any noticeable differences you see.It is possible to "cut" a hierarchical clustering model to obtain a set $k$ non-overlapping clusters.
This practice allows us to compare the performance of hierarchical methods to alternatives like $k$-means, at least when we know the true clusters like we do in the simulated "moons" and "blobs" examples. In real-world applications, we must use our own judgement to decide if a clustering algorithm is producing useful and appropriate results.
The number of clusters can be chosen based upon context of the application, or inspection of the dendrogram to find a location where the height differences between merges begins to become large (an approach analogous to the "elbow" approach in $k$-means).
Below we refit our single linkage agglomerative clustering using $k=3$ and display the results. We use the fit_predict()
method to get the predicted cluster labels for the given data:
agg_cl = AgglomerativeClustering(linkage = 'single', n_clusters=3).fit_predict(moons_X)
plt.scatter(moons_X[:,0], moons_X[:,1], c = agg_cl)
plt.show()
In this example we see the failure of the single linkage method. Other linkage criteria, such as 'complete' tend to perform a little better but not great:
agg_cl = AgglomerativeClustering(linkage = 'complete', n_clusters=3).fit_predict(moons_X)
plt.scatter(moons_X[:,0], moons_X[:,1], c = agg_cl)
plt.show()
You might also have noticed that we did not standardize our features, so these clusters might be disproportionately influenced by our first feature (shown on the x-axis). Unfortunately, standardization isn't enough to yield the results we desire:
from sklearn.preprocessing import StandardScaler
moons_XS = StandardScaler().fit_transform(moons_X)
agg_cl = AgglomerativeClustering(linkage = 'complete', n_clusters=3).fit_predict(moons_XS)
plt.scatter(moons_XS[:,0], moons_XS[:,1], c = agg_cl)
plt.show()
Question #2:
In this section we'll demonstrate DBSCAN clustering using the "moons" data. The code below fits DBSCAN clusters to these data using the default arguments:
from sklearn.cluster import DBSCAN
results_dbscan = DBSCAN().fit(moons_XS)
plt.scatter(moons_XS[:,0], moons_XS[:,1], c = results_dbscan.labels_)
<matplotlib.collections.PathCollection at 0x1643349ce50>
We see that the default values of eps
and min_samples
are way too large for our data, as they group all of the data into the same cluster. We can try some smaller values:
results_dbscan = DBSCAN(eps=0.1, min_samples=2).fit(moons_XS)
plt.scatter(moons_XS[:,0], moons_XS[:,1], c = results_dbscan.labels_)
<matplotlib.collections.PathCollection at 0x164323ca850>
This time we see that our hyperparameter choices are too small. For a simple data set like "moons" we can manually tune them through trial and error to get some really nice looking results:
results_dbscan = DBSCAN(eps=0.28, min_samples=10).fit(moons_XS)
plt.scatter(moons_XS[:,0], moons_XS[:,1], c = results_dbscan.labels_)
<matplotlib.collections.PathCollection at 0x16432442160>
We notice that we were able to get a reasonable fit with 3 detected outliers. We can find out the index positions of these outliers:
## Find predicted assignments
outliers_dbscan = DBSCAN(eps=0.28, min_samples=10).fit_predict(moons_XS)
## Outliers are denoted by the label '-1', so we can see that 0.6% of the data were deemed outliers
np.sum(outliers_dbscan == -1)/outliers_dbscan.shape[0]
0.006
Question #3:
eps
, min_samples
, data
and applies DBSCAN clustering to data
using the given parameters. The function should then return the percentage of data
that was deemed to be an outlier for that combination of parameters.min_samples=10
, then searching over the other parameter.Question #4:
Your task in this portion of the lab is to apply your knowledge of clustering to analyze the workforce data loaded at the start of the lab. Your goal is to prepare a brief report that groups the workforce in different degree fields into clusters based upon commonalities, such as similar salaries, similar racial/ethnic/gender distributions, similar rates of graduate degree attainment, etc.
You should explore at least two clustering algorithms (ie: choose two of the following: $k$-means, agglomerative clustering, DBSCAN). You should be careful to standardize the data if necessary and only cluster using variables you deem relevant to the analysis. You must explain and justify all of your choices in a short "methods" paragraph, using graphs or numerical information as appropriate. And you should describe your final cluster assignments in a "results" paragraph, again using graphs or numerical information as appropriate.
This question is intentionally open-ended and is aimed at preparing you to use clustering methods on your own in the future. I encourage you to check-in with me if you ever feel unsure about what you've done.