This lab aims to provide a brief overview of a small number of important unsupervised learning methods. The content is completely optional, but you may receive up to 2 points of extra credit (on the in-class quiz grade component) for completing it.
To begin, you'll need to import the following libraries:
import pandas as pd
import sklearn
import numpy as np
Throughout the lab, we'll aim to work with a few simulated datasets, along with the Big 5 personality data that was introduced in today's lecture:
## Big 5 data
bf = pd.read_csv("https://remiller1450.github.io/data/big5data.csv", sep='\t')
## Separate the questions
bfq = bf.drop(['race','age','engnat','gender','hand','source','country'], axis=1)
## Note the encoding of survey responses
bfq.head(4)
E1 | E2 | E3 | E4 | E5 | E6 | E7 | E8 | E9 | E10 | ... | O1 | O2 | O3 | O4 | O5 | O6 | O7 | O8 | O9 | O10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | 2 | 5 | 2 | 5 | 1 | 4 | 3 | 5 | 1 | ... | 4 | 1 | 3 | 1 | 5 | 1 | 4 | 2 | 5 | 5 |
1 | 2 | 2 | 3 | 3 | 3 | 3 | 1 | 5 | 1 | 5 | ... | 3 | 3 | 3 | 3 | 2 | 3 | 3 | 1 | 3 | 2 |
2 | 5 | 1 | 1 | 4 | 5 | 1 | 1 | 5 | 5 | 1 | ... | 4 | 5 | 5 | 1 | 5 | 1 | 5 | 5 | 5 | 5 |
3 | 2 | 5 | 2 | 4 | 3 | 4 | 3 | 4 | 4 | 5 | ... | 4 | 3 | 5 | 2 | 4 | 2 | 5 | 2 | 5 | 5 |
4 rows × 50 columns
## Two different simulated datasets
from sklearn import datasets
moons = datasets.make_moons(n_samples=500, noise=0.11, random_state=8)
blobs = datasets.make_blobs(n_samples=500, cluster_std=2, random_state=8)
Shown below are the simulated data that follow a "half moon" grouping:
import matplotlib.pyplot as plt
plt.scatter(moons[0][:,0], moons[0][:,1], c = moons[1])
<matplotlib.collections.PathCollection at 0x1ef555872b0>
And shown below are the simulated data that follow a "blobs" grouping:
import matplotlib.pyplot as plt
plt.scatter(blobs[0][:,0], blobs[0][:,1], c = blobs[1])
<matplotlib.collections.PathCollection at 0x1ef555ea1c0>
We'll demonstrate PCA using the Big 5 data, and we'll demonstrate clustering using the simulated data.
Recall that PCA is a dimension reduction technique that can be used to derive a lower dimensional representation of higher dimensional data. For the Big 5 data, this might be a representation of survey responses to the 50 items in the questionaire that explains most of the variation across participants. Based upon the construction of these 50 items, we'd expect most of the variation to exist in 5 derived dimensions, but we can use PCA to assess that.
To begin, we'll note that the PCA
function in sklearn
is used to perform principal component analysis.
from sklearn.decomposition import PCA
pca_bfq = PCA().fit(bfq)
Similar to other functions in sklearn
, objects created by PCA
have fit
and transform
methods that must be used when working with data. In our example, we use the fit
method to perform PCA on the entire Big 5 dataset.
Next, we might ask ourselves how much of the variation in the 50 items present in the original dataset is present in each of the principal component dimensions. These explained variances are contained in the explained_variance_ratio_
attribute, and we can use them to create our own scree plot:
## Variance Explained by number of components - notice the elbow at either 5 or 7
plt.scatter(range(0,50), pca_bfq.explained_variance_ratio_)
plt.plot(pca_bfq.explained_variance_ratio_)
plt.show()
In this plot we're looking for the point at which components begin to contain a negligible amount of the variation that exists within the original data. For our dataset, it seems like this might occur at either 5 components, or perhaps 7 components. Something to consider is that if there are no patterns/groupings amongst variables, we'd each component to contain roughly 2% of the total variation in the data. Thus, principal components that explain substantially more than 2% of the variance in the data can be viewed as meaningful.
After deciding upon a number of components to retain, we can look at the contributions of the original variables onto each of the retained components. Recall that these contributions are called loadings and they are coefficients in a linear combination. The code below creates a DataFrame containing the loadings for PC1, then it prints the top 10 loadings after sorting them by their absolute value.
## Loadings for a particular principal component (sorted by abs magnitude)
PC1_loadings = pd.DataFrame({'Question': bfq.columns, 'PC1': pca_bfq.components_[0]})
print(PC1_loadings.sort_values(by='PC1', ascending=False, key=abs).head(10))
Question PC1 6 E7 -0.263503 2 E3 -0.252589 4 E5 -0.240867 9 E10 0.223513 19 N10 0.222202 3 E4 0.211569 17 N8 0.205233 18 N9 0.197345 5 E6 0.196707 1 E2 0.195601
From these loadings, we can determine that PC1 is primarily expressing variation in the questions related to extraversion, as 4 of the top 5 contributors (and 7 of the top 10 contributors) are questions that were designed to measure this personality trait.
If desired, we could perform a similar investigation to assign labels/meanings to each of the other principal components we decided to retain.
Next, we might be interested in how individual data-points score in each principal component dimension. We can determine these scores using the transform
method of our fitted PCA
object. This is demonstrated by the code below, which also graphs the 3rd and 4th principal components colored by gender. In this graph we might notice a possible overrepresentation of blue points with above average scores in PC3.
## Scores for observed data
PC_scores = pca_bfq.transform(bfq)
## Plot PC3 vs. PC4 by gender
plt.scatter(PC_scores[:,2], PC_scores[:,3], c = bf['gender'], alpha = 0.2)
plt.show()
To understand what this means, we'd need to explore the loadings of PC3 to determine how an individual would need to respond to the survey items to end up with an above average score in this dimension.
The code given below loads responses to an online questionaire aimed at capturing "dark triad" personality traits, which are machiavellianism (a manipulative attitude), narcissism (excessive self-love), and psychopathy (lack of empathy). For more information you can visit this link: http://openpsychometrics.org/tests/SD3/
Similar to the Big 5 dataset, the questions are labeled according to which trait they were intended to measure.
fit
method to perform PCA on the dark triad data. Next, plot the variance explained by components. Based upon this plot, does it seem reasonable to conclude that 3 underlying factors are reflected in the responses to these questions?## Dark Triad Data
dt = pd.read_csv("https://remiller1450.github.io/data/dark_triad.csv", sep='\t')
dt.head(4)
M1 | M2 | M3 | M4 | M5 | M6 | M7 | M8 | M9 | N1 | ... | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | country | source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 3 | 4 | 2 | ... | 4 | 3 | 2 | 4 | 4 | 4 | 4 | 4 | GB | 1 |
1 | 2 | 1 | 5 | 2 | 2 | 1 | 2 | 2 | 3 | 1 | ... | 1 | 1 | 5 | 4 | 1 | 5 | 3 | 2 | US | 1 |
2 | 3 | 3 | 3 | 5 | 1 | 1 | 5 | 5 | 3 | 2 | ... | 5 | 3 | 1 | 3 | 1 | 2 | 3 | 1 | US | 1 |
3 | 5 | 5 | 4 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | ... | 1 | 5 | 2 | 5 | 5 | 5 | 1 | 5 | GB | 3 |
4 rows × 29 columns
This section will provide a brief overview of the k-means and DBSCAN clustering implementations in sklearn
. We'll first see how to apply k-means clustering to the "blobs" simulated data set, a scenario in which we'd expect the algorithm to do a nice job identifying the underlying groupings of data-points.
Similar to most models in sklearn
, we'll need to use the fit
method to find the k-means cluster labels for our data:
from sklearn.cluster import KMeans
kmeans_fit = KMeans(n_clusters=2, random_state=0).fit(blobs[0])
In this example, we requested $k=2$ clusters. Let's see how they look:
## Scatterplot coloring by assigned cluster label
plt.scatter(blobs[0][:,0], blobs[0][:,1], c = kmeans_fit.labels_)
plt.show()
This is the result you might have expected when visually inspecting the data. However, we should recognize that our decision to fit $k=2$ clusters was somewhat arbitrary.
We can make a data-driven decision by looking at the inertia_
attribute for various choices of k. This attribute records the sum of squared distances of data-points to their closest cluster center. The code given below loops through various possible choices of $k$ and stores the inertia for each of them.
## Loop through choices of k ranging from 2 to 20
scores = []
for k in range(2, 20):
kmeans = KMeans(n_clusters=k)
kmeans.fit(blobs[0])
scores.append(kmeans.inertia_)
We can plot these values to decide the point at which adding additional cluster centers no longer appreciably improves how well our groups fit the data:
## Variance Explained by number of components - notice the elbow at either 5 or 7
plt.scatter(range(2, 20), scores)
plt.plot(range(2, 20), scores)
plt.show()
In this example, it's clearly evident that 3 clusters does much better than 2; however, additional clusters beyond 3 doesn't seem to lead to much of an improvement in fit.
To wrap up this example, we can apply k-means with $k=3$ to the data and see how well it captures the true groupings that were used to simulate these data:
## Refit using proper choice of k=3, then visualize
kmeans_fit = KMeans(n_clusters=3, random_state=0).fit(blobs[0])
plt.scatter(blobs[0][:,0], blobs[0][:,1], c = kmeans_fit.labels_)
plt.show()
We can see that k-means did a good job with these data, but this is something that we'd expect. Finally, it's worthwhile noting that you might want to consider the cluster centroids as a set of "prototypes" that you could report as a summary of the three groups that the algorithm identified. These centroids are contained in the cluster_centers_
attribute of the fitted model:
## Print cluster centroids (ie: X, Y coordinates in this example)
kmeans_fit.cluster_centers_
array([[ 7.28631075, 9.45081924], [-5.37639817, -9.66928774], [ 7.65759339, 0.79102154]])
In this section we'll demonstrate DBSCAN clustering using the "half moons" simulated data. The code below fits DBSCAN
clusters to these data using the default arguments:
from sklearn.cluster import DBSCAN
results_dbscan = DBSCAN().fit(moons[0])
plt.scatter(moons[0][:,0], moons[0][:,1], c = results_dbscan.labels_)
<matplotlib.collections.PathCollection at 0x1ef5654f7f0>
Clearly the default arguments were not appropriate for the half moons example, so let's try tuning the algorithm. We'll begin by exploring smaller values for eps
and min_samples
:
results_dbscan = DBSCAN(eps=0.1, min_samples=2).fit(moons[0])
plt.scatter(moons[0][:,0], moons[0][:,1], c = results_dbscan.labels_)
<matplotlib.collections.PathCollection at 0x1ef566034f0>
This looks a little better, but there now seem to be too many clusters given the pattern that we know is present in these data. Based upon this inspection, we might adjust eps
and min_samples
upward.
For an illustration of just what DBSCAN
can do, the tuning parameters below come extremely close to recapturing the true groupings in these data:
results_dbscan = DBSCAN(eps=0.2, min_samples=8).fit(moons[0])
plt.scatter(moons[0][:,0], moons[0][:,1], c = results_dbscan.labels_)
<matplotlib.collections.PathCollection at 0x1ef5679bee0>
While we normally wouldn't know the true groupings we are seeking to capture, a useful feature of DBSCAN is its ability to detect outliers. In this example, we can see that there were three outliers that did not belong to either of the two clusters that were identified.
Additionally, it's worthwhile noting that DBSCAN is not accompanied by any particular goodness of fit metric in sklearn
, so the algorithm should be tuned to meet the practical needs of your application.