This is a short lab that will cover the tools available in sklearn
for principal component analysis with a focus on dimension reduction.
## Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
Examples throughout the lab while use a large-scale personality data set obtained from the Open-Source Psychometrics Project: https://openpsychometrics.org/. If you took Sta-230 with me, you might remember this data set.
These data contain several thousand responses to an online personality survey consisting of 50 statements rated on a 5-point likert scale.
You can see the statements themselves at this link.
## Big 5 data
bf = pd.read_csv("https://remiller1450.github.io/data/big5data.csv", sep='\t')
## Split the personality questions from the demographics
bf_q = bf.drop(['race','age','engnat','gender','hand','source','country'], axis=1)
bf_demo = bf[['race','age','engnat','gender','hand','source','country']]
## Notice how the survey responses are coded (5 = "Agree", 1 = "Disagree")
bf_q.head(4)
E1 | E2 | E3 | E4 | E5 | E6 | E7 | E8 | E9 | E10 | ... | O1 | O2 | O3 | O4 | O5 | O6 | O7 | O8 | O9 | O10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | 2 | 5 | 2 | 5 | 1 | 4 | 3 | 5 | 1 | ... | 4 | 1 | 3 | 1 | 5 | 1 | 4 | 2 | 5 | 5 |
1 | 2 | 2 | 3 | 3 | 3 | 3 | 1 | 5 | 1 | 5 | ... | 3 | 3 | 3 | 3 | 2 | 3 | 3 | 1 | 3 | 2 |
2 | 5 | 1 | 1 | 4 | 5 | 1 | 1 | 5 | 5 | 1 | ... | 4 | 5 | 5 | 1 | 5 | 1 | 5 | 5 | 5 | 5 |
3 | 2 | 5 | 2 | 4 | 3 | 4 | 3 | 4 | 4 | 5 | ... | 4 | 3 | 5 | 2 | 4 | 2 | 5 | 2 | 5 | 5 |
4 rows × 50 columns
We'd expect a high degree of covariance in how participants tend to rate many of these statements. Thus, we might seek a lower dimensional representation of these data as we can retain most of information about the personalities of respondents using fewer than 50 variables.
In sklearn
the primary function used to perform principal component analysis is PCA()
:
from sklearn.decomposition import PCA
pca_bfq = PCA().fit(bf_q)
Similar to the clustering functions we've worked with, we must use the fit()
method to perform PCA on our data.
You should also note that we did not standardize these data prior to fitting because they are already on a standardized 1-5 scale.
The explained_variance_ratio_
attribute of our fitted PCA object stores the variance explained by each component:
## Variance explained by components 1-5
pca_bfq.explained_variance_ratio_[0:5]
array([0.17308422, 0.10124683, 0.07403848, 0.05951973, 0.05665616])
Question 1: Create a scree plot showing the variance explained by each component. Use this plot to identify the number of components that you think should be retained for these data.
This questionnaire was designed to measure the Big 5 Personality Traits, so we might opt to retain only 5 principal components.
To achieve this we need to refit the PCA with the argument n_components=5
, as the fit_transform()
method of PCA depends upon the number of components specified when the object is created:
## Perform the dimension reduction
pca_bfq_5comp = PCA(n_components=5).fit_transform(bf_q)
## Verify results
pca_bfq_5comp.shape
(19719, 5)
The fit_transform()
method is used to fit a PCA model with a specified number of principal components and return the lower dimensional representation of the input data utilizing those components. You should note that pca_bfq_5comp
contains the "scores" (ie: coordinates) of each data-point in the retained principal component dimensions.
Question 2: Create a scatterplot displaying scores for the first and second principal components. Color each data-point by the demographic variable 'engnat' (English-speaking nationality) and briefly comment upon whether you can visually see any noticeable differences in these scores among English-speaking nationalities compared to non-English-speaking nationalities.
In this application we're interested in assigning meaningful labels to these 5 dimensions that were derived using PCA. This can be done by inspecting the most influential loadings in each component.
## Loadings for a particular principal component (sorted by abs magnitude)
PC1_loadings = pd.DataFrame({'Question': bf_q.columns, 'PC1': pca_bfq.components_[0]})
print(PC1_loadings.sort_values(by='PC1', ascending=False, key=abs).head(10))
Question PC1 6 E7 -0.263503 2 E3 -0.252589 4 E5 -0.240867 9 E10 0.223513 19 N10 0.222202 3 E4 0.211569 17 N8 0.205233 18 N9 0.197345 5 E6 0.196707 1 E2 0.195601
Here we can see that the 4 most influential statements in determining where an individual falls in the first principal component dimension are:
Additionally, you might notice that 7 of 10 top contributors are statements with the "E" label. This is intentional, the creators of this questionnaire designed these statements to measure extroversion, which emerges as the most prominent personality dimension in these data.
Question 3:
pca_bfq
as one of the Big Five personality traits: extroversion, openness, agreeableness, conscientiousness, and neuroticism.Part C: Regardless of your decision in Part B, apply $k$-means clustering to the scores in the first five components using $k=6$ clusters. For each prototype, briefly describe the defining personality characteristics of that cluster using the following template:
Prototype 1: [x1, x2, x3, x4, x5] Characteristics: High openness and low neuroticism
Note: you'd replace x1
- x5
with the central coordinates of the prototype in the five retained principal component dimensions. You should also note that the numerical label given to the cluster is arbitrary, which is why I want you to provide the prototype's coordinates in your answer.
The code given below loads responses to an online questionnaire aimed at capturing "dark triad" personality traits, which are machiavellianism (a manipulative attitude), narcissism (excessive self-love), and psychopathy (lack of empathy). For more information you can visit this link: http://openpsychometrics.org/tests/SD3/
Like the Big 5 dataset, the statements are labeled according to which trait they were intended to measure.
Question 4:
For your reference, the image below (from the appendix of Jones and Paulhus (2014) https://journals.sagepub.com/doi/full/10.1177/1073191113514105) displays the individual survey items.
## Code to load the data
dt = pd.read_csv("https://remiller1450.github.io/data/dark_triad.csv", sep='\t')
## Screenshot of the Dark Triad survey items
from IPython.display import Image
Image("C:\\Users\\millerry\\OneDrive - Grinnell College\\Documents\\STA-395_Intro_ML\\Spring24\\Labs\\dark.PNG")