Lab 3 - Principal Component Analysis¶

This is a short lab that will cover the tools available in sklearn for principal component analysis with a focus on dimension reduction.

In [1]:

## Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

Examples throughout the lab while use a large-scale personality data set obtained from the Open-Source Psychometrics Project: https://openpsychometrics.org/. If you took Sta-230 with me, you might remember this data set.

These data contain several thousand responses to an online personality survey consisting of 50 statements rated on a 5-point likert scale.

You can see the statements themselves at this link.

In [2]:

## Big 5 data
bf = pd.read_csv("https://remiller1450.github.io/data/big5data.csv", sep='\t')

## Split the personality questions from the demographics
bf_q = bf.drop(['race','age','engnat','gender','hand','source','country'], axis=1)
bf_demo = bf[['race','age','engnat','gender','hand','source','country']]

## Notice how the survey responses are coded (5 = "Agree", 1 = "Disagree")
bf_q.head(4)

Out[2]:

	E1	E2	E3	E4	E5	E6	E7	E8	E9	E10	...	O1	O2	O3	O4	O5	O6	O7	O8	O9	O10
0	4	2	5	2	5	1	4	3	5	1	...	4	1	3	1	5	1	4	2	5	5
1	2	2	3	3	3	3	1	5	1	5	...	3	3	3	3	2	3	3	1	3	2
2	5	1	1	4	5	1	1	5	5	1	...	4	5	5	1	5	1	5	5	5	5
3	2	5	2	4	3	4	3	4	4	5	...	4	3	5	2	4	2	5	2	5	5

4 rows × 50 columns

Part 1 - Dimension Reduction¶

We'd expect a high degree of covariance in how participants tend to rate many of these statements. Thus, we might seek a lower dimensional representation of these data as we can retain most of information about the personalities of respondents using fewer than 50 variables.

In sklearn the primary function used to perform principal component analysis is PCA():

In [3]:

from sklearn.decomposition import PCA
pca_bfq = PCA().fit(bf_q)

Similar to the clustering functions we've worked with, we must use the fit() method to perform PCA on our data.

You should also note that we did not standardize these data prior to fitting because they are already on a standardized 1-5 scale.

The explained_variance_ratio_ attribute of our fitted PCA object stores the variance explained by each component:

In [4]:

## Variance explained by components 1-5
pca_bfq.explained_variance_ratio_[0:5]  

Out[4]:

array([0.17308422, 0.10124683, 0.07403848, 0.05951973, 0.05665616])

Question 1: Create a scree plot showing the variance explained by each component. Use this plot to identify the number of components that you think should be retained for these data.

Part 2 - Scores¶

This questionnaire was designed to measure the Big 5 Personality Traits, so we might opt to retain only 5 principal components.

To achieve this we need to refit the PCA with the argument n_components=5, as the fit_transform() method of PCA depends upon the number of components specified when the object is created:

In [5]:

## Perform the dimension reduction
pca_bfq_5comp = PCA(n_components=5).fit_transform(bf_q)

## Verify results
pca_bfq_5comp.shape

Out[5]:

(19719, 5)

The fit_transform() method is used to fit a PCA model with a specified number of principal components and return the lower dimensional representation of the input data utilizing those components. You should note that pca_bfq_5comp contains the "scores" (ie: coordinates) of each data-point in the retained principal component dimensions.

Question 2: Create a scatterplot displaying scores for the first and second principal components. Color each data-point by the demographic variable 'engnat' (English-speaking nationality) and briefly comment upon whether you can visually see any noticeable differences in these scores among English-speaking nationalities compared to non-English-speaking nationalities.

Part 3 - Loadings¶

In this application we're interested in assigning meaningful labels to these 5 dimensions that were derived using PCA. This can be done by inspecting the most influential loadings in each component.

In [6]:

## Loadings for a particular principal component (sorted by abs magnitude)
PC1_loadings = pd.DataFrame({'Question': bf_q.columns, 'PC1': pca_bfq.components_[0]})
print(PC1_loadings.sort_values(by='PC1', ascending=False, key=abs).head(10))

   Question       PC1
6        E7 -0.263503
2        E3 -0.252589
4        E5 -0.240867
9       E10  0.223513
19      N10  0.222202
3        E4  0.211569
17       N8  0.205233
18       N9  0.197345
5        E6  0.196707
1        E2  0.195601

Here we can see that the 4 most influential statements in determining where an individual falls in the first principal component dimension are:

I talk to a lot of different people at parties.
I feel comfortable around people.
I start conversations.
I am quiet around strangers.

Additionally, you might notice that 7 of 10 top contributors are statements with the "E" label. This is intentional, the creators of this questionnaire designed these statements to measure extroversion, which emerges as the most prominent personality dimension in these data.

Question 3:

Part A: Using the most influential loadings, label each of the first five principal component dimension in pca_bfq as one of the Big Five personality traits: extroversion, openness, agreeableness, conscientiousness, and neuroticism.
Part B: Perform a $k$-means clustering analysis on the dimension-reduced data
Part C: Regardless of your decision in Part B, apply $k$-means clustering to the scores in the first five components using $k=6$ clusters. For each prototype, briefly describe the defining personality characteristics of that cluster using the following template:
Prototype 1: [x1, x2, x3, x4, x5] Characteristics: High openness and low neuroticism

Note: you'd replace x1 - x5 with the central coordinates of the prototype in the five retained principal component dimensions. You should also note that the numerical label given to the cluster is arbitrary, which is why I want you to provide the prototype's coordinates in your answer.

Part 4 - On your own¶

The code given below loads responses to an online questionnaire aimed at capturing "dark triad" personality traits, which are machiavellianism (a manipulative attitude), narcissism (excessive self-love), and psychopathy (lack of empathy). For more information you can visit this link: http://openpsychometrics.org/tests/SD3/

Like the Big 5 dataset, the statements are labeled according to which trait they were intended to measure.

Question 4:

Part A - Perform PCA on the survey items in the dark triad data and plot the variance explained by each component. Based upon this plot, does it seem reasonable to conclude that 3 underlying factors are being measured by these questions?
Part B - Using the principal component loadings, assign a label (ie: machiavellianism, narcissism, psychopathy) to each of the first three principal components.
Part C - For any two countries of your choosing, create a data visualization showing the distributions of your choice of dark triad traits across the respondents from each country.

For your reference, the image below (from the appendix of Jones and Paulhus (2014) https://journals.sagepub.com/doi/full/10.1177/1073191113514105) displays the individual survey items.

In [ ]:

## Code to load the data
dt = pd.read_csv("https://remiller1450.github.io/data/dark_triad.csv", sep='\t')

In [7]:

## Screenshot of the Dark Triad survey items 
from IPython.display import Image
Image("C:\\Users\\millerry\\OneDrive - Grinnell College\\Documents\\STA-395_Intro_ML\\Spring24\\Labs\\dark.PNG")

Out[7]:

	E1	E2	E3	E4	E5	E6	E7	E8	E9	E10	...	O1	O2	O3	O4	O5	O6	O7	O8	O9	O10
0	4	2	5	2	5	1	4	3	5	1	...	4	1	3	1	5	1	4	2	5	5
1	2	2	3	3	3	3	1	5	1	5	...	3	3	3	3	2	3	3	1	3	2
2	5	1	1	4	5	1	1	5	5	1	...	4	5	5	1	5	1	5	5	5	5
3	2	5	2	4	3	4	3	4	4	5	...	4	3	5	2	4	2	5	2	5	5

	E1	E2	E3	E4	E5	E6	E7	E8	E9	E10	...	O1	O2	O3	O4	O5	O6	O7	O8	O9	O10
0	4	2	5	2	5	1	4	3	5	1	...	4	1	3	1	5	1	4	2	5	5
1	2	2	3	3	3	3	1	5	1	5	...	3	3	3	3	2	3	3	1	3	2
2	5	1	1	4	5	1	1	5	5	1	...	4	5	5	1	5	1	5	5	5	5
3	2	5	2	4	3	4	3	4	4	5	...	4	3	5	2	4	2	5	2	5	5

	E1	E2	E3	E4	E5	E6	E7	E8	E9	E10	...	O1	O2	O3	O4	O5	O6	O7	O8	O9	O10
0	4	2	5	2	5	1	4	3	5	1	...	4	1	3	1	5	1	4	2	5	5
1	2	2	3	3	3	3	1	5	1	5	...	3	3	3	3	2	3	3	1	3	2
2	5	1	1	4	5	1	1	5	5	1	...	4	5	5	1	5	1	5	5	5	5
3	2	5	2	4	3	4	3	4	4	5	...	4	3	5	2	4	2	5	2	5	5