Introduction to Principal Component Analysis

This lab focuses on principal component analysis (PCA), a method with many applications, most of which involve finding lower dimensional representations of complex data.

Directions (Please read before starting)

Please work together with your assigned partner. Make sure you both fully understand each concept before you move on.
Please record your answers and any related code for all embedded lab questions. I encourage you to try out the embedded examples, but you shouldn’t turn them in.
Please ask for help, clarification, or even just a check-in if anything seems unclear.

\(~\)

Preamble

Packages and Datasets

This lab will rely on the factoextra package for creating graphics that display the results of PCA. It will also use a few functions from the psych package and the corrplot package.

# load the following packages
# install.packages("factoextra")
# install.packages("psych")
# install.packages("corrplot")
library(psych)
library(factoextra)
library(ggplot2)
library(dplyr)
library(corrplot)

\(~\)

Introduction

So far we’ve relied on data visualization as our primary method for identifying trends, but this has its limitations, especially when the number of variables is large.

Consider data consisting of \(n\) measurements (observations) of \(p\) variables, represented by the matrix \(\mathbf{X}\):

\[ \mathbf{X} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{pmatrix} \] Our goal in this lab is to find relationships in the columns of \(\mathbf{X}\) (ie: find patterns across variables) using principal component analysis (PCA). In doing so, we hope to come up with interpretable, low-dimensional representations of the key trends contained in our data.

\(~\)

Dimension Reduction

The main idea behind PCA for dimension reduction is as follows:

Each observation (data-point) has a location in a \(p\)-dimensional space, but not all of these dimensions are equally interesting.
PCA derives a set of new dimensions ordered by variation. Dimensions without much variation (few differences across observations) aren’t very interesting and can be discarded.

Consider the following example:

In Scenario A, the variables X and Y are redundant

If you know an observation’s X-value, you essentially know it’s Y-value too.
One dimension (ie: a line) to capture most of the variation in the variables X and Y.

In scenario B, where isn’t much redundancy in X and Y

Knowing an observation’s X-value tells you very little about its Y-value.
One dimension cannot effectively capture the variation in these two variables, we need a two-dimensional plane.

The next figure (below) shows the first principal component (red) and second principal component (blue) for each scenario:

The first (red) component is constructed to contain the maximum amount of variation in the data, and the second (blue) component is orthogonal to the first.

We can now define the location of each data-point using a coordinate scale in these new dimensions (ie: \(\{PC_1, PC_2\}\) rather than \(\{X, Y\}\)). In scenario A, the \(PC_2\) coordinates have so little variation that we don’t lose much by ignoring them and only considering the \(PC_1\) coordinate of an observation (reducing two dimensions down to just one dimension)

\(~\)

Mathematical Details

Principal components can be expressed by linear combinations of the original variables:

\[PC_1 = \phi_{11} X_1 + \phi_{21} X_2 + \ldots + \phi_{p1} X_p \\ PC_2 = \phi_{12} X_1 + \phi_{22} X_2 + \ldots + \phi_{p2} X_p \\ \ldots \]

The coefficients, \(\phi_{11}, \phi_{21}, \ldots\), are called loadings. They describe the contribution of a variable to a principal component. The matrix containing these loadings is sometimes called the rotation matrix.

If you plug-in the values of \(X_1, \ldots, X_p\) for an observation and multiply by the loadings the resulting value of \(PC_1\) is called a score. This is the location of that observation in the \(PC_1\) dimension (ie: position on the \(PC_1\) axis).

The first step in PCA is to standardize \(\mathbf{X}\), which prevents variables with larger values from dominating. We want the difference between 1m and 10 m (a value of 9) to be treated the same as that between 1000mm and 10000mm (a value of 9000).

After standardization, we’ll consider the covariance matrix \(\mathbf{C}= \tfrac{1}{n-1}\mathbf{X}^T\mathbf{X}\). Using techniques from linear algebra, we can factor this matrix:

\[\mathbf{C} = \mathbf{V}\mathbf{L}\mathbf{V}^T\]

\(\mathbf{V}\) is a matrix of eigenvectors, while \(\mathbf{L}\) is a diagonal matrix of eigenvalues
The columns of \(\mathbf{V}\) contain the loadings for each principal component
The elements of \(\mathbf{L}\) relate to the “variance explained” (importance) of each principal component

\(~\)

Example

The USArrests data set contains four variables for each of the 50 states.

data("USArrests")
head(USArrests)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

If we create a scatterplot matrix (using pairs()), we see signs of redundancy, particularly between “Assault” and “Murder”:

## Step 1 - check for signs of redundancy
pairs(USArrests, lower.panel = NULL)

To perform PCA, we first must standardize the data (using scale()):

## Step 2 - standardize the numeric variables
USArrests_std <- scale(USArrests)

We then use the prcomp() function to find the principal components:

## Step 3 - perform PCA on the standardized data
P <- prcomp(USArrests_std)
P$rotation

##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

\(PC_1\) is the dimension of maximum variation in the data. It is primarily a combination of “Murder”, “Assault”, and “Rape” - so we might label it as “overall violent crime”.
\(PC_2\) is the second principal component, it is orthogonal (independent) of \(PC_1\) and it’s largely based on “UrbanPop” (and “Murder” to a lesser extent) - we might label it as “Urbanness”.

Next, let’s quickly see how we could have used linear algebra instead of prcomp():

n <- nrow(USArrests)  ## find n obs
C <- 1/(n-1) * t(USArrests_std) %*% USArrests_std  ## find Cov matrix
P2 <- svd(C)  ## singular value decomposition (C = UDV^T)
P2$u        ## rotation matrix

##            [,1]       [,2]       [,3]        [,4]
## [1,] -0.5358995  0.4181809 -0.3412327  0.64922780
## [2,] -0.5831836  0.1879856 -0.2681484 -0.74340748
## [3,] -0.2781909 -0.8728062 -0.3780158  0.13387773
## [4,] -0.5434321 -0.1673186  0.8177779  0.08902432

Finally, let’s find the scores of each state in \(PC_1\) and \(PC_2\) and graph the states in these new dimensions:

scores <- as.data.frame(P$x)
scores$State <- rownames(P$x) ### row names were used to store states
ggplot(data = scores, aes(x = PC1, y = PC2, label = State)) + geom_text()

We can look at a few states to help us further understand the usefulness of PCA:

California has a large, negative score in \(PC_1\) and a low score in \(PC_2\). This means it has high rates of violent crime (notice the negative loadings) and a high degree of “urbanness”.
Florida has a large, negative score in \(PC_1\) and a score of nearly zero in \(PC_2\), thus it has high rates of violent crime and is middle of the pack in “urbanness”.
Vermont has a large, positive score in \(PC_1\) and a moderate positive score in \(PC_2\), thus has low amounts of violent crime and is low is “urbanness”.

Question #1: Describe the states South Carolina and Wisconsin using the graph displayed above.

\(~\)

Lab

Practice

The code below creates a subset of the “colleges2019” data set with no missing data in the variables that are selected:

colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>% 
  select(Name, State, Enrollment, Adm_Rate, ACT_median, 
         Net_Tuition, FourYearComp_Males, FourYearComp_Females,
        Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),]  ## Keeps only complete cases (no missing values)

In the questions that follow you will progress through the basic steps of PCA:

Question #2 (Part A): Create a new data frame where all the numeric variables are standardized. To do this you must remove the “Name” and “State” columns.

Question #2 (Part B): Perform PCA using the prcomp() function. Then, considering only loadings above 0.4 in absolute value (a commonly used threshold), come up with a label for the first principal component.

Question #2 (Part C): Considering only loadings above 0.4 in absolute value (a commonly used threshold), come up with a label for the second principal component.

Question #2 (Part D): Using geom_text() create a graphic that displays the \(PC_1\) and \(PC_2\) scores of all colleges in Iowa. Briefly comment on the position of Grinnell College in this graph, mentioning why you think its location is so far from the other Iowa colleges.

\(~\)

When to consider PCA?

Principal component analysis is useful when your data contain several numeric variables that are related with each other such that they might be measuring variation in the same underlying factor.

The first step in determining if PCA is viable is to look at the pairwise relationships between variables. The corrplot package provides several useful functions for this. Demonstrated below are a few of the package’s functions:

## Create the correlation matrix (dropping identifiers)
cormat = cor(select(cd, -Name,-State))

## Default Corrplot
corrplot(cormat)

## Mixed corrplot 
corrplot.mixed(cormat, tl.pos = 'lt', tl.cex = 0.5, number.cex = 0.55)

In the second example, the argument tl.pos moves the text labels to the left and top, “tl”. You could also place them on the diagonal using “d”. The arguments tl.cex and number.cex change the size of the text labels and correlation coefficients (respectively).

Question #3: Notice the grouping of variables with dark blue circles in the center of the corrplot. How does this grouping relate to the first principal component you described in Question? How would explain this relationship?

\(~\)

Variable Contributions

Principal components are linear combinations of the original variables. And if the data are standardized, a variable’s loading is a direct reflection of its importance.

The fviz_contrib() function provides a convenient way to visualize variable contributions:

P <- prcomp(USArrests_std)
fviz_contrib(P, choice = "var", axes = 1, top = 4)  ## PC1

fviz_contrib(P, choice = "var", axes = 2, top = 3)  ## PC2

The horizontal reference line shows the expected loading if each variable contributed uniformly to all of the principal components. Any variable whose loading is the above the reference line can be viewed as “important” in that dimension.

Question #4: Use fviz_contrib() to display the contributions of variables to the first principal component in the PCA of the “colleges” data you perform in Question #2. Do the variables that appear above the reference line correspond with the ones you mentioned in Part B of Question #2? Briefly explain.

\(~\)

Explained Variation

Principal components are constructed so that earlier components explain more variability in the data than the later components. Thus, not all components are equally valuable - but how should we decide which to keep and interpret and which to discard?

Recall that the matrix \(\mathbf{L}\) contains the eigenvalues of the covariance matrix. These values must sum to \(p\) (the number of variables in the original data), so if we divide them by \(p\) we get a measure of variation explained by a component.

Note that the prcomp() function will report the square root of the eigenvalues:

P <- prcomp(USArrests_std)  ## US Arrests data
sum(P$sdev^2)               ## Recall there were 4 variables

## [1] 4

A popular graphic for visualizing explained variation is the scree plot:

fviz_screeplot(P)

The x-axis displays each principal component, while the y-axis displays the corresponding percentage of variance explained by that component. For the “USArrests” data, the first two principal components account for nearly 90% of variation across different states.

We can find the precise y-values of this graph ourselves by dividing the eigenvalues by \(p\) (the number of columns of the original data):

P$sdev^2/ncol(USArrests_std)

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

Question #5: Construct a scree plot for the PCA you performed on the “colleges” data in Question #2. Next, use the PCA results to find the exact proportion of variation explained by the first three components (ie: calculate and sum the y-values displayed on your scree plot for \(PC_1\), \(PC_2\), and \(PC_3\)).

\(~\)

Parallel Analysis

Depending on the goals of an analysis, you might subjectively choose to retain a certain number of principal components by inspecting the scree plot.

Alternatively, some suggest retaining all components that explain at least as much variation as a variable in the original data. This corresponds to an eigenvalue \(\geq 1\) or explained variation \(\geq 1/p\).

While popular, neither these methods are particularly rigorous. A more reliable way to determine the number of components to retain is parallel analysis.

Parallel analysis is a statistical approach that compares the eigenvalues of the real data to a distribution of those found in randomly generated data of the same size (think of a randomization test from Sta-209). We can perform parallel analysis using the fa.parallel() function in the psych library:

fa.parallel(USArrests_std, fa = "pc")  ## fa = "pc" indicates PCA

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  1

For the “USArrests” data, parallel analysis suggests we should only retain the first principal component (which we previously labeled “overall violent crime”).

Question #6: Perform parallel analysis on the “colleges” data that you’ve used in previous several questions. Report the number of components that the method suggests should be retained.

\(~\)

Biplots

A biplot is a commonly used graph that combines the scores of individual observations and the loadings of the original variables on two of the principal component axes.

fviz_pca_biplot(P, axes = c(1,2), repel = TRUE)  ## PC1 and PC2 for USArrests

Notice how the argument repel = TRUE can be used to prevent text labels from overlapping.

\(~\)

Application

In this example you’ll explore data collected from an online personality test. These data measure the big-five personality traits, which are: “openness to experiences”, “conscientiousness”, “extraversion”, “agreeableness”, and “neuroticism”.

The personality test asked participants to rate 50 statements on a five-point scale (1 = disagree, 3 = neutral, 5 = agree). The researchers designed 10 statements to target each of the big five traits. For example, questions tagged with “E” were designed to measure extraversion, while questions tagged with “A” were designed to measure agreeableness.

The exact questions used, along with a more detailed description of the data, can be found here: codebook link

big5 <- read.delim("https://remiller1450.github.io/data/big5data.csv")
b5 <- select(big5, -race, -age, -engnat, -gender, -hand, -source, -country) ## Exclude demographics

Question #7 (Part A): Use the prcomp() function to perform PCA on the big five data set. Be sure to standardize the data. Then, use parallel analysis to determine the number of components that should be retained.

Question #7 (Part B) Construct a scree plot to display the PCA results from Part A and report the variation explained by the first five principal components (the big five personality traits).

Question #7 (Part C): Use fviz_contrib() with the argument top = 10 to visualize the top 10 questions that contribute to the first principal component. What label would you give this component?

Question #7 (Part D): Next, use fviz_contrib() on components 2 through 5 (you don’t need to record these results). How would label these principal components?

Question #7 (Part E) Create a data frame of scores in components 1 through 5 and add the “gender” variable from the original data to this data frame. Then, use group_by() and summarize() to find the average big five personality factor scores of male and female respondents (1 = Male, 2 = Female). Which two personality factors appear to be the most related to gender? Which two personality factors appear to be the least related to gender? Briefly explain.