This lab focuses on principal component analysis (PCA), a method with many applications, most of which involve finding lower dimensional representations of complex data.
Directions (Please read before starting)
\(~\)
This lab will rely on the factoextra
package for creating graphics that display the results of PCA. It
will also use a few functions from the psych
package and
the corrplot
package.
# load the following packages
# install.packages("factoextra")
# install.packages("psych")
# install.packages("corrplot")
library(psych)
library(factoextra)
library(ggplot2)
library(dplyr)
library(corrplot)
\(~\)
So far we’ve relied on data visualization as our primary method for identifying trends, but this has its limitations, especially when the number of variables is large.
Consider data consisting of \(n\) measurements (observations) of \(p\) variables, represented by the matrix \(\mathbf{X}\):
\[ \mathbf{X} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,p} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,p} \end{pmatrix} \] Our goal in this lab is to find relationships in the columns of \(\mathbf{X}\) (ie: find patterns across variables) using principal component analysis (PCA). In doing so, we hope to come up with interpretable, low-dimensional representations of the key trends contained in our data.
\(~\)
The main idea behind PCA for dimension reduction is as follows:
Consider the following example:
In Scenario A, the variables X and Y are redundant
In scenario B, where isn’t much redundancy in X and Y
The next figure (below) shows the first principal component (red) and second principal component (blue) for each scenario:
The first (red) component is constructed to contain the maximum amount of variation in the data, and the second (blue) component is orthogonal to the first.
We can now define the location of each data-point using a coordinate scale in these new dimensions (ie: \(\{PC_1, PC_2\}\) rather than \(\{X, Y\}\)). In scenario A, the \(PC_2\) coordinates have so little variation that we don’t lose much by ignoring them and only considering the \(PC_1\) coordinate of an observation (reducing two dimensions down to just one dimension)
\(~\)
Principal components can be expressed by linear combinations of the original variables:
\[PC_1 = \phi_{11} X_1 + \phi_{21} X_2 + \ldots + \phi_{p1} X_p \\ PC_2 = \phi_{12} X_1 + \phi_{22} X_2 + \ldots + \phi_{p2} X_p \\ \ldots \]
The coefficients, \(\phi_{11}, \phi_{21}, \ldots\), are called loadings. They describe the contribution of a variable to a principal component. The matrix containing these loadings is sometimes called the rotation matrix.
If you plug-in the values of \(X_1, \ldots, X_p\) for an observation and multiply by the loadings the resulting value of \(PC_1\) is called a score. This is the location of that observation in the \(PC_1\) dimension (ie: position on the \(PC_1\) axis).
The first step in PCA is to standardize \(\mathbf{X}\), which prevents variables with larger values from dominating. We want the difference between 1m and 10 m (a value of 9) to be treated the same as that between 1000mm and 10000mm (a value of 9000).
After standardization, we’ll consider the covariance matrix \(\mathbf{C}= \tfrac{1}{n-1}\mathbf{X}^T\mathbf{X}\). Using techniques from linear algebra, we can factor this matrix:
\[\mathbf{C} = \mathbf{V}\mathbf{L}\mathbf{V}^T\]
\(~\)
The USArrests
data set contains four variables for each
of the 50 states.
data("USArrests")
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
If we create a scatterplot matrix (using
pairs()
), we see signs of redundancy, particularly between
“Assault” and “Murder”:
## Step 1 - check for signs of redundancy
pairs(USArrests, lower.panel = NULL)
To perform PCA, we first must standardize the data (using
scale()
):
## Step 2 - standardize the numeric variables
USArrests_std <- scale(USArrests)
We then use the prcomp()
function to find the principal
components:
## Step 3 - perform PCA on the standardized data
P <- prcomp(USArrests_std)
P$rotation
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
Next, let’s quickly see how we could have used linear algebra instead
of prcomp()
:
n <- nrow(USArrests) ## find n obs
C <- 1/(n-1) * t(USArrests_std) %*% USArrests_std ## find Cov matrix
P2 <- svd(C) ## singular value decomposition (C = UDV^T)
P2$u ## rotation matrix
## [,1] [,2] [,3] [,4]
## [1,] -0.5358995 0.4181809 -0.3412327 0.64922780
## [2,] -0.5831836 0.1879856 -0.2681484 -0.74340748
## [3,] -0.2781909 -0.8728062 -0.3780158 0.13387773
## [4,] -0.5434321 -0.1673186 0.8177779 0.08902432
Finally, let’s find the scores of each state in \(PC_1\) and \(PC_2\) and graph the states in these new dimensions:
scores <- as.data.frame(P$x)
scores$State <- rownames(P$x) ### row names were used to store states
ggplot(data = scores, aes(x = PC1, y = PC2, label = State)) + geom_text()
We can look at a few states to help us further understand the usefulness of PCA:
Question #1: Describe the states South Carolina and Wisconsin using the graph displayed above.
\(~\)
The code below creates a subset of the “colleges2019” data set with no missing data in the variables that are selected:
colleges <- read.csv("https://remiller1450.github.io/data/Colleges2019.csv")
cd <- colleges %>%
select(Name, State, Enrollment, Adm_Rate, ACT_median,
Net_Tuition, FourYearComp_Males, FourYearComp_Females,
Debt_median, Salary10yr_median)
cd <- cd[complete.cases(cd),] ## Keeps only complete cases (no missing values)
In the questions that follow you will progress through the basic steps of PCA:
Question #2 (Part A): Create a new data frame where all the numeric variables are standardized. To do this you must remove the “Name” and “State” columns.
Question #2 (Part B): Perform PCA using the
prcomp()
function. Then, considering only loadings above
0.4 in absolute value (a commonly used threshold), come up with a label
for the first principal component.
Question #2 (Part C): Considering only loadings above 0.4 in absolute value (a commonly used threshold), come up with a label for the second principal component.
Question #2 (Part D): Using geom_text()
create a graphic that displays the \(PC_1\) and \(PC_2\) scores of all colleges in
Iowa. Briefly comment on the position of Grinnell College in this
graph, mentioning why you think its location is so far from the other
Iowa colleges.
\(~\)
Principal component analysis is useful when your data contain several numeric variables that are related with each other such that they might be measuring variation in the same underlying factor.
The first step in determining if PCA is viable is to look at the
pairwise relationships between variables. The corrplot
package provides several useful functions for this. Demonstrated below
are a few of the package’s functions:
## Create the correlation matrix (dropping identifiers)
cormat = cor(select(cd, -Name,-State))
## Default Corrplot
corrplot(cormat)
## Mixed corrplot
corrplot.mixed(cormat, tl.pos = 'lt', tl.cex = 0.5, number.cex = 0.55)
In the second example, the argument tl.pos
moves the
text labels to the left and top, “tl”. You could also place them on the
diagonal using “d”. The arguments tl.cex
and
number.cex
change the size of the text labels and
correlation coefficients (respectively).
Question #3: Notice the grouping of variables with dark blue circles in the center of the corrplot. How does this grouping relate to the first principal component you described in Question? How would explain this relationship?
\(~\)
Principal components are linear combinations of the original variables. And if the data are standardized, a variable’s loading is a direct reflection of its importance.
The fviz_contrib()
function provides a convenient way to
visualize variable contributions:
P <- prcomp(USArrests_std)
fviz_contrib(P, choice = "var", axes = 1, top = 4) ## PC1
fviz_contrib(P, choice = "var", axes = 2, top = 3) ## PC2
The horizontal reference line shows the expected loading if each variable contributed uniformly to all of the principal components. Any variable whose loading is the above the reference line can be viewed as “important” in that dimension.
Question #4: Use fviz_contrib()
to
display the contributions of variables to the first principal component
in the PCA of the “colleges” data you perform in Question #2. Do the
variables that appear above the reference line correspond with the ones
you mentioned in Part B of Question #2? Briefly explain.
\(~\)
Principal components are constructed so that earlier components explain more variability in the data than the later components. Thus, not all components are equally valuable - but how should we decide which to keep and interpret and which to discard?
Recall that the matrix \(\mathbf{L}\) contains the eigenvalues of the covariance matrix. These values must sum to \(p\) (the number of variables in the original data), so if we divide them by \(p\) we get a measure of variation explained by a component.
Note that the prcomp()
function will report the square
root of the eigenvalues:
P <- prcomp(USArrests_std) ## US Arrests data
sum(P$sdev^2) ## Recall there were 4 variables
## [1] 4
A popular graphic for visualizing explained variation is the scree plot:
fviz_screeplot(P)
The x-axis displays each principal component, while the y-axis displays the corresponding percentage of variance explained by that component. For the “USArrests” data, the first two principal components account for nearly 90% of variation across different states.
We can find the precise y-values of this graph ourselves by dividing the eigenvalues by \(p\) (the number of columns of the original data):
P$sdev^2/ncol(USArrests_std)
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
Question #5: Construct a scree plot for the PCA you performed on the “colleges” data in Question #2. Next, use the PCA results to find the exact proportion of variation explained by the first three components (ie: calculate and sum the y-values displayed on your scree plot for \(PC_1\), \(PC_2\), and \(PC_3\)).
\(~\)
Depending on the goals of an analysis, you might subjectively choose to retain a certain number of principal components by inspecting the scree plot.
Alternatively, some suggest retaining all components that explain at least as much variation as a variable in the original data. This corresponds to an eigenvalue \(\geq 1\) or explained variation \(\geq 1/p\).
While popular, neither these methods are particularly rigorous. A more reliable way to determine the number of components to retain is parallel analysis.
Parallel analysis is a statistical approach that compares the
eigenvalues of the real data to a distribution of those found in
randomly generated data of the same size (think of a
randomization test from Sta-209). We can perform parallel analysis using
the fa.parallel()
function in the psych
library:
fa.parallel(USArrests_std, fa = "pc") ## fa = "pc" indicates PCA
## Parallel analysis suggests that the number of factors = NA and the number of components = 1
For the “USArrests” data, parallel analysis suggests we should only retain the first principal component (which we previously labeled “overall violent crime”).
Question #6: Perform parallel analysis on the “colleges” data that you’ve used in previous several questions. Report the number of components that the method suggests should be retained.
\(~\)
A biplot is a commonly used graph that combines the scores of individual observations and the loadings of the original variables on two of the principal component axes.
fviz_pca_biplot(P, axes = c(1,2), repel = TRUE) ## PC1 and PC2 for USArrests
Notice how the argument repel = TRUE
can be used to
prevent text labels from overlapping.
\(~\)
In this example you’ll explore data collected from an online personality test. These data measure the big-five personality traits, which are: “openness to experiences”, “conscientiousness”, “extraversion”, “agreeableness”, and “neuroticism”.
The personality test asked participants to rate 50 statements on a five-point scale (1 = disagree, 3 = neutral, 5 = agree). The researchers designed 10 statements to target each of the big five traits. For example, questions tagged with “E” were designed to measure extraversion, while questions tagged with “A” were designed to measure agreeableness.
The exact questions used, along with a more detailed description of the data, can be found here: codebook link
big5 <- read.delim("https://remiller1450.github.io/data/big5data.csv")
b5 <- select(big5, -race, -age, -engnat, -gender, -hand, -source, -country) ## Exclude demographics
Question #7 (Part A): Use the prcomp()
function to perform PCA on the big five data set. Be sure to standardize
the data. Then, use parallel analysis to determine the number of
components that should be retained.
Question #7 (Part B) Construct a scree plot to display the PCA results from Part A and report the variation explained by the first five principal components (the big five personality traits).
Question #7 (Part C): Use
fviz_contrib()
with the argument top = 10
to
visualize the top 10 questions that contribute to the first principal
component. What label would you give this component?
Question #7 (Part D): Next, use
fviz_contrib()
on components 2 through 5 (you don’t need to
record these results). How would label these principal components?
Question #7 (Part E) Create a data frame of scores
in components 1 through 5 and add the “gender” variable from the
original data to this data frame. Then, use group_by()
and
summarize()
to find the average big five personality factor
scores of male and female respondents (1 = Male, 2 = Female). Which two
personality factors appear to be the most related to gender? Which two
personality factors appear to be the least related to gender? Briefly
explain.