principal Components Analysis (PCA) is an unsupervised learning method that is used to understand relationships amongst a set of variables. The method is used in many disciplines, see the very detailed PCA Wikipedia page for descriptions of several different applications.

In this lab we will use the factoextra package to perform principal component analysis and visualize the results.

library(factoextra)  ## Used for PCA visualizations
library(dplyr)  ## Used for data cleaning

1. What are Principal Components?

Consider data consisting of \(n\) measurements of \(p\) variables. If \(p\) is large, it becomes difficult to thoroughly visualize the relationships between variables. For \(p = 10\) there are already 45 distinct two variable combinations, creating a major barrier to using bivariate methods like scatterplots or correlation coefficients to find patterns. PCA avoids the issue of having to sort through many different combinations of variables by instead finding low-dimensional representations of the data that contain as much of the original variation as possible.

The main idea is as follows:

The new derived dimensions mentioned above are called principal components, and they are linear combinations of the original variables. The first principal component is defined:

\[Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \ldots + \phi_{p1} X_p\]

The coefficients \(\phi_{11}, \ldots, \phi_{p1}\) are the loadings of the first principal component. We won’t go into detail on how the loadings of a principal component are calculated, but they can be found quite easily using eigenvalue decompositions, a fairly standard technique in linear algebra.

Conceptually, PCA can be understood through the two scenarios below:

In Scenario A, the variables X and Y are redundant. Meaning if you know an observation’s X-value, you essentially know it’s Y-value too. For these variables, we could effectively use one-dimension (ie: a line) to capture most of the variation (differences amongst data-points) in the original X and Y dimensions.

Contrast this with scenario B, where isn’t much redundancy in X and Y. In this Scenario knowing an observation’s X-value tells us very little about its Y-value, implying that one-dimension cannot effectively express most of the variation in the two variables.

The figure below shows the first principal component (red) and second principal component (blue) for each scenario:

Because \(p = 2\), there are only two principal components in these scenarios. The first component is the dimension containing the maximum amount of variation in the two variables, the second component is the dimension containing the maximum amount of variation that isn’t captured by the first. It is not a coincidence that these two dimensions are orthogonal, all principal components are constructed such that they are linearly independent.

In Scenario A, we can justifiably discard the second principal component when analyzing the data, as this dimension contains very little new information. Using only the first principal component is akin to projecting the data onto the red line.

In scenario B, using only the first component in analysis is questionable, since almost half of the variation in these data is found in the second component!

2. US Arrest Data Example

The loadings of the top principal components can be used to understand patterns in the data, especially when these components describe a high percentage of the variation in the data.

As an example, we’ll analyze data from the 50 US States, looking for relationships between the assault, murder, and rape arrest rates of each state per 100,000 residents. In our analysis we’ll also include a variable documenting the proportion of a state’s population living in urban areas.

Before looking at the data, it is important to point out that we expect some or all of these variables are correlated, which makes PCA a sensible choice.

data(USArrests)
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

In R, PCA can be performed using the prcomp function. In this example, the loadings of each principal component are stored in the object P.

P <- prcomp(USArrests, scale = TRUE)
P
## Standard deviations (1, .., p=4):
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
## 
## Rotation (n x k) = (4 x 4):
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

The argument scale = TRUE is used to rescale each variable to have a variance of 1, which is needed to put variables with different units on the same scale. In most applications rescaling is a necessary step if PCA is to produce sensible results.

By studying the contributions of each of the original variables to a component we can come up with an interpretation of the newly derived dimension. In this example, the first component (PC1) can be understood as the “overall crime rate” because the loadings are large, roughly equal in magnitude, and occur in the same direction for all three crime variables, while the loading for “UrbanPop” is somewhat smaller. Here the negative coefficients indicate that high values in the original variables are mapped to low scores in the PC1 dimension. This means that higher crime rates correspond with lower PC1 scores.

The second principal component roughly describes the urbanization of a state, since the variable “UrbanPop” receives most of the weight in this component.

A quick way to visualize the first two principal components is the biplot function. The argument scale = 0 adjusts the scale of the component directions (the upper and right axes) to match their loadings; while cex = .75 adjusts the size of each plotted point’s text.

biplot(P, scale = 0, cex = .75)

A more elegant visual can be created using the factoextra package:

fviz_pca_biplot(P, repel = TRUE, # Avoid text overlapping (doesn't scale to large datasets)
                col.var = "red", # Variables color
                col.ind = "black")  # Individuals color

The state names indicate each observation’s normalized score on the first two principal components. States in the upper left have low scores in the first principal component (meaning they have high overall rates of crime) and high scores in the second principal component (meaning they have low urbanization).

The red arrows depict the loadings of each variable. For example, the loading for “UrbanPop” is -0.28 in the first principal component and -0.83 in the second component, the red label of “UrbanPop” is centered at (-0.28, -.83) on the scale depicted on the top and right axes of the graph, which describes the loadings.

We can use the biplot to describe the key characteristics of each data point. For example, Florida is high crime and about average in urbanization; while West Virginia is low crime and not very urban.

Question 1:

Use the biplot to describe the crime and urbanization of California and Iowa.

3. How Many Components?

In the US states example, we used PCA to represent four variables using just two dimensions. When doing this it is important to consider how much information was lost by using only the first two principal components. Or, phrased differently, how much variability in the data is not contained in the first two principal components?

The fviz_eig function produces a scree plot, which visualizes the amount of variation captured by each principal component:

fviz_eig(P, addlabels = TRUE)

From the scree plot, we can see that first two principal components contain just under 90% of the variability in these data. This suggests plotting that the data in two-dimensions (those derived from our first two principal components) is an effective summary of all of the variables. Putting these results into context, we can conclude that the USArrests data can be effectively characterized by “overall crime rates” and “urbanization”.

Deciding how many principal components to use is heavily dependent on the application. Most often analysts will examine the scree plot and look for a natural leveling-off point. In the US Arrests example there is sizeable drop-off after the second components, so it is reasonable to use just the first two components.

Note: The percentages shown on the scree plot come directly from the eigenvalues of the data matrix, we could find these percentages ourselves using the standard deviations from the output of prcomp:

P <- prcomp(USArrests, scale = TRUE)
P$sdev^2/sum(P$sdev^2)
## [1] 0.62006039 0.24744129 0.08914080 0.04335752

4. An Example From Psychology

Reducing a dataset to two or three dimensions prior to visualization is one use of principal components, but sometimes the loadings themselves represent the most interesting portion of the analysis.

In this example we will explore data collected from an interactive online personality test. These data measure the big-five personality traits, which are: “openness to experiences”, “conscientiousness”, “extraversion”, “agreeableness”, and “neuroticism”.

The personality questionnaire used in the collection of these data asked participants to rate 50 statements on a five-point scale (1 = disagree, 3 = neutral, 5 = agree). The researchers specifically designed 10 statements targeting each of the big five traits. The questions used, along with a more detailed description of the data, can be found here: big five codebook link)

big5 <- read.delim("https://remiller1450.github.io/data/big5data.csv")
big5 <- select(big5, -race, -age, -engnat, -gender, -hand, -source, -country) ## Exclude demographics

P2 <- prcomp(big5)

Note: We did not use the argument rescale = TRUE in prcomp for this application because these questions are already on the same 1-5 scale.

fviz_eig(P2, addlabels = TRUE)

From the scree plot, five components clearly stand out (with the first really standing out) before there is a noticeable drop in explained variance.

The next step in our analysis is to investigate which personality questions are contributing to each of these five components. We could do this by inspecting the loadings but sorting through loadings for all 50 questions is time consuming. Instead we’ll use the function fviz_contrib to find the top contributing questions for each principal component of interest.

fviz_contrib(P2, choice = "var", axes = 1, top = 6) # Top 6 contributors to the first component