principal Components Analysis (PCA) is an unsupervised learning method that is used to understand relationships amongst a set of variables. The method is used in many disciplines, see the very detailed PCA Wikipedia page for descriptions of several different applications.
In this lab we will use the factoextra package to perform principal component analysis and visualize the results.
library(factoextra) ## Used for PCA visualizations
library(dplyr) ## Used for data cleaning
Consider data consisting of \(n\) measurements of \(p\) variables. If \(p\) is large, it becomes difficult to thoroughly visualize the relationships between variables. For \(p = 10\) there are already 45 distinct two variable combinations, creating a major barrier to using bivariate methods like scatterplots or correlation coefficients to find patterns. PCA avoids the issue of having to sort through many different combinations of variables by instead finding low-dimensional representations of the data that contain as much of the original variation as possible.
The main idea is as follows:
The new derived dimensions mentioned above are called principal components, and they are linear combinations of the original variables. The first principal component is defined:
\[Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \ldots + \phi_{p1} X_p\]
The coefficients \(\phi_{11}, \ldots, \phi_{p1}\) are the loadings of the first principal component. We won’t go into detail on how the loadings of a principal component are calculated, but they can be found quite easily using eigenvalue decompositions, a fairly standard technique in linear algebra.
Conceptually, PCA can be understood through the two scenarios below:
In Scenario A, the variables X and Y are redundant. Meaning if you know an observation’s X-value, you essentially know it’s Y-value too. For these variables, we could effectively use one-dimension (ie: a line) to capture most of the variation (differences amongst data-points) in the original X and Y dimensions.
Contrast this with scenario B, where isn’t much redundancy in X and Y. In this Scenario knowing an observation’s X-value tells us very little about its Y-value, implying that one-dimension cannot effectively express most of the variation in the two variables.
The figure below shows the first principal component (red) and second principal component (blue) for each scenario:
Because \(p = 2\), there are only two principal components in these scenarios. The first component is the dimension containing the maximum amount of variation in the two variables, the second component is the dimension containing the maximum amount of variation that isn’t captured by the first. It is not a coincidence that these two dimensions are orthogonal, all principal components are constructed such that they are linearly independent.
In Scenario A, we can justifiably discard the second principal component when analyzing the data, as this dimension contains very little new information. Using only the first principal component is akin to projecting the data onto the red line.
In scenario B, using only the first component in analysis is questionable, since almost half of the variation in these data is found in the second component!
The loadings of the top principal components can be used to understand patterns in the data, especially when these components describe a high percentage of the variation in the data.
As an example, we’ll analyze data from the 50 US States, looking for relationships between the assault, murder, and rape arrest rates of each state per 100,000 residents. In our analysis we’ll also include a variable documenting the proportion of a state’s population living in urban areas.
Before looking at the data, it is important to point out that we expect some or all of these variables are correlated, which makes PCA a sensible choice.
data(USArrests)
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
In R
, PCA can be performed using the prcomp
function. In this example, the loadings of each principal component are stored in the object P
.
P <- prcomp(USArrests, scale = TRUE)
P
## Standard deviations (1, .., p=4):
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
The argument scale = TRUE
is used to rescale each variable to have a variance of 1, which is needed to put variables with different units on the same scale. In most applications rescaling is a necessary step if PCA is to produce sensible results.
By studying the contributions of each of the original variables to a component we can come up with an interpretation of the newly derived dimension. In this example, the first component (PC1) can be understood as the “overall crime rate” because the loadings are large, roughly equal in magnitude, and occur in the same direction for all three crime variables, while the loading for “UrbanPop” is somewhat smaller. Here the negative coefficients indicate that high values in the original variables are mapped to low scores in the PC1 dimension. This means that higher crime rates correspond with lower PC1 scores.
The second principal component roughly describes the urbanization of a state, since the variable “UrbanPop” receives most of the weight in this component.
A quick way to visualize the first two principal components is the biplot
function. The argument scale = 0
adjusts the scale of the component directions (the upper and right axes) to match their loadings; while cex = .75
adjusts the size of each plotted point’s text.
biplot(P, scale = 0, cex = .75)
A more elegant visual can be created using the factoextra
package:
fviz_pca_biplot(P, repel = TRUE, # Avoid text overlapping (doesn't scale to large datasets)
col.var = "red", # Variables color
col.ind = "black") # Individuals color
The state names indicate each observation’s normalized score on the first two principal components. States in the upper left have low scores in the first principal component (meaning they have high overall rates of crime) and high scores in the second principal component (meaning they have low urbanization).
The red arrows depict the loadings of each variable. For example, the loading for “UrbanPop” is -0.28 in the first principal component and -0.83 in the second component, the red label of “UrbanPop” is centered at (-0.28, -.83) on the scale depicted on the top and right axes of the graph, which describes the loadings.
We can use the biplot to describe the key characteristics of each data point. For example, Florida is high crime and about average in urbanization; while West Virginia is low crime and not very urban.
Question 1:
Use the biplot to describe the crime and urbanization of California and Iowa.
In the US states example, we used PCA to represent four variables using just two dimensions. When doing this it is important to consider how much information was lost by using only the first two principal components. Or, phrased differently, how much variability in the data is not contained in the first two principal components?
The fviz_eig
function produces a scree plot, which visualizes the amount of variation captured by each principal component:
fviz_eig(P, addlabels = TRUE)
From the scree plot, we can see that first two principal components contain just under 90% of the variability in these data. This suggests plotting that the data in two-dimensions (those derived from our first two principal components) is an effective summary of all of the variables. Putting these results into context, we can conclude that the USArrests data can be effectively characterized by “overall crime rates” and “urbanization”.
Deciding how many principal components to use is heavily dependent on the application. Most often analysts will examine the scree plot and look for a natural leveling-off point. In the US Arrests example there is sizeable drop-off after the second components, so it is reasonable to use just the first two components.
Note: The percentages shown on the scree plot come directly from the eigenvalues of the data matrix, we could find these percentages ourselves using the standard deviations from the output of prcomp
:
P <- prcomp(USArrests, scale = TRUE)
P$sdev^2/sum(P$sdev^2)
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
Reducing a dataset to two or three dimensions prior to visualization is one use of principal components, but sometimes the loadings themselves represent the most interesting portion of the analysis.
In this example we will explore data collected from an interactive online personality test. These data measure the big-five personality traits, which are: “openness to experiences”, “conscientiousness”, “extraversion”, “agreeableness”, and “neuroticism”.
The personality questionnaire used in the collection of these data asked participants to rate 50 statements on a five-point scale (1 = disagree, 3 = neutral, 5 = agree). The researchers specifically designed 10 statements targeting each of the big five traits. The questions used, along with a more detailed description of the data, can be found here: big five codebook link)
big5 <- read.delim("https://remiller1450.github.io/data/big5data.csv")
big5 <- select(big5, -race, -age, -engnat, -gender, -hand, -source, -country) ## Exclude demographics
P2 <- prcomp(big5)
Note: We did not use the argument rescale = TRUE
in prcomp
for this application because these questions are already on the same 1-5 scale.
fviz_eig(P2, addlabels = TRUE)
From the scree plot, five components clearly stand out (with the first really standing out) before there is a noticeable drop in explained variance.
The next step in our analysis is to investigate which personality questions are contributing to each of these five components. We could do this by inspecting the loadings but sorting through loadings for all 50 questions is time consuming. Instead we’ll use the function fviz_contrib
to find the top contributing questions for each principal component of interest.
fviz_contrib(P2, choice = "var", axes = 1, top = 6) # Top 6 contributors to the first component
We see that five of the top six contributors (and all of the first four) are questions from the “E” series. This is not a coincidence, psychologists have named this personality dimension “Extroversion”, and the E-series questions were specifically designed to measure introversion/extroversion!
The remainder of the big-five are shown below, we see that the specifically designed question categories are most distinguished in certain components:
library(gridExtra) ## Used to arrange multiple ggplots
n <- fviz_contrib(P2, choice = "var", axes = 2, top = 6) # Second component (Neuroticism)
c <- fviz_contrib(P2, choice = "var", axes = 3, top = 6) # Third component (Conscientiousness)
e <- fviz_contrib(P2, choice = "var", axes = 4, top = 6) # Forth component (Openness to Experiences)
a <- fviz_contrib(P2, choice = "var", axes = 5, top = 6) # Fifth component (Agreeableness)
grid.arrange(n,c,e,a, nrow = 2) ## Function for arranging ggplots into one grid
Putting the findings of this application in context, we can conclude that while the researchers asked each participant 50 questions, most of the person-to-person variation in responses occurs in only 5-dimensions: “openness to experiences”, “conscientiousness”, “extraversion”, “agreeableness”, and “neuroticism”.
The dataset decathon2
in the factoextra
package contains data on the performances of professional decathletes in two different meets:
data(decathlon2)
head(decathlon2)
## X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
## SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69 43.75
## CLAY 10.76 7.40 14.26 1.86 49.37 14.05 50.72
## BERNARD 11.02 7.23 14.25 1.92 48.93 14.99 40.87
## YURKOV 11.34 7.09 15.19 2.10 50.42 15.31 46.26
## ZSIVOCZKY 11.13 7.30 13.48 2.01 48.62 14.17 45.67
## McMULLEN 10.83 7.31 13.76 2.13 49.91 14.38 44.41
## Pole.vault Javeline X1500m Rank Points Competition
## SEBRLE 5.02 63.19 291.7 1 8217 Decastar
## CLAY 4.92 60.15 301.5 2 8122 Decastar
## BERNARD 5.32 62.77 280.1 4 8067 Decastar
## YURKOV 4.72 63.44 276.4 5 8036 Decastar
## ZSIVOCZKY 4.42 55.37 268.0 7 8004 Decastar
## McMULLEN 4.42 56.37 285.1 8 7995 Decastar
Decathlons are scored using a point system where each athlete is awarded points for their individual performance in ten track and field events based upon a table. For our analysis, we will try to reduce the dimensions of these data to better understand elite decathletes.
Question 2:
Use the prcomp
to perform PCA on the decathlon data, be sure to remove irrelevant variables and rescale the data as necessary. Next use fviz_pca_var
to construct a plot showing variable contributions to the first two components. Use this plot (along with the loadings) to interpret the first two components (ie: come up with a meaning for these new dimensions).
Question 3:
Use fviz_pca_ind
to plot the scores of each observed data point in the first two principal components. Use this plot to characterize the athletes “Macey” and “BOURGUIGNON”.
Question 4
Use the col.ind
argument of fviz_pca_ind
to color your previous plot according how many points the athlete scored in that decathlon. Interpret this plot; specifically, how do top athletes appear to be achieving their superior scores?
The “places related” data were assembled by Boyer and Savageau (1981), who scored various cities across the US with the goal of objectively classifying different cities.
places <- read.csv("https://remiller1450.github.io/data/places.csv")
head(places)
## City Climate HousingCost HlthCare Crime Transp
## 1 Abilene,TX 521 6200 237 923 4031
## 2 Akron,OH 575 8138 1656 886 4883
## 3 Albany,GA 468 7339 618 970 2531
## 4 Albany-Schenectady-Troy,NY 476 7908 1431 610 6883
## 5 Albuquerque,NM 659 8393 1853 1483 6558
## 6 Alexandria,LA 520 5819 640 727 2444
## Educ Arts Recreat Econ CaseNum Long Lat Pop StNum
## 1 2757 996 1405 7633 1 -99.6890 32.5590 110932 44
## 2 2438 5564 2632 4350 2 -81.5180 41.0850 660328 36
## 3 2560 237 859 5250 3 -84.1580 31.5750 112402 11
## 4 3399 4655 1617 5864 4 -73.7983 42.7327 835880 35
## 5 3026 4496 2612 5727 5 -106.6500 35.0830 419700 33
## 6 2972 334 1018 5254 6 -92.4530 31.3020 135282 19
The nine rating criteria used are: Climate and Terrain, Housing, Health Care and Environment, Crime, Transportation, Education, The Arts, Recreation, Economics.
For all but two of the above criteria, the higher the score, the better; for Housing and Crime, the lower the score the better.
The scores are computed based upon the following items:
Our analysis goal is to find trends and characterize individual cities using fewer variables.
Question 5:
Use the prcomp
to perform PCA on the places related data, be sure to remove the city’s name, case number, lat/long, population, and StNum. Use the loadings to come up with an interpretation for the first three principal components.
head(places)
## City Climate HousingCost HlthCare Crime Transp
## 1 Abilene,TX 521 6200 237 923 4031
## 2 Akron,OH 575 8138 1656 886 4883
## 3 Albany,GA 468 7339 618 970 2531
## 4 Albany-Schenectady-Troy,NY 476 7908 1431 610 6883
## 5 Albuquerque,NM 659 8393 1853 1483 6558
## 6 Alexandria,LA 520 5819 640 727 2444
## Educ Arts Recreat Econ CaseNum Long Lat Pop StNum
## 1 2757 996 1405 7633 1 -99.6890 32.5590 110932 44
## 2 2438 5564 2632 4350 2 -81.5180 41.0850 660328 36
## 3 2560 237 859 5250 3 -84.1580 31.5750 112402 11
## 4 3399 4655 1617 5864 4 -73.7983 42.7327 835880 35
## 5 3026 4496 2612 5727 5 -106.6500 35.0830 419700 33
## 6 2972 334 1018 5254 6 -92.4530 31.3020 135282 19
Question 6
Construct a scree plot and determine much of the variation in data is contained in the first three principal components? Also, provide an argument for reducing these data to only the first three principal components.
Question 7:
Use fviz_pca_ind
to construct a plot displaying relationship between the first two principal component scores of these cities colored by log-population (use base 10). How does population appear to be related to these components?
Question 8:
Place 213 is New York City, use the first two principal components to characterize New York city.