An Introduction to Principal Components Analysis

principal Components Analysis (PCA) is an unsupervised learning method that is used to understand relationships amongst a set of variables. The method is used in many disciplines, see the very detailed PCA Wikipedia page for descriptions of several different applications.

In this lab we will use the factoextra package to perform principal component analysis and visualize the results.

library(factoextra)  ## Used for PCA visualizations
library(dplyr)  ## Used for data cleaning

1. What are Principal Components?

Consider data consisting of \(n\) measurements of \(p\) variables. If \(p\) is large, it becomes difficult to thoroughly visualize the relationships between variables. For \(p = 10\) there are already 45 distinct two variable combinations, creating a major barrier to using bivariate methods like scatterplots or correlation coefficients to find patterns. PCA avoids the issue of having to sort through many different combinations of variables by instead finding low-dimensional representations of the data that contain as much of the original variation as possible.

The main idea is as follows:

Each of the \(n\) observations lives in \(p\)-dimensional space (for example if \(p = 3\) we could graph the location of each data point on a 3-dimensional scatterplot), but very often not all of these \(p\) dimensions are equally interesting.
PCA derives a set of new dimensions such that variation is largest in the first dimension and smallest in the \(p^{th}\). If we’re lucky, most of the variability in the data exists in a small number these new dimensions, thereby allowing us accurately represent the data in fewer dimensions.

The new derived dimensions mentioned above are called principal components, and they are linear combinations of the original variables. The first principal component is defined:

\[Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \ldots + \phi_{p1} X_p\]

The coefficients \(\phi_{11}, \ldots, \phi_{p1}\) are the loadings of the first principal component. We won’t go into detail on how the loadings of a principal component are calculated, but they can be found quite easily using eigenvalue decompositions, a fairly standard technique in linear algebra.

Conceptually, PCA can be understood through the two scenarios below:

In Scenario A, the variables X and Y are redundant. Meaning if you know an observation’s X-value, you essentially know it’s Y-value too. For these variables, we could effectively use one-dimension (ie: a line) to capture most of the variation (differences amongst data-points) in the original X and Y dimensions.

Contrast this with scenario B, where isn’t much redundancy in X and Y. In this Scenario knowing an observation’s X-value tells us very little about its Y-value, implying that one-dimension cannot effectively express most of the variation in the two variables.

The figure below shows the first principal component (red) and second principal component (blue) for each scenario:

Because \(p = 2\), there are only two principal components in these scenarios. The first component is the dimension containing the maximum amount of variation in the two variables, the second component is the dimension containing the maximum amount of variation that isn’t captured by the first. It is not a coincidence that these two dimensions are orthogonal, all principal components are constructed such that they are linearly independent.

In Scenario A, we can justifiably discard the second principal component when analyzing the data, as this dimension contains very little new information. Using only the first principal component is akin to projecting the data onto the red line.

In scenario B, using only the first component in analysis is questionable, since almost half of the variation in these data is found in the second component!

2. US Arrest Data Example

The loadings of the top principal components can be used to understand patterns in the data, especially when these components describe a high percentage of the variation in the data.

As an example, we’ll analyze data from the 50 US States, looking for relationships between the assault, murder, and rape arrest rates of each state per 100,000 residents. In our analysis we’ll also include a variable documenting the proportion of a state’s population living in urban areas.

Before looking at the data, it is important to point out that we expect some or all of these variables are correlated, which makes PCA a sensible choice.

data(USArrests)
head(USArrests)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

In R, PCA can be performed using the prcomp function. In this example, the loadings of each principal component are stored in the object P.

P <- prcomp(USArrests, scale = TRUE)
P

## Standard deviations (1, .., p=4):
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
## 
## Rotation (n x k) = (4 x 4):
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

The argument scale = TRUE is used to rescale each variable to have a variance of 1, which is needed to put variables with different units on the same scale. In most applications rescaling is a necessary step if PCA is to produce sensible results.

By studying the contributions of each of the original variables to a component we can come up with an interpretation of the newly derived dimension. In this example, the first component (PC1) can be understood as the “overall crime rate” because the loadings are large, roughly equal in magnitude, and occur in the same direction for all three crime variables, while the loading for “UrbanPop” is somewhat smaller. Here the negative coefficients indicate that high values in the original variables are mapped to low scores in the PC1 dimension. This means that higher crime rates correspond with lower PC1 scores.

The second principal component roughly describes the urbanization of a state, since the variable “UrbanPop” receives most of the weight in this component.

A quick way to visualize the first two principal components is the biplot function. The argument scale = 0 adjusts the scale of the component directions (the upper and right axes) to match their loadings; while cex = .75 adjusts the size of each plotted point’s text.

biplot(P, scale = 0, cex = .75)

A more elegant visual can be created using the factoextra package:

fviz_pca_biplot(P, repel = TRUE, # Avoid text overlapping (doesn't scale to large datasets)
                col.var = "red", # Variables color
                col.ind = "black")  # Individuals color

The state names indicate each observation’s normalized score on the first two principal components. States in the upper left have low scores in the first principal component (meaning they have high overall rates of crime) and high scores in the second principal component (meaning they have low urbanization).

The red arrows depict the loadings of each variable. For example, the loading for “UrbanPop” is -0.28 in the first principal component and -0.83 in the second component, the red label of “UrbanPop” is centered at (-0.28, -.83) on the scale depicted on the top and right axes of the graph, which describes the loadings.

We can use the biplot to describe the key characteristics of each data point. For example, Florida is high crime and about average in urbanization; while West Virginia is low crime and not very urban.

Question 1:

Use the biplot to describe the crime and urbanization of California and Iowa.

3. How Many Components?

In the US states example, we used PCA to represent four variables using just two dimensions. When doing this it is important to consider how much information was lost by using only the first two principal components. Or, phrased differently, how much variability in the data is not contained in the first two principal components?

The fviz_eig function produces a scree plot, which visualizes the amount of variation captured by each principal component:

fviz_eig(P, addlabels = TRUE)

From the scree plot, we can see that first two principal components contain just under 90% of the variability in these data. This suggests plotting that the data in two-dimensions (those derived from our first two principal components) is an effective summary of all of the variables. Putting these results into context, we can conclude that the USArrests data can be effectively characterized by “overall crime rates” and “urbanization”.

Deciding how many principal components to use is heavily dependent on the application. Most often analysts will examine the scree plot and look for a natural leveling-off point. In the US Arrests example there is sizeable drop-off after the second components, so it is reasonable to use just the first two components.

Note: The percentages shown on the scree plot come directly from the eigenvalues of the data matrix, we could find these percentages ourselves using the standard deviations from the output of prcomp:

P <- prcomp(USArrests, scale = TRUE)
P$sdev^2/sum(P$sdev^2)

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

4. An Example From Psychology

Reducing a dataset to two or three dimensions prior to visualization is one use of principal components, but sometimes the loadings themselves represent the most interesting portion of the analysis.

In this example we will explore data collected from an interactive online personality test. These data measure the big-five personality traits, which are: “openness to experiences”, “conscientiousness”, “extraversion”, “agreeableness”, and “neuroticism”.

The personality questionnaire used in the collection of these data asked participants to rate 50 statements on a five-point scale (1 = disagree, 3 = neutral, 5 = agree). The researchers specifically designed 10 statements targeting each of the big five traits. The questions used, along with a more detailed description of the data, can be found here: big five codebook link)

big5 <- read.delim("https://remiller1450.github.io/data/big5data.csv")
big5 <- select(big5, -race, -age, -engnat, -gender, -hand, -source, -country) ## Exclude demographics

P2 <- prcomp(big5)

Note: We did not use the argument rescale = TRUE in prcomp for this application because these questions are already on the same 1-5 scale.

fviz_eig(P2, addlabels = TRUE)

From the scree plot, five components clearly stand out (with the first really standing out) before there is a noticeable drop in explained variance.

The next step in our analysis is to investigate which personality questions are contributing to each of these five components. We could do this by inspecting the loadings but sorting through loadings for all 50 questions is time consuming. Instead we’ll use the function fviz_contrib to find the top contributing questions for each principal component of interest.

fviz_contrib(P2, choice = "var", axes = 1, top = 6) # Top 6 contributors to the first component

We see that five of the top six contributors (and all of the first four) are questions from the “E” series. This is not a coincidence, psychologists have named this personality dimension “Extroversion”, and the E-series questions were specifically designed to measure introversion/extroversion!

The remainder of the big-five are shown below, we see that the specifically designed question categories are most distinguished in certain components:

library(gridExtra) ## Used to arrange multiple ggplots

n <- fviz_contrib(P2, choice = "var", axes = 2, top = 6)  # Second component (Neuroticism)

c <- fviz_contrib(P2, choice = "var", axes = 3, top = 6) # Third component (Conscientiousness)

e <- fviz_contrib(P2, choice = "var", axes = 4, top = 6) # Forth component (Openness to Experiences)

a <- fviz_contrib(P2, choice = "var", axes = 5, top = 6) # Fifth component (Agreeableness)

grid.arrange(n,c,e,a, nrow = 2) ## Function for arranging ggplots into one grid

Putting the findings of this application in context, we can conclude that while the researchers asked each participant 50 questions, most of the person-to-person variation in responses occurs in only 5-dimensions: “openness to experiences”, “conscientiousness”, “extraversion”, “agreeableness”, and “neuroticism”.

5. Practice - Understanding Elite Decathletes

The dataset decathon2 in the factoextra package contains data on the performances of professional decathletes in two different meets:

data(decathlon2)
head(decathlon2)

##           X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
## SEBRLE    11.04      7.58    14.83      2.07 49.81        14.69  43.75
## CLAY      10.76      7.40    14.26      1.86 49.37        14.05  50.72
## BERNARD   11.02      7.23    14.25      1.92 48.93        14.99  40.87
## YURKOV    11.34      7.09    15.19      2.10 50.42        15.31  46.26
## ZSIVOCZKY 11.13      7.30    13.48      2.01 48.62        14.17  45.67
## McMULLEN  10.83      7.31    13.76      2.13 49.91        14.38  44.41
##           Pole.vault Javeline X1500m Rank Points Competition
## SEBRLE          5.02    63.19  291.7    1   8217    Decastar
## CLAY            4.92    60.15  301.5    2   8122    Decastar
## BERNARD         5.32    62.77  280.1    4   8067    Decastar
## YURKOV          4.72    63.44  276.4    5   8036    Decastar
## ZSIVOCZKY       4.42    55.37  268.0    7   8004    Decastar
## McMULLEN        4.42    56.37  285.1    8   7995    Decastar

Decathlons are scored using a point system where each athlete is awarded points for their individual performance in ten track and field events based upon a table. For our analysis, we will try to reduce the dimensions of these data to better understand elite decathletes.

Question 2:

Use the prcomp to perform PCA on the decathlon data, be sure to remove irrelevant variables and rescale the data as necessary. Next use fviz_pca_var to construct a plot showing variable contributions to the first two components. Use this plot (along with the loadings) to interpret the first two components (ie: come up with a meaning for these new dimensions).

Question 3:

Use fviz_pca_ind to plot the scores of each observed data point in the first two principal components. Use this plot to characterize the athletes “Macey” and “BOURGUIGNON”.

Question 4

Use the col.ind argument of fviz_pca_ind to color your previous plot according how many points the athlete scored in that decathlon. Interpret this plot; specifically, how do top athletes appear to be achieving their superior scores?

6. Practice - Characterizing Places to Live

The “places related” data were assembled by Boyer and Savageau (1981), who scored various cities across the US with the goal of objectively classifying different cities.

places <- read.csv("https://remiller1450.github.io/data/places.csv")
head(places)

##                         City Climate HousingCost HlthCare Crime Transp
## 1                 Abilene,TX     521        6200      237   923   4031
## 2                   Akron,OH     575        8138     1656   886   4883
## 3                  Albany,GA     468        7339      618   970   2531
## 4 Albany-Schenectady-Troy,NY     476        7908     1431   610   6883
## 5             Albuquerque,NM     659        8393     1853  1483   6558
## 6              Alexandria,LA     520        5819      640   727   2444
##   Educ Arts Recreat Econ CaseNum      Long     Lat    Pop StNum
## 1 2757  996    1405 7633       1  -99.6890 32.5590 110932    44
## 2 2438 5564    2632 4350       2  -81.5180 41.0850 660328    36
## 3 2560  237     859 5250       3  -84.1580 31.5750 112402    11
## 4 3399 4655    1617 5864       4  -73.7983 42.7327 835880    35
## 5 3026 4496    2612 5727       5 -106.6500 35.0830 419700    33
## 6 2972  334    1018 5254       6  -92.4530 31.3020 135282    19

The nine rating criteria used are: Climate and Terrain, Housing, Health Care and Environment, Crime, Transportation, Education, The Arts, Recreation, Economics.

For all but two of the above criteria, the higher the score, the better; for Housing and Crime, the lower the score the better.

The scores are computed based upon the following items:

Climate and Terrain: very hot and very cold months, seasonal temperature variation, heating- and cooling-degree days, freezing days, zero-degree days, ninety-degree days.
Housing: utility bills, property taxes, mortgage payments.
Health Care and Environment: per capita physicians, teaching hospitals, medical schools, cardiac rehabilitation centers, comprehensive cancer treatment centers, hospices, insurance/hospitalization costs index, fluoridation of drinking water, air pollution.
Crime: violent crime rate, property crime rate.
Transportation: daily commute, public transportation, Interstate highways, air service, passenger rail service.
Education: pupil/teacher ratio in the public K-12 system, effort index in K-12, academic options in higher education.
The Arts: museums, fine arts and public radio stations, public television stations, universities offering a degree or degrees in the arts, symphony orchestras, theatres, opera companies, dance companies, public libraries.
Recreation: good restaurants, public golf courses, certified lanes for tenpin bowling, movie theatres, zoos, aquariums, family theme parks, sanctioned automobile race tracks, pari-mutuel betting attractions, major- and minor- league professional sports teams, NCAA Division I football and basketball teams, miles of ocean or Great Lakes coastline, inland water, national forests, national parks, or national wildlife refuges, Consolidated Metropolitan Statistical Area access.
Economics: average household income adjusted for taxes and living costs, income growth, job growth.

Our analysis goal is to find trends and characterize individual cities using fewer variables.

Question 5:

Use the prcomp to perform PCA on the places related data, be sure to remove the city’s name, case number, lat/long, population, and StNum. Use the loadings to come up with an interpretation for the first three principal components.

head(places)

##                         City Climate HousingCost HlthCare Crime Transp
## 1                 Abilene,TX     521        6200      237   923   4031
## 2                   Akron,OH     575        8138     1656   886   4883
## 3                  Albany,GA     468        7339      618   970   2531
## 4 Albany-Schenectady-Troy,NY     476        7908     1431   610   6883
## 5             Albuquerque,NM     659        8393     1853  1483   6558
## 6              Alexandria,LA     520        5819      640   727   2444
##   Educ Arts Recreat Econ CaseNum      Long     Lat    Pop StNum
## 1 2757  996    1405 7633       1  -99.6890 32.5590 110932    44
## 2 2438 5564    2632 4350       2  -81.5180 41.0850 660328    36
## 3 2560  237     859 5250       3  -84.1580 31.5750 112402    11
## 4 3399 4655    1617 5864       4  -73.7983 42.7327 835880    35
## 5 3026 4496    2612 5727       5 -106.6500 35.0830 419700    33
## 6 2972  334    1018 5254       6  -92.4530 31.3020 135282    19

Question 6

Construct a scree plot and determine much of the variation in data is contained in the first three principal components? Also, provide an argument for reducing these data to only the first three principal components.

Question 7:

Use fviz_pca_ind to construct a plot displaying relationship between the first two principal component scores of these cities colored by log-population (use base 10). How does population appear to be related to these components?

Question 8:

Place 213 is New York City, use the first two principal components to characterize New York city.