Please document your answers to all homework questions using R Markdown, submitting your compiled output on P-web.
\(~\)
Unsupervised learning approaches, including as PCA, are frequently used in the analysis of genomic data, which often contain thousands of genetic variables.
The dataset NCI60
in the ISLR
package
contains 6830 gene expression measurements (variables) for 64 cancer
cell lines (observations). Each cell line has known cancer type;
however, the goal of this analysis to explore the extent to which gene
expression data can be used to characterize and identify different types
of cancer.
library(ISLR)
data("NCI60")
nciData <- NCI60$data
labels <- NCI60$labs
Part A: Perform PCA on the gene expression data (be
sure to standardize) storing your results in an object named
nci_pca
. Report the amount of variation explained by the
first three principal components.
Part B: Construct a data frame containing scores for
the first three principal components and add the “labels” vector as an
additional column. Next, filter this data frame to include only the
“PROSTATE”, “OVARIAN”, “COLON”, and “MELANOMA” cancer types. Then, using
the filtered data, create a 3-D scatter plot (plotly
) that
displays each sample’s scores in \(PC_1\), \(PC_2\), and \(PC_3\) and is colored by cancer type.
\(~\)
The “wines” data set records the results of a chemical analysis on wine samples produced by three different cultivators in the same region of Italy.
wines <- read.csv("https://remiller1450.github.io/data/wines.csv")
The goal of this application is to identify and assess differences across wines and cultivators.
Part A: Use functions in the corrplot
package to visualize a correlation matrix of these data. Then, briefly
justify principal component analysis as a reasonable approach to
analyzing these data.
Part B: Perform PCA on the wines data (be sure to
standardize and remove the “Origin” variable) storing your results in an
object named wines_pca
. Then, use parallel analysis to
determine how many components should be retained.
Part C: Calculate and report the amount of variation explained by each of the components you chose to retain in Part B.
Part D: Create a graph that displays the top contributors to \(PC_1\). Using this graph, determine a label for describing this component (Hint: phenols are a group of phytochemicals that account for antioxident activity, flavonoids are the largest group of phenolic compounds, and proanthocyanins are a polyphenol).
Part E: Create a graph that displays the top contributors to \(PC_2\). Briefly describe how this component appears to capture wine characteristics that are different from those that most strongly contribute to \(PC_1\).
Part F: Create a biplot displaying scores and
variable loadings for the first two principal components. Use the
col.ind
argument to color the individual observations
according to the 3 different values of the “Origin” variable in the
original data set. You might also use the argument
label = FALSE
to suppress labeling. Then, using the biplot,
briefly interpret the differences you see across these three
cultivators.
Part G: Use group_by
and
summarize
to find the average scores in \(PC_1\), \(PC_2\), and \(PC_3\) for each of the three cultivators.
Briefly comment on how this information relates to the biplot you
created in Part F.
\(~\)
\(~\)
The code given below loads the responses to an online questionnaire aimed at measuring “dark triad” personality traits, which are machiavellianism (a manipulative attitude), narcissism (excessive self-love), and psychopathy (lack of empathy).
For more information you can visit this link: http://openpsychometrics.org/tests/SD3/
dark = read.delim("https://remiller1450.github.io/data/dark_triad.csv", sep='\t')
Part A: After removing variables that are not survey items and standardizing, perform PCA and create a scree plot showing the variance explained in each principal component. Based upon this plot, does it seem like 3 underlying factors capture most of the variation in the survey items? Briefly explain.
Part B: Using the principal component loadings, relate each of the “dark triad” traits to the most appropriate principal component. Your submitted answer should summarize your label assignments and how you came up with them.
Part C: The graphic below displays the distribution of scores in the first three principal components for respondents from 3 different countries of origin, Germany (DE), India (IN), and the Philippines (PH). Recreate this graphic, replacing the names “PC1”, “PC2” and “PC2” with the names of the dark triad traits you determined in Part B. It’s acceptable if your scores are mirrored across the y-axis (vertical line at zero) relative to the target visualization shown below.
Part D: Using the principal component loading and the survey items/scale information shown below to inform your answer, which of the 3 countries of origin depicted in the above visualization show tends exhibit the most narcissism (excessive self love) among its respondents? Explain your answer.
This image is taken from the Appendix of Jones and Paulhus (2014) https://journals.sagepub.com/doi/full/10.1177/1073191113514105