Directions

Please document your answers to all homework questions using R Markdown, submitting your compiled output on P-web.

\(~\)

Question #1

Unsupervised learning approaches, including as PCA, are frequently used in the analysis of genomic data, which often contain thousands of genetic variables.

The dataset NCI60 in the ISLR package contains 6830 gene expression measurements (variables) for 64 cancer cell lines (observations). Each cell line has known cancer type; however, the goal of this analysis to explore the extent to which gene expression data can be used to characterize and identify different types of cancer.

library(ISLR)
data("NCI60")
nciData <- NCI60$data
labels <- NCI60$labs

Part A: Perform PCA on the gene expression data (be sure to standardize) storing your results in an object named nci_pca. Report the amount of variation explained by the first three principal components.

Part B: Construct a data frame containing scores for the first three principal components and add the “labels” vector as an additional column. Next, filter this data frame to include only the “PROSTATE”, “OVARIAN”, “COLON”, and “MELANOMA” cancer types. Then, using the filtered data, create a 3-D scatter plot (plotly) that displays each sample’s scores in \(PC_1\), \(PC_2\), and \(PC_3\) and is colored by cancer type.

\(~\)

Question #2

The “wines” data set records the results of a chemical analysis on wine samples produced by three different cultivators in the same region of Italy.

wines <- read.csv("https://remiller1450.github.io/data/wines.csv")

The goal of this application is to identify and assess differences across wines and cultivators.

Part A: Use functions in the corrplot package to visualize a correlation matrix of these data. Then, briefly justify principal component analysis as a reasonable approach to analyzing these data.

Part B: Perform PCA on the wines data (be sure to standardize and remove the “Origin” variable) storing your results in an object named wines_pca. Then, use parallel analysis to determine how many components should be retained.

Part C: Calculate and report the amount of variation explained by each of the components you chose to retain in Part B.

Part D: Create a graph that displays the top contributors to \(PC_1\). Using this graph, determine a label for describing this component (Hint: phenols are a group of phytochemicals that account for antioxident activity, flavonoids are the largest group of phenolic compounds, and proanthocyanins are a polyphenol).

Part E: Create a graph that displays the top contributors to \(PC_2\). Briefly describe how this component appears to capture wine characteristics that are different from those that most strongly contribute to \(PC_1\).

Part F: Create a biplot displaying scores and variable loadings for the first two principal components. Use the col.ind argument to color the individual observations according to the 3 different values of the “Origin” variable in the original data set. You might also use the argument label = FALSE to suppress labeling. Then, using the biplot, briefly interpret the differences you see across these three cultivators.

Part G: Use group_by and summarize to find the average scores in \(PC_1\), \(PC_2\), and \(PC_3\) for each of the three cultivators. Briefly comment on how this information relates to the biplot you created in Part F.

\(~\)

\(~\)

Question #3

The code given below loads the responses to an online questionnaire aimed at measuring “dark triad” personality traits, which are machiavellianism (a manipulative attitude), narcissism (excessive self-love), and psychopathy (lack of empathy).

For more information you can visit this link: http://openpsychometrics.org/tests/SD3/

dark = read.delim("https://remiller1450.github.io/data/dark_triad.csv", sep='\t')

Part A: After removing variables that are not survey items and standardizing, perform PCA and create a scree plot showing the variance explained in each principal component. Based upon this plot, does it seem like 3 underlying factors capture most of the variation in the survey items? Briefly explain.

Part B: Using the principal component loadings, relate each of the “dark triad” traits to the most appropriate principal component. Your submitted answer should summarize your label assignments and how you came up with them.

Part C: The graphic below displays the distribution of scores in the first three principal components for respondents from 3 different countries of origin, Germany (DE), India (IN), and the Philippines (PH). Recreate this graphic, replacing the names “PC1”, “PC2” and “PC2” with the names of the dark triad traits you determined in Part B. It’s acceptable if your scores are mirrored across the y-axis (vertical line at zero) relative to the target visualization shown below.

Part D: Using the principal component loading and the survey items/scale information shown below to inform your answer, which of the 3 countries of origin depicted in the above visualization show tends exhibit the most narcissism (excessive self love) among its respondents? Explain your answer.

This image is taken from the Appendix of Jones and Paulhus (2014) https://journals.sagepub.com/doi/full/10.1177/1073191113514105