Please document your answers to all homework questions using R Markdown, submitting your compiled output on P-web.
\(~\)
Customer segmentation is a popular method for businesses sorting customer into groups so that they can engage with them more effectively (targeted sales, advertisements, promotions, etc.) Most approaches to customer segmentation derive measures of recency, frequency, and monetary value (RFM) and perform a clustering analysis using these derived variables.
The “online retail” data set, or
, loaded below documents
all transactions taking place between December 1st 2010 and December 9th
2011 at UK-based non-store online retailer. The company sells a range of
all-occasion gifts, largely to wholesalers.
or <- read.csv("https://remiller1450.github.io/data/OnlineRetail.csv")
Part A: The variable “ElapsedTime” records the number days since the purchase, measured using the last recorded date in the data set (12/09/2011) as the reference so that a value of “0” indicates a purchase on Dec 9th 2011 and a value of 373 indicates a purchase on Dec 1st 2010. Using this variable, create a data frame that contains the “ElapsedTime” of the most recent purchase for each customer. Name this variable “Recency”, and make sure your data frame contains only 1 row per unique customer.
Part B: The variable “UnitPrice” indicates the sale price (in GBP) of each unit, with the number of units sold in a transaction recorded as “Quantity”. Using these variables, create a data frame that contains the total monetary value of each customer, defined by the total amount paid to the retailer across the entire data set. Name this variable “MonetaryValue”, and make sure your data frame contains only 1 row per unique customer.
Part C: The variable “InvoiceNumber” is a unique
identifier for every order placed. Using this variable, create a data
frame that contains the total number of invoices for each customer. Name
this variable “Frequency”, and make sure your data frame contains only 1
row per unique customer. Hint: The unique()
function will return a vector of unique values (such as unique
InvoiceNumbers) and the length of this vector would reflect the number
of distinct orders.
Part D: Use the results of Parts A-C to construct a data frame containing the numeric variables “Receny”, “MonetaryValue”, and “Frequency” by properly merging the results of Parts A-C using the variable “CustomerID”. Then, use the “Elbow Method” to determine a reasonable number of \(k\)-means clusters for these data. Hint: Remember to standardize the data prior to clustering.
Part E: Using the value of \(k\) you selected in Part D, perform \(k\)-means clustering and report each cluster centroid. Then, using these centroids, provide a brief written description of the members of each cluster (ie: “Infrequent large buyers”, etc.)
Part F: Consider the distribution of the variable “MonetaryValue”. With this in mind, do would you expect PAM clustering to be more appropriate for these data? Briefly explain.
Part G: Perform PAM clustering with the value of \(k\) determined in Part D. Report each cluster medoid, and contrast these results with the \(k\)-means clustering reported in Part E.
The goal in this question is to apply clustering methods to find possible groupings/patterns among individuals killed during police interactions in 2015. The data for this application come from the FiveThirtyEight article “Where Have Police Killed Americans in 2015”.
For this question, you should use the processed data stored in the
dataframe pk
, which is created below:
## Source data from FiveThirtyEight
police <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/police-killings/police_killings.csv", stringsAsFactors = TRUE)
## Select important variables and coerce misread var types
pk <- select(police, age, gender, raceethnicity, state, cause, armed, share_white, share_black, share_hispanic, h_income, pov, urate, college) %>%
mutate(age = as.numeric(as.character(age)), pov = as.numeric(as.character(pov)),
share_white = as.numeric(as.character(share_white)),
share_black = as.numeric(as.character(share_black)),
share_hispanic = as.numeric(as.character(share_hispanic)))
## Add individual name, city as rownames
rownames(pk) <- paste0(police$name, " (", police$city, ")")
## Remove cases with missing data in any of the kept variables
pk <- pk[complete.cases(pk),]
Part A: Based upon an inspection of the data frame,
pk
, should euclidean distance or Gower distance be used if
the goal is to display a distance matrix summarizing the
similarities/differences of these data-points? Briefly explain.
Part B: Find the distance matrix using the metric
you specified in Part A and store it in an object named D
,
then apply PAM clustering with \(k =
4\) and print the medoids.
Part C: The fviz_nbclust()
function is
not compatible with clustering done using Gower distance, so the choice
of \(k\) must be optimized manually.
The code below uses a for
loop to display average
silhouette widths for several possible choices of \(k\). Based upon these results, which choice
of \(k\) would you recommend? Briefly
explain.
k_seq <- 2:10
for(i in 1:length(k_seq)){
pam_res <- pam(D, k = k_seq[i])
print(paste("k=", k_seq[i], "Avg sil", round(pam_res$silinfo$avg.width, 3)))
}
## [1] "k= 2 Avg sil 0.189"
## [1] "k= 3 Avg sil 0.115"
## [1] "k= 4 Avg sil 0.138"
## [1] "k= 5 Avg sil 0.133"
## [1] "k= 6 Avg sil 0.135"
## [1] "k= 7 Avg sil 0.124"
## [1] "k= 8 Avg sil 0.105"
## [1] "k= 9 Avg sil 0.103"
## [1] "k= 10 Avg sil 0.095"
Part D: Apply PAM clustering using the choice of \(k\) you identified in Part C. Then, using the medoids of each cluster, come up with a brief description of the different clusters found by the algorithm. Your description may focus on three or four variables that appear most different.
Part E: Apply DIANA clustering to these data,
storing the results as pk_diana
, then use the command
cutree(pk_diana, k = 2)
find the cluster assignment of each
observation. Add these clusters as an additional variable to
pk
.
Part F: Use the table()
function to
create a table displaying cluster assignments as rows and frequencies of
each category of the variable “raceethnicity” as columns. Do these
distributions appear to align with the characteristics of the clusters
reported in Part D?