MATH-256 - Lab #3 - Probability

This lab is intended to provide practice applying a few basic concepts in probability towards a meaningful analysis of real data.

Directions (Please read before starting)

Please work together with your assigned groups. Even though you turn in a write-up that is later scored, labs are intended to formative and a substantial portion of the credit you’ll receive is completion and effort based.
Please record your solutions in an R Markdown document using the conventions we’ve used in previous labs.

$~$

Background - The Hot Hand

Basketball players who make several shots in succession are often described as having a hot hand. While many fans and players are strong believers in the hot hand phenomenon, a 1985 research paper concluded that successive shots are actually independent events (Link).

In this lab, we’ll analyze the performance of the late Kobe Byrant (Rest in Peace) against the Orlando Magic in the 2009 NBA finals on route to his fifth and final championship (a performance that led to him being the 2009 finals MVP).

More specifically, we will look at sequences of Kobe’s made and missed shot attempts to investigate whether there is any evidence that Kobe went on hot-handed shooting streaks during the finals series.

$~$

First Steps

In total, Kobe attempted 133 shots five-game finals series. In the data below, the outcome of each shot is recorded via the variable “basket” as either “H” for a hit or made shot, or as “M” for a missed shot. Additional information, such as game, quarter, time, and description are also included in the dataset.

## This reads the data
kobe <- read.csv("https://remiller1450.github.io/data/Kobe.csv")

## This will load a custom function that we'll use later this lab
source("https://remiller1450.github.io/m256f21/functions.R")

## Print the first few rows of the Kobe dataset
head(kobe)

##    vs game quarter time                                             description
## 1 ORL    1       1 9:47                 Kobe Bryant makes 4-foot two point shot
## 2 ORL    1       1 9:07                               Kobe Bryant misses jumper
## 3 ORL    1       1 8:11                        Kobe Bryant misses 7-foot jumper
## 4 ORL    1       1 7:41 Kobe Bryant makes 16-foot jumper (Derek Fisher assists)
## 5 ORL    1       1 7:03                         Kobe Bryant makes driving layup
## 6 ORL    1       1 6:01                               Kobe Bryant misses jumper
##   basket
## 1      H
## 2      M
## 3      M
## 4      H
## 5      H
## 6      M

Question #1 Overall, what proportion of Kobe’s shot attempts did he end up hitting? Include your answer, as well as the R code you used to find it, in your write-up.

$~$

Shooting Streaks

To determine whether or not Kobe exhibited a “hot hand”, we need to look at sequences of made/missed shots.

To make this task more manageable, we’ll aggregate the data by looking at consecutive made shots until a miss occurs using the calc_streak function (a custom function sourced in the previous section).

To demonstrate how the function works, we’ll begin by looking at the raw data for Kobe’s first 9 shots in Game 1:

kobe$basket[1:9]  ## first 9 shots

## [1] "H" "M" "M" "H" "H" "M" "M" "M" "M"

We can see the first “shooting streak” consists of 2 shots (HM) with only 1 hit, the second streak was 1 shot (M) with 0 hits, while the third was 3 shots (HHM) with 2 hits. The forth, fifth, and sixth streaks were all 1 shot (M) with 0 hits. Compare this information with the output of the calc_streak function that is seen below:

calc_streak(kobe$basket[1:9])

## [1] 1 0 2 0 0 0 0

Question #2: What does each element in the vector created by the code calc_streak(kobe$basket[1:9]) represent? Briefly explain how this relates to the meaning of each element that results from the code kobe$basket[1:9]. In doing so, address why the lengths of each of these vectors is not the same.

Question #3: Apply the calc_streak function to the entire dataset (not just the first 9 shots as was shown above), storing the result in an object called “kobe_streaks”. Then, use the table function to calculate a frequency table of Kobe’s different shooting streaks. Based upon this table, comment upon whether you believe Kobe may have had a “hot hand” at any point during the finals.

$~$

Applying a Probability Model

Kobe went on several scoring streaks involving either 3 or 4 consecutive made baskets, but this by itself does not prove the “hot hand” is real. The way in which statisticians will approach this type of question involves two steps:

Come up with a probability model for an aspect of the data that addresses the research question being posed.
Use that probability model to perform statistical inference (ie: evaluate the compatibility of the observed data with a particular hypothesis).

In this application, we’ll consider a probability model that assumes consecutive shots are independent, and we’ll measure how compatible this model is with Kobe’s performance.

To begin, let $S_1$, $S_2$, $\ldots$ denote a sequence of consecutive shot attempts. Since Kobe hit approximately 44% of his shots, it’s reasonable for our independence model to assume $P(S_i = H) = 0.44$.

Question #4: Using the information given above, find: \[P(S_1 = \text{hit } \cap S_2 = \text{hit } \cap S_3 = \text{hit } \cap S_4 =\text{miss})\] Question #5: Briefly comment on how the probability you calculated in Question #4 might relate to the table you created from the output produced by the calc_streak() function.

$~$

Model Evaluation

In Question #4 you calculated the probability that Kobe has a shooting streak of length three, under the assumption of independent shots.

To determine whether our observed data accurately reflect this theoretical probability, we’ll perform a simulation study. The general idea is to recreate Kobe’s final’s performance many times, with each replication using the independence model to generate the simulated data. In order to do this, we’ll need to introduce a few new R functions.

$~$

Simulating a Random Process

The code below demonstrates how to use R to simulate flipping a fair coin two times:

outcomes <- c("heads", "tails")
sample(outcomes, size = 2, replace = TRUE)

## [1] "tails" "heads"

In this example, the vector outcomes can be viewed as a container holding slips of paper with the labels “heads” and “tails”, and the sample function will draw from this container a certain number of times (size) with or without replacement (notice we asked for replacement here).

Question #6: Modify the code above to flip a fair coin 1000 times, storing the results in an object named “coin_results”. Then, use the table function to summarize the results of your simulated set of coin flips.

$~$

Sometimes we’ll want to simulate scenarios were the outcomes aren’t 50-50. This can be done by specifying the probability of each individual outcome within the sample function:

outcomes <- c("heads", "tails")
coin_outcomes <- sample(outcomes, size = 1000, replace = TRUE, prob = c(.8,.2))
table(coin_outcomes)

## coin_outcomes
## heads tails 
##   813   187

Notice the difference in outcomes when specifying an $80\%$ probability of heads and a $20\%$ probability of tails. Additionally, you should recognize that sample will expect you to provide a vector input to the prob argument whose length matches that of the vector outcomes (so we can’t just say prob = .8 here, we instead need to specify probabilities for every outcome we listed).

Note: if the prob argument is not used, sample will assume that all of the outcomes provided are equally likely.

Question #7: Using the code above as a starting point, simulate the results of all 133 shots taken by Kobe during the finals series under the independence model. Then, use the calc_streak and table functions to display your simulated results. (Hint: be sure to use the character strings “H” and “M” to denote your outcomes, otherwise calc_streak won’t work properly).

Question #8: Compare the distribution of the length of simulated streaks you obtained in Question #6 with that of the real streaks you found in Question #3. Do the results of this simulation support or refute the “hot hand” theory? Briefly explain.

$~$

Replication

Simulating an independent shooter once and comparing it to Kobe’s performance doesn’t provide sufficient evidence to determine whether the observed data are compatible with the independence model. Instead, we need to repeat this simulation many times and see if the real data appear to be sufficiently different from what we’d expect under the independence model.

The histogram below displays the number of 3+ hit shooting streaks that occurred in 1,000 simulation repetitions (ie: there are 1000 values used to create this histogram, each being the number of 3+ streaks in a different simulation):

Question #9: Based upon the histogram, estimate the probability that Kobe has 10 or more shooting streaks of 3+ hits in a sequence of 133 shots (under the independence model).

Question #10: Compare the number of 3+ hit shooting streaks in the observed data with the distribution of such streaks that would be expected if his shots were independent. Do you believe Kobe’s final’s performance supports the notion that the “hot hand” is real? Briefly explain.