Overview
In this project you will analyze the outcomes of Xavier University men’s basketball games in the 2020-21 season. The specific question you will seek to address is:
Can using individual-level player statistics meaningfully improve a model for Xavier’s margin of victory/loss beyond simply using team-level statistics for that game?
In answering this question, you will first build a satisfactory model using team-level game data, and then explore whether that model can be improved using individual-level game data for one or more players.
\(~\)
Data Sources
All of the data involved in this project were obtained from Sports Reference
- XU Basketball Team-level Dataset
- Seven players have averaged at least 18 minutes per game this season (through 19 games), you should only consider individual statistics for those players:
Note: You do not need to consider all of these players, but you are expected to explore using the data from at least one of them to improve your initial model.
\(~\)
Guidelines
- Similar to the first midterm, you will present your analysis during class on Thursday 3/18. This presentation should be no longer than 7 minutes.
- You should look to apply the multiple regression concepts we’ve been studying in this project, even if you decide that a simple linear regression involving no individual-level data is the best model.
- You should include 1-3 high quality data visualizations in your presentation. You might also consider including summary statistics or tables.
- Your presentation should thoroughly explain the steps involved in your analysis, beginning with an introduction to the data and the project’s guiding question, and ending with conclusion that revisits the guiding question.
- You are expected to document your entire analysis and turn-in your R code, along with any presentation materials, via Canvas by the end of the day on Thursday 3/18.
\(~\)
Getting Starting
- To appropriately answer this project’s guiding question, you should begin by coming up with a model that uses team-level statistics to predict the variable “Margin”. Because Xavier has only played 19 this season, you will need to work hard to balance accuracy and parsimony in this model. Putting this more bluntly, you avoid including too many variables in this model. A commonly cited rule of thumb suggests a ratio of 10:1 for data-points to predictors (suggesting a model should contain only 2 predictors in this application); recognize, that this is merely a guideline, and shouldn’t necessarily be strictly adhered to in all circumstances (ie: your model can have more than 2 predictors, so long as the model is properly justified).
- Once you’ve found a satisfactory model that uses only team-level data, you should then explore whether that model can be improved by using individual-level data from one or more of the players listed above. This will require you to merge that player’s data using the “Date” variable. Be aware that some players did not play in all 19 games, so using data from these players will reduce the already small sample size.
General Recommendations
- I strongly advise you to consider variable transformations to aid in model fit or model interpretations (ie: log2(Tm) to address rightward skew, or 100\(*\)FG. to improve interpretation of the slope coefficient)
- I also encourage you to consider forming your own predictors as functions of the original data (ie: constructing a new variable FGto3P_attempt_ratio = FGA/X3PA)
- You should use graphs like the scatterplot matrix to assist you in screening over a large number of predictors
- You should use model selection criteria to assist you in comparing many different non-nested models
- You should use ANOVA to formally establish the statistical superiority of one model over another nested sub-model
Grading
This project will be evaluated using the same rubric and grading scheme as the first midterm, which is available at this link