Homework #2

Directions

For this assignment you should record your answers in an R Markdown file and submit the compiled output (either as an html file in a zipped folder, a pdf, or a word document).

Question #1

The data frame diamonds is contained within the ggplot2 package. These data record the attributes of several thousand diamonds sold by a wholesale online retailer. For this question, your goal is to recreate the graph shown below as closely as possible. A few hints:

Pay attention to the scales, theme, and labels
The argument alpha = 0.3 is used to give each point 30% opacity.
The default colors are used

library(ggplot2)  # Make sure you have this package installed
data("diamonds") # loads the built-in "diamonds" data set

\(~\)

Question #2 - Part A

The “babynames” package contains a data set documenting the number and frequency of all names that appear at least 5 times within a given year as recorded by the United States Social Security Administration.

The code below will load this dataset. You will likely need to install the package unless you’ve previously used it.

#install.packages("babynames")
library(babynames)

## Warning: package 'babynames' was built under R version 4.2.3

data("babynames")

Create a subset of babynames named my_subset that contains information on the names: "Ryan", "Jeff", "Shonda", "Jonathan", "Collin", "Anna"

Next, run the ggplot code given below, which seems like it should create a line chart of each name’s frequency by year. What is happening that makes this graph look so horrible? Take a look at the data frame and explain the issue in 1-2 sentences.

ggplot(my_subset, aes(x = year, y = n, color = name)) + geom_line()

\(~\)

Question #2 - Part B

Create a new graph that fixes the problem you identified in Part A and appopriately displays the frequency of each name over time.

\(~\)

Question #3 - Part A

The data.frame economics in the ggplot2 package contains US economic data provided by US federal reserve.

For Part A of this question, write code that transforms the data from a wide format to a long format (see Part B for details on how the long formatted data will be used) where each row is an economic outcome (pce, psavert, unemped, unemploy) on a specific date. For simplicity, you may drop the variables not used in Part B prior to reshaping the data.

library(ggplot2)  # Make sure you have this package installed
data("economics") # Puts the "economics" data frame in your environment

Question #3 - Part B

Using the data frame you created in Part A, write code that uses ggplot to create a line graph which displays the variables uempmed and psavert on the y-axis and date on the x-axis. Use the color aesthetic to differentiate the lines of each variables.

In the space below your code, use your visual to briefly describe whether these variables appear to be related.

\(~\)

Question #4

The data.frame Loblolly in the datasets package records the heights of several Loblolly pine tree seedlings at various ages (time points).

For this question, you should construct a data frame containing the average height of all trees in the sample at each age/time (ie: 1 row per age with average height as a variable).

Next, you should use ggplot to construct a line graph displaying the mean height by age (Hint: You shouldn’t be using geom_smooth to make your plot, you should plot the means directly). Color this line red, and increase its thickness using the argument lwd = 2.

Finally, you should add another layer displaying the heights of the individual trees over time. Change the opacity of these lines to 50% using the argument alpha = 0.5.

A sample graph is shown below for your reference.

library(datasets)  # Make sure you have this package installed
data("Loblolly")

\(~\)

Question #5 - Part A

Sean Lahman complied a comprehensive database containing pitching, hitting, and fielding statistics from all Major League Baseball games from 1871 through 2016. The database is available in the package Lahman and contains several tables.

library(Lahman)
data("Teams")

Using the Teams data frame in the Lahman package, display the top ten teams in slugging percentage (SLG) since 1969. You should use select so that only the “yearID”, “teamID”, and “SLG” are printed.

SLG should be computed as a team’s total bases divided by at bats (the variable “AB”) for that season. To find the total bases, you should assign a value of 1 for singles, 2 for doubles, 3 for triples, and 4 for home runs (summing these values up to get the total).

Hint: The column “H” is the teams total hits, which can be used to find the number of singles by subtracting the doubles, “X2B”, triples, “X3B”, and home runs, “HR”.

Sample output showing only the first 3 teams is printed below for your reference. You should print the top 10.

##   yearID teamID       SLG
## 1   2019    HOU 0.4954570
## 2   2019    MIN 0.4940684
## 3   2003    BOS 0.4908996

Question #5 - Part B

For the ten teams you identified in Part A, create a data frame contain the playerID and number of home runs (HR) for the player who had the most home runs for that team in that year. This data frame should contain only 10 rows (1 per team from Part A), and it should have the yearID and teamID for each player. It is acceptable if your data frame is not in the same order as the one in Part A, so long as it contains one entry per team.

Batting information, which includes the number of home runs in a season, can be found in the Batting data frame loaded below:

data("Batting")

Hint: I suggest looking into the top_n() function as part of your data wrangling pipeline for this question. It is compatible with group_by().