For this assignment you should record your answers in an R Markdown file and submit the compiled output (either as an html file in a zipped folder, a pdf, or a word document).
The data frame diamonds
is contained within the
ggplot2
package. These data record the attributes of
several thousand diamonds sold by a wholesale online retailer. For this
question, your goal is to recreate the graph shown below as closely as
possible. A few hints:
alpha = 0.3
is used to give each point 30%
opacity.library(ggplot2) # Make sure you have this package installed
data("diamonds") # loads the built-in "diamonds" data set
\(~\)
The “babynames” package contains a data set documenting the number and frequency of all names that appear at least 5 times within a given year as recorded by the United States Social Security Administration.
The code below will load this dataset. You will likely need to install the package unless you’ve previously used it.
#install.packages("babynames")
library(babynames)
## Warning: package 'babynames' was built under R version 4.2.3
data("babynames")
Create a subset of babynames
named
my_subset
that contains information on the names:
"Ryan", "Jeff", "Shonda", "Jonathan", "Collin", "Anna"
Next, run the ggplot
code given below, which seems like
it should create a line chart of each name’s frequency by year. What is
happening that makes this graph look so horrible? Take a look at the
data frame and explain the issue in 1-2 sentences.
ggplot(my_subset, aes(x = year, y = n, color = name)) + geom_line()
\(~\)
Create a new graph that fixes the problem you identified in Part A and appopriately displays the frequency of each name over time.
\(~\)
The data.frame economics
in the ggplot2
package contains US economic data provided by US federal reserve.
For Part A of this question, write code that transforms the data from
a wide format to a long format (see Part B for details on how the long
formatted data will be used) where each row is an economic outcome
(pce
, psavert
, unemped
,
unemploy
) on a specific date. For simplicity, you may drop
the variables not used in Part B prior to reshaping the data.
library(ggplot2) # Make sure you have this package installed
data("economics") # Puts the "economics" data frame in your environment
Using the data frame you created in Part A, write code that uses
ggplot
to create a line graph which displays the variables
uempmed
and psavert
on the y-axis and
date
on the x-axis. Use the color
aesthetic to
differentiate the lines of each variables.
In the space below your code, use your visual to briefly describe whether these variables appear to be related.
\(~\)
The data.frame Loblolly
in the datasets
package records the heights of several Loblolly pine tree seedlings at
various ages (time points).
For this question, you should construct a data frame containing the average height of all trees in the sample at each age/time (ie: 1 row per age with average height as a variable).
Next, you should use ggplot
to construct a line graph
displaying the mean height by age (Hint: You shouldn’t be using
geom_smooth
to make your plot, you should plot the means
directly). Color this line red, and increase its thickness using the
argument lwd = 2
.
Finally, you should add another layer displaying the heights of the
individual trees over time. Change the opacity of these lines to 50%
using the argument alpha = 0.5
.
A sample graph is shown below for your reference.
library(datasets) # Make sure you have this package installed
data("Loblolly")
\(~\)
Sean Lahman
complied a comprehensive database containing pitching, hitting, and
fielding statistics from all Major League Baseball games from 1871
through 2016. The database is available in the package
Lahman
and contains several tables.
library(Lahman)
data("Teams")
Using the Teams
data frame in the Lahman
package, display the top ten teams in slugging percentage (SLG) since
1969. You should use select
so that only the “yearID”,
“teamID”, and “SLG” are printed.
SLG should be computed as a team’s total bases divided by at bats (the variable “AB”) for that season. To find the total bases, you should assign a value of 1 for singles, 2 for doubles, 3 for triples, and 4 for home runs (summing these values up to get the total).
Hint: The column “H” is the teams total hits, which can be used to find the number of singles by subtracting the doubles, “X2B”, triples, “X3B”, and home runs, “HR”.
Sample output showing only the first 3 teams is printed below for your reference. You should print the top 10.
## yearID teamID SLG
## 1 2019 HOU 0.4954570
## 2 2019 MIN 0.4940684
## 3 2003 BOS 0.4908996
For the ten teams you identified in Part A, create a data frame contain the playerID and number of home runs (HR) for the player who had the most home runs for that team in that year. This data frame should contain only 10 rows (1 per team from Part A), and it should have the yearID and teamID for each player. It is acceptable if your data frame is not in the same order as the one in Part A, so long as it contains one entry per team.
Batting information, which includes the number of home runs in a
season, can be found in the Batting
data frame loaded
below:
data("Batting")
Hint: I suggest looking into the top_n()
function as part of your data wrangling pipeline for this question. It
is compatible with group_by()
.