Please document your answers to all homework questions using R
Markdown, submitting your compiled output as a zipped .html folder (this
is necessary when using plotly
).
\(~\)
Twitter is a popular social media network in which users can send and receive short messages (called “tweets”) on any topic they wish. For this question you will analysis data contained in the file Ghostbusters.txt (which can be read using the code below). This file contains 5000 tweets downloaded from Twitter on July 18, 2016, based on a search of the word “ghostbusters”.
## Because the format of twitter data is differently than what we're used to
## we'll need the "scan" function to read it into R. "scan" will search for particular
## characters and use them to define each element of the object it returns
data <- scan("https://raw.githubusercontent.com/ds4stats/case-studies/master/twitter-sentiment/Ghostbusters.txt", what = "")
For Part A, use the stringr
package, write code to clean
these data by removing the Unicode values (strings like
<U+00A0>
). To do this, you should assume that
anything appearing inside of the characters <
and
>
can be removed.
On twitter, a user may echo another user’s tweet to share it with
their own followers by “retweeting”. In these data, all retweets begin
with the letters “RT” followed by “@” and the original user’s twitter
name. For this question, write code that stores retweets into a separate
data set, then use the length
function to find the number
of tweets in this dataset. Be sure to anchor the regex used to identify
retweets.
After excluding retweets, find the number of tweets where “hate” or
“hated” (of any capitalization) appear, and the number of tweets where
“love”, “loved”, or “looved” (and all variants with more “o”s or other
capitalization) appear. Hint: the sum()
function
can be used to count the number of TRUE
elements in a
logical vector, which can be used in conjunction with
str_detect()
to answer this question. You might also find
logical negation, achieved using the !
character,
to be helpful in creating a subset of non-re tweets. We’ve seen this
before in-class with the command !is.na(...)
being used to
select cases without missing values.
\(~\)
The following datasets, sourced Northwestern University’s Storybench, are 6-months of news article titles, dates, and teasers for two news organizations - ABC 7 New York and KCRA in California.
ny_stories = read.csv("https://storybench.org/reinventingtv/abc7ny.csv")
ca_stories = read.csv("https://storybench.org/reinventingtv/kcra.csv")
combined = rbind(data.frame(ny_stories, location = "NY"), data.frame(ca_stories, location = "CA"))
The data frame “combined” contains stories from both organizations with an additional column, “location”, indicating the source.
Find the proportion of stories from each news organization whose teaser includes the name “Donald Trump”.
Using group_by()
, summarize()
and
appropriate stringr
functions, recreate the following table
displaying the number of headlines that include the strings “United
States” and “Russia” for the two news organizations, grouped according
to whether the headline also contains the string “Trump”.
location | Trump | US | Russia |
---|---|---|---|
CA | FALSE | 4 | 43 |
CA | TRUE | 0 | 24 |
NY | FALSE | 6 | 9 |
NY | TRUE | 0 | 8 |
Manipulate the data to determine the total number of capital words included in the teasers each month. Which month had the most capitalized words? Which had the fewest?
Find the number of headlines published on each weekday by these two organizations. Display the number of headlines using a column chart (similar to the one given below).
\(~\)
The Happy Planet Index is an attempt to measure how well different world nations are doing at achieving long, happy, and sustainable lives for their citizens using data compiled from various sources. A description of the dataset’s variable can be found on slide 11 here
For this question, use the plot_ly
function in the
plotly
package to construct a 3-D scatter plot of
“LifeExpectancy” and “GDPperCapita” vs “Happiness” with a fitted linear
regression plane (found using lm()
) depicting the model
Happiness ~ LifeExpectancy + GDPperCapita
. Your graph
should include hoverable labels displaying the country represented by
that data-point. You may use the argument
hoverinfo = "text"
so that these labels only provide the
text label you specify (and not the x, y, z coordinates of the
point).
HappyPlanet <- read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")
Your final result should look something like the graphic shown below (it does not need to resemble it exactly, but it should be similar):
Note: colorscale = "RdBu"
was used in the
surface.