Directions

Please document your answers to all homework questions using R Markdown, submitting your compiled output as a zipped .html folder (this is necessary when using plotly).

\(~\)

Question #1

Twitter is a popular social media network in which users can send and receive short messages (called “tweets”) on any topic they wish. For this question you will analysis data contained in the file Ghostbusters.txt (which can be read using the code below). This file contains 5000 tweets downloaded from Twitter on July 18, 2016, based on a search of the word “ghostbusters”.

## Because the format of twitter data is differently than what we're used to
## we'll need the "scan" function to read it into R. "scan" will search for particular
## characters and use them to define each element of the object it returns
data <- scan("https://raw.githubusercontent.com/ds4stats/case-studies/master/twitter-sentiment/Ghostbusters.txt", what = "")

Part A

For Part A, use the stringr package, write code to clean these data by removing the Unicode values (strings like <U+00A0>). To do this, you should assume that anything appearing inside of the characters < and > can be removed.

Part B

On twitter, a user may echo another user’s tweet to share it with their own followers by “retweeting”. In these data, all retweets begin with the letters “RT” followed by “@” and the original user’s twitter name. For this question, write code that stores retweets into a separate data set, then use the length function to find the number of tweets in this dataset. Be sure to anchor the regex used to identify retweets.

Part C

After excluding retweets, find the number of tweets where “hate” or “hated” (of any capitalization) appear, and the number of tweets where “love”, “loved”, or “looved” (and all variants with more “o”s or other capitalization) appear. Hint: the sum() function can be used to count the number of TRUE elements in a logical vector, which can be used in conjunction with str_detect() to answer this question. You might also find logical negation, achieved using the ! character, to be helpful in creating a subset of non-re tweets. We’ve seen this before in-class with the command !is.na(...) being used to select cases without missing values.

\(~\)

Question #2

The following datasets, sourced Northwestern University’s Storybench, are 6-months of news article titles, dates, and teasers for two news organizations - ABC 7 New York and KCRA in California.

ny_stories = read.csv("https://storybench.org/reinventingtv/abc7ny.csv")
ca_stories = read.csv("https://storybench.org/reinventingtv/kcra.csv")
combined = rbind(data.frame(ny_stories, location = "NY"), data.frame(ca_stories, location = "CA"))

The data frame “combined” contains stories from both organizations with an additional column, “location”, indicating the source.

Part A

Find the proportion of stories from each news organization whose teaser includes the name “Donald Trump”.

Part B

Using group_by(), summarize() and appropriate stringr functions, recreate the following table displaying the number of headlines that include the strings “United States” and “Russia” for the two news organizations, grouped according to whether the headline also contains the string “Trump”.

location Trump US Russia
CA FALSE 4 43
CA TRUE 0 24
NY FALSE 6 9
NY TRUE 0 8

Part C

Manipulate the data to determine the total number of capital words included in the teasers each month. Which month had the most capitalized words? Which had the fewest?

Part D

Find the number of headlines published on each weekday by these two organizations. Display the number of headlines using a column chart (similar to the one given below).

\(~\)

Question #3

The Happy Planet Index is an attempt to measure how well different world nations are doing at achieving long, happy, and sustainable lives for their citizens using data compiled from various sources. A description of the dataset’s variable can be found on slide 11 here

For this question, use the plot_ly function in the plotly package to construct a 3-D scatter plot of “LifeExpectancy” and “GDPperCapita” vs “Happiness” with a fitted linear regression plane (found using lm()) depicting the model Happiness ~ LifeExpectancy + GDPperCapita. Your graph should include hoverable labels displaying the country represented by that data-point. You may use the argument hoverinfo = "text" so that these labels only provide the text label you specify (and not the x, y, z coordinates of the point).

HappyPlanet <- read.csv("https://remiller1450.github.io/data/HappyPlanet.csv")

Your final result should look something like the graphic shown below (it does not need to resemble it exactly, but it should be similar):

Note: colorscale = "RdBu" was used in the surface.