R for Political Data Science Week 7: The 2020 Twitter Primary

Let’s mine Twitter data to check patterns of speech for differences between Democratic presidential candidates.

By G. Elliott Morris / February 15, 2019

 in R for Political Data US Politics R-Posts


This is part of a series of short posts about politics that seeks to show how we use data science to learn more about the real world. Follow along here.


There are lots of journalists on Twitter. We like to follow political news. While studies have shown that the site is not broadly representative of the US as a whole — something that anyone who spent more than 5 minutes outside of the east coast bubble could realize in a heartbeat — we still turn to it for some suggestions of political trends or attention/agenda setting. The authors of a 2011 paper write:

Overall, we found that Twitter users signif- icantly overrepresent the densely population regions of the U.S., are predominantly male, and represent a highly non- random sample of the overall race/ethnicity distribution.

So let’s just run a quick look at who is tweeting the most about their race for president, what they’re saying, and how users are reacting.

I get these data by using Mike Kearney’s excellent rtweet package:

library(rtweet)

# get twitter handles for each candidate
candidates <- c('@CoryBooker','@PeteButtigieg','@JulianCastro',
                '@JohnKDelaney','@TulsiGabbard','@SenGillibrand',
                '@KamalaHarris','@amyklobuchar','@ewarren')


# get most recent 3200 tweets for each candidate
#tweets_2020_cands <- get_timelines(candidates, n = 3200, retryonratelimit =TRUE)

# (I'm just reading them in since this takes a while)
load(file="../../data_no_export/post/2019_02_15-2020-twitter-primary/tweets.Rdata")

The first question I want to answer is an easy one: who is using the platform most to communicate? I’ve created the plot below to show the total number of tweets each officially-declared candidate has sent since January 1, 2018, and limited that sample to be (a) original statuses that (b) weren’t responses to other users (such as saying “thanks for your support!” a million and a half times — I’m looking at you, Julian Castro):

# filter dataset to tweets from Jan 1
since_jan <- tweets_2020_cands %>%
  filter(created_at > ymd('2019-01-01'))

# no retweets or replies
since_jan <- since_jan %>%
  filter(is.na(reply_to_status_id),
          is.na(retweet_status_id)) 

# get cumulative number of tweetss
cumul_tweets <- since_jan %>%
  group_by(screen_name) %>%
  summarise(num_tweets = n())


# plot
gg <- ggplot(cumul_tweets,aes(x=reorder(screen_name,num_tweets),
                              y=num_tweets,col=screen_name,fill=screen_name)) +
  geom_col(alpha=0.8) +
  scale_color_pander() +
  scale_fill_pander() +
  theme(legend.position = 'none') +
  coord_flip() +
  labs(title="Kamala Harris is Winning the Social Media Primary",
       subtitle="...for whatever that's worth",
       x="Candidate",
       y="Cumulative Number of Tweets\n(Since Jan. 1, 2018)",
       caption="Source: Tweets scraped using @kearneymw `rtweet` package")
preview(gg,themearg = theme(panel.grid.major.y = element_blank()))

In my opinion, John Delaney — a low-tier candidate who has been running for president since Spring 2017 but rarely breaks 2% in the national polling — surprises us here; he sends a lot of tweets. Most seem to be directed at President Trump (we’ll dive into text analysis below) but communicates a lot on the platform for how little he seems to matter (to both candidates or, as I mentioned above, the real population of Twitter: journalist and media types.)

Let’s look at what types of words the candidates are using. What subjects? Are they happy or sad? Etc.

library(tidytext)
bing <- get_sentiments("bing")
# initial text cleaning and word tokens
remove_reg <- "&amp;|&lt;|&gt;"

tidy_tweets <- since_jan %>% 
  filter(!str_detect(text, "^RT")) %>%
  mutate(text = str_remove_all(text, remove_reg)) %>%
  unnest_tokens(word, text, token = "tweets") %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, "[a-z]"))

# remove stop words, do some other cleaning
tidy_tweets <- tidy_tweets %>%
  anti_join(stop_words) %>%
  filter(str_detect(word, "[a-z]"))

# match to bing lexicon so we only get real words
tidy_tweets.bing <- tidy_tweets %>%
  inner_join(bing, by = "word")

# look at frequencies
tidy_tweets.bing %>%
  group_by(screen_name) %>%
  count(word) %>%
  arrange(desc(n)) %>%
  summarise(most_frequent_word = first(word)) %>%
  kable()
screen_name most_frequent_word
amyklobuchar trump
CoryBooker criminal
ewarren hard
JohnKDelaney trump
JulianCastro crisis
KamalaHarris afford
PeteButtigieg freedom
SenGillibrand wrong
TulsiGabbard love

Looking at each candidate’s most commonly-used word offers us a simple glimpse into what they’re leaning on in their bid for the presidency. This is of course a simple analysis, but it could be informative in a supplemental way (in other words, we’re not making any predictions here). Notice that “Trump” comes up twice; once for John Delaney, who tweets a lot, and once for Amy Klobuchar who doesn’t tweet quite as much.

If their word usage isn’t offering us a new or revolutionary look into their campaign, perhaps the pattern of positive/negative sentiment in their speech can. A candidate who focuses on criminal justice reform and injustice quite frequently might be portrayed as having a “darker” or more negative campaign. I’ve simply tallied up the words each candidate uses that are positive and divided by their total word usage:

# grouped positive/negative by word
sentiments_counts <- tidy_tweets.bing %>%
  group_by(screen_name,word) %>%
  count(sentiment) %>%
  arrange(-n)

# positive freqquency
pos_neg <-  sentiments_counts %>%
  group_by(screen_name,sentiment) %>%
  summarise(n = sum(n)) %>%
  as.data.frame() %>%
  group_by(screen_name) %>%
  mutate(freq = case_when(sentiment=="positive" ~n/sum(n))) %>%
  filter(!is.na(freq))

# plot
gg <- ggplot(pos_neg,aes(x=reorder(screen_name,freq),
                              y=freq,col=screen_name,fill=screen_name)) +
  geom_col(alpha=0.8) +
  scale_color_pander() +
  scale_fill_pander() +
  theme(legend.position = 'none') +
  coord_flip() +
  labs(title="The 2020 Democratic Candidates Are Pretty\nEvenly Positive/Negative on Twitter",
       subtitle="Their word usage doens't differ that much",
       x="Candidate",
       y="Percentage of Total Words That Are Positive",
       caption="Source: Tweets scraped using @kearneymw `rtweet` package")
preview(gg)

This sentiment analysis says that there are two groups of candidates in 2020: some that are more positive than not on Twitter and those that are more negative than positive. Tulsi Gabbard — who has been charged with defending a lot of controversy surrounding her ties to the Assad regime in Syria — trails everyone else for positivity. I wonder why.

However, an approach to analyzing emotionality categorizing words as only positive or negative is naive; that is not how people interpret language and limits the scale with which we can analyze word usage. We could instead look at how angry, sad, disgusted, fearful, joyful, and truthful a candidate’s word usage is. Below I’ve calculated the percentage of each candidate’s word usage that fall into 8 different categories of emotion:

nrc <- get_sentiments("bing")

# sentiment by candidate
tidy_tweets.nrc <- tidy_tweets %>%
  # match to lexicon
  inner_join(nrc, by = "word")

# grouped positive/negative by word
emotions <- tidy_tweets.nrc %>%
  group_by(screen_name,word) %>%
  count(sentiment) %>%
  arrange(-n)

# grequency of each emotion
nrc_grouped <- emotions %>%
  group_by(screen_name) %>%
  count(sentiment, sort = TRUE) %>%
  mutate(freq = n/sum(n)) %>%
  as.data.frame() %>%
  filter(sentiment %in% c('positive','negative'))

gg <- ggplot(nrc_grouped,aes(x=reorder(screen_name,freq),y=freq,col=screen_name,fill=screen_name)) +
  geom_col(alpha=0.8) +
  facet_wrap(~sentiment,scales = 'free_x',ncol=3,nrow=3) +
  coord_flip() +
  labs(title="A Mixed Bag of Emotions",
       subtitle="Some small differences, but no real outliers, in the 2020 candidates' online emotions",
       x="Candidate",
       y="Percentage of Total Words",
       caption="Source: Tweets scraped using @kearneymw `rtweet` package")
preview(gg, themearg = theme(panel.spacing = unit(2, "lines")))

I don’t know about you, but no clear patterns in these data stand out to me. I personally think this is a wash, save for Gabbard’s outstanding usage of words that are about disgust, like “hate,” “greed,” and “cruel.” Amy Klobuchar, Pete Buttigieg, and Julian Castro also appear to consistently use words in the “happy” categories — joy and trust — more than the average candidate. This certainly matches their above-average score for positivity.

Overall, it’s hard to know how much word usage or Twitter profiles matter. But it at least offers us a glimpse into some of what sets the candidates apart. For that reason, I think this is valuable. Besides, we always want a good reason to mine Twitter data, right? Right?



Share


Tags


Related


Comments

comments powered by Disqus