R.1 — Introduction, and Averaging Polls of the 2016 U.S. Presidential Election

May. 10, 2018

Categories: R-Posts Tags: R Tidyverse Politics



Introduction to the Series


Political analysis manifests itself via a variety of methods. There’s the conventional journalist, often embedded in campaigns or tracking a specific grassroots movement to assess electability or traction. There’s the political scientist, who spends years learning a wide range of theory and methods to solve some of the most salient problems of our time. Campaign analysts, too, spend hundreds of hours combing through individual-level voter data and private polls to discuss strategy. Then there’s the election analyst that turns a variety of survey data, economic trends, and other indicators into predictions of candidate performance and extrapolations of causality.

The Crosstab (coming up on its second birthday!) has so far been much of the latter: a marriage between methods learned in quantitative social and political science, data journalism, and (by necessity) data science. Suffice to say, the work that goes into creating the posts, models, and analysis featured on the site involves a very healthy amount of computer programming. Be it finding and transforming data from online sources or writing sophisticated simulation models to predict phenomena, there are various tools available to come up with the insights I present. I code all of my work in the R programming language — a very powerful syntax for statistical analysis of data big and small.

The community of online programmers and bloggers that write about R (#rstats on Twitter, for the uninitiated) is quite large. You can learn almost anything from the folks who contribute their content to the masses (especially at the excellent R-Bloggers aggregation website and DataCamp course library). What is lacking are solutions — both simple and complex — to common problems that political analysts may face, written in the R language.

In this vein of thought, I’m introducing a series (of undetermined length) of blog posts about extrapolatable uses for the R techniques I’ve learned over the years. My hope is that this category of posts will grow into a large subset of this blog’s content that (A) teaches new analysts, journalists, and professors alike a toolset that they can use to better their own approaches to coding solutions; and (B) among experienced users, prompt new ways of thinking to better “tidy” their code, improve past approaches, etc. Along the way, I’m sure I’ll learn something from you all, too.

As a note: these posts are not designed to give you a complete introduction to programming in R. Rather, I assume most readers will know a few basic concepts: how to write code in RStudio, what a “data type” is, the difference between a vector and a data frame, etc. For that info, direct your attention to this excellent stats101 course from Stanford.

Today’s topic? An analysis of polls from the 2016 presidential election. We’ll learn (A) where to find the data (for free!) online, (B) how to wrangle the data into usable formats, and (C) how to average polls over time. Because the best way to learn how to code is to dive right in (and in small chunks), let’s get down to business!


## Averaging Polls from the 2016 Election

Part 1. Getting the polls

To start, we’re going to download the data for our analysis from HuffPost’s Pollster.com website, an excellent page that aggregates individual polls from a variety of sources. The R code for this is relatively simple thanks to the read_tsv function from the readr` package. Once we import the package, it’s just two lines of code to import all the polls from the 2016 election into our environment.

library(readr)

polls_2016 <- read_tsv(url("http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv"))

With this simple block of code, we now have 1905 polls of the Clinton vs. Trump (vs. Johnson) presidential race.


Part 2. Wrangling the polls

However, we can’t go off to the races just yet. We have to make sure that the polls we’re grabbing are all of useful “populations.” That is, we don’t want polls that just ask Republicans who they’re voting for because Democrats and Independents can cast ballots too. We wouldn’t be getting a reliable picture of the electorate if we just looked at those numbers.

There are three variates of polls that ask the entire electorate (or holistic subsets thereof) how they will vote: polls of “all adults,” of “likely voters,” and “registered voters.” We will pass this requirement to our dataset using the filter() function from the dplyr package (which can also be found in the tidyverse set of packages).

library(dplyr)

polls_2016 <- polls_2016 %>%
  filter(sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"))

Now, calling the nrow() function with our polls_2016 dataset we will check if the filter actually took our the observations of the undesired polls. We can double-check that with the unique() function taking to the values of the subpopulation variable.

nrow(polls_2016)
## [1] 816
unique(polls_2016$sample_subpopulation)
## [1] "Likely Voters"     "Registered Voters" "Adults"

Before we finish the wrangling stage, we also want to make sure R knows that the end_date variable, which records the last day in which a pollster fielded a poll to respondents, is read as a date object, rather than a string. That’s very easy to do with various functions from the lubridate package. Specifically, we want to use the ymd() function, which turns strings in yyyy-mm-dd format into date objects. We’ll pack this transformation into dplyr’s mutate() function, which “mutates” (adds or changes) variables.

library(lubridate)

polls_2016 <- polls_2016 %>% 
  mutate(end_date = ymd(end_date))

One final change we have to make: expanding the data to cover every day in the campaign. Because there isn’t a poll released every day, but we eventually want our moving average to look at readings over the past two weeks (not the past 14 polls), we have to change our dataset to include “polls” (even if they don’t exist) for every day from the beginning to end of our data. Rest assured we’re not creating polls — the numbers for Trump and Clinton get set to NA here.

To do this, we’re going to use dplyr’s right_join() function to combine our polls_2016 dataset with a dataset with one variable, “end_date,” of a sequence of polls from the minimum to maximum dates polls were actually fielded. Again, this is just two simple lines of code.

polls_2016 <- polls_2016 %>%
  right_join(data.frame(end_date = seq.Date(min(polls_2016$end_date), max(polls_2016$end_date), by="days"))) 


Part 3. Average the polls

Now that we’re sure the data we’re using (1) consists of the right observations, (2) is in the right format, and (3) includes readings (even if they’re NA) for every day of the campaign, we can create a rolling average of the past 14 polls over the 2016 campaign cycle.

However, we have a final bit of wrangling to do: Recall that we want to be sure that the rolling averaging function we use (rollapply() from the zoo package) will average over the past 14 days of polls. The first requirement was to make sure there was a date for every day of the campaign in our end_date variable. The second step is to ensure that there are not multiple readings for each day in our data.

To do so, we use dplyr’s group_by() and summarise() functions to trim the data down to daily readings. This is easy: for each day on which there are multiple polls released, we want the Clinton and Trump variables to be the average, or mean(), of those polls. Again this is a simple 4 lines of code.

polls_2016 <- polls_2016 %>%
  group_by(end_date) %>%
  summarise(Clinton = mean(Clinton),
            Trump = mean(Trump)) 

We can use the head() function from dplyr to look at our new data, saved back to the polls_2016 object.

head(polls_2016)
## # A tibble: 6 x 3
##   end_date   Clinton Trump
##   <date>       <dbl> <dbl>
## 1 2015-05-26      50    32
## 2 2015-05-27      NA    NA
## 3 2015-05-28      NA    NA
## 4 2015-05-29      NA    NA
## 5 2015-05-30      NA    NA
## 6 2015-05-31      NA    NA

Now for the finale, actually averaging the polls! The tool we want to use here is the rollapply() function from the zoo package, which will take a few arguments from us: the variable we’re averaging, the width of units over which to average (in this case, the number of days), the function to apply to those days, how often we should make the average, if there will be NAs in our data, what to fill missing values with, and which way to window the function.

Again, since we’re using functions written by other developers, this is easy to do in just four lines of code! We will mutate() a new variable, called Clinton.Avg, which will store the rolling average of all the polls taken during the 2016 U.S. presidential election campaign.

library(zoo)

rolling_average <- polls_2016 %>%
  mutate(Clinton.Margin = Clinton-Trump,
         Clinton.Avg =  rollapply(Clinton.Margin,width=14, FUN=function(x){mean(x, na.rm=TRUE)}, by=1, partial=TRUE, fill=NA, align="right")) 

Here’s what the tail() (the opposite of head()) of our daily averages dataset looks like now:

tail(rolling_average)
## # A tibble: 6 x 5
##   end_date   Clinton Trump Clinton.Margin Clinton.Avg
##   <date>       <dbl> <dbl>          <dbl>       <dbl>
## 1 2016-11-02    45.4  42.9           2.56        3.34
## 2 2016-11-03    45.5  42.9           2.54        3.24
## 3 2016-11-04    46.2  42.7           3.56        3.34
## 4 2016-11-05    46.3  43.1           3.20        3.20
## 5 2016-11-06    46.4  42.6           3.74        3.12
## 6 2016-11-07    44.6  42.4           2.2         2.75

We can use the ggplot2 package to graph this trend and get a better look at the polls, passing our data frame to the ggplot() function and adding a line for the average and points for the polls themselves with the geom_line() and geom_point() functions:

library(ggplot2)

ggplot(rolling_average)+ 
  geom_point(aes(x=end_date,y=Clinton.Margin))  + 
  geom_line(aes(x=end_date,y=Clinton.Avg),col="blue") 

Viola: a moving average displayed over every poll of the 2016 presidential election. With just three simple steps we were able to code an incredibly insightful analysis with free data available online in code that will run in a fraction of a second. That’s the beauty of a fast, powerful, well-sourced language such as R.

All told, we wrote this entire analysis — from data gathering to wrangling — in less than 40 lines of code.

In the next post, we’ll familiarize ourselves with predictive models and errors via the power of presidential election polls as far back as the 1970s.



Have any questions, comments, or concerns? You know where to find me.

— Elliott






comments powered by Disqus