LOESS vs Bayesian GAM for Finding Trends in Data

Oct. 10, 2018

Categories: Rstats US Politics Tags: Polls Bayesian stan R


Prompted by new data from our weekly polling with YouGov, I shared a graph of the growing gender gap in American politics yesterday:

The gender gap in the generic ballot grew from 35 to 45 pts (!!) among college-educated voters in our polling from last week to this week. Seems to be corresponding with a better overall generic ballot for Democrats. It is not a good long-term strategy to tick off suburban women. pic.twitter.com/3PNg6LVhIn

— G. Elliott Morris (@gelliottmorris) October 9, 2018

The growing divide in the preference for Democrats over Republicans among men versus women — here shown among only college-educated Americans — is a striking development in US politics. An electorate divided as deeply by gender as other demographics, like educational attainment or race, would certainly push us into uncharted territory. Thus, it’s important to get the trend-detection right.

To that end, a discussion of methods follows.

Typically when I am analyzing time-series polling, I use a method called LOESS to smooth out the estimates and adjust for noisy data. LOESS smoothing, short for local regression (and akin to locally weighted scatterplot smoothing, or LOWESS), is a form of nonparametric regression that can be used to uncover and explore nonlinear trends in data. Primarily, I use LOESS smoothing to show trends on scatterplots when relationships are clearly not 1:1.

However, they are not always the optimal pick. As people shared with me after I posted the original graph, LOESS smoothing might be an imperfect representation of trends and uncertainty in polling data. Allow me to be clear in saying that there is no single answer for which technique is best. Here, I compare my typical LOESS approach with a much more sophisticated one: a Bayesian implementation of generalized additive models.

Instead of using the LOESS smoother built into the ggplot2 package (Wickham 2016), I created my own using a Bayesian GAM smoother with the brms package (Bürkner 2017) as an interface to stan to fit the model.

The data was simple enough: I used the Economist/YouGov polling tables to grab the share of men and women who chose Democrats over Republicans in the House generic ballot on any given date. The default standard deviation used for this model is 0.10 — about where we would expect it to be for subgroup estimates (n~400) for marginal data.

head(data.brms, 10)
   gender poll_date  dem_margin sdy
   Female         0  0.20350786 0.1
   Female         7  0.16189866 0.1
   Female        14  0.18249686 0.1
   Female        21  0.16291266 0.1
   Female        28  0.14785891 0.1
   Female        35 -0.01650884 0.1
   Female        42  0.16244296 0.1
   Female        49  0.21555512 0.1
   Female        56  0.25216574 0.1
   Female        63  0.39003819 0.1

The GAM model that I fit uses the Bayesian stan programming language to sample from a chain of Markov Chain Monte Carlo draws of a Gaussian distribution. I specify the model as thus, predicting the Democrats’ margin with the subgroup (either men or women) given that the polling is inherently uncertain with the gender of the group and a smooth term for the poll date. The other arguments are ones passed to stan.

library(brms)

fit <- brm(bf(dem_margin | se(sdy, sigma=TRUE) ~
                gender + 
                t2(poll_date,by=gender)) ,
           data = data.brms,
           family = gaussian(), 
           cores = 4, seed = 17,
           iter = 4000, warmup = 1000, thin = 10, refresh = 0,
           control = list(adapt_delta = 0.99))

The summary() of the model is thus:

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: dem_margin | se(sdy, sigma = TRUE) ~ gender + t2(poll_date, by = gender) 
   Data: data.brms (Number of observations: 156) 
Samples: 4 chains, each with iter = 4000; warmup = 1000; thin = 10;
         total post-warmup samples = 1200

Smooth Terms: 
                               Estimate Est.Error l-95% CI u-95% CI Eff.Sample
sds(t2poll_dategenderFemale_1)     0.11      0.15     0.00     0.52        996
sds(t2poll_dategenderMale_1)       0.13      0.16     0.00     0.53       1200
                               Rhat
sds(t2poll_dategenderFemale_1) 1.00
sds(t2poll_dategenderMale_1)   1.00

Population-Level Effects: 
                           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept                      0.23      0.01     0.21     0.26       1200 1.00
genderMale                    -0.18      0.02    -0.22    -0.15       1200 1.00
t2poll_date:genderFemale_1    -0.03      0.01    -0.06    -0.00       1200 1.00
t2poll_date:genderMale_1       0.01      0.01    -0.01     0.04       1200 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     0.01      0.01     0.00     0.02       1174 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

And if we draw from the predictive posterior distribution, we see that the equation does a rather good job of predicting the data — if not a little too uncertain (notice the fat tails of y-rep).

pp_check(fit,nsamples=50)

Ultimately, what we want is a plot that looks similar to the original but draws its trend based off the Bayesian GAM. To get the smoothed lines, I use the marginal_effects() function from brms, and then do some wrangling to set up two data frames for my plot:

# extract the marginal effects
msms <- marginal_effects(fit,method='fitted')

# make dataset of polls and trendline for the ggplot
trendline <- msms[[3]] %>% as.data.frame()

# add points to trendline dataset
trendline_for_gg <- trendline %>%
  mutate(poll_date = as.character(poll_date + min(turnout_byparty.ts$poll_date))) %>%
  select(-c(dem_margin)) %>%
  group_by(poll_date,gender) %>%
  summarise_all(first) %>%
  as.data.frame() %>%
  mutate(poll_date = ymd(poll_date))

points_for_gg <- turnout_byparty.ts %>%
  mutate(poll_date = as.character(poll_date),
         dem_margin = TheDemocraticPartyCandidate-TheRepublicanPartyCandidate) %>%
  mutate(poll_date = ymd(poll_date))

Here is the ggplot2 code to make the plot, which graphs the GAM smooth with a filled line and colored fill alongside a LOESS trend, with a dotted line and grey fill for either gender.

# plot
ggplot() +
  geom_point(data=points_for_gg,
             aes(x=poll_date,y=dem_margin,
                 col=gender)) +
  geom_smooth(data=points_for_gg, aes(x=poll_date,y=dem_margin, group=gender),linetype=2) +
  geom_hline(yintercept = 0,linetype=2,col='gray60') +
  geom_line(data=trendline_for_gg,
            aes(x=poll_date,y=estimate__,
                col=gender),size=1) +
  geom_ribbon(data=trendline_for_gg,
              aes(x=poll_date,ymin=lower__,ymax=upper__,
                  fill=gender),alpha=0.3) +
  labs(title="Comparing LOESS and Bayesian GAM in R",
       subtitle="Local regression can more uncertain and reactionary, but it depends on your standard errors",
       x="Date",
       y="Preference for Democrats over Republicans\nin the 2018 Generic Ballot" ,
       caption="Economist/YouGov Weekly Polls (public tables)"
  ) +
  scale_y_continuous(labels=scales::percent_format(accuracy = 2)) +
  scale_x_date(date_labels = "%b '%y") +
  scale_color_manual(" ",values=c("Female"="#85C1E9","Male"="#5499C7")) +
  scale_fill_manual(" ",values=c("Female"="#85C1E9","Male"="#5499C7"))

I notice two things: first, the LOESS trend is a little more sensitive to the data. While the patterns generally trace each other quite well, there are bumps around the fall of 2017 and spring of 2018 where the LOESS diverges from the Bayesian GAM. This could be due to the relatively high standard deviation of the brms equation, or to the the short default span of the LOESS.

Next steps:

  1. Dynamic standard error in the model: right now, I use the same standard error of the y variable at every point for the x. I am not sure this is correct, and my originally instinct was to supply the model with y | se(y) ~ x + tx(x, by=x2), but that produced some funky results. This part of the trend-fitting is very much in progress. If you have an idea of how to solve this problem, ping me!
  2. Some more fine-tuning: I’m a novice Bayesian, so not all parameters are optimized (especially in the algorithmic sense) — it is the constant struggle of a data scientist to fit the best model possible!
  3. Make more trends! (Like for voter turnout.)

References

Bürkner, Paul-Christian. 2017. “An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80 (4): 1–28. https://doi.org/10.18637/jss.v080.i01.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.






comments powered by Disqus