LOESS vs Bayesian GAM for Finding Trends in Data
Oct. 10, 2018Rstats US Politics Polls Bayesian stan R
Prompted by new data from our weekly polling with YouGov, I shared a graph of the growing gender gap in American politics yesterday:
The gender gap in the generic ballot grew from 35 to 45 pts (!!) among college-educated voters in our polling from last week to this week. Seems to be corresponding with a better overall generic ballot for Democrats. It is not a good long-term strategy to tick off suburban women. pic.twitter.com/3PNg6LVhIn— G. Elliott Morris (@gelliottmorris) October 9, 2018
The growing divide in the preference for Democrats over Republicans among men versus women — here shown among only college-educated Americans — is a striking development in US politics. An electorate divided as deeply by gender as other demographics, like educational attainment or race, would certainly push us into uncharted territory. Thus, it’s important to get the trend-detection right.
To that end, a discussion of methods follows.
Typically when I am analyzing time-series polling, I use a method called LOESS to smooth out the estimates and adjust for noisy data. LOESS smoothing, short for local regression (and akin to locally weighted scatterplot smoothing, or LOWESS), is a form of nonparametric regression that can be used to uncover and explore nonlinear trends in data. Primarily, I use LOESS smoothing to show trends on scatterplots when relationships are clearly not 1:1.
However, they are not always the optimal pick. As people shared with me after I posted the original graph, LOESS smoothing might be an imperfect representation of trends and uncertainty in polling data. Allow me to be clear in saying that there is no single answer for which technique is best. Here, I compare my typical LOESS approach with a much more sophisticated one: a Bayesian implementation of generalized additive models.
Instead of using the LOESS smoother built into the
ggplot2 package (Wickham 2016), I created my own using a Bayesian GAM smoother with the
brms package (Bürkner 2017) as an interface to
stan to fit the model.
The data was simple enough: I used the Economist/YouGov polling tables to grab the share of men and women who chose Democrats over Republicans in the House generic ballot on any given date. The default standard deviation used for this model is 0.10 — about where we would expect it to be for subgroup estimates (n~400) for marginal data.
gender poll_date dem_margin sdy Female 0 0.20350786 0.1 Female 7 0.16189866 0.1 Female 14 0.18249686 0.1 Female 21 0.16291266 0.1 Female 28 0.14785891 0.1 Female 35 -0.01650884 0.1 Female 42 0.16244296 0.1 Female 49 0.21555512 0.1 Female 56 0.25216574 0.1 Female 63 0.39003819 0.1
The GAM model that I fit uses the Bayesian
stan programming language to sample from a chain of Markov Chain Monte Carlo draws of a Gaussian distribution. I specify the model as thus, predicting the Democrats’ margin with the subgroup (either men or women) given that the polling is inherently uncertain with the gender of the group and a smooth term for the poll date. The other arguments are ones passed to
library(brms) fit <- brm(bf(dem_margin | se(sdy, sigma=TRUE) ~ gender + t2(poll_date,by=gender)) , data = data.brms, family = gaussian(), cores = 4, seed = 17, iter = 4000, warmup = 1000, thin = 10, refresh = 0, control = list(adapt_delta = 0.99))
summary() of the model is thus:
Family: gaussian Links: mu = identity; sigma = identity Formula: dem_margin | se(sdy, sigma = TRUE) ~ gender + t2(poll_date, by = gender) Data: data.brms (Number of observations: 156) Samples: 4 chains, each with iter = 4000; warmup = 1000; thin = 10; total post-warmup samples = 1200 Smooth Terms: Estimate Est.Error l-95% CI u-95% CI Eff.Sample sds(t2poll_dategenderFemale_1) 0.11 0.15 0.00 0.52 996 sds(t2poll_dategenderMale_1) 0.13 0.16 0.00 0.53 1200 Rhat sds(t2poll_dategenderFemale_1) 1.00 sds(t2poll_dategenderMale_1) 1.00 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat Intercept 0.23 0.01 0.21 0.26 1200 1.00 genderMale -0.18 0.02 -0.22 -0.15 1200 1.00 t2poll_date:genderFemale_1 -0.03 0.01 -0.06 -0.00 1200 1.00 t2poll_date:genderMale_1 0.01 0.01 -0.01 0.04 1200 1.00 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat sigma 0.01 0.01 0.00 0.02 1174 1.00 Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).
And if we draw from the predictive posterior distribution, we see that the equation does a rather good job of predicting the data — if not a little too uncertain (notice the fat tails of y-rep).
Ultimately, what we want is a plot that looks similar to the original but draws its trend based off the Bayesian GAM. To get the smoothed lines, I use the
marginal_effects() function from brms, and then do some wrangling to set up two data frames for my plot:
# extract the marginal effects msms <- marginal_effects(fit,method='fitted') # make dataset of polls and trendline for the ggplot trendline <- msms[] %>% as.data.frame() # add points to trendline dataset trendline_for_gg <- trendline %>% mutate(poll_date = as.character(poll_date + min(turnout_byparty.ts$poll_date))) %>% select(-c(dem_margin)) %>% group_by(poll_date,gender) %>% summarise_all(first) %>% as.data.frame() %>% mutate(poll_date = ymd(poll_date)) points_for_gg <- turnout_byparty.ts %>% mutate(poll_date = as.character(poll_date), dem_margin = TheDemocraticPartyCandidate-TheRepublicanPartyCandidate) %>% mutate(poll_date = ymd(poll_date))
Here is the
ggplot2 code to make the plot, which graphs the GAM smooth with a filled line and colored fill alongside a LOESS trend, with a dotted line and grey fill for either gender.
# plot ggplot() + geom_point(data=points_for_gg, aes(x=poll_date,y=dem_margin, col=gender)) + geom_smooth(data=points_for_gg, aes(x=poll_date,y=dem_margin, group=gender),linetype=2) + geom_hline(yintercept = 0,linetype=2,col='gray60') + geom_line(data=trendline_for_gg, aes(x=poll_date,y=estimate__, col=gender),size=1) + geom_ribbon(data=trendline_for_gg, aes(x=poll_date,ymin=lower__,ymax=upper__, fill=gender),alpha=0.3) + labs(title="Comparing LOESS and Bayesian GAM in R", subtitle="Local regression can more uncertain and reactionary, but it depends on your standard errors", x="Date", y="Preference for Democrats over Republicans\nin the 2018 Generic Ballot" , caption="Economist/YouGov Weekly Polls (public tables)" ) + scale_y_continuous(labels=scales::percent_format(accuracy = 2)) + scale_x_date(date_labels = "%b '%y") + scale_color_manual(" ",values=c("Female"="#85C1E9","Male"="#5499C7")) + scale_fill_manual(" ",values=c("Female"="#85C1E9","Male"="#5499C7"))
I notice two things: first, the LOESS trend is a little more sensitive to the data. While the patterns generally trace each other quite well, there are bumps around the fall of 2017 and spring of 2018 where the LOESS diverges from the Bayesian GAM. This could be due to the relatively high standard deviation of the brms equation, or to the the short default span of the LOESS.
- Dynamic standard error in the model: right now, I use the same standard error of the y variable at every point for the x. I am not sure this is correct, and my originally instinct was to supply the model with
y | se(y) ~ x + tx(x, by=x2), but that produced some funky results. This part of the trend-fitting is very much in progress. If you have an idea of how to solve this problem, ping me!
- Some more fine-tuning: I’m a novice Bayesian, so not all parameters are optimized (especially in the algorithmic sense) — it is the constant struggle of a data scientist to fit the best model possible!
- Make more trends! (Like for voter turnout.)
- The 2018 Midterms are Upon Us: Final Thoughts Beforehand
- Now Out: My Course "Analyzing Polling and Election Data in R" at DataCamp!
- The Political News That We're Googling Ahead of the Midterms