House 2018 Model Talk: Regression versus Simulation

"All models are wrong, but some are useful."

G. Elliott Morris

9 minute read

Up until now, my forecast of the 2018 House midterms has been pretty grim for Democrats. It has noted a 7-8 percentage point Democratic advantage in the generic ballot and calculated a 32% chance of Democrats winning control of the House majority. It has come to that conclusion via a mix of national polls, district-level predictions, and simulations designed to parameterize the error in our measurement.

Today, I made an adjustment to that model, and when I ran the computer code it spit out the following:

> ----------------- Model ------------------------
> Model run: October 13, 2017
> Projected Dem. Vote (%): 54.1%
> Projected Seats: 208
> -------------- Simulation ----------------------
> Median # Dem. Seats: 215
> Most Likely # Dem. Seats: 211
> Dem. Probability of Victory: 46%

That 46% probability of victory is 14% higher than what I was reporting yesterday.

So… what in the hell is going on?

Let’s rewind a bit. First, we’ll talk through some of the principles we keep in mind when doing this sort of forecasting. Then we’ll analyze what that means for the change I made. Briefly, here’s what I think the forecast is intended to accomplish (an extended version is included in my methodology post for this forecast):

  1. Forecasts should calculate a range of possible outcomes, rather than labeling one single outcome the most likely and conveying that that's the one that will happen.
  2. The numbers should reflect similar expectations we could draw upon analyzing the underlying data with a similar approach.
  3. The real power in a forecast is in accounting for the error we've encountered in the past.
  4. Any forecast is built on a foundation of assumptions we make about the target event (in this case, the 2018 House midterms), which brings in all the error we've made about those assumptions.

For the most part, the way I have constructed the model so far adheres to these principles — but there is one assumption I made that needs to be reversed. This reversal has some important consequences (as you have seen!) that need to be adequately explained, perhaps even justified. It falls into the domain of the third bullet point above.

Again, I strongly recommend you read my methodology post for the model, as most of the rest of this piece will make more sense if you understand the underlying prediction process. You don’t need to read all of it! Focus on Steps 4 and 5.

There are three parts to this explainer… (1) the model, (2) the problem, (3) the remedy.

1. The Model

Again, as detailed in a separate post, step 4 of the model is the part that covers our prediction for every single congressional district. At a basic level, I am predicting the 2018 Democratic candidate’s vote in each House seat using the 2016 and 2014 Democratic vote share in that district, Clinton’s vote share there, and a variable indicating if there’s an incumbent running. As I’ve written before, this method is pretty basic and has its own error. That being said it still produces estimates that are 4-5 seats within what we’ve forecast for past elections, (we’ll call this “back-testing” to save time) and has roughly 5% error on average.

Step 5 of the model takes these estimates, the average error from when we back-tested, and creates (or “simulates”) tens of thousands of fake elections. Each of these fake elections (what we call “trials”) randomly gives a boost to the Democrat or Republican in each district within the bounds of the margin of error for the past. (We do some fancy-ish stuff with these errors; namely, they’re correlated across districts and we also use error in national polls and add uncertainty for a high number of undecideds to estimate what different “boosts” are appropriate).At the end of each simulation, we simply add up the number of districts where Democrats win more votes than Republicans. We write that number down in a list and repeat however many times we want.

After all of our trials are run we are left with a long list of the numbers of seats Democrats could win on November 6, 2018 (you can actually download this list on my forecast homepage!) with which we can do some cool math:

  • First, we use it to figure out the chance that Democrats win a House majority, or more than 217 seats. That's the probability of victory you see on the forecast homepage
  • We can also use it to compute the "most likely" result. Usually, that's the result that happens most often (the "mode," in statistics language), but there is also value in reporting out the average (or "mean") and median outcomes.

That’s great! Except for one thing…

What if the mean, median, and/or mode don’t match what our linear regression, as per step 4, says we should expect come November 2018?

2. The Problem

Therein lies the crux of our problem; the two estimates almost never match. That’s because of a particular phenomenon occurring in the present distribution of Democratic voters across the United States. Mostly because of gerrymandering and a process called sorting — whereby left-leaning partisans are clustering in particular areas (cities, mostly) — the Democrats enjoy a huge advantage in the number of Safe seats they currently possess. By our calculations, seats historically categorized as “Safe Democratic” will be won by Republicans less than 1% of the time. That is, after all, why we call them safe!

But on the other side of the aisle, Republicans are spread out over more competitive seats, ones we call “Lean Republican”. The following graphics illustrate this.

There are 5-6 times as many seats that Republicans are expected to win by 5-15% than there are for Democrats. And since our simulations randomly pick error within 10% margin of error for these seat-level estimates, there are more much more seats that can be flipped Democratic in each of our trial elections than can be moved to the “R” column.

In other words, the simulations are giving a boost to the Democratic party over what our simple step 4 predictor tells us. Intuitively, this is because the chance that Democrats overperform expectations is inherently higher than the chance Republicans do. That boost currently amounts to a median of 8 (and an average of 12!) seats for the Party, equal to a 12-15% boost in the chance that Republicans lose their House majority.

So, again, do we pick the step 4 total or the step 5 total? Or do we do something else entirely, adjusting the step 5 total Democratic seat share to match the step 4 number (this is what the forecast has done so far)?

There are a few ways to think through this.

First, one could simply say that the number of expected Democratic seats computed by the step 4 model is the correct number. After all, that’s what Ockham’s Razor would tell us to do; “the simplest explanation is the best one.” In terms of 2018 predictions, a district-level model isn’t the simplest you could be, but it is close.

You could also say (indeed, this is how I was initially justifying this position) that the linear model is the best expectation of the election because it’s the most tried method. The step 4 estimate is the best estimate because that is what has been most adjusted and readjusted, tested and retested in the past. This is just another assumption we have to make about the nature of the forecasting model.


I think these people — yes, including myself in crafting forecast model 1.0 — are wrong. Here’s why.

The entire point of a simulation-based forecasting model is to take into account the errors in our past predictions and create a probabilistic expectation of what we expect may happen — but we also create some other measurements for the “best estimate” of Democratic performance. These measurements are slightly more optimistic for Democrats precisely because of that disproportionately higher number fo lean Republican seats. It makes sense to say that Democrats have an outside advantage at picking up seats because they do. Yet, the forecasting technique at present doesn’t represent that.

What are we to do?

3. The Remedy

To restate: the issue at hand is the correction of the simulation’s calculation of the “best estimate” for the number of seat Democrats should win next year. Again that correction has been in place to move the number to match the estimate from the linear model. This is not a bad choice at face value and not a wrong assumption per sé, but definitely affording its own drawbacks.

To remedy the present issue I am simply deleting that adjustment. The new reported “best estimate” of Democratic seat share will simply be the median of the simulated number of Democratic seats. We will also be reporting the more strict estimate of seats based on our step 4 linear regression, though that’s not the headline number we should be analyzing.

You can imagine that the reported share of seats is moving up roughly 8 seats on average. You can see this in action via the difference in the two graphics below.



Again, we think that this difference in reporting makes a big (and better) difference in how we’re communicating expectations for the 2018 midterms. Whereas before we were saying the expected Democratic performance in 2018 is a hard number determined by a less sophisticated, simple model, we’re now saying that our expectation of reality can be better determined by our simulated number of seats. In this case, this (easy to make in hindsight) decision benefits Democrats by about 15% in our probabilistic forecast.

Wrapping up(Phew, we’re almost done)

So, to recap (TL;DR): Our initial reporting on the 2018 midterm race emphasized the seat share estimates produced by a linear model. Reporting also favored the probabilities produced by a simulated range of possibilities that had been adjusted to match that model, rather than speak for itself. We’ve done away with that adjustment now. As a result, the probability that Democrats win back the House in 2018 has increased from 32% to 45%, a very meaningful difference. We are also choosing to emphasize reporting the median and mode (most likely outcome) of these simulations rather than the strict estimate of the aforementioned linear model — again, 215 seats now instead of 208.

We think that the changes more accurately reflect the distributional dynamics at play in the 2018 House midterms: notably, the 5-6x greater number of possibly vulnerable Republicans than possibly vulnerable Democrats.

You can imagine that before our change, we were saying “Our best guess of the Democratic Party’s share of House seats after the 2018 midterms, according to a simple model, is 208 seats. A simulation, which accounts for the chance that we are wrong, adjusted to match that model says they have a 32% chance of winning 218 or more seats.”

Now we’re saying “Our best guess of the Democratic Party’s share of House seats after the 2018 midterms is around 215 seats. Strictly speaking, they are favored in 208 seats, but when we account for the higher number of Lean Republican districts than Democratic districts, Democrats get a slight boost and they win a House majority about 45% of the time.”

We think the latter makes more sense.

Questions or comments about this shift in our forecast? We know it’s a little confusing, so we’d be happy to talk it over with you via email.

comments powered by Disqus