It has been a hot minute since I launched this site’s forecasting model for the 2018 midterm elections to the United States House of Representatives. Current generic ballot polling gives Democrats a healthy edge in the national popular vote, and forecasts of a “wave” election are likely to continue. The operative question is: will the wave be high enough?
This post will answer some specific questions about how I’m projecting the outcome of the 2018 midterms. Among them…
- How do you average your polls?
- How do you project the election day national vote share using polls?
- How do you use that national vote share to project outcomes at the district level?
- Why are Democrats expected to win fewer seats than their national vote share suggests?
Before I do that, there are a few things to keep in mind when interpreting the forecast to which I’d like to draw attention. The tips come from books written by Nate Silver and “superforecaster” and professor Joseph E. Tetlock:
1. Outcomes are probabilistic.
There is always some chance that an outcome can happen. Sometimes that chance is zero, and sometimes it is one-hundred — but more often than not there is a chance anything is possible, even if the chance is remote enough that we would round its chance of happening to zero.
This can also be turned on its head; if we often understate the possibility of thing x occurring, we may also overstate the probability that thing y occurs. Don’t assume 80% is 100% – that 20% could make a big difference if an “unexpected” event happens (see: the 2016 election).
So when a model says that Democrats are likely to win 55% of the vote share, it’s very well possible that they win 57 or 53 percent instead. Don’t count on exact outcomes — expect a distribution of possibilities.
2. The output data are quantitative, but ought to be combined with qualitative analysis.
Being told that a given House majority is only fifty percent likely is not very informative when taken at face value. Instead of just spouting off numbers, analysts should utilize forecasts to explain what we can expect in certain events, often based on what (properly calibrated) forecasts have said in the past.
The forecast can also be useful in gauging the impact of certain events, or the wiggle-room candidates have when making important political choices. Just as well, in a forecast of an election to the United States House of Representatives there may be a large role to play for the attitudes or appearance of a select candidate. In that scenario, when a model would inherently assume that someone exists near some spectrum of normality, the qualitative assessment may deserve to be taken into consideration more than their forecast win/loss percentage.
There are just a couple of the scenarios in which the output data of forecasts can be helpful.
3. The model responds to new data quickly (when it ought to) but keeps old numbers in mind.
It only makes sense for us to take recent data more seriously than old — events can render old data obsolete quickly (see: Comey, July 2016) — but it is often the case the political environments change without big events to spur that shift. Not only does it make logical sense to “care more” about recent data, but doing so has helped our model make better predictions in the past.
This also causes forecasts to look rather volatile during some uncertain races, or at the beginning of elections when people (might) not have their minds made up. However, we would rather have this volatility than a model that treats one-week-old information the same as information coming out on the day before Election Day.
4. The past is a pretty good indicator of what we do ( and don't !) know about the future.
The oft-repeated adage that “the past is not indicative of the future” may be right for some things — short term poker odds and votes on unnoticed congressional legislation, for example — but in forecasting, there is one thing the past conveys very well: error. We should often use past prediction misses as an indicator of the error, or uncertainty, that we might encounter this year.
Finally, any good modeler will give you the following advice from statistician George Box: “all models are wrong, but some are useful.” This is to say that a model is, ultimately, a tool. If our tools break or are misused, they may not be helpful. Yet when used the correct way they can be incredibly informative.
Keeping the above in mind, let’s get into the weeds.
A Forecasting Method for the 2018 House Midterm Election
Much like the more interpretive side of the model, the technical details also draw heavily on seasoned work from field experts. The model details below draw heavily from academic papers published by political scientists Joseph Bafumi of Dartmouth College, Robert Erikson of Columbia University, and Christopher Wlezien of the University of Texas at Austin. A few technical details have changed — such as the use of weighted polling averages early in the cycle, the inclusion of some more district-level information, and an altered simulation model — but the overall technique borrows heavily from their work.
With that let’s explain each of five steps of the model:
- Average polls of the national congressional race
- Predict what polls might say on election day
- Combine the polls with data from special elections to predict the popular vote on election day
- Add the swing in Democratic votes implied by the projected national popular vote to every congressional district with a contested election
- Simulate 50,000 random elections, using these “trial” contests to compute probabilistic expectations for every seat
Step 1: Averaging the polls
The first step in understanding House elections is understanding the national environment. That is, if Democrats have the lead in polls overall then we can assume (wrongly, in some cases) that they will perform well in the House of Representatives. Once we know how one party is doing nationally we can use that data to project races locally.
To assess the partisan leaning of the national environment, I primarily use polls of the generic Congressional ballot, a question used in public opinion polls that asks Americans which party they will be voting for in their congressional district. Of course, I don’t want to use just one poll as our benchmark, as any single data point could miss the true mark due to random statistical error, a tendency of any one firm to lean more toward Democrats or Republicans, or similar factors that could potentially throw off our expectations — so I take an average of comparable measures.
Specifically, the daily polling average is calculated by taking an exponential weighted average of all polls conducted in the election cycle. This method allows us to put more emphasis on polls conducted recently, but still keep old information in mind when averaging all the data. I decide what weight is best by instructing my computer to tell me which one produces the most predictive average on any given day in the cycle.
What we’re left with is a snapshot of national preferences for the Democratic and Republican parties. On its face, this is very useful for predicting what may happen nationally on election day. Polls aren’t perfect, however, always incurring some error, and so I combine them with other data to get a better forecast of the national popular vote.
Step 2. Forecasting polls on election day
I now move to the method used to predict the election day margin in the polls for Democrats. The method is rather simple: on each day, run a prediction model of the final polling average using the average of polls on that day from 1940 to 2016. I use the polling margin for the party of the presidency and a variable indicating whether or not the election is a midterm election to predict the likely final polling average.
For every day the model is run, it uses a the weighted average of the polls for every House election up to that point in the cycle to predict the outcome of those elections. Then, it uses the equation from that prediction, combined with polls for this cycle, to predict the final 2018 polls.
As of writing, this increases the Democrats’ chances in the election, guessing that they’ll gain anywhere about 3 points in the generic ballot from now until next November.
However, the thing we really want to predict is the votes on election day, not the polls, and the generic ballot polling average is not the only information we have that can be useful to this end.
Step 3. Projecting the national popular vote on election day
In predicting the Democratic margin of voted on election day, the generic ballot only provides us with part of an answer. On election day, my average of polls has a margin of error of roughly 12 percentage points — rather large, to say the least. Any good forecaster will strive for better accuracy.
To that end, I model the final popular vote by combining both the generic ballot polling average and the average swing toward Democrats in special elections that have taken place since the most recent national election. These data add accuracy to the forecast, usually overstating the swings against the party in power — a helpful correction, as polling can often understate those swings.
However, predictions are never 100% accurate. On any given day, there is a certain amount of error in the estimates. Running the model 309 days out, there is about an 8% margin of error (in either direction) for our guess of the final election day polls. We can be reasonably good forecasters of the national environment, but notes of caution are warranted. (Note that, because a new model for the final polls is defined each day, each day also has its own particular errors, some days more, some days less.)
Of course, the national popular vote does not matter much in deciding the outcome of the election. What really matters is what happens in all 435 of the the single-member districts of the United States House. What we really want to know is the number of seats each party will receive. So let’s see what past vote shares have said about the number of seats won by Democrats.
Step 4. Projecting the election outcomes of all 435 congressional districts
The clearest path to projecting seat share would be to use past information about the relationship between votes and seats to project 2018’s relationship. That is, when Democrats have earned 54% of the vote in the past, how many seats did they win? Put a better way: if I try to predict past Democratic seat share with their national vote share, what would 54% vote share predict?
This method actually yields pretty good results. From a statistical perspective, this model is sound.
But theory says something else.
The problem with forecasting national seat share with nationwide vote share is that House seats are not a perfect portrait of the vote. For example, in 2016 Democrats won 49.4% of the two-party popular vote in the House, yet they only won 44.5% of seats. In 2014, Republicans won 53% of the vote yet were awarded 57% of the seats. Way back in the 2006 midterm, Democrats won 55% of the vote but only received 53% of seats. There is a clear bias against Democrats in the House of Representatives. There are multiple explanations for this malapportionment.
Firstly is the obvious explanation: Republicans have, over the past decade, built up control of a majority of state legislatures, allowing them to gerrymander House district maps and give themselves a leg up on Democrats competing for those seats. Be it a racial gerrymander — where the power of minority voters) who are often Democrats) is diluted — or a partisan gerrymander — whereby Republicans draw district maps to explicitly benefit themselves — Democrats face a harsh uphill battle.
Secondly, and perhaps more gravely, Democrats face an “unintentional gerrymander”, a structural disadvantage they may have crafted themselves. The data show that Democrats have become clustered in tight geographic areas, weakening their influence over the inherently geographically controlled House. This is commonly attributed to a phenomenon called sorting, the process by which partisans move to areas where they feel they fit in. For Democrats, this concentrates their electoral power in compact urban areas. Republicans, on the other hand, spreading out their influence over massive swaths of sagebrush and pick up a disproportionate number of seats.
It is likely that both of these hypotheses explain the Democrats’ geographic electoral disadvantage. Despite the cause, it remains true that the electoral map is heavily tilted towards the GOP, causing big problems for Democrats in picking up the correct number of seats. I see some evidence of this trouble emerge in 2012, 2014, and 2016. In the graphic above there is a noticeable national overperformance for the GOP. In sum, it is possible that in 2018 the Democratic Party faces such a harsh national map that past indicators won’t be reliable in predicting the Democratic Party’s performance.
To remedy this issue, I employ and improve another of Bafumi et. al’s models that take into account individual-level seat characteristics and combines them with what I estimate about the national environment….
… by combining individual district-level information with national electoral preferences
Ultimately, I want to use the model to obtain individual forecasts for every congressional district, counting up each party’s seats to get our final projection of who will control the House. To do that I use the following information to predict the Democratic/Republican margin of victory in each seat.
- The Democratic vote margin in the district’s previous congressional election*
- A weighted average of the vote margin for the Democratic presidential candidate in the past two cycles (an approximate measure of district partisanship)
- A variable indicating whether the seat has an incumbent running, or if it’s open, and whether that incumbent is a Democrat or a Republican
- The “national swing,” the difference between Democrat’s projected national two-party vote share and their national two-party vote share last cycle
*In seats that were previously uncontested, the forecast model uses a prediction of what the Democratic vote margin would have been.
Running the model retrospectively to assess validity, a forecast in 2016 would have missed the outcome by three seats. Instead of picking 194 seats for the Democrats, the model predicted 191.
Turning to the current midterm cycle, the forecasting model (ran today) projects that Democrats will win 220 seats — 14 more than they did in 2016 — despite a 4 percentage point swing towards them in the popular vote. They’re projected to win the popular vote by 5% but lose the House by 8 districts. Recall the discussion of malapportionment from the third section of this post.
Of course, there is error in our forecasts (as we’ve discussed), and accounting for that error is crucial in producing what we all really want: a probabilistic estimate of how likely it is Democrats take back the majority in the House of Representatives on November 6th, 2018.
Step 5. Simulate the election
The final step is to simulate the election, accounting for uncertainty from our estimates and assessing the probability that the Democrats or Republicans could over/underperform and move the outcome away from what I estimate. The framework for this simulation comes from a common statistical technique used to account for possible error in data called Markov Chain Monte Carlo.
This process has 3 major steps:
- Choose a random national error, according to the root-mean-square error from the formula predicting final national poll margins on that day
- Choose correlated district error for all 435 congressional districts. This way, I can account for the likely scenario that if my projections are wrong they will likely systematically err in one direction (either pro-Democrat of pro-Republican).
- Add up the number of seats won by Democrats and records it along with a “win” counter for each seat
I repeat this process many thousands of times, in the end counting up the number of “elections” in which Democrats win more than 218 seats in the House and dividing that by the number of total simulations. That becomes their probability of victory. (I do the same for each seat).
The graph above displays a distribution of the possible number seats the Democratic Party could win, given current national vote share and district characteristics. The red area is where Republicans hold the House. The blue area is where Democrats win a majority. The height of each bar represents how likely that individual outcome is to occur (higher is better).
I can also “game this out,” forecasting the different seats Democrats could gain (or lose) given any national vote margin I like. Inputting every possibility from a 15 percentage point win to a 15 point loss, we learn that (if current district characteristics hold, and they likely won’t) the Democratic Party would need a margin of roughly 8% to have majority odds of winning the House this year. Again, there is uncertainty there: Democrats could win with a 4% margin or lose with a 9% one. Details in the graphic below:
The discrepancy between the percentage of votes won and the percentage of seats won, as seen at any point on the blue line above between 0% and 10% on the x-axis, can be almost entirely attributed to our earlier discussion about the geographic disadvantage Democrats face as a result of gerrymandering and clustering.
That’s about it for this post! If you’re sticking around to the three thousandth word, there are a few things you should know:
- The model updates every day of the election cycle, twice a day, to account for changing information.
- However, each time the model is run, it produces slightly different numbers as a result of statistical randomness in the simulations. Sometimes the probabilistic estimates will move as much as 0.5%! Don’t panic if you see small movements without observable shifts; it’s likely just “noise” in the forecast.
- What about district-level polls? Well, polls for individual congressional district have been rather errant in the past, and the way this model is devised doesn’t necessitate updated district numbers — those would be “baked in,” so to speak, in changes in the national polling.
- I’m not perfect, and modeling takes so much time that I find it impossible that I have not committed at least one error. If I catch a bug in the model, I’ll update this page with the correct information.
- Race ratings, the traditional "Safe Dem", "Likely Dem", "Lean Dem","Tossup", etc.. will be added in future posts. You can find them on the forecast homepage.
- Lastly, come back here frequently! I often re-read these posts and find places where I can clear up some sentence structure or incorporate readers’ feedback. This will only get better with time.
Questions? Comments? Concerns? Please be sure to pass them along in a tweet or email.
- Sep. 10, 2017: The previous version of the model used 0s/100s in place of 2016 vote share in places where no Democrat/Republican ran in 2016. For those previously uncontested districts, the 2016 vote shares have been replaced with 2014 vote shares. This makes little difference overall --- causing just a 2% decrease, for example, to Pete Sessions' (TX-32) projected 2018 vote share --- but small changes could make a big difference in a tight election.
- Oct. 13, 2017: I made a big update to how the model reports the probability that Democrats will win the House. Read more here.
- Jan. 1, 2018: I changed the model from estimating the final vote on election day to the final polls. This shift represents our belief that poll bias might not reflect past statistical patterns, and we should treat estimates with slightly more uncertainty instead. The models also now express data in terms of Democratic margins, not Democratic two-party vote shares, which is both (A) be more intuitive for the reader and (B) slightly more statistically accurate. This had the effect of increasing our expectations of the Democrat's 2018 performance by roughly one percentage point.
- June 1, 2018: The model now uses the redrawn congressional map in Pennsylvania and uses special election data when predicting the popular vote.