It has been roughly two weeks since I launched my forecasting model for the 2018 midterm elections to the House of Representatives. Current generic ballot polling gives Democrats a 7.2 percentage point edge in the national popular vote, roughly where it has been since a month after Donald Trump became president. The forecasting model estimates, contrary to what a 7.2% lead would imply, that Democrats have just a 22% chance of winning the House majority on November 6, 2018.
This post will answer some specific questions about how I’m projecting the outcome of the 2018 midterms. Among them…
- How do you average your polls?
- How do you project the election day national vote share using polls?
- How do you use that national vote share to project outcomes at the district level?
- Why are Democrats expected to win fewer seats than their national vote share suggests?
Before I do that, it helps to understand why we may want to go about forecasting an election. I like to give credit where credit is due, and so I’ll include the two books by Nate Silver (of course) and “superforecaster” and professor Joseph E. Tetlock from which most of these modeling underpinnings come. With that being said, let’s explain the usefulness of the forecast itself. Here are some guiding principles:
1. Outcomes are probabilistic.
There is always some chance that an outcome can happen. Sometimes that chance is zero, and sometimes it is one-hundred — but more often than not there is a chance anything is possible, even if the chance is remote enough that we would round its chance of happening to zero.
This can also be turned on its head; if we often understate the possibility of thing x occurring, we may also overstate the probability that thing y occurs. Don’t assume 80% is 100% – that 20% could make a big difference if an “unexpected” event happens (see: the 2016 election).
So when a model says that Democrats are likely to win 55% of the vote share, it’s very well possible that they win 57 or 53 percent instead. Don’t count on exact outcomes — expect a distribution of possibilities.
2. The output data are quantitative, but ought to be combined with qualitative analysis.
Being told that a given House majority is only fifty percent likely is not very informative when taken at face value. Instead of just spouting off numbers, analysts should utilize forecasts to explain what we can expect in certain events, often based on what (properly calibrated) forecasts have said in the past.
The forecast can also be useful in gauging the impact of certain events, or the wiggle-room candidates have when making important political choices. Just as well, in a forecast of an election to the United States House of Representatives there may be a large role to play for the attitudes or appearance of a select candidate. In that scenario, when a model would inherently assume that someone exists near some spectrum of normality, the qualitative assessment may deserve to be taken into consideration more than their forecast win/loss percentage.
There are just a couple of the scenarios in which the output data of forecasts can be helpful.
3. The model responds to new data quickly, but keeps old numbers in mind.
It only makes sense for us to take recent data more seriously than old — events can render old data obsolete quickly (see: Comey, July 2016) — but it is often the case the political environments change without big events to spur that shift. Not only does it make logical sense to “care more” about recent data, but doing so has helped our model make better predictions in the past.
This also causes forecasts to look rather volatile during some uncertain races, or at the beginning of elections when people (might) not have their minds made up. However, we would rather have this volatility than a model that treats one-week-old information the same as information coming out on the day before Election Day.
4. The past is a pretty good indicator of what we do ( and don't !) know about the future.
The oft-repeated adage that “the past is not indicative of the future” may be right for some things — short term poker odds and votes on unnoticed congressional legislation, for example — but in forecasting there is one thing the past conveys very well: error. We should often use past prediction misses as an indicator of the error, or uncertainty, that we might encounter this year.
Finally, any good modeler will give you the following advice from statistician George Box: “all models are wrong, but some are useful.” This is to say that a model is, ultimately, a tool. If our tools break or are misused, they may not be helpful. Yet when used the correct way they can be incredibly informative.
Keeping the above in mind, let’s get into the weeds.
A Forecasting Method for the 2018 House Midterm Election
Step 1: Averaging the polls
The first step in understanding House elections is understanding the national environment. That is, if Democrats have the lead in polls of Americans overall then we can assume (wrongly, in some cases) that they will perform well in the House of Representatives. Once we know how one party is doing nationally we can use that data to project races locally.
To assess that national environment we use polls of the generic Congressional ballot, a measure of public opinion that asks Americans which party they will be voting for in their congressional district. Of course, we don’t want to use just one poll as our benchmark, as any single poll could miss the true mark due to random statistical error, a bad person conducting the poll, or similar factors that could throw us off — so we take an average of comparable measures. Specifically, we average the polls according to a method put forth by political scientist and statistician Simon Jackman called “pooling the polls.”
The daily poll-of-polls is simply a weighted rolling average of electoral preferences. To compute this, I first select all polls collected in the prior 30 days, converting every poll into its two-party equivalent. For example, if a poll pegged 40% support for the Democrats, 40% for Republicans, and 20% for other/undecided I would transform the 40% for each of the major parties into their share of the voters choosing one or the other — in this case, both parties would receive 50% of the vote (40⁄80 = .5, or 50%) .
After collecting and converting the polls I split up each of them into individual readings for each day the survey was in the field, allocating the number of respondents to these new “polls” accordingly. (If a 300-person poll with 50-50 vote share was conducted over 3 days, I create 3 100-person tied polls.) Thirdly I collapse each day’s collective polling into an average of every reading on that day, putting more stock in the “polls” with more respondents. Finally, I average the entire month’s worth of polling, putting much more weight on the responses from one day ago than those a month old. In fact, the weighting formula for this “poll of polls” almost entirely discounts any poll taken more than 14 days ago — polls taken in the last 4 days receive nearly 65% of the weight.
What we’re left with is a snapshot of national preferences for the Democratic and Republican parties. On its face, this is very useful for predicting what may happen nationally on election day. Polls aren’t perfect, however, always incurring some error, and so we combine them with other data to get a better forecast of the national popular vote.
Step 2. Forecasting the national popular vote
The model I use to forecast the national popular vote from polls is an easily understood one, and it comes from political scientists Joseph Bafumi of Dartmouth College, Robert Erikson of Columbia University, and Christopher Wlezien of the University of Texas at Austin. Their linear model to predict the eventual Democrat share of the national vote uses just two variables: Democrats’ polled two-party vote share plus an indicator of the party in the White House. Their theory is that early generic ballot polling underestimates backlash to the party of the president — a finding certainly corroborated by the evidence — so the equation essentially gives a bonus to whatever party is not in control of the White House.
Bafumi et. al. also find that the relationship between polls, the vote, and the president changes over time. Specifically, they present 4 models to forecast the outcome from different points in time: from February, April, June, August, September, and October of the election year. I use the idea of their model, but with updated data from 2014 and 2016. Special thanks for political scientist and pollster Charles Franklin for this data.
For every day the model is run, it uses the polls for every House election up to that point in the cycle to predict the outcome of those elections. Then, it uses the equation from that prediction, combined with polls for this cycle, to predict the 2018 vote share
As of writing, this slightly dampens the Democrats’ advantage in generic ballot polls, moving them from a projected 53.7% of the two-party vote to 53.6%. It’s a slight correction, but over time (and in different scenarios) it is statistically significant.
However, our predictions are never 100% accurate. On any given day, there is a certain amount of error. Running the model for the day before the election, there is about 2.2% error (in terms of root-mean-square error, which I’ll talk more in section 4) in the model of the national popular vote — equaling a margin of error around 4.5% — so we can be reasonably good forecasters of the national environment! (Note that, because a new model for the vote is defined each day, each day also has its own particular national polling error.)
Of course, the national popular vote does not matter much at all; the number of seats each party receives will determine the fate of the House. So let’s see what past vote shares have said about the number of seats won by Democrats.
Step 3. Projecting the Democrats’ House seat share — the easy way
The clearest path to projecting seat share would be to use past information about the relationship between votes and seats to project 2018’s relationship. That is, when Democrats have earned 54% of the vote in the past, how many seats did they win? Put a better way: if we try to predict past Democratic seat share with their national vote share, what would 54% vote share predict?
This method actually yields pretty good results. From a statistical perspective, this model is sound.
But theory says something else.
The problem with forecasting national seat share with nationwide vote share is that House seats are not a perfect portrait of the vote. For example, in 2016 Democrats won 49.4% of the two-party popular vote in the House, yet they only won 44.5% of seats. In 2014, Republicans won 53% of the vote yet were awarded 57% of the seats. Way back in the 2006 midterm, Democrats won 55% of the vote but only received 53% of seats. There is a clear bias against Democrats in the House of Representatives. There are multiple explanations for this malapportionment.
Firstly is the obvious explanation: Republicans have, over the past decade, built up control of a majority of state legislatures, allowing them to gerrymander House district maps and give themselves a leg up on Democrats competing for those seats. Be it a racial gerrymander — where the power of minority voters) who are often Democrats) is diluted — or a partisan gerrymander — whereby Republicans draw district maps to explicitly benefit themselves — Democrats face a harsh uphill battle.
Secondly, and perhaps more gravely, Democrats face an “unintentional gerrymander”, a structural disadvantage they may have crafted themselves. The data show that Democrats have become clustered in tight geographic areas, weakening their influence over the inherently geographically controlled House. This is commonly attributed to a phenomenon called sorting, the process by which partisans move to areas where they feel they fit in. For Democrats, this concentrates their electoral power in compact urban areas. Republicans, on the other hand, spreading out their influence over massive swaths of sagebrush and pick up a disproportionate number of seats.
It is likely that both of these hypotheses explain the Democrats’ geographic electoral disadvantage. Despite the cause, it remains true that the electoral map is heavily tilted towards the GOP, causing big problems for Dems in picking up the right amount of seats. We see some evidence of this trouble emerge in 2012, 2014, and 2016. In the graphic above there is a noticeable national overperformance for the GOP. In sum, it is possible that in 2018 the Democratic Party faces such a harsh national map that past indicators won’t be reliable in predicting the Democratic Party’s performance.
To remedy this issue, I employ and improve another of Bafumi et. al’s models that takes into account individual level seat characteristics and combines them with what we estimate about the national environment.
4. Projecting the Democrats’ House seat share — the better way
… by combining individual district-level information with national electoral preferences
Ultimately, we want to use the model to obtain individual forecasts for every congressional district, counting up each party’s seats to get our final projection of who will control the House. To do that we use the following variables in two models used to predict how Democrats have performed in every House election since 1952 (technically not every district election; only ones where two major-party candidates have run against each other are counted):
- Vote share for the district’s previous Democratic candidate
- Vote share for the Democratic presidential candidate last cycle (or two cycles ago if no Democrat ran)
- A variable indicating whether the seat has an incumbent running, or if it’s open, and whether that incumbent is a Democrat or a Republican
- If there is an incumbent running, a variable indicating whether the incumbent is a freshman Representative
- The “national swing,” the difference between Democrat’s projected national two-party vote share and their national two-party vote share last cycle
Because being a freshman only matters if you’re also an incumbent (an open seat can’t have a freshman Representative because it has no incumbent), we need to forecast these categories of seats separately. As such, the resulting two formulas to project seat outcomes are:
- Open seat: Democratic %now = + Preslag % + National Swing %
- Incumbent: Democratic %now = Democratic %lag + Preslag % + Incumbent + Freshman + National Swing %
Using election results from 1952 to 2016 kindly made available by political science professor Gary Jacobson (which includes over 13,000 district races!) we can predict the outcome of elections for each year using the past two cycle’s election data. That is, if I wanted to predict the 2014 elections, I would use data from 2012 and 2010. The model can predict election outcomes with up to 94% of error accounted for. Since 2008, the root-mean-square error (a measure of forecasting accuracy) for any House election with an open seat is just 5 percentage points. For incumbent seats, the error is a slightly better 3.8 points.
Running the model retrospectively, a forecast in 2014 would have mispredicted 4 seats. Instead of picking 188 seats for the Democrats, the model predicted a slightly harsher 184. In 2014 this error did not matter much however as Democrats had less than 5% chance of winning either way.
Turning to the current midterm cycle, the forecasting model projects that Democrats will win 202 seats — just 3 more than they did in 2016 — despite a 4 percentage point swing towards them in the popular vote. They’re projected to win 53.6% of the popular vote yet only 45% of the seats. Recall the malapportionment discussion from the third section of this post.
Of course, there is error in our forecasts (as we’ve discussed), and accounting for that error is crucial in producing what we all really want: a probabilistic estimate of how likely it is Democrats take back the majority in the House of Representatives on November 6th, 2018.
Step 5. Simulate the election
The final step is to simulate the election, accounting for uncertainty from our estimates and assessing the probability that the Democrats or Republicans could over/underperform and move the outcome away from what we estimate. The framework for this simulation comes from a common statistical technique used to account for possible error in data called Markov Chain Monte Carlo.
This process has 3 major steps:
- Choose a random national error, according to the root-mean-square error from the formula predicting national vote share on that day
- Choose correlated district error for all 435 congressional districts. This way, we can account for the likely scenario that if my projections are wrong they will likely systematically err in one direction (either pro-Democrat of pro-Republican).
- Add up the number of seats won by Democrats and records it along with a “win” counter for each seat
I repeat this process many thousands of times, in the end counting up the number of “elections” in which Democrats win more than 218 seats in the House and dividing that by the number of total simulations. That becomes their probability of victory. (I do the same for each seat).
The graph above displays a distribution of the possible number seats the Democratic Party could win, given current national vote share and district characteristics. The red area is where Republicans hold the House. The blue area is where Democrats win a majority. The height of each bar represents how likely that individual outcome is to occur (higher is better).
We can also “game this out,” forecasting the different seats Democrats could gain (or lose) given any national vote share we like. Inputting every possibility from 35 percentage points to 65, we learn that (if current district characteristics hold, and they won’t) the Democratic Party would need the support of 56% of United States voters to have majority odds to win back the House. Of course, there is uncertainty there: Democrats could clearly win with 55% of the national vote (or they could lose with 57%). Details in the graphic below:
The discrepancy between the percentage of votes won and the percentage of seats won, as seen at any point on the blue line above between 50% and 56% on the x-axis, can be almost entirely attributed to our earlier discussion about the geographic disadvantage Democrats face as a result of gerrymandering, clustering, and sorting.
That’s about it for this post! If you’re sticking around to the three thousandth word, there are a few things you should know:
- The model updates only if there are new generic ballot polls available OR an incumbent has decided not to run for reelection. Both of those things will undoubtedly happen between now and Election Day 2018, but don’t panic if you see no update for a few days. There’s just no reason to change things. As a note, when new seat information does become available, I’ll note it in a post.
- Even when I do run an update without much new information, the model will inherently produce slightly different numbers as a result of statistical randomness in the simulations. Sometimes the probabilistic estimates will move as much as 0.5%! Don’t panic if you see small movements without observable shifts; it’s likely “noise” in the forecast.
- What about district-level polls? Well, congressional ballot polls have been pretty crappy in the past, and the way this model is devised doesn’t necessitate updated district numbers — those would be “baked in,” so to speak, in changes in the national polling.
- I’m not perfect, and modeling takes so much time that I find it impossible that I have not committed at least one error. If I catch a bug in the model, I’ll update this page with the correct information.
- Race ratings, the traditional "Safe Dem", "Likely Dem", "Lean Dem","Tossup", etc.. will be added in future posts. You can find them on the forecast homepage.
- Lastly, come back here frequently! I often re-read these posts and find places where I can clear up some sentence structure or incorporate readers’ feedback. This will only get better with time.
Questions? Comments? Concerns? Please be sure to pass them along in tweet or email form.
- Sep. 10, 2017: The previous version of the model used 0s/100s in place of 2016 vote share in places where no Democrat/Republican ran in 2016. For those previously uncontested districts, the 2016 vote shares have been replaced with 2014 vote shares. This makes little difference over all --- causing just a a 2% decrease, for example, to Pete Sessions' (TX-32) projected 2018 vote share --- but small changes could make a big difference in a tight election.
- Oct. 13, 2017: We made a big update to how the model reports the probability that Democrats will win the House. Read more here.