2020 Presidential Elections Analysis

Cassidy Bargell

View the Project on GitHub cassidybargell/election_analytics


Final Election Prediction


The Model

For my final election prediction I have chosen to use a weighted ensemble that combines generalized linear models based off of data from polls, demographics, unemployment rates, and COVID-19 deaths.

The individual models that make up the weighted ensemble are as follows:

The final weighted ensemble is:

Predicted Incumbent Vote Share = (pwt * Poll-Model) + (ewt * Unemploy-Model) + (dwt * Demographic-Model) + (cwt * COVID-Model)

Where pwt, ewt, dwt, and cwt are weights assigned to each model. The heavier a model is weighted, the more influence it has over the final prediction produced by the model.

Below is the distribution of coefficients for each state included in the final ensemble.

Positive coefficients suggest a positive relationship between independent variable and predicted vote share, and the opposite is true for negative coefficients. Therefore the coefficients can be interpreted as follows:

The outlier in demographic coefficients is West Virginia, a solidly red state. This is because the white population in West Virginia has steadily declined since 1992, from ~96% to ~94%, while the state has become increasingly Republican. This is not overly informative about the demographic independent variable, but rather reflects the rapid increase in vote share the Republican party experienced in West Virginia; from receiving ~42% in 1992, to ~72% in 2016.

Why include these variables in the weighted ensemble?

I have explored each of these variables in previous weeks, and believe the combination has the potential to accurately capture the complexities of the 2020 election.

Weighting the Models

I explored two ways to weight the models in the ensemble. One method was weighting based on the root mean squared errors of each generalized linear model. The other was a somewhat arbitrary choice in weights. The weighting options are explained further below.

Choice Weights

The first way I chose to weight the models was somewhat arbitrary, however represents what I think logically should receive the most weighting in the model. Given that the polling data being used is from 1 week or less out from the election, the polls should be less variable and in turn more predictive of the actual election outcome. Nate Silver, for example, weights his model almost entirely on polls the closer to election day it gets (538).

For this reason, I weighted polls most heavily at 0.85, and weighted the rest of the models equally at 0.05.

This choice of weights predicts a Biden win with 279 electoral college votes over Trump’s 259. This is the same model predicted using only polling data (1 weight for polls, 0 weight for the other variables).

If all models are weighted equally at 0.25, the ensemble predicts a Biden win with 323 electoral college votes over Trump’s 215. When any of the other variables are weighted the most heavily at 0.85, only the unemployment model predicts a Trump electoral college victory with 304 votes over Biden at 234. COVID-19 death data on the other hand provides the most extreme prediction at a Biden win with 418 electoral college votes.

The variations of the weighted ensemble using simple choice in weights are below:

Predicted Trump Electoral College Votes = (0.85 * Poll-Model) + (0.05 * Unemploy-Model) + (0.05 * Demographic-Model) + (0.05 * COVID-Model) = 259

Predicted Trump Electoral College Votes = (0.25 * Poll-Model) + (0.25 * Unemploy-Model) + (0.25 * Demographic-Model) + (0.25 * COVID-Model) = 215

Predicted Trump Electoral College Votes = (0.05 * Poll-Model) + (0.85 * Unemploy-Model) + (0.05 * Demographic-Model) + (0.05 * COVID-Model) = 304

Predicted Trump Electoral College Votes = (0.05 * Poll-Model) + (0.05 * Unemploy-Model) + (0.85 * Demographic-Model) + (0.05 * COVID-Model) = 230

Predicted Trump Electoral College Votes = (0.05 * Poll-Model) + (0.05 * Unemploy-Model) + (0.05 * Demographic-Model) + (0.85 * COVID-Model) = 120

Weight by Root Mean Squared Error

The second way I have weighted the models uses root mean square errors (RMSE). RMSE is a measure of the differences between the values predicted by a model and the true values, or a measure of the in-sample performance of each model. The smaller the RMSE, the more predictive that model has historically performed.

I have therefore weighted the models individually for each state. The weights are inversely proportional to the models’ RMSE. If a model’s RMSE was higher in comparison to the other models for that state, it was weighted less, and if a model had a relatively lower RMSE, it was weighted more heavily in that state’s weighted ensemble. This method allows for dynamic weighting based on state.

The distribution of RMSEs is visualized below. ‘Choice weights’ represents the weighted total RMSE produced by the ensembles with polls weighted at 0.85. ‘RMSE-Weights’ is the weighted total RMSE produced by using each model’s RMSE value for varying weights by state.

COVID-19 death models generally have the lowest RMSE values, whereas the economic data generally has higher RMSE values. Therefore, in the RMSE weighted ensemble COVID-19 deaths are generally weighed more heavily than economic data.

The weighted total RMSE for each state is also lower using the RMSE weights rather than the Choice weights. This would suggest the ensembles modelled using RMSE values for weights have lower root mean square errors than the ensembles modelled weighing polls most heavily.

Note D.C. is not modelled but is assumed to be a guaranteed Democrat win.

Using the RMSE weighted ensemble, it predicts a more secure Biden win with 368 electoral college votes, and Trump with 170.

I believe the initial choice weighted ensemble (polls at 0.85), and the RMSE-weighted model provide the most promising prediction outcomes, so those two will be explored further below.

Prediction Interval

For these two models I have constructed a 95% confidence interval for predicted vote share in each state.

For the choice weighted ensemble, there are 8 states whose confidence intervals include the 50% vote share mark. They are Wisconsin, Virginia, New Hampshire, Nevada, Iowa, Georgia, Florida and Colorado. In this prediction model electoral college votes are determined by winning above 50% of the popular vote share. So, within the 95% confidence interval these states are most likely to “flip” from their predicted party winner to the other.

To illustrate this point, a prediction interval can be created using the upper and lower bounds of the confidence intervals. In the choice weighted ensemble, this means that for the lower bound of the prediction interval, Iowa, Georgia and Florida flip blue. The opposite would happen for the upper bound of the prediction interval.

For my choice weighted ensemble, the prediction interval for Trump electoral college votes would range from 208 to 301. (For Biden this would translate to 330 to 237).

This means that within my 95% confidence intervals there is a path for Trump to win, however it would require flipping states currently projected to vote blue.

A prediction interval can also be constructed for the RMSE-weighted ensemble using 95% confidence intervals for each state.

Seven states from this model have confidence intervals that include the 50% vote share tipping point. They are Wisconsin, Texas, Ohio, Iowa, Georgia, Florida and Arizona.

The prediction interval for Trump from this model is constructed in the same way, using the lower and upper bounds of state confidence intervals which in turn flips these potential swing states one way or another. The prediction interval for Trump electoral college votes would therefore range from 126 to 254. For Biden this would be from 284 to 412. Although this suggests that a Trump win would be highly unlikely, because 95% confidence intervals are used, a Trump win is definitely not impossible.

Final Point Estimate of Electoral College

Exploring both of these weighted ensembles has been helpful in understanding what factors might pull the election one way or another. For my final point prediction I have chosen to use the RMSE-weighted model. Although this model predicts much more extreme Biden wins, I think that the weighting of states dynamically is valuable. I believe it better reflects how different states are influenced and predicted by differing variables more strongly.

Polls may be overstated in this model as both the polling and COVID-19 models rely heavily on 2020 polls. While this might be the case, given the uniqueness of this election in terms of economic fundamentals and other shocks, I believe polling data is the strongest method for understanding public opinion in 2020.*

In that case, I would predict a Biden victory. With a point-estimate of 368 electoral college votes, and a prediction interval of 284 to 412.

In turn, I predict Trump to receive 170 electoral college votes, with a prediction interval of 126 to 254.

There is of course uncertainty in this model, and a Biden victory is not necessarily guaranteed, although I am predicting it to be highly likely.

Thanks to Alison Hu for collaboration and help in building my final prediction model.