Data Whisperers

by Taylor Saunders and Annette Przygoda

Introduction

On May 13, 2013, one day before the provincial elections in British Columbia (B.C.), all major polling companies projected a landslide win for the NDP. While media reports showed a slight tightening of the race in the last two weeks of the campaign, as of 24 hours before the elections, pollsters predicted the NDP's share of the populat vote to be between 41% and 46%, compared to 31% to 37% for the B.C. Liberals. On election day, the Liberals won convincingly, taking 44.1% of the popular vote compared to 39.7% for the NDP.

Election observers, the media and voters were stunned. What happened? On average, polling companies were off by 18 percentage points with their predictions. Forum Research came closest, with an 8 point difference, but still predicted a win for the NDP that never materialized. Big-name polling companies like Ipsos, Angus Reid and EKOS were off by 15 to 20 percentage points.

In the immediate aftermath of the election, this stunning shift made for a lot of good media stories that were mostly focussed on Christy Clark as the “Comeback Kid”. Major news outlets reported on the particular strength of Clark’s last days of the campaign or discussed the fact that her negative campaign appeared to have worked . Other reports focussed on the failure of the NDP to capitalize on their lead in the polls. Discussions of the accuracy of polls and predictions were rare. Where they did happen, almost all reports attributed the dismal performance of the predictions to a massive election-day opinion shift among voters.

These reports were bolstered further by the early reactions of the pollsters themselves. Those reactions included statements like Angus Reid’s assertion that the days “of co-operative respondents who want to tell pollsters what they think and of good citizens who show up to vote” are over. Essentially, the polling companies wanted us to believe that predicting people’s actions is impossible, that people change their opinions and political affiliations like they change their underwear.

If that was true, millions of statisticians and researchers worldwide would be out of a job. We suggest that rather than collectively waving the white flag and giving up on our jobs of understanding and predicting behaviour, there are ways to explain what happened and why the election predictions failed so spectacularly: flawed polling methodology, in particular, a lack of a “likely voter” model and failing to adequately account for undecided voters in the predictions.

Standing On The Shoulders Of Giants

Some people believe that election polling can be done without having much in-depth subject matter expertise about voting behaviour or politics; that it is purely about crunching some numbers. When looking at successful polling and predictions, we quickly see that that is not true. Polling relies on identifying not only what party or candidate someone would vote for in a given election but also whether someone will actually turn out on election day to cast a ballot. For example, major polling companies in the USA all employ methods to identify “likely voters” in their models to predict election outcomes. Some pollsters make it as simple as using projected turnout as a variable in their models which implies that voters across all parties have the same likelihood of showing up on election day. Others go as far as including an entire battery of questions to identify which voters are more likely to show up than others. A closer look at those questions reveals that almost all US pollsters collect data on:

Eligibility to vote
Registered to vote
Past voting behaviour
Level of knowledge about the electoral process
Level of interest in the current political campaign
Actual intention to vote in current elections
Party affiliation
Demographic information

The above is a list we could easily find in any first year Political Science textbook on voting behaviour and elections. This is where subject matter expertise comes in. Electoral polling doesn’t happen in an isolation chamber without context. In fact, it happens as a result of five decades of existing research dedicated to understanding voting behaviour. In this field, early theories and research focussed heavily on aspects like longstanding political loyalties that are often based on socio-economic factors (which is why it is still important to collect demographic information and ask about existing party affiliation). Over time, research and theories shifted more to explanations that view voting as a rational choice of individuals who are informed and weigh the pros and cons of each party and candidate for their own interests. These theories are reflected in the questions about someone’s level of knowledge of the electoral process and their interest in the current campaign etc. Contemporary theories developed by Political Scientists also take into account institutional factors like systemic barriers to voting, reflecting the importance to check whether poll respondents are eligible to vote and registered to vote, as well as probe about past voting behaviour.

In addition to determining the likelihood of voting, the list above and the underlying theories can also be used to better understand undecided voters and incorporate that knowledge into our predictions. For example, someone might be undecided at the time of a poll, maybe early in the campaign, but has a history of consistently voting for the same party. Another respondent might identify as undecided and also score low on the general interest in the current campaign and states that he does not know where the polling place in their district is. If we know these things, we can more accurately estimate not only the likelihood that someone will vote at all, but also how undecided voters should factor in to the model driving our predictions of the eventual election result.

We see five decades of Political Science reflected in questions used to predict voting behaviour. We also see carefully worded questions that likely reflect years of experience in survey research. For example, Gallup’s question to gauge voting intent sounds like this: “I'd like you to rate your chances of voting in November's election for president on a scale of 1 to 10. If 1 represents someone who definitely will not vote and 10 represents someone who definitely will vote, where on this scale of 1 to 10 would you place yourself?”. AP-IPSOS used a similar question: “On November 2nd, the election for President will be held. Using a 1-to-10 scale, where 10 means you are completely certain you will vote and 1 means you are completely certain you will NOT vote, how likely are you to vote in the upcoming presidential election? You can use any number between 1 and 10, to indicate how strongly you feel about your likelihood to vote.” RAND’s question was a bit shorter, but no less concrete: “What is the percent chance that you will vote in the Presidential election?” Those questions might seem long and clunky, but they represent the key rules for crafting survey questions. Good questions need to be clear, concise, and measure only one thing at a time. If the idea is to measure how likely it is for someone to vote on election day, the question should be about the likelihood to vote on election day. Granted, there are some areas where asking a direct question is not the preferred scenario. For example, survey respondents are unlikely to provide honest answers when asked about criminal activities or behaviour that is generally deemed socially unacceptable or undesirable. This is why some pollsters phrase their question about past voting behaviour like this: “Sometimes things come up and people are not able to vote. In the 2000 election for President, did you happen to vote?” The wording is still clear and measures only one thing, but provides respondents with an option to save face if they did not vote in the past. Sometimes things come up.

What is clear when looking at political polling in the US is that polling companies invest large amounts of time and expertise in crafting their questions and dealing with the data they derive from it. That doesn’t mean that they are infallible. In fact, many pollsters in the US got chided for not accurately predicting a clear victory for Barack Obama in the 2012 elections. However, those pollsters were a lot more willing to discuss their shortcomings and methodological challenges in the aftermath of that election. In British Columbia, we saw an entirely different approach to both the polling itself and to the discussions of their failure to accurately predict the election results.

Likely and Undecided Voters in British Columbia

We know that asking survey respondents whether they will actually make the effort to show up on election day is a crucial piece in the puzzle of predicting election results. We looked at the three major polling companies that provided polls and predictions leading up to the 2013 BC election and examined their approaches to determining who is a “likely voter”.[1] Interestingly, none of them asked respondents a direct question about their intention to vote on election day.

In the four weeks leading up to the 2013 British Columbia provincial election, Ipsos Reid conducted four separate polls of BC voters. Each poll consisted of an online panel survey in which a sample of 800 or more adults was asked to indicate which political party they intend to vote for. Specifically, IPSOS asked respondents the following: “Thinking of how you feel right now, if a provincial election were held tomorrow here in BC, which of the following parties candidates would you be most likely to support, or lean towards?” Similar questions were asked by EKOS and Angus Reid in their polls leading up to the election. At first glance, this might seem appropriate as a measure of voter intent. However, when comparing this approach to the approach taken by US polling companies and to the approaches supported by existing research, the following issues stand out:

The questions used by all three companies get at party or candidate affiliation, but do not directly determine whether a survey respondent is a) eligible to vote and b) determined to show up on election day.
Only EKOS appears to also ask questions about respondents’ past voting behaviour, but interestingly, the question used focuses on voting behaviour in the past federal elections. This is problematic, because we know that in general, voter turnout tends to be higher in federal elections than in provincial elections. Someone may well be inclined to vote federally, but has no history or intention to participate in the provincial elections.
All three polling companies used questions that included the wording "if the election were held tomorrow....", or a variation thereof. Research on voting behaviour has shown that this wording doesn't accurately capture voter intent. In fact, in surveys where respondents are asked to questions, one about a hypothetical election day "tomorrow", and one about the actual election day, results often differ to some extent.

It is unclear, based on the limited information that is publicly available, what other sources of information the three polling companies use. They may or may not include information on demographics, knowledge about the electoral process or other factors. What is clear from the available information is that all three companies failed to ask a question similar to those used by all major US polling companies that focuses solely on someone’s likelihood to turn out. Lacking this information, polling companies in British Columbia would have had a difficult time estimating voter turnout and determining the extent to which stated NDP or Liberal supporters were actually planning to or able to cast a ballot on election day.

What about undecided voters? In a way, this is a related issue. As we have mentioned above, US pollsters use the information they collect about likely voters and voting intention to predict voting behaviour on election day. Granted, for each election, there will be a group of truly undecided voters or swing voters with no clear party affiliation. However, for a certain percentage of the polled population, it will be possible to use past voting behaviour to predict with some level of certainty whether and how an undecided voter will vote on election day.

Looking at the 2013 British Columbia provincial election, we found that all three polling companies essentially kept undecided voters in a separate group, treating all of them as true swing voters that could jump to any of the parties on election day. Using IPSOS polls as an example, the graph below demonstrates that as time passed between the three polling cycles used by IPSOS, the share of undecided voters decreased steadily, while the share of Liberal supporters grew.

The data is clearly telling a story here that did not make it into the predictions or the media reports prior to the elections. Undecided voters in this scenario were likely not truly undecided or swing voters, but Liberal leaning potential voters. Furthermore, the graph shows a trend over time that not only refutes the stories of a very large NDP lead (note the margin of error shown in the graph), but also provides proof that the final election result was not due to a last minute opinion swing (we will discuss the margin of error and missed story in the data in much more detail in our next post). For now, it is important to note here that BC pollsters were not able to predict the behaviour of undecided voters, and we suggest that this is because they did not collect the right information on voting intent, voting history and other factors that have been shown in the research to greatly influence voting behaviour.

Can We Trust The Polls?

In the immediate aftermath of the election, reactions to the inability of the polls to predict the results in British Columbia focused on one thing: a massive overnight opinion swing. The day after the election, in an interview with the Globe and Mail, Angus Reid said “First of all, I don't think the polls were wrong”. Instead, he noted that pollsters simply missed the late Liberal surge, a.k.a. massive overnight opinion swing. Later in the interview, he stated that "The amount of effort really required to do this properly probably exceeds the budget these days of media organizations that are used to paying nothing for polls."

Reid’s statements highlight a key problem: accurate polling requires effort: the effort to know the research, the effort to ask the right questions and the effort to listen to the story in the data using appropriate methodologies.

When looking at the British Columbia polls leading up to the election we can see that people didn’t change their minds overnight. The trend of Liberal leaning “undecided” voters supporting Christy Clark’s government was there starting in early April, if not sooner. What is not as visible in the numbers themselves, but glaringly clear in the various methodologies of the polling companies is that the pollsters failed to actually ask their respondents about their likelihood of showing up on election day. The combination of methodology and Liberal-leaning “undecided” voters led to a scenario where the predictions not only failed, but ended up misleading the public.

Electoral polling has consequences. Telling NDP supporters over months that their party will win the election decisively might have convinced NDP leaning voters to not turn out on election day. We all know that it is highly unlikely that our own one ballot will decide an election (a fact that Political Scientists call the “voting paradox”), so maybe some people didn’t see the need to invest time and effort to vote if the result is already clear ahead of time. Additionally, strategic voting may have been influenced by the polls. Assuming the NDP win was a foregone conclusion might have led some NDP supporters to cast their ballot for the Green party in their district. When asked about his polls, Mr. Reid said “I thought it was really a marvellous example of maybe polling at its finest, in the sense that we and Ipsos and others were saying to the province of British Columbia in the days leading up to the final vote that there was an NDP train coming down the track, and that obviously got the attention of a lot of people and may have actually lulled some of the NDP supporters into thinking it was a fait accompli.” Polling has consequences.

Many predictions fail. In fact, the whole premise of statistical analysis is that we produce estimates that are never 100% correct. Instead, we use past and present data to detect trends and to estimate a level of confidence that we have about these trends holding true in the future. It is not a bad thing when predictions fail. It should prompt us to examine our work and adjust approaches where necessary, rather than assert that people and behaviours are unpredictable. It is our job to learn from failed predictions and to improve upon our methods. And that means that we need an open and transparent discussion about our methodologies and assumptions, as well as a willingness to connect with others who might be doing similar work. Polling in British Columbia does not exist in a bubble. We have opportunities to compare our approaches to those of pollsters in other countries. We also have opportunities to learn and expand upon our own subject matter expertise, instead of assuming that it is all just a numbers game. It is not.

Accurate polling requires effort. Indeed!

[1] It should be stated clearly here that the three companies do not necessarily make all of their methodology publicly available. However, all three companies posted information on their survey questions, their samples and overall methodology. As such, it is possible to examine their general approaches, even when a detailed discussion of their modelling is missing.

Data Whisperers

Data Confusion

Wednesday, 31 July 2013

Why It Matters To Ask The Right Questions – Lessons From Polling and Survey Research

Introduction

Standing On The Shoulders Of Giants

Likely and Undecided Voters in British Columbia

Can We Trust The Polls?