Lessons from Brexit

 

brexit-andrieu-wide

Time will tell whether or not Brexit is a disaster the UK, but in any case it is hardly one for prediction markets.

Bremain was certainly the favorite of the bookmakers all along, while polls were inconclusive or wildly fluctuating, with loads of undecideds. The day before the poll, Hypermind gave Bremain a probability of 75%, and Brexit only 25%. In view of the result, some are questioning the reliability and relevance of forecasts from prediction markets. Fair enough.

brexit-bremain

Probabilities of Brexit and Bremain on Hypermind from June 16th (just before Jo Cox’s murder) to the announcement of the results on June 24. Just before election day (June 23), the probability of Brexit was hovering around 25%.

So let’s take advantage of what Americans call a “teachable moment” to explain again what prediction market forecasts are, what they are not, and why Hypermind’s are particularly reliable.

Probabilities vs certainties

It can’t be said that Hypermind was “right” on Brexit. But to argue that it was “wrong” requires a total disregard for what probabilities mean. In fact, the very idea that a probabilistic forecast – 25% chance – can be proved right or wrong with a single observation is absurd. At the end of an interview in French weekly Le Point just two days before the vote, I was asked the question “If the Brexit wins, what conclusions you will draw?” Here’s my answer :

Hypermind’s forecasts are accurate probabilities, not certainties. Of all the events that we believe to have “only” 25% chances of happening, like Brexit today, we can guarantee that about one in four will happen, even if it was not the most likely outcome. Maybe Brexit will be that one … but there are three in four chances that it won’t be.

Well, Brexit was that one … there was a one in four chances. Only those who make the mistake of confusing 25% (unlikely) with 0% (not a chance) could blame Hypermind.

The Curse

In fact, we probabilistic forecasters must live under a particularly ironic curse: we know full well that whenever an unlikely event happens – and it must, eventually, otherwise probabilities would be meaningless – we will be loudly (but wrongly) criticized.

How to assess the reliability of probabilistic forecasts

But then how do we know if the probability of 25% for Brexit was correctly estimated? Ideally, we would be able to re-run the referendum dozens of times and observe the frequency of Brexit outcomes: if it won about 1 in 4 times, the prediction of 25% likelihood would be validated. Conversely, if the results deviated too much from that 1/4 proportion of Brexit outcomes, we could conclude that the prediction was wrong. The correspondance between the predicted event probability and the actual event frequency of occurence is what is called “calibration”. The better calibrated a forecasting system is, the more its probabilities can be trusted.

Unfortunately, of course, we can’t ever re-run the referendum, nor any other even predicted by Hypermind. Each one is unique. So how can we measure the reliability of our forecasts? The accepted way of doing this is the next best thing : consider as a group all the questions ever addressed by Hypermind over the past two years, including Brexit. The market forecasted 181 political, geopolitical and macroeconomic questions, with 472 possible outcomes. Some were naturally more difficult to forecast than others, but none was trivial, as each question was sponsored by at least one government, bank, or media facing some strategic uncertainty.

The calibration results are illustrated by the graph below. The closer the data points are to the diagonal, the more calibrated the forecasts are. The probabilities generated by Hypermind are generally quite reliable: events that are given about 25% chances of happening do happen about 20-25% of the time. Events estimated at 50% occur about half the time. Events assigned a probability of 90% occur about nine times out of ten, and one in ten also fails to occur… The correlation is not perfect, but it is quite remarkable. It’s hard to do much better.

Calib 181 Brexit

Hypermind forecast calibration over 2 years on 181 question and 472 possible event outcomes. Every day at noon, the estimated probability of each outcome was recorded. Once all the questions are settled, we can compare, at each level of probability, the percentage of events predicted to occur and the percentage that actually occurred. The size of data points  indicates the number of forecasts recorded at each level of probability.

You will notice that the data also exhibit the so-called “favorite-longshot bias”, a slight S-curve pattern which results from overestimating improbable events and underestimating the more probable ones. Calibration would be better without this systematic distortion at the extremes. It is perhaps a bit ironic to note that the data from the Brexit question went against this pattern and thus helped slightly improve Hypermind’s overall calibration (from .007 to .006). It is as if the occurence of an unlikely event was long overdue in order to better match predicted probabilities to observed outcomes.

What does not kill you makes you stronger

A final lesson is that every confrontation with reality makes the system more reliable, whatever the outcome, because it learns. For every bettor that took a position against Brexit, there was necessarily at least another that bet on it. Everyone who lost that bet will now have less influence on the odds for future forecasts, since he or she will have less money to bet with. Conversely, the forecasts of those who bet correctly will henceforth weigh more on the consensus, because they have more money than ever to move the market prices. Thus the quality of future collective forecasts continuously improves.

Les leçons du Brexit

L’avenir nous dira si le Brexit est une catastrophe ou non pour les Anglais, mais en tout cas cela n’en est pas une pour les marchés prédictifs.

Certes, le Bremain était le grand favori des bookmakers, alors que les sondages étaient contradictoires avec encore au moins 10% d’indécis. La veille du scrutin, Hypermind prévoyait 75% de probabilité pour Bremain et 25% de probabilité pour le Brexit. Au vu du résultat, certains s’interrogent sur la fiabilité et la pertinence des prévisions issues des marchés prédictifs. C’est légitime.

Je vais donc profiter de ce que les américains appellent un “teachable moment” pour expliquer à nouveau ce que sont les prévisions d’un marché prédictif, ce qu’elles ne sont pas, et pourquoi celles d’Hypermind sont fiables.

Des probabilités, pas des certitudes

On ne peut pas dire qu’Hypermind ait eu “raison” sur le Brexit. Mais il faut ne rien entendre aux probabilités pour assurer à l’inverse qu’Hypermind s’est “trompé”. En fait, l’idée même qu’une prévision probabiliste – 25% de chances – puisse être validée ou invalidée par une seule observation est absurde.  A la fin de l’interview dans Le Point, je répondais justement à la question “Si le Brexit l’emporte, quelles conclusions en tirerez-vous ?” de la façon suivante :

Les prévisions d’Hypermind sont des probabilités fiables, pas des certitudes. De tous les événements que nous estimons avoir « seulement » 25 % de chances de se réaliser, comme le Brexit aujourd’hui, nous pouvons garantir qu’environ un sur quatre se réalisera, même s’il n’était pas « favori ». Peut-être que le Brexit sera celui-là…, mais il y a trois chances sur quatre que non.

Le Brexit fut donc celui là… il y avait une chance sur quatre. Si erreur il y a, elle n’est donc pas tant du coté d’Hypermind, que du coté de ceux qui ne font pas la différence entre 25% (“peu probable”) et 0% (“aucune chance”).

Malédiction

De fait, c’est la malédiction particulière du prévisionniste probabiliste que de savoir pertinemment qu’à chaque fois qu’un évènement peu probable se réalisera – et il en faut, car sinon la probabilité n’aurait aucun sens – il se le verra bruyamment reproché. A tort.

Comment évaluer la fiabilité des prévisions

D’accord, me direz vous, mais alors comment savoir si la probabilité de 25% était correctement estimée ? Idéalement, il faudrait pouvoir observer le résultat non pas sur un seul référendum mais sur plusieurs dizaines : Est-ce qu’environ un sur quatre donnerait la victoire au Brexit, et trois sur quatre au Bremain ? Si oui, la prévision de 25% serait vérifiée. Mais si les résultats déviaient trop de ces proportions, alors on pourrait dire que la prévision était mauvaise. L’adéquation entre le pourcentage d’évènements prévus et le pourcentage d’événements réalisés est ce que l’on appelle “l’étalonnage” des prévisions. Mieux le prévisionniste est étalonné, plus ses probabilités sont fiables.

Malheureusement, il n’y aura pas d’autres référendums identiques, et chaque évènement traité par Hypermind est unique. Alors comment évaluer l’étalonnage et la fiabilité des prévisions ? Le mieux que l’on puisse faire c’est de considérer l’ensemble des questions traitées par Hypermind depuis deux ans, Brexit compris: il y en a eu 181, sur des sujets politiques, géopolitiques, et macroéconomiques, avec 472 réponses possibles. Les niveaux de difficultés variaient, naturellement, mais aucune n’était triviale, car chacune était commanditée par au moins un sponsor (gouvernent, banque, média, etc.) faisant face à quelque incertitude stratégique.

Les résultats sont illustrés par le graphe ci-dessous. Moins les data dévient de la diagonale, plus les prévisions sont bien étalonnées. On voit que les probabilités générées par Hypermind sont globalement fiables : les évènements auxquels on accorde 25% de chances se réalisent environ une fois sur quatre ou cinq. Les évènements auxquels on accorde 50% de chances se réalisent une fois sur deux. Les évènements estimés à 90% de chances se réalisent neuf fois sur dix, et ne se réalisent pas une fois sur dix, etc. La corrélation n’est pas parfaite, mais elle est très remarquable. Il est difficile de faire beaucoup mieux.

Calib 181 Brexit FR

Étalonnage des prévisions d’Hypermind sur 181 questions avec 472 réponses (évènements) possibles sur une période de deux ans. Chaque jour à midi, les probabilités estimées sur l’ensemble des réponses sont enregistrées. Quand les résultats sont connus, on peut comparer, à chaque niveau de probabilité, l’adéquation des pourcentages d’événements prévus et d’évènements observés. La taille des points indique le nombre de prévisions relevées à chaque niveau de probabilité.

Étalonnage des prévisions d’Hypermind sur 181 questions avec 472 réponses (événements) possibles sur une période de deux ans. Chaque jour à midi, les probabilités estimées sur l’ensemble des réponses sont enregistrées. Quand les résultats sont connus, on peut comparer, à chaque niveau de probabilité, l’adéquation des pourcentages d’évènements prévus et d’évènements réalisés. La taille des points indique le nombre de prévisions relevées à chaque niveau de probabilité.

Il est peut-être un peu ironique, et certainement contre-intuitif, de réaliser que les résultats de la question Brexit ont légèrement amélioré, plutôt que dégradé, l’étalonnage global d’Hypermind. C’est comme si le système attendait depuis longtemps qu’un évènement improbable se réalise afin de mieux étalonner ses probabilités !

Ce qui ne tue pas rend plus fort

Une dernière leçon à tirer est que chaque confrontation à la réalité rend le système plus fiable, quelque soit le résultat, car il apprend. Pour chaque parieur qui s’est positionné contre le Brexit, il y en a au moins un autre qui a parié dessus. La voix de chaque perdant s’en trouve diminuée, car il ou elle aura moins d’argent pour parier sur les questions suivantes, donc moins d’influence sur les cotes. Inversement, les opinions de ceux qui ont vu juste gagnent en influence, car ils ont désormais plus d’argent à miser sur leurs prévisions (a priori plus avisées que celles des autres). La qualité des prévisions collectives à venir est ainsi affinée.

Hypermind wins the 2016 Republican nomination race

trumpwin

Last week the Associated Press reported that Donald Trump had finally acquired enough delegates to lock in the GOP nomination. But he is not the only winner of this extraordinary primary season: of all the leading prediction markets, Hypermind was the most accurate by far. It outperformed Betfair, the Iowa Electronic Markets (IEM), and PredictIt, respectively the largest prediction market in the world (based in the UK), the longest-running and the newest US-based political markets.

Figure 1 below details the forecasts of each prediction market starting from January 25, a week before the Iowa primary, and ending on May 3, 2016, on the eve of the Indiana primary which proved fatal to Trump’s last two rivals. (No data is available for the IEM before January 25, so the this is also the longest period over which we can compare the performance of all four markets.)

panel-markets

Figure 1 – Probability of winning the GOP presidential nomination for Trump, Cruz, Rubio, or somebody else (Other), according to the four prediction markets, from January 25 to May 3, 2016.

On his way to victory, Trump crushed the hopes of 16 other candidates, and defied the expert forecasts of countless political pundits. However, as Figure 1 shows, even before the first ballot was cast in Iowa, the markets had already anointed Trump the favorite. Then, except for a short week between his Iowa stumble and his New Hampshire comeback in early February, he remained the favorite throughout the campaign until his last rivals finally quit.

Figure 1 also shows that Hypermind was systematically more bullish on Trump than the other markets were, and much less likely to lose confidence and overreact when he stumbled. The contrast is especially vivid in April, when the establishment-fueled fantasy of denying Trump the nomination at a contested convention got a lot of traction in all the markets, but much less so in Hypermind.

For a quantitative measure of accuracy it is customary to use the brier score, which sum the squared errors between the predictions and the true outcomes. The smaller the brier score, the better the prediction: in a 4-way prediction like this one, a perfect prediction has a brier score of 0, a chance prediction (i.e., 25% for each option) scores 0.75, while a totally wrong prediction scores 2.

To get a sense of how accurate the markets were throughout the comparison period, we compute each market’s brier score on a daily basis. Then we average those daily brier scores into a mean daily brier score for each market. The results are plotted in Figure 2 : Hypermind was 35% more accurate than Betfair, and 40% more accurate than IEM and PredictIt.

PIBH-brier

Figure 2 – Mean daily brier score for each prediction market from January 25 to May3, 2016. Lower scores mean better accuracy.

It is remarkable that a play-money market like Hypermind could significantly outperform the leading real-money markets on a question that made daily front-page news all over the world for many months. But it is not overly surprising. Consider this:

  1. It isn’t the first time that Hypermind more accurately forecasted U.S. elections than more often-quoted outfits. It did as well in the 2014 midterm elections (Servan-Schreiber & Atanasov, 2015).
  2. The idea that prediction markets work better when traders must “put their money where their mouth is” is a  hard-to-kill cliché that has no basis in fact, as Servan-Schreiber et al. (2004) proved more than a decade ago. Hard currency need not be involved as long as traders risk something that is valuable to them: reputation, status and self-satisfaction will do just fine for many, especially among the smartest. One particular advantage of play-money markets over their real-money counterparts is that they can better match influence with past success: everyone starts at the same level of wealth, and the only way to amass more play money than others, and thus weigh more on the market prices, is to bet successfully. There is less dumb money than in real-money markets.
  3. Hypermind is much more than just a play-money version of Betfair, IEM or PredictIt. Spawned from Lumenogic‘s multi-year collaboration with the Good Judgment Project, winner of the IARPA ACE forecasting competition, Hypermind’s sole purpose is to make the best possible predictions, rather than enriching a bookmaker, conducting academic research, or providing entertainment.  Its few thousand traders are carefully selected and rewarded (with cash prizes) solely based on actual performance. Good forecasters thrive, while poor forecasters whittle and drop out. In this competitive environment, there are no second chances, which makes the Hypermind community an elite bunch, not just any crowd.

References:

 

Live Webinar with Dr. Emile Servan-Schreiber

I was recently invited by our friends at PredictIt to discuss the accuracy and significance of prediction markets and collective intelligence.

During this live 30 min webinar, I dive into why markets, like PredictIt and Hypermind, have the ability to forecast the future by pooling the speculation of many. I go into the types of individuals who thrive in predictions markets, and why diversity and independent thinking is required for accuracy. I also discuss why the possibility of reward and loss promotes more objective and less passionate thinking, enhancing the quality of the opinions that can be aggregated. And more !

 

 

Hypermind accuracy over its first 18 months

Hypermind was launched in May 2014. The chart below plots the accuracy of its predictions over the 151 questions and 389 outcomes that have expired at of this writing. All the predictions so far have been about politics, geopolitics, macroeconomics, business issues, and some current events. No sports.

To generate this chart, we proceeded as follows. The data was collected daily: every day at Noon we recorded the latest transaction price on each traded outcome and treated it as a probability for this outcome. These observations were then grouped in 20 probability bins: 1-5%, 6-10%, 11-15%, …, 96-99%. Then, we just plotted the average of the probabilities in each bin against the percentage of the outcomes represented in the bin that actually occurred.

The market is accurate to the extent that the two numbers are well calibrated, ie., that the data points are aligned with the chart’s diagonal. In our case the measure of calibration is .001, meaning that the average difference between the percentage of events actually coming true and the forecast at each level of probability is only about 3.3%.  If we did not know better, we might conclude that reality aligns itself with Hypermind’s predictions.

calibration 151x5 171215

 

Polls are dead, long live markets

NFCROWD

The polling fiasco in the 2015 UK general election is just the latest in a string of high-profile failures over the last few months. This contrasts with the good performance of prediction markets, and Hypermind in particular.

Let’s start with the referendum on Scottish independence in september 2014. In the final weeks before the referendum, the polls consistently announced a cliffhanger with Yes and No tied within the margin of error. Yet the actual results gave “No” a large majority of 55%, 10 points ahead of “Yes” (45%).

The betting markets on the other hand clearly favored the “No” vote throughout. Witness for instance how the “Yes” vote on Hypermind always stayed below the 50% likelihood threshold, and was given a low probability just before the referendum took place on September 18th.

SCOTLAND

Then came the midterm congressional elections in the U.S., in november 2014. The big question then was whether the Republicans would recapture control of the Senate, which they did. The polls mostly saw this coming, but were much more timid in their forecasts than the betting markets.

In fact, as discussed earlier in this blog, Hypermind out-predicted all the poll-aggregation models operated by the biggest U.S. media, as well as Nate Silver’s FiveThirtyEight. (Only the Washington Post model ended-up out-predicting Hypermind at the very end, but its prediction was all over the place beforehand, as can be seen in the chart below.)

midterms2014Senate

The Israeli elections in March 2015 again stumped the polls and the pundits. The closer we got to election day, the more Benyamin Netanyahu was given up for dead, politically. The latest polls even predicted his Likud party would be 4 seats behind his leftist rival, and considered how difficult it would be for him to assemble a 61 seat majority coalition in the Knesset. Instead, Likud scored 6 more seats than its closest rival, and Bibi was able to remain prime minister for a 4th term.

What about the betting markets ? On the day before the election, while noting that the election was a rare instance of an “actual tossup“, the New York Times also noted that Hypermind was giving Netanyahu 55% chances of staying prime minister. In fact, Hypermind had clearly kept Netanyahu in the favorite seat all through the campaign.

BIBISTAY

Which now brings us to the UK general election 2015. It concluded yesterday with a big win for David Cameron’s Conservative Party, a hair-breadth away from an absolute majority in parliament. This was in contrast to all the polling data which had Labour tied with the Conservatives, both very far from a majority. Based on the poll projections of a hung parliament, the pundits could not see how Cameron could gather a governing coalition, even when adding up Ukip and the LibDems. Everyone gave Labour’s Miliband a much better chance of forming a government, with tacit support from the Scottish Nationalist Party. In fact, the polls gave the Labour+SNP a clear majority in the House…

The story was different in the betting markets. At worst, Cameron’s chances of forming the next government remained close to 50%, tied with Labour’s Miliband’s, a far cry from the large Labour advantage everyone assumed from the parliamentary arithmetic based on poll projections. On Hypermind, a Cameron rebound even occurred just before election day.

UKPRIME-full-en

It will take some time to understand why election polls, which had served the media so well for so long, seem to be suddenly experiencing a global meltdown. Perhaps the simple, powerful idea of the “representative panel” just no longer works well when individualism is pushed to the extreme in modern societies…

What is encouraging, though, is that betting markets – an approach that preexisted polls by decades – are proving more reliable, especially when the going gets tough. This is probably related to the idea, explored earlier in this blog, that predicting human affairs is in general best left to human brains than to algorithms and statistics.

Why you need collective intelligence in the age of big data

(c) Philippe Andrieu - click to visit the artist's website

There’s an old joke about a someone who has lost his car keys and keeps looking for them under a street light, but with no success. After a while, a policeman finally asks why he doesn’t extend his search elsewhere. “Because that’s where the light is,” answers the man.

The current obsession with Big Data is somewhat reminiscent of this so-called “street light effect” – the tendency to look for answers where they are easiest to look for, not most likely to be found.

In fact, whether or not a big-data search party is likely to discover something useful really depends on the kinds of data that are at hand. Computers are really good at processing data that are well structured: digital, clean, explicit and unambiguous. But when the data are unstructured – analog, noisy, implicit or ambiguous – human brains are better at making sense of them.

Whereas a single human brain, or a modest personal computer, may deal with small data sets of the preferred kind, the “bigger” the data is, the more computing power has to be brought to bear. In the case of structured data, bigger computers will come in handy, but in the case of unstructured data – the kind computers can’t properly deal with – there’s also a hard limit on how much computing power a single human brain can deliver. So the best way to make sense of big unstructured data sets is to tap into the collective intelligence of a multitude of brains.

Big Data vs Collective Intelligence

The best kind of computing power to bring to bear on big data depends on the kind of data that has to be processed. Collective intelligence delivers the best performance when dealing with big unstructured data sets.

When the goal is to peer into the future, statistical big-data approaches are especially brittle, because the data at hand are necessarily rooted in the past. That’s ok when what you are trying to forecast is extremely similar to what has already happened – like a mature product line in a stable market – but it breaks down disgracefully when you are dealing with brand new products or disrupted markets.

Here are just a few examples of situations we have encountered where collective forecasting proves superior to data-driven projections:

Disrupted market: When in the mid-2000 the world-wide demand for dairy products suddenly increased three-fold in the space of a few months, after a decade of stability, dairy product producers could not rely any more on their data-driven forecasting models. Instead, they tapped into the collective forecasting insights of their people on the ground, closest to the customers, to better understand and model the new demand drivers.

New products: A few years ago Lumenogic collaborated with a team of marketing researchers to run a prediction market within a Fortune 100 consumer packaged firm, focusing on new products. When compared to the forecasts issued from the classic data-driven methods, the researchers found that the collective forecasts provide superior results in 67% of the cases, reduce average error by approximately 15 percentage points, and reduce the error range by over 40%.

Political elections: In the past 20 years, prediction markets have become famous for their ability to outperform polls as a means to forecast electoral outcomes.  So much so that a skewer of distinguished economists eventually petitioned the U.S. government to legalize political betting for the benefit of society – which it did recently, to some extent, as evidenced by the recent launch of PredictIt. The big-data camp fought back in the form of poll aggregators, as popularized by statistical wizard Nate Silver, and further enriched by other non-poll data sets such as campaign contributions, ad spend, etc. To no avail. In last november’s U.S. Midterm elections, the collective intelligence of Hypermind’s few hundred (elite) traders outperformed all the big data-driven statistical prediction models put forth by major media organizations. That’s because the wisdom of crowds is able to aggregate a lot of information – unstructured data – about what makes each election unique, whereas this data lies out of the reach of statistical algorithm, however sophisticated.

Despite the current and growing flood digital data – the kind computers and algorithms can deal with – we should not lose sight that the world offers magnitudes more unstructured data – the kind only human brains can collectively make sense of.  So if you ever find yourself searching fruitlessly under that big-data street light, remember that collective intelligence may provide just the night goggles you need to extend your search.