Les leçons du Brexit

L’avenir nous dira si le Brexit est une catastrophe ou non pour les Anglais, mais en tout cas cela n’en est pas une pour les marchés prédictifs.

Certes, le Bremain était le grand favori des bookmakers, alors que les sondages étaient contradictoires avec encore au moins 10% d’indécis. La veille du scrutin, Hypermind prévoyait 75% de probabilité pour Bremain et 25% de probabilité pour le Brexit. Au vu du résultat, certains s’interrogent sur la fiabilité et la pertinence des prévisions issues des marchés prédictifs. C’est légitime.

Je vais donc profiter de ce que les américains appellent un “teachable moment” pour expliquer à nouveau ce que sont les prévisions d’un marché prédictif, ce qu’elles ne sont pas, et pourquoi celles d’Hypermind sont fiables.

Des probabilités, pas des certitudes

On ne peut pas dire qu’Hypermind ait eu “raison” sur le Brexit. Mais il faut ne rien entendre aux probabilités pour assurer à l’inverse qu’Hypermind s’est “trompé”. En fait, l’idée même qu’une prévision probabiliste – 25% de chances – puisse être validée ou invalidée par une seule observation est absurde.  A la fin de l’interview dans Le Point, je répondais justement à la question “Si le Brexit l’emporte, quelles conclusions en tirerez-vous ?” de la façon suivante :

Les prévisions d’Hypermind sont des probabilités fiables, pas des certitudes. De tous les événements que nous estimons avoir « seulement » 25 % de chances de se réaliser, comme le Brexit aujourd’hui, nous pouvons garantir qu’environ un sur quatre se réalisera, même s’il n’était pas « favori ». Peut-être que le Brexit sera celui-là…, mais il y a trois chances sur quatre que non.

Le Brexit fut donc celui là… il y avait une chance sur quatre. Si erreur il y a, elle n’est donc pas tant du coté d’Hypermind, que du coté de ceux qui ne font pas la différence entre 25% (“peu probable”) et 0% (“aucune chance”).

Malédiction

De fait, c’est la malédiction particulière du prévisionniste probabiliste que de savoir pertinemment qu’à chaque fois qu’un évènement peu probable se réalisera – et il en faut, car sinon la probabilité n’aurait aucun sens – il se le verra bruyamment reproché. A tort.

Comment évaluer la fiabilité des prévisions

D’accord, me direz vous, mais alors comment savoir si la probabilité de 25% était correctement estimée ? Idéalement, il faudrait pouvoir observer le résultat non pas sur un seul référendum mais sur plusieurs dizaines : Est-ce qu’environ un sur quatre donnerait la victoire au Brexit, et trois sur quatre au Bremain ? Si oui, la prévision de 25% serait vérifiée. Mais si les résultats déviaient trop de ces proportions, alors on pourrait dire que la prévision était mauvaise. L’adéquation entre le pourcentage d’évènements prévus et le pourcentage d’événements réalisés est ce que l’on appelle “l’étalonnage” des prévisions. Mieux le prévisionniste est étalonné, plus ses probabilités sont fiables.

Malheureusement, il n’y aura pas d’autres référendums identiques, et chaque évènement traité par Hypermind est unique. Alors comment évaluer l’étalonnage et la fiabilité des prévisions ? Le mieux que l’on puisse faire c’est de considérer l’ensemble des questions traitées par Hypermind depuis deux ans, Brexit compris: il y en a eu 181, sur des sujets politiques, géopolitiques, et macroéconomiques, avec 472 réponses possibles. Les niveaux de difficultés variaient, naturellement, mais aucune n’était triviale, car chacune était commanditée par au moins un sponsor (gouvernent, banque, média, etc.) faisant face à quelque incertitude stratégique.

Les résultats sont illustrés par le graphe ci-dessous. Moins les data dévient de la diagonale, plus les prévisions sont bien étalonnées. On voit que les probabilités générées par Hypermind sont globalement fiables : les évènements auxquels on accorde 25% de chances se réalisent environ une fois sur quatre ou cinq. Les évènements auxquels on accorde 50% de chances se réalisent une fois sur deux. Les évènements estimés à 90% de chances se réalisent neuf fois sur dix, et ne se réalisent pas une fois sur dix, etc. La corrélation n’est pas parfaite, mais elle est très remarquable. Il est difficile de faire beaucoup mieux.

Calib 181 Brexit FR

Étalonnage des prévisions d’Hypermind sur 181 questions avec 472 réponses (évènements) possibles sur une période de deux ans. Chaque jour à midi, les probabilités estimées sur l’ensemble des réponses sont enregistrées. Quand les résultats sont connus, on peut comparer, à chaque niveau de probabilité, l’adéquation des pourcentages d’événements prévus et d’évènements observés. La taille des points indique le nombre de prévisions relevées à chaque niveau de probabilité.

Étalonnage des prévisions d’Hypermind sur 181 questions avec 472 réponses (événements) possibles sur une période de deux ans. Chaque jour à midi, les probabilités estimées sur l’ensemble des réponses sont enregistrées. Quand les résultats sont connus, on peut comparer, à chaque niveau de probabilité, l’adéquation des pourcentages d’évènements prévus et d’évènements réalisés. La taille des points indique le nombre de prévisions relevées à chaque niveau de probabilité.

Il est peut-être un peu ironique, et certainement contre-intuitif, de réaliser que les résultats de la question Brexit ont légèrement amélioré, plutôt que dégradé, l’étalonnage global d’Hypermind. C’est comme si le système attendait depuis longtemps qu’un évènement improbable se réalise afin de mieux étalonner ses probabilités !

Ce qui ne tue pas rend plus fort

Une dernière leçon à tirer est que chaque confrontation à la réalité rend le système plus fiable, quelque soit le résultat, car il apprend. Pour chaque parieur qui s’est positionné contre le Brexit, il y en a au moins un autre qui a parié dessus. La voix de chaque perdant s’en trouve diminuée, car il ou elle aura moins d’argent pour parier sur les questions suivantes, donc moins d’influence sur les cotes. Inversement, les opinions de ceux qui ont vu juste gagnent en influence, car ils ont désormais plus d’argent à miser sur leurs prévisions (a priori plus avisées que celles des autres). La qualité des prévisions collectives à venir est ainsi affinée.

Hypermind wins the 2016 Republican nomination race

trumpwin

Last week the Associated Press reported that Donald Trump had finally acquired enough delegates to lock in the GOP nomination. But he is not the only winner of this extraordinary primary season: of all the leading prediction markets, Hypermind was the most accurate by far. It outperformed Betfair, the Iowa Electronic Markets (IEM), and PredictIt, respectively the largest prediction market in the world (based in the UK), the longest-running and the newest US-based political markets.

Figure 1 below details the forecasts of each prediction market starting from January 25, a week before the Iowa primary, and ending on May 3, 2016, on the eve of the Indiana primary which proved fatal to Trump’s last two rivals. (No data is available for the IEM before January 25, so the this is also the longest period over which we can compare the performance of all four markets.)

panel-markets

Figure 1 – Probability of winning the GOP presidential nomination for Trump, Cruz, Rubio, or somebody else (Other), according to the four prediction markets, from January 25 to May 3, 2016.

On his way to victory, Trump crushed the hopes of 16 other candidates, and defied the expert forecasts of countless political pundits. However, as Figure 1 shows, even before the first ballot was cast in Iowa, the markets had already anointed Trump the favorite. Then, except for a short week between his Iowa stumble and his New Hampshire comeback in early February, he remained the favorite throughout the campaign until his last rivals finally quit.

Figure 1 also shows that Hypermind was systematically more bullish on Trump than the other markets were, and much less likely to lose confidence and overreact when he stumbled. The contrast is especially vivid in April, when the establishment-fueled fantasy of denying Trump the nomination at a contested convention got a lot of traction in all the markets, but much less so in Hypermind.

For a quantitative measure of accuracy it is customary to use the brier score, which sum the squared errors between the predictions and the true outcomes. The smaller the brier score, the better the prediction: in a 4-way prediction like this one, a perfect prediction has a brier score of 0, a chance prediction (i.e., 25% for each option) scores 0.75, while a totally wrong prediction scores 2.

To get a sense of how accurate the markets were throughout the comparison period, we compute each market’s brier score on a daily basis. Then we average those daily brier scores into a mean daily brier score for each market. The results are plotted in Figure 2 : Hypermind was 35% more accurate than Betfair, and 40% more accurate than IEM and PredictIt.

PIBH-brier

Figure 2 – Mean daily brier score for each prediction market from January 25 to May3, 2016. Lower scores mean better accuracy.

It is remarkable that a play-money market like Hypermind could significantly outperform the leading real-money markets on a question that made daily front-page news all over the world for many months. But it is not overly surprising. Consider this:

  1. It isn’t the first time that Hypermind more accurately forecasted U.S. elections than more often-quoted outfits. It did as well in the 2014 midterm elections (Servan-Schreiber & Atanasov, 2015).
  2. The idea that prediction markets work better when traders must “put their money where their mouth is” is a  hard-to-kill cliché that has no basis in fact, as Servan-Schreiber et al. (2004) proved more than a decade ago. Hard currency need not be involved as long as traders risk something that is valuable to them: reputation, status and self-satisfaction will do just fine for many, especially among the smartest. One particular advantage of play-money markets over their real-money counterparts is that they can better match influence with past success: everyone starts at the same level of wealth, and the only way to amass more play money than others, and thus weigh more on the market prices, is to bet successfully. There is less dumb money than in real-money markets.
  3. Hypermind is much more than just a play-money version of Betfair, IEM or PredictIt. Spawned from Lumenogic‘s multi-year collaboration with the Good Judgment Project, winner of the IARPA ACE forecasting competition, Hypermind’s sole purpose is to make the best possible predictions, rather than enriching a bookmaker, conducting academic research, or providing entertainment.  Its few thousand traders are carefully selected and rewarded (with cash prizes) solely based on actual performance. Good forecasters thrive, while poor forecasters whittle and drop out. In this competitive environment, there are no second chances, which makes the Hypermind community an elite bunch, not just any crowd.

References:

 

Live Webinar with Dr. Emile Servan-Schreiber

I was recently invited by our friends at PredictIt to discuss the accuracy and significance of prediction markets and collective intelligence.

During this live 30 min webinar, I dive into why markets, like PredictIt and Hypermind, have the ability to forecast the future by pooling the speculation of many. I go into the types of individuals who thrive in predictions markets, and why diversity and independent thinking is required for accuracy. I also discuss why the possibility of reward and loss promotes more objective and less passionate thinking, enhancing the quality of the opinions that can be aggregated. And more !

 

 

Hypermind accuracy over its first 18 months

Hypermind was launched in May 2014. The chart below plots the accuracy of its predictions over the 151 questions and 389 outcomes that have expired at of this writing. All the predictions so far have been about politics, geopolitics, macroeconomics, business issues, and some current events. No sports.

To generate this chart, we proceeded as follows. The data was collected daily: every day at Noon we recorded the latest transaction price on each traded outcome and treated it as a probability for this outcome. These observations were then grouped in 20 probability bins: 1-5%, 6-10%, 11-15%, …, 96-99%. Then, we just plotted the average of the probabilities in each bin against the percentage of the outcomes represented in the bin that actually occurred.

The market is accurate to the extent that the two numbers are well calibrated, ie., that the data points are aligned with the chart’s diagonal. In our case the measure of calibration is .001, meaning that the average difference between the percentage of events actually coming true and the forecast at each level of probability is only about 3.3%.  If we did not know better, we might conclude that reality aligns itself with Hypermind’s predictions.

calibration 151x5 171215

 

Polls are dead, long live markets

NFCROWD

The polling fiasco in the 2015 UK general election is just the latest in a string of high-profile failures over the last few months. This contrasts with the good performance of prediction markets, and Hypermind in particular.

Let’s start with the referendum on Scottish independence in september 2014. In the final weeks before the referendum, the polls consistently announced a cliffhanger with Yes and No tied within the margin of error. Yet the actual results gave “No” a large majority of 55%, 10 points ahead of “Yes” (45%).

The betting markets on the other hand clearly favored the “No” vote throughout. Witness for instance how the “Yes” vote on Hypermind always stayed below the 50% likelihood threshold, and was given a low probability just before the referendum took place on September 18th.

SCOTLAND

Then came the midterm congressional elections in the U.S., in november 2014. The big question then was whether the Republicans would recapture control of the Senate, which they did. The polls mostly saw this coming, but were much more timid in their forecasts than the betting markets.

In fact, as discussed earlier in this blog, Hypermind out-predicted all the poll-aggregation models operated by the biggest U.S. media, as well as Nate Silver’s FiveThirtyEight. (Only the Washington Post model ended-up out-predicting Hypermind at the very end, but its prediction was all over the place beforehand, as can be seen in the chart below.)

midterms2014Senate

The Israeli elections in March 2015 again stumped the polls and the pundits. The closer we got to election day, the more Benyamin Netanyahu was given up for dead, politically. The latest polls even predicted his Likud party would be 4 seats behind his leftist rival, and considered how difficult it would be for him to assemble a 61 seat majority coalition in the Knesset. Instead, Likud scored 6 more seats than its closest rival, and Bibi was able to remain prime minister for a 4th term.

What about the betting markets ? On the day before the election, while noting that the election was a rare instance of an “actual tossup“, the New York Times also noted that Hypermind was giving Netanyahu 55% chances of staying prime minister. In fact, Hypermind had clearly kept Netanyahu in the favorite seat all through the campaign.

BIBISTAY

Which now brings us to the UK general election 2015. It concluded yesterday with a big win for David Cameron’s Conservative Party, a hair-breadth away from an absolute majority in parliament. This was in contrast to all the polling data which had Labour tied with the Conservatives, both very far from a majority. Based on the poll projections of a hung parliament, the pundits could not see how Cameron could gather a governing coalition, even when adding up Ukip and the LibDems. Everyone gave Labour’s Miliband a much better chance of forming a government, with tacit support from the Scottish Nationalist Party. In fact, the polls gave the Labour+SNP a clear majority in the House…

The story was different in the betting markets. At worst, Cameron’s chances of forming the next government remained close to 50%, tied with Labour’s Miliband’s, a far cry from the large Labour advantage everyone assumed from the parliamentary arithmetic based on poll projections. On Hypermind, a Cameron rebound even occurred just before election day.

UKPRIME-full-en

It will take some time to understand why election polls, which had served the media so well for so long, seem to be suddenly experiencing a global meltdown. Perhaps the simple, powerful idea of the “representative panel” just no longer works well when individualism is pushed to the extreme in modern societies…

What is encouraging, though, is that betting markets – an approach that preexisted polls by decades – are proving more reliable, especially when the going gets tough. This is probably related to the idea, explored earlier in this blog, that predicting human affairs is in general best left to human brains than to algorithms and statistics.

Why you need collective intelligence in the age of big data

(c) Philippe Andrieu - click to visit the artist's website

There’s an old joke about a someone who has lost his car keys and keeps looking for them under a street light, but with no success. After a while, a policeman finally asks why he doesn’t extend his search elsewhere. “Because that’s where the light is,” answers the man.

The current obsession with Big Data is somewhat reminiscent of this so-called “street light effect” – the tendency to look for answers where they are easiest to look for, not most likely to be found.

In fact, whether or not a big-data search party is likely to discover something useful really depends on the kinds of data that are at hand. Computers are really good at processing data that are well structured: digital, clean, explicit and unambiguous. But when the data are unstructured – analog, noisy, implicit or ambiguous – human brains are better at making sense of them.

Whereas a single human brain, or a modest personal computer, may deal with small data sets of the preferred kind, the “bigger” the data is, the more computing power has to be brought to bear. In the case of structured data, bigger computers will come in handy, but in the case of unstructured data – the kind computers can’t properly deal with – there’s also a hard limit on how much computing power a single human brain can deliver. So the best way to make sense of big unstructured data sets is to tap into the collective intelligence of a multitude of brains.

Big Data vs Collective Intelligence

The best kind of computing power to bring to bear on big data depends on the kind of data that has to be processed. Collective intelligence delivers the best performance when dealing with big unstructured data sets.

When the goal is to peer into the future, statistical big-data approaches are especially brittle, because the data at hand are necessarily rooted in the past. That’s ok when what you are trying to forecast is extremely similar to what has already happened – like a mature product line in a stable market – but it breaks down disgracefully when you are dealing with brand new products or disrupted markets.

Here are just a few examples of situations we have encountered where collective forecasting proves superior to data-driven projections:

Disrupted market: When in the mid-2000 the world-wide demand for dairy products suddenly increased three-fold in the space of a few months, after a decade of stability, dairy product producers could not rely any more on their data-driven forecasting models. Instead, they tapped into the collective forecasting insights of their people on the ground, closest to the customers, to better understand and model the new demand drivers.

New products: A few years ago Lumenogic collaborated with a team of marketing researchers to run a prediction market within a Fortune 100 consumer packaged firm, focusing on new products. When compared to the forecasts issued from the classic data-driven methods, the researchers found that the collective forecasts provide superior results in 67% of the cases, reduce average error by approximately 15 percentage points, and reduce the error range by over 40%.

Political elections: In the past 20 years, prediction markets have become famous for their ability to outperform polls as a means to forecast electoral outcomes.  So much so that a skewer of distinguished economists eventually petitioned the U.S. government to legalize political betting for the benefit of society – which it did recently, to some extent, as evidenced by the recent launch of PredictIt. The big-data camp fought back in the form of poll aggregators, as popularized by statistical wizard Nate Silver, and further enriched by other non-poll data sets such as campaign contributions, ad spend, etc. To no avail. In last november’s U.S. Midterm elections, the collective intelligence of Hypermind’s few hundred (elite) traders outperformed all the big data-driven statistical prediction models put forth by major media organizations. That’s because the wisdom of crowds is able to aggregate a lot of information – unstructured data – about what makes each election unique, whereas this data lies out of the reach of statistical algorithm, however sophisticated.

Despite the current and growing flood digital data – the kind computers and algorithms can deal with – we should not lose sight that the world offers magnitudes more unstructured data – the kind only human brains can collectively make sense of.  So if you ever find yourself searching fruitlessly under that big-data street light, remember that collective intelligence may provide just the night goggles you need to extend your search.

How accurate is the Hypermind prediction market?

cristalballHypermind sells predictions, so the first question that comes up is usually: “how accurate are they?”. We have now accumulated enough data to be able to take a deep look, and the results are very good.

But before we dive in, let’s be clear about what we mean by “accuracy”. Market predictions are typically expressed as probabilities : it won’t say “Event E will occur”, it will say instead: “There is a 70% chance that event E will occur”. Implicit in that statement is that there also is a 30% chance that event E won’t occur… Which means that any single prediction like this cannot be considered right or wrong, whatever happens.

However, over many predictions, accuracy can be measured as a product of both calibration and discrimination:

Calibration – Predictions are said to be well calibrated when the events deemed more probable do occur more often, and those deemed less probable in fact occur less often. For example, if we consider all the events to which the market ascribed 30% probability, we should observe that 30% of them actually do occur. Similarly, if we consider all the events to which the market ascribed 80% probability, we should observe that 80% of them actually do occur. And so on.

Discrimination – This is a measure of how extreme the predictions are. The closer they are to 0% (absolutely unlikely) or 100% (absolutely likely), the more discriminating they are said to be. Decision makers like predictions that are discriminating because they are more actionable.

accuracy-en-sqr

Only God’s predictions could be both perfectly calibrated and perfectly discriminating: events would always be predicted to be 0% likely or 100% likely, and the prediction would always be correct. Baring such perfection, calibration is preferable to discrimination: a fuzzy but generally correct forecast is better than a categorical but misleading forecast.

HYPERMIND DATA

Let us now turn to Hypermind’s data. The prediction market has been operating since May 16th, 2014 with a panel of a few hundred traders recruited and rewarded based on performance.(1)

At this point, 75 questions of political, geopolitical, economic and business nature have been settled: questions about elections in Europe, the U.S., Brazil, Afghanistan and elsewhere, the P5+1 negotiations with Iran over its nuclear program, the war in Ukraine, the GE takeover of Alstom, the ECB stress test, the price of oil, and a whole lot more. The time horizon for the predictions in this data set was in the range of a few days to a few months. All in all, 41,442 trades have been conducted on 196 possible outcomes.

As the chart below illustrates, Hypermind’s predictions are well calibrated. The chart plots the percentage of events that occur at each price level between 1 and 99H (the market’s virtual money). It shows that the prices at which various outcomes are traded on the market can readily be interpreted as realistic probabilities for those outcomes, give or take a few percentage points.

caption goes here

To generate this chart, we recorded the price of each traded outcome every day at 12 Noon, grouped all outcomes traded at the same price and computed the percentage of them that actually occurred. The closer the data points are to the diagonal in the chart, the more the market’s prices predict true probabilities in the real world. Some data points are larger than others to indicate the relative number of outcomes traded at each price level. (The colors, however, are just for show!)

To assess discrimination, it is visually useful to plot the same data at a coarser level by clustering prices in ten intervals of 10H each. As the larger data points include more observations, we can see that most trades occur at price points closer to the extremes, where predictions are more certain, than towards the middle, around 50H, where uncertainty is at its peak.

calibration081214-10

By this measure, Hypermind is also usefully discriminating: For instance, on a daily basis, two thirds of its predictions (64%) indicate outcome probabilities below 20% (very unlikely) or above 80% (very likely). Similarly, 80% of its predictions are either unlikely (below 30%) or likely (above 70%).

COMPARISON POINTS

This analysis shows that Hypermind’s predictions are both accurate and actionable, but it tells us little about the intrinsic difficulty of the questions, nor about how well other forecasting methods might have done in comparison on those same questions.

Unfortunately for this purpose, only a few of the questions addressed by Hypermind so far have also been systematically forecasted by other methods or venues. That is partly by design, because the value of Hypermind predictions depends as much on their exclusivity as on their accuracy. We would rather focus on important questions that only few – but the right few – care about, than on entertaining issues that everybody else is already forecasting.

A particularly interesting point of comparison is with the Good Judgment Project, a multi-million dollar research project sponsored by the U.S. government’s Intelligence Advanced Research projects Activity.(2) Since August 2014, Hypermind has been allowed to forecast several dozens of the same geopolitical questions submitted to the Good Judgment forecasters. Based on the score of questions that have closed so far, Hypermind seems to be performing very well. However, there isn’t enough data yet to draw firm conclusions, so this is an issue we will revisit at a later date when more questions have closed.

In the meantime, events like political elections are both important and entertaining, and are widely forecasted. In an earlier post, we documented how Hypermind outperformed all the big-data statistical poll-aggregation models (aka Nate Silver and friends) when predicting the results of the 2014 U.S. midterm elections.

Although the comparative data is still sparse, it clearly suggests that Hypermind exhibits excellent accuracy not so much because the predictions are easy, but because it performs at a best-in-class level.


NOTES

(1) The first few hundred Hypermind traders were recruited based on remarkable performance in various prediction markets operated by NewsFutures and Lumenogic between 2000 and 2014.

(2) Full disclosure: Lumenogic, one of the firms backing Hypermind, has also been a member of the Good Judgment Project research team since 2012. Indeed, some of the prediction market technology used by Hypermind was originally developed by Lumenogic for this purpose.