Les leçons du Brexit

L’avenir nous dira si le Brexit est une catastrophe ou non pour les Anglais, mais en tout cas cela n’en est pas une pour les marchés prédictifs.

Certes, le Bremain était le grand favori des bookmakers, alors que les sondages étaient contradictoires avec encore au moins 10% d’indécis. La veille du scrutin, Hypermind prévoyait 75% de probabilité pour Bremain et 25% de probabilité pour le Brexit. Au vu du résultat, certains s’interrogent sur la fiabilité et la pertinence des prévisions issues des marchés prédictifs. C’est légitime.

Je vais donc profiter de ce que les américains appellent un “teachable moment” pour expliquer à nouveau ce que sont les prévisions d’un marché prédictif, ce qu’elles ne sont pas, et pourquoi celles d’Hypermind sont fiables.

Des probabilités, pas des certitudes

On ne peut pas dire qu’Hypermind ait eu “raison” sur le Brexit. Mais il faut ne rien entendre aux probabilités pour assurer à l’inverse qu’Hypermind s’est “trompé”. En fait, l’idée même qu’une prévision probabiliste – 25% de chances – puisse être validée ou invalidée par une seule observation est absurde.  A la fin de l’interview dans Le Point, je répondais justement à la question “Si le Brexit l’emporte, quelles conclusions en tirerez-vous ?” de la façon suivante :

Les prévisions d’Hypermind sont des probabilités fiables, pas des certitudes. De tous les événements que nous estimons avoir « seulement » 25 % de chances de se réaliser, comme le Brexit aujourd’hui, nous pouvons garantir qu’environ un sur quatre se réalisera, même s’il n’était pas « favori ». Peut-être que le Brexit sera celui-là…, mais il y a trois chances sur quatre que non.

Le Brexit fut donc celui là… il y avait une chance sur quatre. Si erreur il y a, elle n’est donc pas tant du coté d’Hypermind, que du coté de ceux qui ne font pas la différence entre 25% (“peu probable”) et 0% (“aucune chance”).

Malédiction

De fait, c’est la malédiction particulière du prévisionniste probabiliste que de savoir pertinemment qu’à chaque fois qu’un évènement peu probable se réalisera – et il en faut, car sinon la probabilité n’aurait aucun sens – il se le verra bruyamment reproché. A tort.

Comment évaluer la fiabilité des prévisions

D’accord, me direz vous, mais alors comment savoir si la probabilité de 25% était correctement estimée ? Idéalement, il faudrait pouvoir observer le résultat non pas sur un seul référendum mais sur plusieurs dizaines : Est-ce qu’environ un sur quatre donnerait la victoire au Brexit, et trois sur quatre au Bremain ? Si oui, la prévision de 25% serait vérifiée. Mais si les résultats déviaient trop de ces proportions, alors on pourrait dire que la prévision était mauvaise. L’adéquation entre le pourcentage d’évènements prévus et le pourcentage d’événements réalisés est ce que l’on appelle “l’étalonnage” des prévisions. Mieux le prévisionniste est étalonné, plus ses probabilités sont fiables.

Malheureusement, il n’y aura pas d’autres référendums identiques, et chaque évènement traité par Hypermind est unique. Alors comment évaluer l’étalonnage et la fiabilité des prévisions ? Le mieux que l’on puisse faire c’est de considérer l’ensemble des questions traitées par Hypermind depuis deux ans, Brexit compris: il y en a eu 181, sur des sujets politiques, géopolitiques, et macroéconomiques, avec 472 réponses possibles. Les niveaux de difficultés variaient, naturellement, mais aucune n’était triviale, car chacune était commanditée par au moins un sponsor (gouvernent, banque, média, etc.) faisant face à quelque incertitude stratégique.

Les résultats sont illustrés par le graphe ci-dessous. Moins les data dévient de la diagonale, plus les prévisions sont bien étalonnées. On voit que les probabilités générées par Hypermind sont globalement fiables : les évènements auxquels on accorde 25% de chances se réalisent environ une fois sur quatre ou cinq. Les évènements auxquels on accorde 50% de chances se réalisent une fois sur deux. Les évènements estimés à 90% de chances se réalisent neuf fois sur dix, et ne se réalisent pas une fois sur dix, etc. La corrélation n’est pas parfaite, mais elle est très remarquable. Il est difficile de faire beaucoup mieux.

Calib 181 Brexit FR

Étalonnage des prévisions d’Hypermind sur 181 questions avec 472 réponses (évènements) possibles sur une période de deux ans. Chaque jour à midi, les probabilités estimées sur l’ensemble des réponses sont enregistrées. Quand les résultats sont connus, on peut comparer, à chaque niveau de probabilité, l’adéquation des pourcentages d’événements prévus et d’évènements observés. La taille des points indique le nombre de prévisions relevées à chaque niveau de probabilité.

Étalonnage des prévisions d’Hypermind sur 181 questions avec 472 réponses (événements) possibles sur une période de deux ans. Chaque jour à midi, les probabilités estimées sur l’ensemble des réponses sont enregistrées. Quand les résultats sont connus, on peut comparer, à chaque niveau de probabilité, l’adéquation des pourcentages d’évènements prévus et d’évènements réalisés. La taille des points indique le nombre de prévisions relevées à chaque niveau de probabilité.

Il est peut-être un peu ironique, et certainement contre-intuitif, de réaliser que les résultats de la question Brexit ont légèrement amélioré, plutôt que dégradé, l’étalonnage global d’Hypermind. C’est comme si le système attendait depuis longtemps qu’un évènement improbable se réalise afin de mieux étalonner ses probabilités !

Ce qui ne tue pas rend plus fort

Une dernière leçon à tirer est que chaque confrontation à la réalité rend le système plus fiable, quelque soit le résultat, car il apprend. Pour chaque parieur qui s’est positionné contre le Brexit, il y en a au moins un autre qui a parié dessus. La voix de chaque perdant s’en trouve diminuée, car il ou elle aura moins d’argent pour parier sur les questions suivantes, donc moins d’influence sur les cotes. Inversement, les opinions de ceux qui ont vu juste gagnent en influence, car ils ont désormais plus d’argent à miser sur leurs prévisions (a priori plus avisées que celles des autres). La qualité des prévisions collectives à venir est ainsi affinée.

Hypermind wins the 2016 Republican nomination race

trumpwin

Last week the Associated Press reported that Donald Trump had finally acquired enough delegates to lock in the GOP nomination. But he is not the only winner of this extraordinary primary season: of all the leading prediction markets, Hypermind was the most accurate by far. It outperformed Betfair, the Iowa Electronic Markets (IEM), and PredictIt, respectively the largest prediction market in the world (based in the UK), the longest-running and the newest US-based political markets.

Figure 1 below details the forecasts of each prediction market starting from January 25, a week before the Iowa primary, and ending on May 3, 2016, on the eve of the Indiana primary which proved fatal to Trump’s last two rivals. (No data is available for the IEM before January 25, so the this is also the longest period over which we can compare the performance of all four markets.)

panel-markets

Figure 1 – Probability of winning the GOP presidential nomination for Trump, Cruz, Rubio, or somebody else (Other), according to the four prediction markets, from January 25 to May 3, 2016.

On his way to victory, Trump crushed the hopes of 16 other candidates, and defied the expert forecasts of countless political pundits. However, as Figure 1 shows, even before the first ballot was cast in Iowa, the markets had already anointed Trump the favorite. Then, except for a short week between his Iowa stumble and his New Hampshire comeback in early February, he remained the favorite throughout the campaign until his last rivals finally quit.

Figure 1 also shows that Hypermind was systematically more bullish on Trump than the other markets were, and much less likely to lose confidence and overreact when he stumbled. The contrast is especially vivid in April, when the establishment-fueled fantasy of denying Trump the nomination at a contested convention got a lot of traction in all the markets, but much less so in Hypermind.

For a quantitative measure of accuracy it is customary to use the brier score, which sum the squared errors between the predictions and the true outcomes. The smaller the brier score, the better the prediction: in a 4-way prediction like this one, a perfect prediction has a brier score of 0, a chance prediction (i.e., 25% for each option) scores 0.75, while a totally wrong prediction scores 2.

To get a sense of how accurate the markets were throughout the comparison period, we compute each market’s brier score on a daily basis. Then we average those daily brier scores into a mean daily brier score for each market. The results are plotted in Figure 2 : Hypermind was 35% more accurate than Betfair, and 40% more accurate than IEM and PredictIt.

PIBH-brier

Figure 2 – Mean daily brier score for each prediction market from January 25 to May3, 2016. Lower scores mean better accuracy.

It is remarkable that a play-money market like Hypermind could significantly outperform the leading real-money markets on a question that made daily front-page news all over the world for many months. But it is not overly surprising. Consider this:

  1. It isn’t the first time that Hypermind more accurately forecasted U.S. elections than more often-quoted outfits. It did as well in the 2014 midterm elections (Servan-Schreiber & Atanasov, 2015).
  2. The idea that prediction markets work better when traders must “put their money where their mouth is” is a  hard-to-kill cliché that has no basis in fact, as Servan-Schreiber et al. (2004) proved more than a decade ago. Hard currency need not be involved as long as traders risk something that is valuable to them: reputation, status and self-satisfaction will do just fine for many, especially among the smartest. One particular advantage of play-money markets over their real-money counterparts is that they can better match influence with past success: everyone starts at the same level of wealth, and the only way to amass more play money than others, and thus weigh more on the market prices, is to bet successfully. There is less dumb money than in real-money markets.
  3. Hypermind is much more than just a play-money version of Betfair, IEM or PredictIt. Spawned from Lumenogic‘s multi-year collaboration with the Good Judgment Project, winner of the IARPA ACE forecasting competition, Hypermind’s sole purpose is to make the best possible predictions, rather than enriching a bookmaker, conducting academic research, or providing entertainment.  Its few thousand traders are carefully selected and rewarded (with cash prizes) solely based on actual performance. Good forecasters thrive, while poor forecasters whittle and drop out. In this competitive environment, there are no second chances, which makes the Hypermind community an elite bunch, not just any crowd.

References:

 

Hypermind accuracy over its first 18 months

Hypermind was launched in May 2014. The chart below plots the accuracy of its predictions over the 151 questions and 389 outcomes that have expired at of this writing. All the predictions so far have been about politics, geopolitics, macroeconomics, business issues, and some current events. No sports.

To generate this chart, we proceeded as follows. The data was collected daily: every day at Noon we recorded the latest transaction price on each traded outcome and treated it as a probability for this outcome. These observations were then grouped in 20 probability bins: 1-5%, 6-10%, 11-15%, …, 96-99%. Then, we just plotted the average of the probabilities in each bin against the percentage of the outcomes represented in the bin that actually occurred.

The market is accurate to the extent that the two numbers are well calibrated, ie., that the data points are aligned with the chart’s diagonal. In our case the measure of calibration is .001, meaning that the average difference between the percentage of events actually coming true and the forecast at each level of probability is only about 3.3%.  If we did not know better, we might conclude that reality aligns itself with Hypermind’s predictions.

calibration 151x5 171215

 

Polls are dead, long live markets

NFCROWD

The polling fiasco in the 2015 UK general election is just the latest in a string of high-profile failures over the last few months. This contrasts with the good performance of prediction markets, and Hypermind in particular.

Let’s start with the referendum on Scottish independence in september 2014. In the final weeks before the referendum, the polls consistently announced a cliffhanger with Yes and No tied within the margin of error. Yet the actual results gave “No” a large majority of 55%, 10 points ahead of “Yes” (45%).

The betting markets on the other hand clearly favored the “No” vote throughout. Witness for instance how the “Yes” vote on Hypermind always stayed below the 50% likelihood threshold, and was given a low probability just before the referendum took place on September 18th.

SCOTLAND

Then came the midterm congressional elections in the U.S., in november 2014. The big question then was whether the Republicans would recapture control of the Senate, which they did. The polls mostly saw this coming, but were much more timid in their forecasts than the betting markets.

In fact, as discussed earlier in this blog, Hypermind out-predicted all the poll-aggregation models operated by the biggest U.S. media, as well as Nate Silver’s FiveThirtyEight. (Only the Washington Post model ended-up out-predicting Hypermind at the very end, but its prediction was all over the place beforehand, as can be seen in the chart below.)

midterms2014Senate

The Israeli elections in March 2015 again stumped the polls and the pundits. The closer we got to election day, the more Benyamin Netanyahu was given up for dead, politically. The latest polls even predicted his Likud party would be 4 seats behind his leftist rival, and considered how difficult it would be for him to assemble a 61 seat majority coalition in the Knesset. Instead, Likud scored 6 more seats than its closest rival, and Bibi was able to remain prime minister for a 4th term.

What about the betting markets ? On the day before the election, while noting that the election was a rare instance of an “actual tossup“, the New York Times also noted that Hypermind was giving Netanyahu 55% chances of staying prime minister. In fact, Hypermind had clearly kept Netanyahu in the favorite seat all through the campaign.

BIBISTAY

Which now brings us to the UK general election 2015. It concluded yesterday with a big win for David Cameron’s Conservative Party, a hair-breadth away from an absolute majority in parliament. This was in contrast to all the polling data which had Labour tied with the Conservatives, both very far from a majority. Based on the poll projections of a hung parliament, the pundits could not see how Cameron could gather a governing coalition, even when adding up Ukip and the LibDems. Everyone gave Labour’s Miliband a much better chance of forming a government, with tacit support from the Scottish Nationalist Party. In fact, the polls gave the Labour+SNP a clear majority in the House…

The story was different in the betting markets. At worst, Cameron’s chances of forming the next government remained close to 50%, tied with Labour’s Miliband’s, a far cry from the large Labour advantage everyone assumed from the parliamentary arithmetic based on poll projections. On Hypermind, a Cameron rebound even occurred just before election day.

UKPRIME-full-en

It will take some time to understand why election polls, which had served the media so well for so long, seem to be suddenly experiencing a global meltdown. Perhaps the simple, powerful idea of the “representative panel” just no longer works well when individualism is pushed to the extreme in modern societies…

What is encouraging, though, is that betting markets – an approach that preexisted polls by decades – are proving more reliable, especially when the going gets tough. This is probably related to the idea, explored earlier in this blog, that predicting human affairs is in general best left to human brains than to algorithms and statistics.

How accurate is the Hypermind prediction market?

cristalballHypermind sells predictions, so the first question that comes up is usually: “how accurate are they?”. We have now accumulated enough data to be able to take a deep look, and the results are very good.

But before we dive in, let’s be clear about what we mean by “accuracy”. Market predictions are typically expressed as probabilities : it won’t say “Event E will occur”, it will say instead: “There is a 70% chance that event E will occur”. Implicit in that statement is that there also is a 30% chance that event E won’t occur… Which means that any single prediction like this cannot be considered right or wrong, whatever happens.

However, over many predictions, accuracy can be measured as a product of both calibration and discrimination:

Calibration – Predictions are said to be well calibrated when the events deemed more probable do occur more often, and those deemed less probable in fact occur less often. For example, if we consider all the events to which the market ascribed 30% probability, we should observe that 30% of them actually do occur. Similarly, if we consider all the events to which the market ascribed 80% probability, we should observe that 80% of them actually do occur. And so on.

Discrimination – This is a measure of how extreme the predictions are. The closer they are to 0% (absolutely unlikely) or 100% (absolutely likely), the more discriminating they are said to be. Decision makers like predictions that are discriminating because they are more actionable.

accuracy-en-sqr

Only God’s predictions could be both perfectly calibrated and perfectly discriminating: events would always be predicted to be 0% likely or 100% likely, and the prediction would always be correct. Baring such perfection, calibration is preferable to discrimination: a fuzzy but generally correct forecast is better than a categorical but misleading forecast.

HYPERMIND DATA

Let us now turn to Hypermind’s data. The prediction market has been operating since May 16th, 2014 with a panel of a few hundred traders recruited and rewarded based on performance.(1)

At this point, 75 questions of political, geopolitical, economic and business nature have been settled: questions about elections in Europe, the U.S., Brazil, Afghanistan and elsewhere, the P5+1 negotiations with Iran over its nuclear program, the war in Ukraine, the GE takeover of Alstom, the ECB stress test, the price of oil, and a whole lot more. The time horizon for the predictions in this data set was in the range of a few days to a few months. All in all, 41,442 trades have been conducted on 196 possible outcomes.

As the chart below illustrates, Hypermind’s predictions are well calibrated. The chart plots the percentage of events that occur at each price level between 1 and 99H (the market’s virtual money). It shows that the prices at which various outcomes are traded on the market can readily be interpreted as realistic probabilities for those outcomes, give or take a few percentage points.

caption goes here

To generate this chart, we recorded the price of each traded outcome every day at 12 Noon, grouped all outcomes traded at the same price and computed the percentage of them that actually occurred. The closer the data points are to the diagonal in the chart, the more the market’s prices predict true probabilities in the real world. Some data points are larger than others to indicate the relative number of outcomes traded at each price level. (The colors, however, are just for show!)

To assess discrimination, it is visually useful to plot the same data at a coarser level by clustering prices in ten intervals of 10H each. As the larger data points include more observations, we can see that most trades occur at price points closer to the extremes, where predictions are more certain, than towards the middle, around 50H, where uncertainty is at its peak.

calibration081214-10

By this measure, Hypermind is also usefully discriminating: For instance, on a daily basis, two thirds of its predictions (64%) indicate outcome probabilities below 20% (very unlikely) or above 80% (very likely). Similarly, 80% of its predictions are either unlikely (below 30%) or likely (above 70%).

COMPARISON POINTS

This analysis shows that Hypermind’s predictions are both accurate and actionable, but it tells us little about the intrinsic difficulty of the questions, nor about how well other forecasting methods might have done in comparison on those same questions.

Unfortunately for this purpose, only a few of the questions addressed by Hypermind so far have also been systematically forecasted by other methods or venues. That is partly by design, because the value of Hypermind predictions depends as much on their exclusivity as on their accuracy. We would rather focus on important questions that only few – but the right few – care about, than on entertaining issues that everybody else is already forecasting.

A particularly interesting point of comparison is with the Good Judgment Project, a multi-million dollar research project sponsored by the U.S. government’s Intelligence Advanced Research projects Activity.(2) Since August 2014, Hypermind has been allowed to forecast several dozens of the same geopolitical questions submitted to the Good Judgment forecasters. Based on the score of questions that have closed so far, Hypermind seems to be performing very well. However, there isn’t enough data yet to draw firm conclusions, so this is an issue we will revisit at a later date when more questions have closed.

In the meantime, events like political elections are both important and entertaining, and are widely forecasted. In an earlier post, we documented how Hypermind outperformed all the big-data statistical poll-aggregation models (aka Nate Silver and friends) when predicting the results of the 2014 U.S. midterm elections.

Although the comparative data is still sparse, it clearly suggests that Hypermind exhibits excellent accuracy not so much because the predictions are easy, but because it performs at a best-in-class level.


NOTES

(1) The first few hundred Hypermind traders were recruited based on remarkable performance in various prediction markets operated by NewsFutures and Lumenogic between 2000 and 2014.

(2) Full disclosure: Lumenogic, one of the firms backing Hypermind, has also been a member of the Good Judgment Project research team since 2012. Indeed, some of the prediction market technology used by Hypermind was originally developed by Lumenogic for this purpose.

Hypermind correctly predicted no deal with Iran on nuclear centrifuges

The year-long negotiations with Iran over its nuclear program have failed to reach an agreement by the november 24 deadline. At issue, in particular, was the number of centrifuges that Iran would be allowed to operate to enrich its uranium into weapons-grade material. It currently operates about 10,000, while the P5+1 countries initially aimed to bring that number below 4,000.

Starting in mid september, Hypermind featured a prediction market on this question, as part of a geopolitical contest featuring questions formulated by the Intelligence Advanced Research Projects Activity (IARPA) ACE project.

Iran centrifuges

As the chart shows, Hypermind’s forecast was correctly dire throughout the negotiations, predicting that no deal would be reached on that critical issue. Only briefly did it dip down from around 80% probability of “no agreement” to 50/50 uncertainty. The initial dip was caused by reports that the P5+1 countries, growing desperate for a deal, might allow Iran to operate 5 or 6,000 centrifuges… But the Hypermind prediction traders quickly resolved that this wouldn’t save the negotiations.

Hypermind out-predicts big-data models in the 2014 U.S. midterm elections

With Intrade gone and the rise of sophisticated statistical models à la FiveThirtyEight operated by various U.S. media, we haven’t heard much about prediction markets during the 2014 U.S. midterms election cycle. It was as if the allure of big data and statistical rock stars like Nate Silver had eclipsed the robust and well-documented success of collective human intelligence. Are prediction markets doomed to be road kill on the big-data super highway?

Not so fast.

In head-to-head comparisons, the Hypermind prediction market offers evidence that the aggregated brain power of a prediction market can still outpredict the much-hyped statistical machines.

Hypermind listed several stocks on the midterm elections in the 2014 U.S., focusing on control of the Senate and the 5 most undecided individual races in Kansas, Iowa, North Carolina, Colorado, and Georgia. This allows comparisons between Hypermind’s predictions and those of the 7 major statistical models: FiveThirtyEight (Nate Silver), Washington Post, New York Times, Huffington Post, Princeton Election Consortium, PredictWise, and Daily Kos.

In the analysis below we are comparing the predictions of each model against Hypermind, against each other, and against the average prediction of the 7 models. Importantly, we are not just comparing predictions made on election day, but throughout the weeks or months – depending on the question – during which the market and all models were simultaneously spewing predictions.*

Accuracy is measured using brier scores, which actually compute the error between the predictions and the true outcomes. The smaller the brier score, the better the prediction: a perfect prediction has a brier score of 0, while a chance prediction – think 50/50 – has a brier score of 0.5, and a totally wrong prediction scores 2.

To get a sense of how the methods compared overall, we computed for each question the brier score of each method every day throughout the comparison period. Then we averaged those daily brier scores into a mean daily brier score for each method and each question. Then we averaged those across the 6 questions to get an overall mean daily brier score for each method.**

The chart below plots the results. By this measure, all models except Princeton’s did slightly better than chance, but Hypermind out-predicted all of them, including the average prediction of all the models (“Models Mean”).

midterms2014Overall

We then took a closer look at these elections’ most important question: would Republicans win control of the Senate? In this case, Hypermind again out-performed all the models, as can be seen in the chart below. Except for the Washington Post’s, all the models remained, throughout the comparison period, much less confident than Hypermind in the Republican’s ultimate control of the Senate.

 midterms2014Senate

The Washington Post model, although more unstable that any other – notice the large dip around 50% from late august to mid-september – did particularly well at the end of the campaign, so Hypermind’s advantage isn’t as visually obvious as it is against the other models. However, if we compute the average daily brier scores over the entire period during which the Washington Post and Hypermind operated in parallel – from early july to election day – we find a 36% accuracy advantage for Hypermind (.096) over the Washington Post (.150).

There is an important lesson to be learned here: even in this age of big data and super computers, human collective intelligence is still our best means of predicting the future. Isn’t that reassuring?

Notes
(*) The periods of comparison for each question were as follows: Senate Control [sept. 3 to nov. 4]; IA, KS, CO, NC [oct. 9 to nov. 4]; GA [oct. 20 to nov. 4].
(**) Computing mean daily brier scores over entire forecasting periods, like we do here, is also how the geopolitical predictions of the IARPA-sponsored Good Judgment Project are being scored by the U.S. Government.
Data Sources
Hypermind’s daily closing prices for each contract are available for download in Excel format.
Models’ data were recorded by the New York Times here and here. Available for download courtesy of the Upshot’s Josh Katz.