- Client Sign In
We live in an era in which human judgment is rapidly being replaced by artificial intelligence. But forecasting geopolitical and economic events is far more difficult than winning at Jeopardy or Go.
In a US-government-sponsored competition, Superforecasters took on three competing research teams, each with millions of dollars of funding, who built hybrid forecasting systems to combine statistical models, automated tools, and judgments from over 1,000 human forecasters. 187 forecasting questions later, the results were clear.
The Superforecasters were 20% more accurate than the closest competitor and 21% more accurate than the control group.
They achieved this impressive victory by being “justifiably confident but appropriately humble,” as we explain further below.
It’s one thing to be “on the right side of maybe.” It’s much more challenging − and useful − to provide a confident signal about which outcomes will and will not occur.
The Superforecasters showed that form of “justifiable confidence” in the HFC competition. One measure that decision scientists often use is called d’ (d-prime), which can be defined as the difference between the mean forecast when events occur and mean forecast when events do not occur. For Good Judgment’s Superforecasters, the d’ is 0.575 (their mean forecast was 70.9% when events occurred and only 13.5% when events did not occur). For the best performing HFC competitor, d’ is 0.429 (61.1% when events occurred vs. 18.2% when events did not occur). For the baseline, d’ is 0.54 (68.7% when events occurred and 14.6% when events did not occur).
The graph here shows calibration curves for the Superforecasters vs the best competitor and the baseline group. Forecasters who are “appropriately humble” place their forecasts as close to the diagonal line as possible. Across all questions, when they forecast a 40% probability, those possible outcomes occur 40% of the time.
The blue line for the HFC Superforecasters is almost indistinguishable from the diagonal line, showing near-perfect calibration. The best competitor and the baseline group, in contrast, show overconfidence in some cases and underconfidence in others.
Our calculations use the public HFC competition data that IARPA, the US-government sponsor of this tournament, has published. The forecasts for the 187 questions in this competition are not included in the summary statistics presented elsewhere for Good Judgment’s commercial forecasts.
Schedule a consultation to learn how our FutureFirst monitoring tool, custom Superforecasts, and training services can help your organization make better decisions.