What Superforecasters Actually Said About ForecastBench

What Superforecasters Actually Said About ForecastBench

Every few months, a new AI benchmark result gets journalists excited. Claims spike, the headlines write themselves, and nuance gets left behind.

Good Judgment welcomes AI progress on forecasting. We’ve argued consistently that the answer is Superforecasters plus AI, not one or the other. But the latest round of claims deserves a closer look, because the full report tells a very different story from the summary.

The Substack post

On 23 February 2026, the Forecasting Research Institute published Wave 5 of the Longitudinal Expert AI Panel (LEAP), a recurring survey tracking forecasts from AI scientists, industry leaders, economists, policy researchers, and Superforecasters. Alongside the report, FRI shared a Substack blog post summarizing the key findings.

Here’s what it says on ForecastBench*, in full:

AI systems are expected to surpass top human forecasters within the next few years, but the significance of that achievement is debated. Superforecasters themselves are the most bullish group on automated forecasting progress, with the median superforecaster predicting AI systems will beat their ForecastBench benchmark by 2028, which is earlier than both the median expert (2030) and the median public (2033) forecast.

*ForecastBench is a benchmark measuring AI systems’ forecasting accuracy against a 2024 Superforecaster baseline.

The report

The report’s topline summary of the same finding includes an additional passage that did not make it into the blog post:

However, forecasters qualitatively disagree on what this milestone would signify. Many note that AI excels at data-rich, quantitative questions (weather, sports, financial data) but struggles with geopolitical judgment where data is sparse and context-dependent. Others caution that because ForecastBench is structured as a frozen 2024 human baseline with many data-heavy questions and multiple AI attempts, this advantages AI systems in ways that may overstate genuine forecasting superiority.

The rationales

The report’s most interesting feature, the rationale analysis, goes further. Superforecasters were among those who most explicitly flagged that ForecastBench’s design makes an early AI “win” more likely for reasons that have little to do with genuine forecasting ability.

Their concerns:

  • The Superforecaster baseline comes from a single engagement in 2024. AI systems are now scored on entirely different questions. ForecastBench uses difficulty-adjusted Brier scores to bridge that gap, but each layer of statistical bridging adds uncertainty to the comparison.
  • Many questions focus on weather, sports, and financial data where AI has a structural advantage from data access rather than judgment.
  • Multiple AI systems are tested every two weeks, meaning that, as one respondent put it, “given enough evaluations, eventually one will fall under this mark by chance.”

One participating Superforecaster stated directly that the score measures “how well an LLM can make good predictions in general, in comparison to the public and to generalists, rather than it being intended as a specific comparison to Superforecasters.”

In other words, many respondents who predicted a relatively early date for LLMs hitting the benchmark were simultaneously arguing that, in practice, it wouldn’t mean much. “Bullish” is not the word here.

In addition, as yet another participating Superforecaster wrote,

Beating the superforecaster median on ForecastBench after difficulty adjustment is a much higher bar than ‘be competitive.’ The last bit of improvement is going to be brutally hard for AI: excellent calibration, restraint (knowing when not to be confident), and robustness across lots of weird question types.

What this means

None of this is a criticism of ForecastBench as a research project. The benchmark is a serious attempt to measure something that matters, and the FRI team has been transparent about its methodology in the technical documents. But there is a gap between what the benchmark can show and what the headlines claim it shows.

Superforecasters still lead on the overall ForecastBench leaderboard, and on the “market questions,” they are almost 50% more accurate (0.40 vs. 0.59) than the nearest AI entrant. The questions resolved so far skew toward short-horizon, data-rich topics where AI has structural advantages. The longer-range, judgment-heavy questions are still pending. And as we’ve written before, the benchmark doesn’t capture teaming, advanced aggregation, updated forecasts, or the upstream work of formulating the right questions in the first place.

As Dr. Warren Hatch told the New York Times earlier this month: “When the data is sparse and the environment is in flux, machines are backward looking by definition. And that’s where I think the space for humans will remain.”

Good Judgment provides forecasts and analysis from our team of professional Superforecasters to government, NGO, and corporate decision-makers. Learn more about FutureFirst.

Good Judgment’s 2025 in Review

A Record Year and What We Learned About AI, Markets, and the Future of Forecasting

It’s been a challenging year. Public imagination has been captured by prediction markets and AI alike as potential oracles for, well, everything. And yet, here we are at Good Judgment Inc, not just standing but setting records.

This year, Good Judgment launched an unprecedented 1,140 forecasting questions across our public and private platforms, with a void rate of exactly zero. That’s a benchmark other forecasting platforms cannot claim.

Our top-line developments in 2025:

  • Our Superforecasters have continued to outperform the markets, as featured in the Financial Times, and provide precise probabilities in our 11th annual collaboration with The Economist.
  • Good Judgment won an Honourable Mention in the 2025 IF Awards from the Association of Professional Futurists (APF) together with our UK partners ForgeFront for our joint Future.Ctrl methodology. This is a much-coveted professional award in the foresight industry.
  • Good Judgment’s CEO Dr. Warren Hatch delivered a keynote address at UN OCHA’s Global Humanitarian Policy Forum. We find it especially heartwarming that global leaders at the top level are paying attention to Superforecasting as a way to improve decision-making.
  • We have added an executive education program to our Superforecasting Workshops menu. It’s designed for decision-makers who want to incorporate probability forecasts into their process. So far, our client list includes a major technology company, an oil multinational, and investment funds, among others.
  • We now offer in-person workshops as part of a leadership development program with our Canadian partner, Kingbridge Centre.

But beyond the big names and numbers, we’ve learned something important about where human forecasting fits in an increasingly automated world.

The Two-Front Challenge

On one side, prediction markets like Polymarket have drawn enormous attention. On the other, large language models (LLMs) have shown remarkable ability to synthesize information and generate plausible-sounding forecasts. So are Superforecasters still the best in the field? Are we still needed?

Our answer, backed by data, is yes.

Outperforming the Markets

For the third year in a row, our US Federal Reserve forecasts beat the CME’s FedWatch tool, a result we’ve been documenting throughout the year and that was featured in the Financial Times. Three years is a pattern.

What about Polymarket? On questions like Fed rate decisions, we find it essentially duplicates FedWatch, volatility included. In other words, the prediction market hype hasn’t translated into better forecasts on questions like these, and these are the type of questions that matter most to our clients.

The AI Question

Forecasting Research Institute, our sibling organization, runs the only forecasting competition we know that directly pits humans against AI models. According to their latest results, the best-performing LLM still lags the best human forecasters by 40%.

Why the gap? It comes down to a fundamental difference in what forecasters do best versus what AI does best.

AI synthesizes existing information. If the answer to a question is somewhere on the internet, a well-trained model will surface it quickly. But for questions marked with greater volatility (who wins the next election, where markets are heading, what happens next in a geopolitical crisis), the answer isn’t sitting in a database. It’s contingent on human behavior, a much harder variable to predict than mere extrapolation from data.

The best human forecasters go a step or two beyond the retrieval and synthesis of information. They weigh evidence, model uncertainty, update their thinking as conditions change, and produce nuanced judgments. That’s a capability AI could tap into only by accessing Superforecasters’ aggregated forecasts and their detailed reasoning. As we’ve written elsewhere, “What our clients value are not only the numbers but also the rationales that Superforecasters provide with their numerical forecasts. By examining their chain of reasoning—something that black-box systems cannot provide reliably—leaders are able to scrutinize assumptions, trace causal links, and stress-test scenarios by noting hidden risks.” For the types of questions we see in our client work, Superforecasters are still the best.

Looking Ahead

None of this means we’re ignoring AI developments. Quite the opposite. We’ve been actively experimenting with how to integrate AI into the Superforecasting process. It is Good Judgment’s opinion that a hybrid approach is the path forward. Not AI replacing Superforecasters, but AI amplifying what Superforecasters already do well.

As we head into the new year, we are seeing momentum picking up once again on the business side. FutureFirst™, our subscription-based forecast monitoring tool, has seen all Q4 renewals go through. Once organizations experience what our structured forecasting provides and build it into their workflows, they tend to stay.

On the training side, we are now offering Advanced Judgment & Modeling, the next-level Superforecasting training to graduates of our two-day workshop. As a Texas National Security Review study found, decision-makers tend to be vasty overconfident but can improve calibration even with brief training. Our analysis supports these findings.

Our public workshop continued to receive stellar ratings from participants in 2025. Here’s an excerpt from one of our favorite reviews:

“The content was excellent and incredibly practical, diving deep into the art and science of forecasting. The unexpected highlight was the group itself. It was one of the most uniquely thoughtful, globally diverse rooms I’ve been part of in a long time. … Grateful for the experience and the brilliant people I met. Highly recommend it to anyone serious about sharpening their judgment or improving decision quality.”
— Jeff Trueman, Eisengard AI, November 2025

Although 2025 marked ten years since the publication of Tetlock and Gardner’s Superforecasting: The Art and Science of Prediction, forecasting as a discipline is still a novel way of thinking for many organizations. It feels risky to put a number on a prediction, because with numbers comes accountability. But accountability leads to better forecasting and hence better decisions. This case is getting easier to make, especially when we can point to years of Superforecasters’ documented outperformance of the competition, the markets and, now, the machines.

To our clients, staff, and forecasters: thank you. We wouldn’t be here without your energy, rigor, and recognition. Here’s to another year of proving what human judgment can do with the ever-evolving tools that we have.

Human vs AI Forecasts

Human vs AI Forecasts: What Leaders Need to Know

In October 2025, our colleagues at the Forecasting Research Institute released new ForecastBench results comparing large language models (LLMs) and human forecasters on real-world questions. Superforecasters still lead with a difficulty-adjusted Brier score of 0.081, while the best LLM to date, GPT-4.5, scores 0.101.

In other words, Superforecasters have a roughly 20% edge over the best model (lower scores are better).

Meanwhile, LLMs have surpassed the median public forecaster and continue to improve. FRI notes that a simple linear extrapolation would suggest LLM-Superforecaster parity by November 2026.

While Good Judgment Inc helped recruit Superforecasters for this study, we were not involved in its design or execution.

Good Judgment’s Take

We track AI benchmark results closely. Our client work points to a slower timeline than that suggested by the FRI. For the kinds of problems leaders bring to us, we doubt the Superforecaster-AI gap will close within the next year, or, for questions with limited or fuzzy data—which tends to be the case with most high-impact real-world questions—even in the next several years, if ever. The reasons for our take are threefold.

First, ForecastBench tested binary (yes/no) questions. In our client work, more than two thirds of questions are multinomial or continuous. Leaders often need a probability distribution and point estimates, not just a yes/no threshold. Asking whether US GDP, for instance, will exceed 3% next year is less informative than estimating the full distribution so organisations can plan for a range of outcomes.

Second, as the original GJP research has shown, teaming and aggregation raise accuracy by 10% to 25%. Structured teaming and advanced aggregation (beyond simple medians) reduce noise, ensure a broader spectrum of viewpoints and data points, and improve calibration, especially on complex questions that require subjective judgment. The factors of teaming, advanced aggregation, and question types merit further study, in our judgment.

Third, the AI research tournaments typically collect forecasts at a single point in time. However, new information is always coming in, which makes updated forecasts more accurate and more useful for decision makers.

It is also important to note that a 0.081 vs 0.101 result translates roughly to a 20% edge in accuracy. For decision makers, a 20% improvement can change both the choice and the outcome.

What This Means for Decision Makers

For organisations that need timely, reliable, high-stakes decisions, this is the key takeaway: AI progress is real, but disciplined human judgment still sets the bar. For the best results today, our view is human + AI. Use Superforecasters and Superforecasting methods with capable, secure models for faster, better forecasts.

What our clients value are not only the numbers but also the rationales that Superforecasters provide with their numerical forecasts. By examining their chain of reasoning—something that black-box systems cannot provide reliably—leaders are able to scrutinize assumptions, trace causal links, and stress-test scenarios by noting hidden risks. This transparency makes the decision process more deliberate, accountable after fact, and explainable to stakeholders.

As CEO Dr. Warren Hatch noted in a recent Guardian interview, “We expect AI will excel in certain categories of questions, like monthly inflation rates. For categories with sparse data that require more judgment, humans retain the edge. The main point for us is that the answer is not human or AI, but human and AI to get the best forecast possible as quickly as possible.”

Learn more about FutureFirst and see ahead of the crowd.