Testing Polymarket’s “Most Accurate” Claim

Testing Polymarket’s “Most Accurate” Claim

In the case of central bank forecasts, the claim does not hold up when compared to our panel of professional Superforecasters, writes Chris Karvetski, PhD, GJ Senior Data and Decision Scientist.

Figure 1. The averaged meeting Brier scores.

In a recent 60 Minutes segment, Polymarket CEO Shayne Coplan described his platform in sweeping terms: “It’s the most accurate thing we have as mankind right now. Until someone else creates sort of a super crystal ball.”

It’s a memorable line and an ambitious claim that, at least in the case of central bank forecasts, does not hold up when compared to our panel of professional Superforecasters.

Superforecasting emerged from more than a decade of empirical research, systematic evaluation, and the cultivation of best practices. Polymarket emerged from a very different ecosystem of venture capital, market design, and financial incentives. But origins, pedigrees, and resources ultimately do not decide accuracy. Head-to-head testing on matched forecasting questions does. Central bank rate decisions provide an ideal setting for such an evaluation, which is why we compared Polymarket and Superforecaster forecasts across the full set of 25 recent monetary policy meetings of the Federal Reserve, European Central Bank, Bank of England, and Bank of Japan for which forecasts from both platforms were available.[1]

For each meeting, we aligned forecasts on three mutually exclusive outcomes—raise, hold, or cut—and evaluated probabilistic accuracy using the Brier score, the standard scoring rule for such forecasts. Lower scores indicate better performance, yielding a clean, apples-to-apples basis for objective comparison across platforms.

We used two complementary approaches, both pointing in the same direction. First, across all 1,756 daily forecasts, Superforecasters achieved lower (i.e., better) scores on 76 percent of days, with an average daily score of 0.135 compared to 0.159 for Polymarket. In other words, the prediction market’s performance was about 18 percent worse. Second, to account for unequal forecast horizons across meetings, we averaged daily scores within each meeting and then averaged those scores across the 25 meetings. On this basis, Superforecasters achieved an average score of 0.102, compared to 0.126 for Polymarket, making Polymarket roughly 24 percent worse.

This pattern is consistent with prior evidence. Superforecasters have a documented[2] history of strong performance in central bank forecasting, including comparisons against futures markets and other financial benchmarks, with coverage in The New York Times[3] and the Financial Times[4]. Taken together, the evidence shows that when forecasting systems are evaluated head-to-head on the same questions using standard accuracy metrics, the Superforecasters’ aggregate forecast performs better in this domain than prediction markets, undercutting claims of universal predictive supremacy.

* Chris Karvetski, PhD, is the Senior Data and Decision Scientist at Good Judgment Inc


 

[1] Polymarket coverage was not uniform across all central bank meetings. For example, forecasts were available for meetings in March 2024 and June 2024, but not for the 30 April/1 May meeting. Our analysis includes all meetings and all forecast days for which both platforms provided data, without selectively excluding any overlapping observations.

[2] See Good Judgment Inc, “Superforecasters Beat Futures Markets for a Third Year in a Row,” 12 December 2025.

[3] See Peter Coy: “A Better Forecast of Interest Rates,” New York Times, 21 June 2023 (may require subscription).

[4] “Looking at the data since January [2025], it is clear that the superforecasters continue to beat the market.” Joel Suss, “Monetary Policy Radar: ‘Superforecasters’ tend to beat the market,” Financial Times, October 2025 (requires subscription to FT’s Monetary Policy Radar).

Keep up with the latest Superforecasts with a FutureFirst subscription.

The Devil Is in the Verb

The Devil Is in the Verb

Superforecaster Ryan Adler on what Polymarket’s Venezuela kerfuffle reveals about “simple” forecasting questions.

Writing questions is harder than it looks. For simple things, like run-of-the-mill economic data (when the government isn’t shut down), it is rather easy. However, we do not live in a simple world, and framing forecasting questions for highly relevant things can require extremely methodical work. Having written well north of 4,000 forecasting questions and forecast on hundreds more, I’ve suffered through every mistake imaginable (and probably more to come).

Bill Buckner’s 1986 World Series error

After President Trump arranged for Nicolas Maduro to receive free housing in New York (Mayor Mamdani should be proud), Polymarket found itself in a bit of a pickle of its own making.

Will the U.S. invade Venezuela by [date]?

It’s a simple question, a short sentence, and something that has been on a whole lot of minds in recent months.

Straightforward enough? At first glance, it might seem black-and-white. Folks at Polymarket apparently thought so, but they were quite mistaken. Many market participants assumed that the US operation constituted an invasion, and Polymarket has concluded otherwise.

As is often the case for a forecasting question, which is functionally identical to an event contract, the devil is in the operative verb: invade.

In what we at Good Judgment would call the resolution criteria, Polymarket attempted to expound a bit on what would constitute the US invading Venezuela:

This market will resolve to “Yes” if the United States commences a military offensive intended to establish control over any portion of Venezuela between November 3, 2025, and January 31, 2026, 11:59 PM ET. Otherwise, this market will resolve to “No”.

Left only with this language, I have to say the traders who think they should get paid out based on the events surrounding Operation Absolute Resolve have a very strong case to make. Was this a military offensive? Dozens of dead Cuban intelligence officers would say so, and I don’t think anyone would claim otherwise. Did that offensive “intend to establish control over any portion of Venezuela” (emphasis added)? Absolutely. Maduro was in a building, which is on land. According to all reporting, US forces landed on Venezuelan territory and clearly prevented Maduro from escaping the structure he and his wife were in. A perimeter was established. What’s in that perimeter? A portion of Venezuela.

Sure, this would have been a very small chunk of Venezuela, and the control established (intention for doing so is implicit) was of a very short duration. However, Polymarket didn’t take the time to add modifiers that would have excluded the operation that we saw.

As an undocumented attorney (got my JD back in the day, but never practiced), one of my first classes was on contracts. While there are many nuances in interpretation, enforceability, and public policy from jurisdiction to jurisdiction, there is one thing that controls a contract: its text. Here, Polymarket may have intended for its language to mean something different than the seizure of Maduro, but that’s not what they wrote in the contract they offered to the market.

Mistakes will be made, but this, in my opinion and experience, was an unforced error.

* Ryan Adler is a Superforecaster, GJ managing director, and leader of Good Judgment’s question team

Good Judgment’s 2025 in Review

A Record Year and What We Learned About AI, Markets, and the Future of Forecasting

It’s been a challenging year. Public imagination has been captured by prediction markets and AI alike as potential oracles for, well, everything. And yet, here we are at Good Judgment Inc, not just standing but setting records.

This year, Good Judgment launched an unprecedented 1,140 forecasting questions across our public and private platforms, with a void rate of exactly zero. That’s a benchmark other forecasting platforms cannot claim.

Our top-line developments in 2025:

  • Our Superforecasters have continued to outperform the markets, as featured in the Financial Times, and provide precise probabilities in our 11th annual collaboration with The Economist.
  • Good Judgment won an Honourable Mention in the 2025 IF Awards from the Association of Professional Futurists (APF) together with our UK partners ForgeFront for our joint Future.Ctrl methodology. This is a much-coveted professional award in the foresight industry.
  • Good Judgment’s CEO Dr. Warren Hatch delivered a keynote address at UN OCHA’s Global Humanitarian Policy Forum. We find it especially heartwarming that global leaders at the top level are paying attention to Superforecasting as a way to improve decision-making.
  • We have added an executive education program to our Superforecasting Workshops menu. It’s designed for decision-makers who want to incorporate probability forecasts into their process. So far, our client list includes a major technology company, an oil multinational, and investment funds, among others.
  • We now offer in-person workshops as part of a leadership development program with our Canadian partner, Kingbridge Centre.

But beyond the big names and numbers, we’ve learned something important about where human forecasting fits in an increasingly automated world.

The Two-Front Challenge

On one side, prediction markets like Polymarket have drawn enormous attention. On the other, large language models (LLMs) have shown remarkable ability to synthesize information and generate plausible-sounding forecasts. So are Superforecasters still the best in the field? Are we still needed?

Our answer, backed by data, is yes.

Outperforming the Markets

For the third year in a row, our US Federal Reserve forecasts beat the CME’s FedWatch tool, a result we’ve been documenting throughout the year and that was featured in the Financial Times. Three years is a pattern.

What about Polymarket? On questions like Fed rate decisions, we find it essentially duplicates FedWatch, volatility included. In other words, the prediction market hype hasn’t translated into better forecasts on questions like these, and these are the type of questions that matter most to our clients.

The AI Question

Forecasting Research Institute, our sibling organization, runs the only forecasting competition we know that directly pits humans against AI models. According to their latest results, the best-performing LLM still lags the best human forecasters by 40%.

Why the gap? It comes down to a fundamental difference in what forecasters do best versus what AI does best.

AI synthesizes existing information. If the answer to a question is somewhere on the internet, a well-trained model will surface it quickly. But for questions marked with greater volatility (who wins the next election, where markets are heading, what happens next in a geopolitical crisis), the answer isn’t sitting in a database. It’s contingent on human behavior, a much harder variable to predict than mere extrapolation from data.

The best human forecasters go a step or two beyond the retrieval and synthesis of information. They weigh evidence, model uncertainty, update their thinking as conditions change, and produce nuanced judgments. That’s a capability AI could tap into only by accessing Superforecasters’ aggregated forecasts and their detailed reasoning. As we’ve written elsewhere, “What our clients value are not only the numbers but also the rationales that Superforecasters provide with their numerical forecasts. By examining their chain of reasoning—something that black-box systems cannot provide reliably—leaders are able to scrutinize assumptions, trace causal links, and stress-test scenarios by noting hidden risks.” For the types of questions we see in our client work, Superforecasters are still the best.

Looking Ahead

None of this means we’re ignoring AI developments. Quite the opposite. We’ve been actively experimenting with how to integrate AI into the Superforecasting process. It is Good Judgment’s opinion that a hybrid approach is the path forward. Not AI replacing Superforecasters, but AI amplifying what Superforecasters already do well.

As we head into the new year, we are seeing momentum picking up once again on the business side. FutureFirst™, our subscription-based forecast monitoring tool, has seen all Q4 renewals go through. Once organizations experience what our structured forecasting provides and build it into their workflows, they tend to stay.

On the training side, we are now offering Advanced Judgment & Modeling, the next-level Superforecasting training to graduates of our two-day workshop. As a Texas National Security Review study found, decision-makers tend to be vasty overconfident but can improve calibration even with brief training. Our analysis supports these findings.

Our public workshop continued to receive stellar ratings from participants in 2025. Here’s an excerpt from one of our favorite reviews:

“The content was excellent and incredibly practical, diving deep into the art and science of forecasting. The unexpected highlight was the group itself. It was one of the most uniquely thoughtful, globally diverse rooms I’ve been part of in a long time. … Grateful for the experience and the brilliant people I met. Highly recommend it to anyone serious about sharpening their judgment or improving decision quality.”
— Jeff Trueman, Eisengard AI, November 2025

Although 2025 marked ten years since the publication of Tetlock and Gardner’s Superforecasting: The Art and Science of Prediction, forecasting as a discipline is still a novel way of thinking for many organizations. It feels risky to put a number on a prediction, because with numbers comes accountability. But accountability leads to better forecasting and hence better decisions. This case is getting easier to make, especially when we can point to years of Superforecasters’ documented outperformance of the competition, the markets and, now, the machines.

To our clients, staff, and forecasters: thank you. We wouldn’t be here without your energy, rigor, and recognition. Here’s to another year of proving what human judgment can do with the ever-evolving tools that we have.