One Question, Two Tariffs: How Polymarket’s Contract Design Dodged a Bullet

One Question, Two Tariffs: How Polymarket’s Contract Design Dodged a Bullet

The prediction market bundled two distinct legal questions into one contract on Trump’s tariffs. The Court happened to treat them the same way. That was luck, not design, writes Ryan Adler, GJ Question Team Lead.


Years ago, I was in a city council meeting for a budget presentation. I had put together a number of tables and figures. In what is probably not a unique experience, I saw during the presentation that one of my tables had an underlying error for a key figure. There was a sum formula that excluded a line item that had been added later. To my relief, nobody noticed. Later, I updated the file so that the table was correct for future use, and went on with the knowledge that I dodged a bullet.

When you make a mistake in Excel, whether due to a typo when setting a range or outright negligence, you don’t have to worry too much about the mistake jumping off the screen or page for a large number of people (especially when the money involved is someone else’s). Event contract markets don’t have the luxury of innumeracy, because people are putting their own money forward and expecting a return.

I got lucky with my spreadsheet. If one were careful enough to notice, a similar scenario played out with Polymarket’s contract on whether the Supreme Court would rule in favor of Trump’s tariffs under the International Emergency Economic Powers Act (IEEPA). In the end, their oversight was immaterial to the outcome, but it could have been another unforced error (e.g., see my last post).

Splitting the Baby

At Good Judgment, I made a deliberate choice to split the issue into two questions. The first asked: “Before 1 August 2026, will the Supreme Court rule in Trump v. V.O.S. Selections Inc. that all of President Trump’s ‘reciprocal tariffs’ are either not authorized by the IEEPA or unconstitutional?” The second asked the same thing, but about the “trafficking tariffs.”

That distinction wasn’t cosmetic. The reciprocal tariffs and the trafficking tariffs arose from different executive orders and were justified by different facts and very different scopes. Yes, the cases were consolidated together. Yes, the Federal Circuit addressed them in the same opinion. But it was entirely plausible that the Supreme Court could have treated them differently. Courts are perfectly capable of splitting the baby.

Polymarket, by contrast, offered a single, bundled contract: “Will the Supreme Court rule in favor of Trump’s tariffs?” Their resolution criteria tied the outcome to whether the Court reversed or vacated the Federal Circuit’s holding that the tariffs exceeded IEEPA authority. Clean enough on paper, but it assumed that “the tariffs” were one thing. They were not.

If the Court had upheld the trafficking tariffs but struck down the reciprocal tariffs, or vice versa, my decision to split the question into two would have saved us, with one question resolving as “Yes” and the other “No.” Polymarket would have had to interpret whether the government had “prevailed.” That’s the kind of ambiguity you don’t want when people have money on the line.

(A note on “prevailed”: This is not a great term to use for a question/market. Anyone who reads a lot of federal appellate decisions, whether professionally or for fun, knows that litigation over what is meant by “prevailing party” has led to the deaths of many trees.)

Polymarket ducked a mess because the Court ultimately treated the categories the same way, but that was luck layered on top of legal reality.

The figure above shows how this played out in forecasting. Early on, our Superforecasters assigned different probabilities to the two questions. The reciprocal tariffs consistently carried a higher probability of being struck down than the trafficking tariffs. That divergence reflected genuine legal distinction, and Polymarket’s framing failed to account for it.

After oral arguments in early November gave insight as to where the Justices’ attention was focused, the two Good Judgment lines converged. By early February, Superforecasters were around 80 percent that the administration would lose on the reciprocal tariffs. Polymarket’s pricing implied something closer to the mid-70s. On 20 February, the Supreme Court ruled against the administration.

You can argue about who was a few percentage points closer. That’s not the interesting part. The interesting part is structural. Markets are only as clean as their definitions. When you bundle distinct legal questions into one contract, you are implicitly betting that the Court won’t differentiate. In this case, Polymarket dodged the bullet. But if the Court had drawn finer lines, as courts often do, the market could have faced a messy resolution.

I got to fix my spreadsheet after the meeting. A forecasting question or a prediction market needs to nail it the first time.

Keep up with the latest Superforecasts with a FutureFirst subscription.

What Superforecasters Actually Said About ForecastBench

What Superforecasters Actually Said About ForecastBench

Every few months, a new AI benchmark result gets journalists excited. Claims spike, the headlines write themselves, and nuance gets left behind.

Good Judgment welcomes AI progress on forecasting. We’ve argued consistently that the answer is Superforecasters plus AI, not one or the other. But the latest round of claims deserves a closer look, because the full report tells a very different story from the summary.

The Substack post

On 23 February 2026, the Forecasting Research Institute published Wave 5 of the Longitudinal Expert AI Panel (LEAP), a recurring survey tracking forecasts from AI scientists, industry leaders, economists, policy researchers, and Superforecasters. Alongside the report, FRI shared a Substack blog post summarizing the key findings.

Here’s what it says on ForecastBench*, in full:

AI systems are expected to surpass top human forecasters within the next few years, but the significance of that achievement is debated. Superforecasters themselves are the most bullish group on automated forecasting progress, with the median superforecaster predicting AI systems will beat their ForecastBench benchmark by 2028, which is earlier than both the median expert (2030) and the median public (2033) forecast.

*ForecastBench is a benchmark measuring AI systems’ forecasting accuracy against a 2024 Superforecaster baseline.

The report

The report’s topline summary of the same finding includes an additional passage that did not make it into the blog post:

However, forecasters qualitatively disagree on what this milestone would signify. Many note that AI excels at data-rich, quantitative questions (weather, sports, financial data) but struggles with geopolitical judgment where data is sparse and context-dependent. Others caution that because ForecastBench is structured as a frozen 2024 human baseline with many data-heavy questions and multiple AI attempts, this advantages AI systems in ways that may overstate genuine forecasting superiority.

The rationales

The report’s most interesting feature, the rationale analysis, goes further. Superforecasters were among those who most explicitly flagged that ForecastBench’s design makes an early AI “win” more likely for reasons that have little to do with genuine forecasting ability.

Their concerns:

  • The Superforecaster baseline comes from a single engagement in 2024. AI systems are now scored on entirely different questions. ForecastBench uses difficulty-adjusted Brier scores to bridge that gap, but each layer of statistical bridging adds uncertainty to the comparison.
  • Many questions focus on weather, sports, and financial data where AI has a structural advantage from data access rather than judgment.
  • Multiple AI systems are tested every two weeks, meaning that, as one respondent put it, “given enough evaluations, eventually one will fall under this mark by chance.”

One participating Superforecaster stated directly that the score measures “how well an LLM can make good predictions in general, in comparison to the public and to generalists, rather than it being intended as a specific comparison to Superforecasters.”

In other words, many respondents who predicted a relatively early date for LLMs hitting the benchmark were simultaneously arguing that, in practice, it wouldn’t mean much. “Bullish” is not the word here.

In addition, as yet another participating Superforecaster wrote,

Beating the superforecaster median on ForecastBench after difficulty adjustment is a much higher bar than ‘be competitive.’ The last bit of improvement is going to be brutally hard for AI: excellent calibration, restraint (knowing when not to be confident), and robustness across lots of weird question types.

What this means

None of this is a criticism of ForecastBench as a research project. The benchmark is a serious attempt to measure something that matters, and the FRI team has been transparent about its methodology in the technical documents. But there is a gap between what the benchmark can show and what the headlines claim it shows.

Superforecasters still lead on the overall ForecastBench leaderboard, and on the “market questions,” they are almost 50% more accurate (0.40 vs. 0.59) than the nearest AI entrant. The questions resolved so far skew toward short-horizon, data-rich topics where AI has structural advantages. The longer-range, judgment-heavy questions are still pending. And as we’ve written before, the benchmark doesn’t capture teaming, advanced aggregation, updated forecasts, or the upstream work of formulating the right questions in the first place.

As Dr. Warren Hatch told the New York Times earlier this month: “When the data is sparse and the environment is in flux, machines are backward looking by definition. And that’s where I think the space for humans will remain.”

Good Judgment provides forecasts and analysis from our team of professional Superforecasters to government, NGO, and corporate decision-makers. Learn more about FutureFirst.

Testing Polymarket’s “Most Accurate” Claim

Testing Polymarket’s “Most Accurate” Claim

In the case of central bank forecasts, the claim does not hold up when compared to our panel of professional Superforecasters, writes Chris Karvetski, PhD, GJ Senior Data and Decision Scientist.

Figure 1. The averaged meeting Brier scores.

In a recent 60 Minutes segment, Polymarket CEO Shayne Coplan described his platform in sweeping terms: “It’s the most accurate thing we have as mankind right now. Until someone else creates sort of a super crystal ball.”

It’s a memorable line and an ambitious claim that, at least in the case of central bank forecasts, does not hold up when compared to our panel of professional Superforecasters.

Superforecasting emerged from more than a decade of empirical research, systematic evaluation, and the cultivation of best practices. Polymarket emerged from a very different ecosystem of venture capital, market design, and financial incentives. But origins, pedigrees, and resources ultimately do not decide accuracy. Head-to-head testing on matched forecasting questions does. Central bank rate decisions provide an ideal setting for such an evaluation, which is why we compared Polymarket and Superforecaster forecasts across the full set of 25 recent monetary policy meetings of the Federal Reserve, European Central Bank, Bank of England, and Bank of Japan for which forecasts from both platforms were available.[1]

For each meeting, we aligned forecasts on three mutually exclusive outcomes—raise, hold, or cut—and evaluated probabilistic accuracy using the Brier score, the standard scoring rule for such forecasts. Lower scores indicate better performance, yielding a clean, apples-to-apples basis for objective comparison across platforms.

We used two complementary approaches, both pointing in the same direction. First, across all 1,756 daily forecasts, Superforecasters achieved lower (i.e., better) scores on 76 percent of days, with an average daily score of 0.135 compared to 0.159 for Polymarket. In other words, the prediction market’s performance was about 18 percent worse. Second, to account for unequal forecast horizons across meetings, we averaged daily scores within each meeting and then averaged those scores across the 25 meetings. On this basis, Superforecasters achieved an average score of 0.102, compared to 0.126 for Polymarket, making Polymarket roughly 24 percent worse.

This pattern is consistent with prior evidence. Superforecasters have a documented[2] history of strong performance in central bank forecasting, including comparisons against futures markets and other financial benchmarks, with coverage in The New York Times[3] and the Financial Times[4]. Taken together, the evidence shows that when forecasting systems are evaluated head-to-head on the same questions using standard accuracy metrics, the Superforecasters’ aggregate forecast performs better in this domain than prediction markets, undercutting claims of universal predictive supremacy.

* Chris Karvetski, PhD, is the Senior Data and Decision Scientist at Good Judgment Inc


 

[1] Polymarket coverage was not uniform across all central bank meetings. For example, forecasts were available for meetings in March 2024 and June 2024, but not for the 30 April/1 May meeting. Our analysis includes all meetings and all forecast days for which both platforms provided data, without selectively excluding any overlapping observations.

[2] See Good Judgment Inc, “Superforecasters Beat Futures Markets for a Third Year in a Row,” 12 December 2025.

[3] See Peter Coy: “A Better Forecast of Interest Rates,” New York Times, 21 June 2023 (may require subscription).

[4] “Looking at the data since January [2025], it is clear that the superforecasters continue to beat the market.” Joel Suss, “Monetary Policy Radar: ‘Superforecasters’ tend to beat the market,” Financial Times, October 2025 (requires subscription to FT’s Monetary Policy Radar).

Keep up with the latest Superforecasts with a FutureFirst subscription.