$1 Trillion into AI, and It Still Can’t Ask the Right Questions

$1 Trillion into AI, and It Still Can’t Ask the Right Questions

Superforecaster Ryan Adler on getting frontier LLM models to write solid forecasting questions.

Anyone hoping that artificial intelligence’s dominance of the headlines might wane in 2025 has been very disappointed. With over $1 trillion in announced investment, and December still to go, AI is permeating every aspect of society at an impressive pace. Perhaps most startling for many is the prospect that advances in these technologies will come for our jobs. While this fear is nothing new, it’s certainly accelerating.


As someone who, among other things, writes forecasting questions for a living, I’m not indifferent to the prospect of an AI model making my knowledge, skills, and abilities obsolete. That said, I undertook a recent exercise to evaluate the current state of the threat. Pleasantly enough, I found that frontier AI systems have a very long way to go.

What Frontier Models Get Wrong about Forecasting Questions

My colleague Chris Karvetski created a detailed prompt we used to ask ChatGPT, Gemini, Claude, and Grok to draft solid forecasting questions related to Russia’s war in Ukraine. This task requires at least a modicum of qualitative assessment, unlike simpler questions about, say, asset prices or interest rates. If that’s what you want, FRED already does most of the work. But war is different. It’s complicated; political and historical fictions abound; and the current Russian government is nothing if not creative on paper. Clear and salient forecasting questions are an absolute must for this and many other topics.

Here are a few quick observations:

  • ChatGPT had an affinity for UN action, which should raise a red flag (no pun intended) for anyone familiar with the UN Security Council structure. Russian veto effectively prohibits any action that isn’t blessed by the Kremlin.
  • Gemini seemed to presuppose perfect knowledge: that casualties down to the man and locations down to the kilometer were readily available data. They aren’t. Surplusage is also an issue, and its attempts to define controlling sources created potentially fatal inconsistencies.
  • Claude tried to cover US and EU sanctions. On its face, it might look like a good framing. However, it tried to set a threshold (50%) without providing any notion of how the current spectrum of sanctions and restrictive measures would be quantified. Without clear metrics, that’s a complete nonstarter.
  • Grok shared some of Gemini’s overconfidence in information availability and often framed questions that could only be resolved long after the fighting had concluded. That’s hardly useful for policymakers.

For those who have seen 1776, you may recall the scene where Thomas Jefferson first shares his draft declaration of independence. There’s a pause, then just about everybody in Independence Hall starts clamoring with questions. If the LLM-generated questions above were presented to Superforecasters, I suspect a similar scene of questions and clarification demands. Great forecasters know that the quality of a forecast, as well as the information derived from it, depends on every element of a question being as clear as possible. Otherwise, you may end up with a probability that means next to nothing.

Will one or more of these model lines get better? Certainly. Will a domain-specific program give us a run for our money before Halley’s Comet makes its return? Probably. But if Lewis and Clark’s path to the Pacific Ocean were an analogy for AI’s journey to writing iron-clad forecasting questions, today’s frontier models have just made it to Kansas City.

* Ryan Adler is a Superforecaster, GJ managing director, and leader of Good Judgment’s question team

Human vs AI Forecasts

Human vs AI Forecasts: What Leaders Need to Know

In October 2025, our colleagues at the Forecasting Research Institute released new ForecastBench results comparing large language models (LLMs) and human forecasters on real-world questions. Superforecasters still lead with a difficulty-adjusted Brier score of 0.081, while the best LLM to date, GPT-4.5, scores 0.101.

In other words, Superforecasters have a roughly 20% edge over the best model (lower scores are better).

Meanwhile, LLMs have surpassed the median public forecaster and continue to improve. FRI notes that a simple linear extrapolation would suggest LLM-Superforecaster parity by November 2026.

While Good Judgment Inc helped recruit Superforecasters for this study, we were not involved in its design or execution.

Good Judgment’s Take

We track AI benchmark results closely. Our client work points to a slower timeline than that suggested by the FRI. For the kinds of problems leaders bring to us, we doubt the Superforecaster-AI gap will close within the next year, or, for questions with limited or fuzzy data—which tends to be the case with most high-impact real-world questions—even in the next several years, if ever. The reasons for our take are threefold.

First, ForecastBench tested binary (yes/no) questions. In our client work, more than two thirds of questions are multinomial or continuous. Leaders often need a probability distribution and point estimates, not just a yes/no threshold. Asking whether US GDP, for instance, will exceed 3% next year is less informative than estimating the full distribution so organisations can plan for a range of outcomes.

Second, as the original GJP research has shown, teaming and aggregation raise accuracy by 10% to 25%. Structured teaming and advanced aggregation (beyond simple medians) reduce noise, ensure a broader spectrum of viewpoints and data points, and improve calibration, especially on complex questions that require subjective judgment. The factors of teaming, advanced aggregation, and question types merit further study, in our judgment.

Third, the AI research tournaments typically collect forecasts at a single point in time. However, new information is always coming in, which makes updated forecasts more accurate and more useful for decision makers.

It is also important to note that a 0.081 vs 0.101 result translates roughly to a 20% edge in accuracy. For decision makers, a 20% improvement can change both the choice and the outcome.

What This Means for Decision Makers

For organisations that need timely, reliable, high-stakes decisions, this is the key takeaway: AI progress is real, but disciplined human judgment still sets the bar. For the best results today, our view is human + AI. Use Superforecasters and Superforecasting methods with capable, secure models for faster, better forecasts.

What our clients value are not only the numbers but also the rationales that Superforecasters provide with their numerical forecasts. By examining their chain of reasoning—something that black-box systems cannot provide reliably—leaders are able to scrutinize assumptions, trace causal links, and stress-test scenarios by noting hidden risks. This transparency makes the decision process more deliberate, accountable after fact, and explainable to stakeholders.

As CEO Dr. Warren Hatch noted in a recent Guardian interview, “We expect AI will excel in certain categories of questions, like monthly inflation rates. For categories with sparse data that require more judgment, humans retain the edge. The main point for us is that the answer is not human or AI, but human and AI to get the best forecast possible as quickly as possible.”

Learn more about FutureFirst and see ahead of the crowd.

Four Steps to Integrate Probabilities into Decisions

Four Steps to Integrate Probabilities into Decisions

Picnic scene and wedding scene side by side with a weather icon
The true yes/no boundary for action is rarely 50%.

Decision makers often want a simple “yes” or “no.” This creates a challenge when you’re in the business of providing probabilities. If you tell them there’s a 76% chance of Event X happening by Date Y, you might get this response: “So that’s a yes!” When you point out the remaining 24% chance it won’t happen, you might get: “So that’s a no?”

Busy leaders need to get straight to action. They often simplify the process by treating anything above 50% as a “yes” and anything below as a “no.” However, it’s rarely the case that the yes/no boundary is 50% in the real world. The true yes/no boundary, which we call the decision threshold, depends on the nature of the decision, the cost, and the stakes involved.

Say there’s a 28% chance of rain. For a casual picnic, you might accept the risk and go anyway. But for an outdoor wedding reception, your threshold for action is probably much lower. You may set up tents just in case.

The decision threshold is fundamentally about cost-benefit analysis: How many false positives are you willing to accept in order to avoid missing a real threat or opportunity?

For example, if you risk losing $100 and the cost of mitigation is $35, it may be worth taking action (mitigating the risk) if the probability of loss exceeds 35%.

The Four-Step Framework

Instead of simply delivering a forecast for the decision maker to interpret, we reverse the process by defining the yes/no boundary first. Here’s our four-step process for decision makers to use forecasts with clear thresholds, leading to better, faster judgments.

1. Identify the core decision. Begin by clearly stating the decision the organization faces. Here are a few examples:

  • Committing to an innovation cycle to mitigate the risk of a future regulatory ban on a key product.
  • Preparing for a possible increase in US tariffs on a critical base metal import.
  • Validating the underlying assumptions of a new investment thesis before allocating capital.

2. Set the decision threshold. The yes/no boundary can only be set by the decision maker. Will they act at 20%? 40%? 60%? By focusing on the cost and stakes, they avoid defaulting to the simple (and often misguided) 50% threshold.

3. Pose the question to the forecasters. Ideally, they should be unaware of the specific decision or the decision maker’s identity. This firewall ensures the probability estimate remains independent and unbiased. But if anonymity can’t be provided, treat the forecasting as a separate process as much as possible. Remember: Forecasts focus on how the world will be while decisions often reflect what we want the world to be.

4. Deliver the actionable forecast. With the yes/no boundary already in place, the decision maker can use the probability estimate immediately to make a call. With this robust framework, decision makers are able to use probabilities effectively to arrive at better decisions faster.

Learn more about FutureFirst and see ahead of the crowd.