$1 Trillion into AI, and It Still Can’t Ask the Right Questions

$1 Trillion into AI, and It Still Can’t Ask the Right Questions

Superforecaster Ryan Adler on getting frontier LLM models to write solid forecasting questions.

Anyone hoping that artificial intelligence’s dominance of the headlines might wane in 2025 has been very disappointed. With over $1 trillion in announced investment, and December still to go, AI is permeating every aspect of society at an impressive pace. Perhaps most startling for many is the prospect that advances in these technologies will come for our jobs. While this fear is nothing new, it’s certainly accelerating.


As someone who, among other things, writes forecasting questions for a living, I’m not indifferent to the prospect of an AI model making my knowledge, skills, and abilities obsolete. That said, I undertook a recent exercise to evaluate the current state of the threat. Pleasantly enough, I found that frontier AI systems have a very long way to go.

What Frontier Models Get Wrong about Forecasting Questions

My colleague Chris Karvetski created a detailed prompt we used to ask ChatGPT, Gemini, Claude, and Grok to draft solid forecasting questions related to Russia’s war in Ukraine. This task requires at least a modicum of qualitative assessment, unlike simpler questions about, say, asset prices or interest rates. If that’s what you want, FRED already does most of the work. But war is different. It’s complicated; political and historical fictions abound; and the current Russian government is nothing if not creative on paper. Clear and salient forecasting questions are an absolute must for this and many other topics.

Here are a few quick observations:

  • ChatGPT had an affinity for UN action, which should raise a red flag (no pun intended) for anyone familiar with the UN Security Council structure. Russian veto effectively prohibits any action that isn’t blessed by the Kremlin.
  • Gemini seemed to presuppose perfect knowledge: that casualties down to the man and locations down to the kilometer were readily available data. They aren’t. Surplusage is also an issue, and its attempts to define controlling sources created potentially fatal inconsistencies.
  • Claude tried to cover US and EU sanctions. On its face, it might look like a good framing. However, it tried to set a threshold (50%) without providing any notion of how the current spectrum of sanctions and restrictive measures would be quantified. Without clear metrics, that’s a complete nonstarter.
  • Grok shared some of Gemini’s overconfidence in information availability and often framed questions that could only be resolved long after the fighting had concluded. That’s hardly useful for policymakers.

For those who have seen 1776, you may recall the scene where Thomas Jefferson first shares his draft declaration of independence. There’s a pause, then just about everybody in Independence Hall starts clamoring with questions. If the LLM-generated questions above were presented to Superforecasters, I suspect a similar scene of questions and clarification demands. Great forecasters know that the quality of a forecast, as well as the information derived from it, depends on every element of a question being as clear as possible. Otherwise, you may end up with a probability that means next to nothing.

Will one or more of these model lines get better? Certainly. Will a domain-specific program give us a run for our money before Halley’s Comet makes its return? Probably. But if Lewis and Clark’s path to the Pacific Ocean were an analogy for AI’s journey to writing iron-clad forecasting questions, today’s frontier models have just made it to Kansas City.

* Ryan Adler is a Superforecaster, GJ managing director, and leader of Good Judgment’s question team

What’s a month?

What’s a month?

Why question wording must be exact in forecasting

Superforecaster Ryan Adler turns a live CNBC disagreement about Tesla shares into a quick guide on clarity. Good forecasting starts with shared definitions.

On Monday morning (4 August 2025), I was pounding away on my keyboard with CNBC playing in the background. Living in the Mountain time zone, morning meant the Halftime Report, hosted by Scott “The Judge” Wapner. I was loosely listening in when it became clear that Wapner and “Investment Committee” member Joe Terranova were having a disagreement over whether Tesla shares were up or down over the past month. The exchange was cordial but awkward, as Wapner insisted that Tesla shares were down in the past month based on where the stock was trading that morning, but Terranova was very confident that it was up in the past month. They eventually went to commercial and came back having discovered the source of discrepancy. The problem wasn’t that one was right and the other wrong. The problem was that they were each defining “month” differently.

A month before 4 August 2025 would have been 4 July 2025, a market holiday. The chart CNBC showed related back to the closing price of Tesla on 3 July (about $315). Terranova, on the other hand, was using the opening price as of the opening bell on 7 July 2025, four weeks previous, when the price was a bit under $300. The two talked past each other for a bit until the reason for the difference was identified.

Ambiguity Kills Forecasts

What does this have to do with forecasting? Everything!

Among the many lessons that came out of the Good Judgment Project, it was clear that the fight against ambiguity is essential and never-ending. While others may give this fight a lower priority, it is front-and-center on our minds at Good Judgment with every question drafted and reviewed.

If a term or clause could be interpreted reasonably in different ways, we define that term and include examples as needed. And even if someone interprets something in an arguably unreasonable way, such as asserting that the death of a country’s president doesn’t mean that the person stops being that country’s president (it’s happened repeatedly, for some reason), we clarify.

We aren’t perfect, and the world sometimes creates situations that weren’t on anyone’s radar when a germane question was launched beforehand. That said, we know that everybody must be contemplating the same elements of an event they are asked to forecast. Leaning on Potter Stewart’s concurrence in Jacobellis v. Ohio, where he said, “I know it when I see it,” may work when deciding that a movie is not obscene, but it is no way to set a threshold for a forecasting question. Otherwise, we would invite static from the crowd instead of signal.

Bottom line: The CNBC confusion shows how ambiguity kills forecasts. Define upfront what counts, when it counts, and who decides, and leave as little as possible to interpretation. Good forecasting starts with good question writing.

Do you have what it takes to be a Superforecaster? Find out on GJ Open!

* Ryan Adler is a Superforecaster, GJ managing director, and leader of Good Judgment’s question team

When AI Becomes a False Prophet: A Cautionary Tale for Forecasters

When AI Becomes a False Prophet: A Cautionary Tale for Forecasters

With a nod to Taylor Swift and Travis Kelce, Superforecaster Ryan Adler discusses the gospel according to AI and why forecasters should always verify their sources.

Google’s AI Overview references an AI-generated video to support a false claim.

The promises of artificial intelligence have set up camp in media headlines over the past few years. ChatGPT has become a household name, billions are being spent just to power the equipment to run these programs and models, and the cutting-edge technology is front and center in ongoing tensions between the US and China. It hasn’t left any aspect of human activity untouched, including forecasting.

To be sure, the impacts already felt cannot be understated. We are looking at the front end in what I’m confident will be a seismic shift in society, with large swaths of labor markets around the globe being shaken to their core. That said, we aren’t there yet.

Here’s a recent example of how AI took itself out at the knees regarding a recent forecasting question on Good Judgment Open. In late April 2025, the time came to close a question regarding potential nuptials between Kansas City Chiefs star Travis Kelce and pop superstar Taylor Swift: “Before 19 April 2025, will Travis Kelce and Taylor Swift announce or acknowledge that they are engaged to be married?” (It’s not my favorite subject matter, but we try to maintain a diverse pool of questions.)

As a moderately rabid Chiefs fan myself, I was confident the answer was no, because that would have made headlines across media outlets. However, a key part of the job of running a forecasting platform is being in the habit of double and triple checking. So, I checked with Google. I entered “Are Travis Kelce and…” into the search field, which immediately autofilled to “are travis and taylor engaged?” (The first-name thing with pop culture stars annoys me to no end, but I digress.) To my surprise, Google’s AI preview popped up immediately.

“Yes, according to reports, Travis Kelce and Taylor Swift are engaged.”

“Trust, but verify”

Skeptical, I looked at what the experimental generative AI response was using as a reference to return such a statement. That’s when things got fun.

The first link of the cited material was a YouTube video. Keep in mind that Google, the search engine I used to start my research, owns YouTube. The account that posted the video? DangerousAI. That alone raises more red flags than a May Day parade in Moscow circa 1974. The brief video, dated 24 February 2025, purported to show Travis Kelce announcing that Swift and he “got engaged last week.” However, as the video progressed, the absurdity of Kelce’s putative announcement became perfectly clear.

To sum up, Google’s AI system linked to search was fooled by an AI product posted on another Google platform to give a patently false response.

I don’t highlight this incident as a criticism of Google. However, it should serve as a warning. I’ve seen some GJ Open forecasters take AI responses as gospel. I’m here to tell you that in matters of facts vs fiction, AI is very capable of being a false prophet. This is not to say that AI isn’t an incredibly valuable tool. It certainly is! We are finding more and more uses for it at Good Judgment, but we put it through its paces long before we deem it reliable for a particular role. As the Russian proverb instructs, “Trust, but verify.” (No, President Reagan didn’t say it first.) When it comes to AI and everything else you see online, my suggestion is that you just verify.

Do you have what it takes to be a Superforecaster? Find out on GJ Open!

* Ryan Adler is a Superforecaster, GJ managing director, and leader of Good Judgment’s question team