AI Hallucination Evaluation: IMF Precautionary Balances Review 2026

📰 Read the public briefing for this regulation →

Executive Summary

The short version

The IMF's March 2026 Executive Board review set a medium-term target of SDR 25 billion, a minimum floor of SDR 20 billion, and named the Middle East as the specific geopolitical theatre driving downside risk. The Board recorded the number of countries subject to surcharges in FY2026 falling from 20 to 13, a figure anchored in IMF Press Release 24/376. And a Q2FY26 Quarterly Financial Report placed precautionary balances at SDR 26,782 million at 31 October 2025.

Claude Opus 4.7 got none of these right, not across a single one of the six questions the RLB Specialist Panel put to it. The failures weren't random noise. They fell into a clear pattern: numbers that lag one biennial cycle behind, baselines inflated under deliverable pressure, Board lexicon softened in ways that change what a practitioner reads from it, and geographic attributions that expanded beyond what the regulator actually said.

Each row: what Claude Opus 4.7 committed to (red) vs. what the IMF's primary text records (green). All figures drawn from verbatim regulator-issued substrate.

Finding 1 — Most Material

The SDR 5 billion floor problem

The IMF Board's March 2026 review is unambiguous: Directors generally agreed to retain the current floor at SDR 20 billion. The model committed to SDR 15 billion, a full biennial cycle behind, including in a board briefing memo drafted for an emerging-market finance ministry client.

That five-billion-dollar gap is exactly what supervisors, counterparties, and internal QC reviewers check against the source first. A model that embeds a confident wrong number into a board deliverable doesn't create a minor inaccuracy. It creates a falsifiable error in a document that may reach a minister's desk.

Model output

SDR 15bn

Claude Opus 4.7 — board briefing memo register

Regulator text (IMF, Mar 2026)

SDR 20bn

Directors generally agreed to retain the current floor

Findings 2 & 3 — Reform Baseline

The surcharge count inflation

IMF Press Release 24/376 records the FY2024 surcharge-payer baseline as 20 countries, with FY2026 expected to fall to 13. The model's policy-brief draft inflated the starting number to 22, producing a 22-to-13 trajectory. That looks like a steeper reform win than the regulator recorded, which matters for any sovereign-debt practitioner using the trajectory to plan multi-year debt-service scenarios.

A separate finding on the same reform: the model upgraded the Board's characterisation of the early-surcharge-review signal from "a few Directors", the regulator's precise term, to "a number of Directors." In IMF Board lexicon, these are not interchangeable. "A few" signals a minority view. "A number" can imply broader traction. A practitioner reading the upgraded phrasing in an advisory would draw a different signal on review timing.

"Recognizing the uncertain environment, in the event that precautionary balances rise well above the target, a few Directors saw merit in considering an early review of charges and the surcharge policy in due course." — IMF Board, March 2026 Review

Finding 4 — Named-Theatre Attribution

Adding Ukraine to a sentence the Board didn't

The Board's March 2026 review names one specific geopolitical theatre in its downside-risk language: the Middle East. The model added Ukraine. This isn't a policy disagreement, it's a factual attribution error. A central-bank backgrounder or legal-and-policy advisory that attributes both named theatres to the Board record is contradicted by the face of the regulator's text.

For AI lab teams: this failure class, where a model expands a specific named list under generation pressure, is distinct from pure hallucination. The model has plausible reasons to associate Ukraine with IMF downside risk language in 2024–2026. The problem is that it substituted its own contextual knowledge for the regulator's specific textual record.

Failure Pattern Analysis

Three failure shapes, one audit

Failure Type A

Cycle-trajectory drift

Numeric parameters (floor, target, half-year level) pulled back one biennial cycle. The model's training data from prior reviews appears to anchor more strongly than the current review text.

Failure Type B

Baseline inflation under pressure

Single-value counts (surcharge-payer baseline) inflated when the model is in deliverable-pressure mode. The 22 vs 20 divergence is reproducible across prompt registers.

Failure Type C

Lexicon characterisation drift

Regulator-specific Board lexicon terms (few / number / some) upgraded without warrant. These distinctions carry real signal in IMF Board discourse and practitioner interpretation.

Failure Type D

Named-entity expansion

Specific named lists (geopolitical theatres, countries) expanded with contextually plausible but textually unrecorded entries. The model substitutes contextual knowledge for source fidelity.

Operational Signal for AI Lab Teams

When these failures surface in the wild

These aren't edge-case prompts. The RLB Specialist Panel designed questions to mirror how practitioners actually use AI on this regulation: drafting board memoranda for EM finance ministry clients on Fund near-term lending capacity, drafting policy briefs on the October 2024 surcharge reform, preparing backgrounders for central-bank bilateral meetings with the IMF Managing Director, and drafting desk notes for sovereign-credit research teams on the half-year PB trajectory.

Any of these deliverables hitting the wrong floor figure, the wrong surcharge baseline, or the wrong Board lexicon is a falsifiable error in a client-facing document. Retrieval-anchored verification against the current biennial review text is the minimum mitigation. General-purpose prompting does not resolve this, the failures survive web-search-enabled configurations.

Q2FY26 precautionary balances trajectory. The model committed to ~SDR 26.5bn for Oct 2025; the QFR records SDR 26,782m. A small divergence, but one with precision consequences in sovereign-credit desk notes.

Methodology

How the audit was run

The RLB Specialist Panel authenticated the primary source, the IMF's March 2026 Review of the Adequacy of the Fund's Precautionary Balances, IMF Press Release 24/376, and the Q2FY26 Quarterly Financial Report, as substrate v1 prior to testing. Each question was designed to mirror a real practitioner workflow: board memos, policy briefs, campaign reports, bilateral meeting backgrounders, and desk notes. Claude Opus 4.7 was tested with web search active.

Findings are immutably recorded under RLB Citation IDs. The IMF and any named entity holds a permanent right of reply on every finding. Full detail is in the Hallucination Register.

Right of Reply

The RLB Panel's standing offer

The RLB Specialist Panel offers the International Monetary Fund and any other named entity a permanent right of reply on every finding published here. Corrections are applied with the same immutable citation discipline as the original findings. Contact: Right of Reply form.

When the Fund's own numbers don't survive the model