AI Labs · Last updated 7 Jun 2026 · methodology vv2.3 · Hallucination Register

IMF Surcharge Reform 2024: Numeric Baseline Failures Across Frontier AI Model Configurations

Both Claude Opus 4.7 with web search and Claude Sonnet 4.6 with web search produced the same wrong pre-reform baseline when asked about the IMF's October 2024 surcharge reform, citing 19 surcharge-paying countries where the IMF's own published record establishes 20. The regulation in question is the IMF Charges and Surcharge Reform (2024), a Board-approved policy change effective 1 November 2024 that restructured the Fund's surcharge framework with explicit before/after country counts.

The error is not a paraphrase or approximation: both models committed to a specific integer that diverges from the regulator's figure, with each model arriving at the same wrong number via different failure paths, one reconstructing from training, the other deferring to a third-party source that had already introduced the error. When two models tested under different configurations converge on the same specific wrong number through different mechanisms, it signals that the correct figure is systematically under-indexed relative to the widely-circulated wrong figure in the content the models draw from.

When this affects AI Labs

Compliance teams, sovereign debt analysts, fintech operators, and multilateral finance practitioners routinely query frontier models about IMF policy, surcharge thresholds, quota-based eligibility criteria, effective dates for rule changes. The IMF's surcharge framework is directly material to member-country borrowing costs and to the advisory work of any institution operating in emerging-market finance. When a model confidently states the wrong country count in response to a factual policy question, downstream users working on debt-restructuring analysis, country-risk reporting, or regulatory filings may incorporate that error into their own work without re-checking the primary source.

The lab-side exposure has two layers. First, the model's output is delivered with no uncertainty signal, both tested configurations stated the pre-reform baseline as if it were settled fact. A user with no independent access to the IMF's source document has no reason to doubt it. If that user acts on the figure in a client report or regulatory submission, the model is the proximate cause of the error.

Second, this class of failure, confidently wrong on a specific, verifiable integer in a policy document, is precisely the failure type that surfaces in misuse-attribution and regulatory-liability discussions when AI-assisted compliance advice goes wrong. Labs that cannot demonstrate their models handle IMF policy numerics correctly face heightened scrutiny in regulated-domain deployment.

The IMF's surcharge reform documentation sits at the intersection of several structural properties that make it a reliable failure surface: it contains precise numeric thresholds that changed at a specific date (creating a pre/post discontinuity), it is covered predominantly by secondary commentary rather than by the Fund's own primary text in high-traffic indexed content, and the reform was finalised in late 2024, meaning the correct figures were published close to or after the training-data window for models now in deployment. Any regulation sharing these properties, recent numeric amendments, thin primary-source indexing, dense secondary-source paraphrase, will reproduce this failure class.

Aggregate impact

Model	Configuration	Failure count	Dominant error pattern
Claude Opus 4.7	Web search	1	Wrong pre-reform baseline integer reconstructed from training
Claude Sonnet 4.6	Web search	1	Wrong pre-reform baseline integer sourced from third-party secondary content

Claude Opus 4.7 with web search stated that 19 countries were paying surcharges before the reform took effect. The IMF's published record establishes 20. The model's phrasing, "the number of countries paying surcharges falls from 19 to 11", suggests it reconstructed the pre-reform figure from training rather than retrieving it from the primary document, producing a confident specific claim with no qualifier. The post-reform figure (11) is correct; the failure is localised to the pre-reform baseline, which the model appears to have held in training at the wrong value.

Claude Sonnet 4.6 with web search reproduced the same wrong integer (19) but via a different path: its response explicitly cited retrieved content, and the cited source appears to be a third-party account that had already introduced the error. The model's reasoning treated that secondary source as authoritative without flagging the discrepancy against the IMF's own published numbers. Both the Opus 4.7 and Sonnet 4.6 failures surface at the same numeric boundary, the pre-reform count of surcharge-paying countries, through different failure mechanisms: one internal, one external-source-dependent.

Failures cluster at a single numeric datum that exists in the IMF's authoritative text but is predominantly represented in indexed secondary content at the wrong value. The convergence across two models tested under web-search configurations, one relying on training, one deferring to retrieved third-party content, indicates the correct figure is consistently lower-ranked or absent in the content both training pipelines and live retrieval pipelines draw from.

This is not a retrieval gap alone and not a training-data gap alone; it is a signal that the secondary-source paraphrase ecosystem for this reform systematically propagated the wrong baseline, and neither model's architecture caught it.

Findings

1 finding in this case study. Click any to see its full evidence card.

Finding on 'Q004 Probe' for Claude Opus 4.7 with web search ON see this finding →

What your team should do

Implications for your training data

Both tested configurations failed on the same integer, the pre-reform count of surcharge-paying countries, despite having access to different content sources. The most parsimonious explanation is that the wrong figure (19) was more heavily represented in the training corpus than the IMF's published figure (20), likely because secondary commentary on the reform (news coverage, policy briefs, law-firm summaries) circulated the incorrect baseline before the IMF's authoritative document was widely indexed.

Training-side, the corpus ingestion pipeline for multilateral financial institutions such as the IMF should weight primary Board documents, official press releases, and published staff reports over secondary commentary on the same topic. Where a numeric claim in a secondary source conflicts with the primary document, the training signal should penalise secondary-source figures.

The reform was finalised in late 2024, placing the correct figures close to or past the effective training window for models currently in deployment. This creates a class of failures where recent policy amendments with specific numeric thresholds are represented in training predominantly by early drafts, commentary, or pre-finalisation estimates rather than the authoritative final text. Corpus refresh cadence for international financial institutions, IMF, World Bank, BIS, FSB, should be treated as a priority class, not handled on the same schedule as general web content.

Implications for your post-training logic

When a model is operating with web search enabled and the retrieved content contains a specific numeric claim about a regulator's published rule, the retrieval-augmented layer should apply source-authority weighting: the regulator's own domain (imf.org, bis.org, etc.) should rank above third-party commentary when both are available. Claude Sonnet 4.6's failure path shows that the current ranker does not reliably enforce this hierarchy, a third-party account carrying the wrong baseline was treated as sufficient without cross-checking against the primary source.

For numeric thresholds and country-count figures in policy documents, a self-verification pass before finalising the response would catch the class of error observed here. The mechanism does not need to be elaborate: when the model commits to a specific count or threshold tied to a named policy change, a secondary retrieval step targeting the regulator's primary document and checking for numeric consistency would surface the discrepancy. The fact that both the correct and incorrect figures were likely reachable via web search makes this a retrieval-routing and verification problem, not a knowledge problem.

Specific eval / red-team probes RegLeg suggests

Pre/post numeric discontinuities in recent policy amendments: probe models on the specific before/after figures in regulations finalised in the 12–18 months before the model's training cutoff, where secondary commentary was published before the authoritative text was widely indexed.
Single-integer baseline claims with coherent downstream arithmetic: test cases where the wrong baseline produces internally consistent downstream calculations, the model's response passes a basic coherence check even while the foundational figure is wrong.
Retrieval authority ranking under conflicting sources: design probes where the regulator's primary document and a third-party summary disagree on a specific number; verify whether the model flags the conflict or silently defers to the secondary source.
IMF quota-based threshold mechanics: the surcharge framework involves quota percentages, country-specific borrowing levels, and fiscal-year projections, probe the full numeric landscape around threshold changes, not just the headline country count.
Calibration on specific integers vs. ranges: check whether models are more likely to express uncertainty on approximate figures than on specific integers, and whether that calibration difference holds when the specific integer is wrong.

How RLB can help

We document nuanced failure patterns across model versions and configurations on regulatory content, failure modes including subcategory-numeric conflation where a specific integer is wrong while surrounding arithmetic holds; multi-body institutional attribution drift where the cited lead body is the lower-frequency co-author; schema over-specification on technical formats such as ISIN and LEI structures; secondary-source deference on recently amended numeric thresholds; and false-negative evasion on retrievable official-speech content that falls outside a retrieval pipeline's indexing cadence.

The IMF surcharge reform failure observed across both Claude Opus 4.7 and Claude Sonnet 4.6 is a clear instance of the secondary-source deference pattern, and we have documented it across enough regulatory domains to characterise where models are systematically exposed.

We can help your team address these gaps in three concrete ways. First, targeted correction-pair generation: for each documented failure mode, we can produce regulator-authoritative correction pairs formatted for direct ingestion into your training-data pipeline, the correct figure alongside the wrong figure your model currently holds, paired with the primary-source sentence that establishes ground truth. Second, embedded eval partnership: quarterly comparative reports across model versions on a defined regulatory portfolio, with regression monitoring on previously-documented failure modes so you can see whether a new release has resolved or shifted the failure surface.

Third, pre-release evaluation cycles for capability launches touching regulated domains, financial services, payments infrastructure, sovereign-debt and multilateral-finance content, flagging failure shapes before they reach customers operating in compliance-sensitive contexts.

To scope a technical partnership for refining your models against these failure modes, contact us at reglegbrief.com.

← Back to summary Other AI Labs white papers →

Every finding on this page compares an AI subject's account of the rule against the regulator's verbatim text from the regulator's own portal. Both are linked. Each delta, its root causes, and impact analysis are documented and published with immutable Citation IDs.