
SINGAPORE, June 10, 2026. Two frontier AI models running with web search enabled, both tested by the RLB Specialist Panel, produced confidently wrong reconstructions of two operationally consequential mechanics in the International Monetary Fund's 2024 Guidance Note on Financing Assurances and Sovereign Arrears, in findings released today by the RegLeg Brief Specialist Panel. Asked when the Fund's Lending Into Official Arrears Strand 4 pathway is activated, and what creditor coverage satisfies financing assurances in pre-emptive cases, both models substituted invented tests and an invented threshold for the conditions the Guidance Note sets out.
Claude Sonnet 4.6, asked whether a bilateral creditor's silence within four weeks satisfies Strand 4 entry, answered that Strand 4 "is not available simply because one creditor is slow or silent" and that "there must be an affirmative signal of unwillingness to engage." The Guidance Note states the Fund shall seek Strand 4 safeguards where "an adequately representative agreement has not been reached through a representative standing forum" and "consent is not forthcoming." The published text treats absence of consent within four weeks as a structural trigger; the model elevated it into a refusal test the regulator does not impose.
Claude Opus 4.7, on the same question, described a good-faith engagement obligation, a holdout-as-binding-obstacle test, and an orderly-resolution advancement criterion. None of those three appears in the Strand 4 entry conditions, which specify a three-part structural gate: no representative standing forum agreement, no consent within four weeks, and the Strand 3 criteria unmet.
Opus 4.7, asked what creditor coverage satisfies financing assurances in a pre-emptive restructuring, answered that the "sufficient set" must account for "more than 50 percent of the total financing contributions required from official bilateral creditors." No numerical threshold for "sufficient set" appears in the source for pre-emptive cases. The model transposed the majority-of-financing test from the Strand 1 representative-Paris-Club-agreement context into the pre-emptive sufficient-set test, where it does not appear.
A sovereign debt team or finance ministry desk officer relying on either output would advise activation or coverage off conditions the Guidance Note does not contain.
Cross-provision conflation is the dominant failure shape Claude Opus 4.7 produces on the IMF's 2024 Guidance on Financing Assurances and Sovereign Arrears, the model correctly understands the general architecture of the IMF sovereign debt framework but systematically applies numerical thresholds and procedural conditions from one sub-track to another, producing outputs that are structurally coherent and contextually plausible but substantively wrong in exactly the ways that matter for operational use. Across the findings documented here, Claude Opus 4.7 with web search substituted general program-level preconditions for the specific sequential procedural triggers required to invoke Strand 4, and applied the majority-financing-contributions threshold from the Strand 1 adequately-representative Paris Club agreement context to the pre-emptive case 'sufficient set' definition, a provision where no such numerical threshold appears in the regulator's text. The error is not retrieval failure in the conventional sense: the model retrieves the right framework, the right vocabulary, and the right domain; the failure occurs at the level of sub-track specificity, where condition sets from distinct procedural branches are merged or substituted. For a regulation whose entire operational purpose is to specify exactly which conditions apply in which procedural sequence, this failure shape produces outputs that would misdirect sovereign debt advisory work at the decision points that matter most.
This is the consolidated view of findings. Click 'see details →' on any item for the full details for each finding.
This failure implicates the training data's representation of sub-track-specific procedural logic versus general program conditionality: the model's corpus almost certainly contains far more material describing IMF program conditions at a general level than the specific three-part sequential gate that defines Strand 4 eligibility, causing the model to select the higher-frequency framing when answering a sub-track-specific procedural question.
The retrieval stack is not obviously at fault here — the model retrieved the correct framework domain — but the ranking or selection logic did not surface or weight the Strand 4-specific procedural text over the general-program framing that it appears to have defaulted to.
see details →This Sonnet failure mirrors the Opus result on the same question and reinforces the cross-model signal: the Strand 4 entry gate requires an affirmative refusal-to-engage event, but both subjects defaulted to treating creditor silence as a sufficient trigger. This points to a generation-layer disposition rather than a retrieval gap. The model surfaces the correct framework family and even cites adjacent procedural language, then collapses a two-state regulatory predicate (silence versus affirmative refusal) into a single permissive condition during answer composition.
For the lab's team, two probes are worth running: a calibration sweep across sovereign-debt sub-tracks measuring whether the model preserves binary procedural predicates when one branch is the lower-frequency outcome in training material, and a comparison of websearch-augmented versus base-mode answers on the same question to isolate whether retrieved context corrects the drift or whether the inference-time selection logic overrides retrieved specifics with framework-level priors.
see details →Every finding on this page compares an AI subject's account of the rule against the regulator's verbatim text from the regulator's own portal. Both are linked. Each delta, its root causes, and impact analysis are documented and published with immutable Citation IDs.