
Two frontier AI models, Two frontier AI models, each running with web search, generated authoritative-sounding answers about the October 2024 CPMI report on API harmonisation for cross-border payments that either denied or contradicted information available in the CPMI's own public record. The Specialist Panel tested both models on six operational questions a payments-research analyst, a compliance team, and a market-briefing desk would actually pose. Every answer that should have surfaced a CPMI-published number, named partner, or scope assignment instead refused the data, substituted a fabrication, or reshaped the regulator's text.
On the payment pre-validation API recommendation, both models declined to identify the South African Reserve Bank as the CPMI's named collaboration partner, even though CPMI Brief No. 9 (November 2025) states the partnership in one sentence. On the global fast-payment-systems landscape, both models denied that CPMI's own published figures break out central-bank versus private operation, even though the Tara Rice speech of November 2023 gives the proportions verbatim. On the February 2026 update to the ISO 20022 data requirements, Sonnet 4.6 manufactured a November 2026 phase-out deadline for unstructured addresses that does not appear in the d230 update text.
On the stakeholder breakdown across the ten harmonisation recommendations, Opus 4.7 reshaped the scope of Recommendation 1, omitting the regulation's explicit framing that the recommendation targets jurisdictional authorities alongside standards bodies.
For BIS-CPMI as the standards body, and for the central banks, payment-system operators, and correspondent banks acting on its guidance, the operational issue is concrete. AI-assisted research briefings, regulatory summaries, and market-intelligence work product configured around these outputs would carry denials of CPMI source material that is in fact public, alongside at least one fabricated regulatory deadline. The mistake is not recoverable at runtime: each output reads internally consistent and policy-fluent, and validation against the CPMI primary text only happens if the user already knows what to look for.
RegLeg Brief documents all six findings with the immutable RLB Citation IDs below, linked to the per-finding pages where the verbatim model output, the matched CPMI excerpt, and the Specialist Panel's diagnosis sit alongside one another. The findings are referenced as: RLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q007-Opus47, RLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q007-Sonnet46, RLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q008-Opus47, RLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q009-Sonnet46, RLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q010-Opus47, and RLB-H-INT-BIS-CPMI-API-HARMONISATION-CROSS-BORDER-2024-Q010-Sonnet46.
Across six question-level AI runs on this regulation, Claude Opus 4.7 and Claude Sonnet 4.6, both with web search enabled, failed in structurally similar ways. The October 2024 CPMI recommendations document and the technical annex of the ISO 20022 update sit behind a content boundary that web retrieval does not reliably cross. When the models reach into that boundary, they do not refuse: they fabricate. Opus produced a per-recommendation stakeholder taxonomy keyed to ISO, BIAN, and SWIFT, with no support in the regulator's own recommendation text. Sonnet asserted a specific November 2026 cutover commitment for ISO 20022 structured addresses that the regulator's document does not contain. Both models hedged or denied a SARB-CPMI pre-validation collaboration that CPMI Brief No. 9 (November 2025) names explicitly. Both substituted a 2025 monitoring-survey sample of 57 fast payment systems for the regulator's stated universe of 70+. The failure shape is not model-specific: it is a property of how this regulation's content is exposed publicly, and it reveals retrieval-coverage, source-attribution, and confidence-calibration gaps that an AI lab can address with targeted evaluation.
This is the consolidated view of findings. Click 'see details →' on any item for the full details for each finding.
The model retrieved the SARB-CPMI pre-validation partnership context — it had enough signal to know the question was about a named central-bank collaboration on a specific recommendation — but downgraded a regulator-confirmed fact to speculative hedge phrasing ('plausible but unverified'). This is calibration drift in a direction that looks like safety: the model hedged where it should have committed. The training and retrieval pipeline appears to penalise commitment without verifying retrieval, producing under-confidence on facts the regulator has itself published.
For evaluators, the high-value probe is the asymmetry: when the AI has retrieved enough to identify the entity at issue (SARB, payment pre-validation, CPMI Brief No. 9), commitment behaviour should match retrieval depth rather than defaulting to hedge. A confidence-calibration eval that scores against regulator-published source text would surface this drift before it ships to production users.
see details →Sonnet 4.6 with web search returned a confident negative — 'available sources do not identify SARB as a named pilot partner' — when CPMI Brief No. 9 (November 2025) explicitly does name SARB. The failure mode is a false-negative retrieval gap presented as a positive knowledge claim. The model's web-search loop either did not surface CPMI Brief No. 9 or surfaced it and did not extract the SARB identification from it; in either case the model treated the absence-of-retrieval as evidence-of-absence rather than as a retrieval-coverage limitation.
For an AI lab, this is a high-value alignment probe: confident negatives on entity-level regulatory questions — 'no named partner exists', 'no specific date is published' — should be evaluated against a corpus of regulator-published material to determine the false-negative rate. The same retrieval pattern almost certainly produces similar false-negatives across other regulator briefs.
see details →Domain inference used as a stakeholder-assignment mechanism — assigning ISO, BIAN, and SWIFT to a harmonisation-processes category by structural reasoning — is not retrieval. The training data for the CPMI October 2024 recommendations PDF appears to lack the per-recommendation stakeholder content, and the model's self-check did not flag that its output was constructed rather than retrieved. The RAG glue layer is not enforcing a 'content was found' gate before allowing domain-inference fill. Worse, the structured presentation format — a roman-numeral taxonomy with named bodies attached — gives the inference output the visual register of a retrieved factual breakdown.
For an AI lab, this is a generation-calibration probe: when the model produces structured taxonomic output on a regulatory document whose primary text was not in retrieval, the structure itself should signal inference and the response should explicitly say so.
see details →The model returned a specific cutover commitment — 'from November 2026 onwards, only structured and hybrid addresses will be permitted in ISO 20022 cross-border payment messages' — attributing the commitment to the CPMI document. The regulator's own text describes only generalised 'standardisation and regulatory developments since 2023' and a separate technical annex; the specific date-and-format commitment is not present there. The fabrication appears to draw from SWIFT/CBPR+ community discussion material, which has circulated dates and structured-address mandates, and the model has cross-attributed that content to CPMI without distinguishing the source.
For an AI lab, this is a source-attribution eval candidate: when the model returns a specific commitment with a specific date, the regulator-source attribution should be verifiable against the cited document, and where it is not, the model should distinguish industry-community material from regulator-published material in its response.
see details →The model substituted a survey-sample count (57 systems from the 2025 CPMI monitoring survey) for the regulator's stated universe figure (70+ from the November 2023 Tara Rice speech). The error is statistical-substrate confusion: the model has retrieved a sample from one CPMI publication and presented it as the universe figure that a different CPMI publication actually states.
The implication for the retrieval-and-generation pipeline is that when multiple CPMI sources publish related-but-different numbers (universe versus sample, current-state versus monitoring-snapshot), the model is not disambiguating between them; it surfaces the most recently retrieved figure as if it were the answer to the question. For an AI lab, this is a high-yield eval probe: regulatory benchmark questions that have universe-versus-sample distinctions in the primary source corpus should surface this confusion pattern reliably.
see details →Sonnet 4.6 with web search returned a confident non-availability claim — 'a precise percentage breakdown of central bank vs. privately operated FPS is not enumerated in the public Brief 10 summaries available' — when the November 2023 Tara Rice CPMI speech explicitly publishes the 40%/35% breakdown. This is the same false-negative pattern as the SARB partnership question: absence in the retrieved set is reported as absence from the regulator's record.
The model can cite the November 2023 speech accurately in other contexts (the 70+ universe figure traces to the same source), so the retrieval coverage is intermittent rather than missing entirely. For an AI lab, this is an evaluation probe for the consistency dimension of retrieval: facts published in a single regulator source should be retrieved reliably across questions that touch that source, and intermittent retrieval is itself a failure mode that user-facing responses do not signal.
see details →Every finding on this page compares an AI subject's account of the rule against the regulator's verbatim text from the regulator's own portal. Both are linked. Each delta, its root causes, and impact analysis are documented and published with immutable Citation IDs.