AI Hallucination ResearchRegulatorsGlobal standard-settersINTBIS-CPMICPMI-ISO-20022-HARMONISATION-UPDATED-2026 › White paper
AI Labs · Last updated 7 Jun 2026 · methodology vv2.3 · Hallucination Register

ISO 20022 Harmonisation: Numeric Conflation and Attribution Failures in Cross-Border Payment Regulation

Alert: Frontier AI models misread CPMI ISO 20022 Harmonisation (2026 update)

Two frontier AI models running with web search enabled, both tested by the RLB Specialist Panel, produced confidently wrong reconstructions of the CPMI Harmonised ISO 20022 Data Requirements for Enhancing Cross-Border Payments, the Updated Report that anchors the messaging architecture for the G20 cross-border payments roadmap and binds correspondent banks, payment scheme operators, and real-time gross settlement systems to a common data model.

The RegLeg Brief Specialist Panel tested both models on the regulator's adoption metrics, on Fedwire's hybrid/end-state postal address format, on the CPMI working-group chair attribution, and on the operational statistics published in BIS-channel speeches, and documents findings in which the models blended distinct subcategory adoption percentages into a single composite figure, over-specified the mandatory tier of a published technical schema, misattributed the working-group chair to a higher-frequency central bank, and evaded a precisely-stated operational statistic by returning a false negative.

Claude Opus 4.7, asked what share of faster payment systems and RTGS systems currently use ISO 20022 messaging, wrote that "approximately 79% of both real-time gross settlement (RTGS) systems and fast payment systems (FPS) had either already implemented ISO 20022 or had concrete plans to do so." The regulator's record, drawn from Bank of England Governor Andrew Bailey's 12 March 2026 speech, reads: "more than three-quarters of faster payment systems and approaching half of RTGS systems now use ISO 20022." The model collapsed two distinct figures into one symmetric percentage that matches neither, applied to both system types simultaneously.

Asked separately about Fedwire's postal address format under the hybrid/end-state approach, Opus 4.7 elevated Building Number, Post Code, and Country Sub-Division into a structured mandatory tier; the implementing body's published FAQ places those elements in the optional tier and prescribes country code plus town name plus optional free-format lines of 70 characters as the binding format.

Claude Sonnet 4.6 reproduced the same 79% conflation on the adoption-rate question and added two further failures. Asked which central bank chairs the relevant CPMI working group, Sonnet 4.6 named the Federal Reserve Bank of New York; the working-group co-chair role belongs to the Reserve Bank of Australia.

Asked for the official statistics on payment inquiry rates and manual touchpoints under the existing cross-border architecture, the model returned a false negative, claiming no specific figure existed; the regulator's March 2026 speech gives the precise figures of 1 to 3 per cent of payments generating inquiries requiring 5 to 10 manual touchpoints, with resolution times reducible by up to 80 per cent through harmonised ISO 20022 implementation.

A correspondent-bank compliance officer, payment-scheme operator, fintech integrator, or regtech tool advising on cross-border implementation timelines and relying on either output would misadvise a client on implementation readiness, pursue the wrong central-bank counterparty on standards governance, implement a more restrictive Fedwire address schema than the regulator requires, and miss a quantitative baseline the regulator itself published. That is the failure mode these findings document.

Executive summary

Numeric conflation across disaggregated adoption-rate subcategories, collapsing distinct faster-payment-system and RTGS figures into a single blended claim, is the primary failure surface for Claude Opus 4.7 with web search on the CPMI Harmonised ISO 20022 Data Requirements for Enhancing Cross-Border Payments, Updated Report. Claude Sonnet 4.6 with web search exhibits a different but structurally related failure: attribution errors on multi-body institutional roles, and false-negative evasion on quantitative operational statistics that appear in official speeches but not in the core publication text. Across both models, failures concentrate on content that is either numerically granular at the subcategory level or delivered through secondary regulatory channels, speeches, working-group announcements, implementing-body FAQs, rather than the primary document body. This failure shape is a signal worth attending to: it suggests that when regulator-attributed statistics arrive via channels with lower indexing density, both models fall back on internally-reconstructed composites rather than retrieval, and that the reconstruction process degrades silently rather than producing an explicit uncertainty signal.

Findings — impact summary

This is the consolidated view of findings. Click 'see details →' on any item for the full details for each finding.

  1. Finding on 'Q006 Probe' for Claude Opus 4.7 with web search ONRLB-H-INT-BIS-CPMI-ISO-20022-HARMONISATION-UPDATED-2026-Q006-Opus47

    This failure implicates the training corpus's handling of subcategory-level numeric claims from official-speech channels. The model produced a single blended 79% figure where the regulator's March 2026 speech gives two distinct values — one for faster payment systems and a substantially lower one for RTGS. This suggests the speech content either was not retrieved or was compressed during ingestion in a way that averaged across the two system-type categories. If your eval suite tests adoption-rate questions at the aggregate level only, this failure is invisible; the gap is specifically at subcategory resolution.

    see details →
  2. Finding on 'Q010 Probe' for Claude Opus 4.7 with web search ONRLB-H-INT-BIS-CPMI-ISO-20022-HARMONISATION-UPDATED-2026-Q010-Opus47

    This failure implicates retrieval coverage of implementing-body FAQ layers. When the FRB Services FAQ defining the hybrid/end-state postal address format is not retrieved, the model reconstructs the mandatory/optional field boundary from training, and reconstruction tends toward over-specification — adding Building Number, Post Code, and Country Sub-Division to the mandatory tier where the FAQ places them as optional. The RAG or retrieval glue is not surfacing the implementing body's own technical specification when it conflicts with a more structured internal representation.

    see details →
  3. Finding on 'Q006 Probe' for Claude Sonnet 4.6 with web search ONRLB-H-INT-BIS-CPMI-ISO-20022-HARMONISATION-UPDATED-2026-Q006-Sonnet46

    This failure mirrors the Opus 4.7 conflation on the same adoption-rate question — the model produced an identical blended 79% figure applied symmetrically to faster payment systems and RTGS systems, where the regulator's record gives two distinct figures that diverge by roughly thirty percentage points. The Sonnet variant adds a fabricated third-party citation pointing to a centralbanking.com URL that does not contain the figure attributed to it, suggesting the retrieval pipeline surfaced a paraphrased secondary source and the model treated that summary as authoritative without flagging the discrepancy.

    The cross-model recurrence is the load-bearing signal: where both Opus and Sonnet produce the same composite shape on a subcategory-disaggregated regulator statistic, the gap is in the retrieval ranker's weighting of official-speech channels, not in a single model's calibration.

    see details →
← Other AI Labs white papers The detailed Case study →

Every finding on this page compares an AI subject's account of the rule against the regulator's verbatim text from the regulator's own portal. Both are linked. Each delta, its root causes, and impact analysis are documented and published with immutable Citation IDs.