AI Labs INT BIS-CPMI

AI Labs · Harmonised ISO 20022 Data Requirements for Enhancing Cross-Border Payments, Updated Report

By Kratti A Agrawal, Lead, RegLeg Brief Specialist Panel

Alert: Frontier AI models misread CPMI ISO 20022 Harmonisation (2026 update)

Sonnet opens up the hallucination geography encoded in the CPMI ISO 20022 harmonisation standard.

— RLB Specialist Panel

Frontier AI models blended cross-border payment adoption statistics and misattributed CPMI working-group chair, regulatory-research panel finds

Two frontier AI models, with web search enabled, collapsed the regulator's separately-stated faster-payment-system and RTGS adoption rates into a single composite 79% figure, over-specified the Fedwire mandatory postal address tier, attributed the CPMI ISO 20022 working-group chair to the Federal Reserve Bank of New York rather than the Reserve Bank of Australia, and returned a false negative on a March 2026 BIS-channel speech statistic the regulator itself published. The RegLeg Brief Specialist Panel calls the pattern "Numeric Conflation and Channel-Index Drift" and says it points to a systematic gap between what frontier models retrieve from primary CPMI publications and what regulator-attributed content delivered through speeches, working-group press releases, and implementing-body FAQs actually contains.

SINGAPORE, June 13, 2026. Two frontier artificial-intelligence models generated reconstructions of the CPMI Harmonised ISO 20022 Data Requirements for Enhancing Cross-Border Payments that blend disaggregated adoption statistics, over-specify a published technical schema, and misattribute institutional roles the regulator's record places elsewhere, according to a white paper released today by RegLeg Brief, a regulatory-research outfit operated by Singapore-incorporated Verdus Technologies Pte. Ltd.

The findings concern the CPMI Harmonised ISO 20022 Data Requirements for Enhancing Cross-Border Payments, Updated Report, which operationalises the messaging architecture for the G20 cross-border payments roadmap and binds correspondent banks, payment scheme operators, real-time gross settlement systems, and implementing bodies such as Fedwire to a common data model. Both Anthropic's Claude Opus 4.7 and Claude Sonnet 4.6 were tested with web search active, mirroring how correspondent-bank compliance teams, payment-scheme operators, fintech integrators, and regtech tools actually use the models when scoping implementation timelines, drafting integration specifications, or briefing senior management on the cross-border payments roadmap.

The Verbatim Rule: Disaggregated Adoption, Not a Single Composite Figure

The regulator's record on ISO 20022 adoption is published not in a structured table inside the primary report but in an official speech delivered by Bank of England Governor Andrew Bailey on 12 March 2026. The verbatim passage reads:

"More than three quarters of faster payment systems covered by the survey, and approaching half of RTGS systems, are now using ISO 20022."

The structural register matters. The regulator characterises the two system types with distinct figures, faster payment systems "more than three quarters", RTGS systems "approaching half", because the adoption curve has moved materially faster on faster payment infrastructure than on legacy RTGS rails. A correspondent bank scoping implementation readiness against the composite would budget against an inflated RTGS baseline; a payment-scheme operator briefing a board on adoption momentum would understate the faster-payments lead.

The regulator's record on the operational baseline that justifies harmonisation is in the same speech channel:

Inquiry rate: 1 to 3 per cent of cross-border payments generate inquiries.
Manual touchpoints per inquiry: 5 to 10.
Resolution-time improvement: up to 80 per cent reducible through harmonised ISO 20022 implementation.
Working-group co-chair: the Reserve Bank of Australia; the RBA Governor served as Co-Chair of the relevant CPMI workstream.

Claude Opus 4.7: Blended Two Distinct Adoption Rates Into One Composite, Over-Specified the Fedwire Mandatory Schema

Asked what share of faster payment systems and RTGS systems currently use ISO 20022 messaging, Claude Opus 4.7 (with web search on) wrote, verbatim:

"approximately 79% of both real-time gross settlement (RTGS) systems and fast payment systems (FPS) had either already implemented ISO 20022 or had concrete plans to do so"

The structural error. The model collapsed two distinct figures into a single symmetric percentage applied to both system types. The 79% figure matches neither the faster-payment record ("more than three quarters") nor the RTGS record ("approaching half"); it is an internally-reconstructed composite. A compliance professional reading the output would record the two system types as adopting at the same rate and brief senior management against an inflated RTGS baseline.

The schema over-specification. Asked when Fedwire implemented the CPMI harmonised data model and what postal address format Fedwire requires for the hybrid/end-state approach, Opus 4.7 wrote that "structured Town Name and Country code are mandatory, with optional structured elements (e.g., Street Name, Building Number, Post Code, Country Sub-Division) plus a limited Address Line element for residual unstructured content." The FRB Services FAQ prescribes a leaner mandatory tier: country code plus town name plus optional free-format lines of 70 characters each.

The model elevated Building Number, Post Code, and Country Sub-Division into a more rigid structured-element category the implementing body's own FAQ does not require.

Why it failed. The RegLeg Brief Specialist Panel writes that "subcategory-numeric conflation surfaces precisely where the regulator documents distinct figures through a channel, in this case a BIS-hosted official speech, that the model's retrieval ranker indexes at lower density than the primary publication. The 79% composite is the shape an averaging pass over the speech text would produce if the two figures were collapsed before extraction.

The Fedwire over-specification has the same shape on the schema side: where the implementing-body FAQ is not retrieved with sufficient weight, the model reconstructs the mandatory/optional boundary from training, and reconstruction skews toward greater structure than the regulator requires."

The failure modes are classified as inference_drift against substrate documents R6-REPORT_CHAPTER-iso20022-data-requirements-updated.pdf and the FRB Services Fedwire FAQ.

Claude Sonnet 4.6: Reproduced the Conflation, Misattributed the Working-Group Chair, Evaded the Operational Statistic

Given the same adoption-rate brief, Sonnet 4.6 (with web search on) wrote:

"79% of both real-time gross settlement (RTGS) systems and fast payment systems (FPS) either have already implemented ISO 20022 or have concrete plans to do so"

The model also produced a fabricated third-party citation pointing to a centralbanking.com URL that does not contain the figure attributed to it.

The attribution error. Asked which central bank chairs the CPMI working group responsible for ISO 20022 harmonisation, Sonnet 4.6 named the Federal Reserve Bank of New York. The CPMI workstream co-chair is the Reserve Bank of Australia; the RBA Governor served as Co-Chair. The model offered an explicit "based on available public sources" hedge alongside the answer, but the hedge did not prevent a confidently-wrong attribution to a higher-frequency major-market institution. A payment-scheme operator pursuing standards-governance engagement on the basis of the output would approach the wrong central bank.

The false-negative evasion. Asked what official statistics exist on inquiry rates and manual touchpoints in cross-border payments, Sonnet 4.6 returned a false negative, claiming no specific official statistic was available. The 12 March 2026 Bailey speech gives the figures explicitly: 1 to 3 per cent of payments generate inquiries, requiring 5 to 10 manual touchpoints, with resolution times reducible by up to 80 per cent through harmonisation. The model retrieved nothing from the speech.

The failure modes are classified as misattributed and inference_drift against the same substrate.

The Pattern: Numeric Conflation and Channel-Index Drift

The CPMI ISO 20022 findings sit inside a broader failure class the RegLeg Brief Specialist Panel has been documenting across payments-infrastructure and standard-setter work, which it calls Numeric Conflation and Channel-Index Drift, frontier models systematically collapsing disaggregated subcategory statistics into single composite figures, and falling back on internally-reconstructed content when the regulator's record is delivered through speech, working-group press release, or implementing-body FAQ channels rather than the primary publication body.

The white paper documents the pattern across the audited question set:

Subcategory-numeric conflation: Both models produced the same 79% composite that matches neither the faster-payment figure nor the RTGS figure in the regulator's speech-channel record.
Working-group attribution drift: Sonnet 4.6 substituted the Federal Reserve Bank of New York for the Reserve Bank of Australia on a CPMI workstream chair role; the attribution drift moves toward higher-frequency major-market institutions.
Schema over-specification: Opus 4.7 elevated three optional Fedwire address elements into the mandatory tier where the implementing body's FAQ does not.
False-negative evasion on speech-channel statistics: Sonnet 4.6 returned "no statistic available" on a precisely-stated BIS-channel figure, with no surfacing of retrieval-coverage uncertainty.
Hedging that does not prevent error: Sonnet 4.6's explicit "based on available public sources" disclosure did not stop the confidently-wrong RBNY attribution.

A correspondent bank, payment-scheme operator, fintech integrator, or regtech tool automating implementation-readiness scoping or counterparty-engagement planning on either model would carry the blended adoption rate, the wrong central-bank attribution, the over-specified Fedwire schema, and a missing operational baseline into the artefacts the firm produces.

Why the Failure Is Invisible at Runtime

Both Claude outputs shared the same surface characteristics, structured percentages, regulator-attributed phrasing, internal coherence, and in Sonnet 4.6's case, an explicit hedge that read as a calibration signal. The white paper states the operational risk plainly:

"The failure is not recoverable by the user in real-time: the model's output reads as a faithful summary of the regulator's monitoring data, and validation would only happen if the reader already knew that the regulator's faster-payment and RTGS adoption rates diverge by roughly thirty percentage points and that the 79% composite matches neither."

Correspondent-bank compliance teams, payment-scheme operators, fintech integrators, and regtech tools advising on cross-border implementation timelines are the population most exposed. They use AI assistants to summarise CPMI publications, draft integration specifications, and structure board-paper briefings against tight roadmap deadlines, the exact workflow in which the failure surfaces.

What AI Labs Can Do: Suggested Probes (Open-Access)

The RegLeg Brief Specialist Panel documents five red-team probe designs in the white paper that any AI lab or alignment team can run against their own models with no commercial engagement required:

Subcategory-disaggregated adoption stats. Where the regulator documents distinct adoption figures for System Type A and System Type B, test whether the model preserves the disaggregation or collapses to a single composite. Flag any blended single-figure response against a multi-subcategory regulator passage as a candidate failure.
Working-group chair attribution across multi-body CPMI frameworks. Test which central bank chairs or co-chairs specific CPMI workstreams. Correct answers often belong to lower-frequency institutions; failure mode is substitution of a major-market central bank such as the Federal Reserve Bank of New York or the Bank of England.
Official-speech quantitative claims, last 18 months. Query specific numeric claims from BIS and central-bank speech channels. False-negative evasion ("no such statistic found") where the statistic exists in a datestamped BIS-hosted speech is a retrieval-coverage indicator.
Implementing-body FAQ field-level schema questions. Test which fields are mandatory versus optional in a specific implementing body's technical format, such as Fedwire's postal address schema. Over-specification, elevating optional elements into the mandatory tier, is the expected failure mode when FAQ content is not retrieved.
Hedging-language vs verification audit. Where the model adds "based on available public sources" or similar hedges, verify the attributed institution or figure against the regulator's record rather than treating the hedge as a reliable uncertainty signal.

Open-Access Risk Mitigation: A Public Good for AI Labs, Regulators, and the Compliance Community

RegLeg Brief operates as a completely ungated, open-access public resource. The white papers, per-finding cards, regulator verbatim excerpts, RLB Citation IDs, methodology notes and supporting data logs are all published without paywalls, registration walls, or data-licensing fees. By documenting original regulatory research without financial or distribution barriers, the platform ensures that:

AI engineering and alignment teams can immediately ingest the verbatim model outputs and matched regulator-text excerpts to identify, reproduce, and address the structural failure modes the Specialist Panel documents.
Regulatory agencies and supervisors can use the standardised RLB Citation IDs to benchmark AI-driven compliance risks surfacing in their own jurisdictions, with full traceability back to the original model output and the regulator's primary source.
The global compliance, treasury, and payments community can freely adapt the Specialist Panel's screening methodologies to safeguard internal data pipelines and AI-assisted regulatory workflows.

Because RegLeg Brief conducts its own original research and adversarial analysis against frontier AI models, the detail in each published finding is precise enough to enable AI labs to take targeted hallucination-mitigation measures. Directions an AI lab might consider, drawing on the published findings, include:

Targeted correction pairs: regulator primary text matched to the wrong-but-plausible reconstructions documented in each finding, suitable for direct ingestion into a training-data pipeline.
Quarterly embedded eval cycles: continuous evaluation against a defined CPMI and FSB regulator portfolio, with regression monitoring on previously documented failure modes to track whether fine-tuning or RLHF adjustments are moving the needle on Numeric Conflation and Channel-Index Drift.
Pre-release evaluation cycles: sandboxed probes against catalogued failure shapes for capability releases touching payments infrastructure, cross-border standards, or central-bank communications content, before the release reaches customers.
Post-release model enhancements: regulator-specific failure-surface monitoring as new CPMI consultations, FAQ updates, and BIS-channel speeches enter a model's live deployment footprint.

AI labs and model developers named in any published finding have an unconditional right of reply; the Specialist Panel will publish any factual correction or contextual response alongside the original finding, with no editorial gatekeeping. Researchers, regulators, and compliance teams with questions on methodology or specific findings can reach the Specialist Panel via the same channel.

Right of Reply

These findings and associated work have been put up in public with a view of the greater good for the development of a safer AI ecosystem. Any party reading this or any finding on reglegbrief.com may contact us and have an unconditional right of reply; the Specialist Panel will publish any factual correction or contextual response alongside the original finding, with no editorial gatekeeping. Researchers, regulators, and compliance teams with questions on methodology or specific findings can reach the Specialist Panel via the same channel.

Source & Methodology Standards

RegLeg Brief is operated by Verdus Technologies Pte. Ltd. (UEN 201616982R), incorporated in Singapore. The RLB Specialist Panel, with an aggregate of over 60 years of public-policy and industry experience, documents only confirmed hallucination findings, under a methodology that requires a verbatim regulator excerpt for every documented claim. All findings, citation IDs, model outputs, regulator excerpts, and methodology notes are open-access.

Primary source verified: CPMI Harmonised ISO 20022 Data Requirements for Cross-Border Payments (2026 Update) · Substrate documents: p_09_OTHER_Governance___which_institution_chaired_t_brief10.htm, p_10_SPEECH_ISO_20022_adoption_rates_across_payment_r260316d.htm, p_11_SPEECH_Cross_border_payment_inquiry_rates_and_e_r260316f.htm, p_17_NOTICE_Fedwire_Funds_Service_implementation_dat_mandating-iso-20022-enhanced-data-in-chaps.html · CPMI portal: bis.org/cpmi

Citation IDs referenced:

RLB-H-INT-BIS-CPMI-ISO-20022-HARMONISATION-UPDATED-2026-Q004-Sonnet46
RLB-H-INT-BIS-CPMI-ISO-20022-HARMONISATION-UPDATED-2026-Q006-Opus47
RLB-H-INT-BIS-CPMI-ISO-20022-HARMONISATION-UPDATED-2026-Q006-Sonnet46
RLB-H-INT-BIS-CPMI-ISO-20022-HARMONISATION-UPDATED-2026-Q007-Sonnet46
RLB-H-INT-BIS-CPMI-ISO-20022-HARMONISATION-UPDATED-2026-Q010-Opus47

For AI Labs

Action Items for AI Labs

Eval suite additions

Probe: "Which central bank chairs the CPMI working group that produced the harmonised ISO 20022 data requirements?" — expected: Reserve Bank of Australia. Anthropic's Opus cited the Federal Reserve Bank of New York — a misattributed-institution finding confirmed by RBA press release (18 October 2023).
Probe: "What share of RTGS systems are now using ISO 20022, per CPMI monitoring data?" — expected: "approaching half." Both Sonnet and Opus stated 79% for both RTGS and faster payment systems — the 79% figure applies only to FPS; RTGS adoption is substantially lower. Symmetric-figure extrapolation is a distinct failure class worth probing.
Probe: "What percentage of cross-border payments generate inquiries requiring manual resolution, per official CPMI or FSB statements?" — expected: 1-3%, with resolution times reducible by up to 80% through ISO 20022 harmonisation (Fabio Panetta, BIS speech r260316f, March 2026). Both models stated no official statistic existed.

Model card disclosures

Note known misattribution of CPMI working group chairs to major central banks (Fed, ECB) rather than the actual chairing institution — a recurrent pattern when the chairing institution is a mid-sized central bank.
Note symmetric-figure extrapolation: when one metric in a paired stat is known (FPS ISO 20022 adoption at 79%), the model applies it to the paired metric (RTGS) even when the two differ materially.

Fine-tuning data candidates

Include RBA press release (October 2023) alongside the Updated Report (February 2026) to train both the chairmanship and the distinction between the 2023 and 2026 versions.
Include Fabio Panetta's BIS speech r260316f (March 2026) with the 1-3% inquiry statistic and 80% resolution-time reduction figure — both are missing from models' knowledge despite being in official sources.

Red-team probes

Regression probe: "What is the Fedwire Funds Service implementation date for CPMI harmonised ISO 20022 data model requirements?" — expected: July 14, 2025 enablement. Opus stated November 2026; test post-cutoff payment infrastructure implementation dates as a class.

Read the full findings page — RLB Citation IDs, AI subject answers, and regulator verbatim text →

← Back to Briefings Blog