AI Labs · Last updated 15 Jun 2026 · methodology v2.3 · Hallucination Register

AI Hallucination Evaluation: Linking Fast Payment Systems Across Borders — Governance and Oversight

On 15 October 2024 the Bank for International Settlements' Committee on Payments and Market Infrastructures issued the final report 'Linking Fast Payment Systems Across Borders: Governance and Oversight,' recorded as publication d223.

The final report is the successor to the interim publication d219, recorded in October 2023, which set out ten considerations for governance and oversight of interlinking arrangements between fast payment systems. d223 itself sets out seven oversight recommendations in Section 5.2, recorded as Recommendation 1 through Recommendation 7, and records seven specific public consultation respondents in Annex 1: the Bill and Melinda Gates Foundation, EBA Clearing, the Emerging Payments Association Asia, Giesecke+Devrient, the International Institute of Finance, Mastercard, and The Clearing House Company.

The report is published alongside the API harmonisation companion publication d224, which records ten recommendations on cross-border payment messaging, distinct from d223's recommendation set.

The October 2024 report is structurally important to the AI-lab audit lens for three reasons. First, the report sits inside a layered standard-setting record (the interim d219, the final d223, and the companion d224) that creates several discrete reproduction tasks where the AI is asked to record a specific count of recommendations or considerations against a specific publication. Second, the report draws an explicit scoping line in Section 2.2 between the in-scope interlinking arrangement model and the out-of-scope single access point and common platform models; any model that conflates them produces a wrong but plausible answer about the instrument's coverage.

Third, the report records a specific, named public-consultation respondent set in Annex 1 that the AI is asked to reproduce verbatim in stakeholder-engagement deliverables.

The RLB Specialist Panel designed the questions in this audit to mirror how lawyers, compliance officers, risk officers, operations leads, and board secretariats at FPS operators, hub entities, payment institutions, and banks actually use AI on this practice area: drafting board-level briefings on the d223 outcome, drafting legal opinions on cross-border interlinking risk, drafting compliance frameworks against the d223 recommendation set, drafting operating manuals for interlinking arrangements, and drafting stakeholder-engagement notes. Each question is anchored to verbatim regulator-issued primary substrate.

When this affects AI Labs

AI lab teams fielding frontier models into cross-border fast-payment, payment-system oversight, and central-bank advisory deployments will see the failure modes documented here surface when the model is asked to reproduce a count of recommendations, a named respondent list, or a scoping statement against an international standard-setting document. The pattern matters specifically for product surfaces that promise verbatim quotation from regulator-issued documents on the CPMI's work, on cross-border fast-payment policy, or on payment-system oversight frameworks more broadly.

The six findings document a confident, fluent failure mode: the model produces a structurally plausible answer with the wrong count, the wrong respondent list, or the wrong scoping treatment, with no hedging or source-verification recommendation.

Aggregate impact

The six findings in this audit, taken together, describe a specific pattern in how the two frontier AI subjects handled the October 2024 CPMI final report on FPS interlinking governance. Across recommendation-count questions, scoping questions, and named-respondent questions, the AI subjects committed to verbatim-looking answers that the regulator's own primary text in d223 directly contradicts. The failure shape is consistent: the model produces a structurally plausible answer in a register that reads as if it had retrieved the regulator's text directly, with no hedging, no source-verification recommendation, and no flag of uncertainty.

The specific failure modes documented are: (a) inference drift on counts (recommendation counts of approximately ten and of six, against the regulator's seven); (b) conflation of distinct instruments (the interim d219's ten considerations imported as if they were the final d223's recommendations); (c) misstated rule on scoping treatment (single access point gateway arrangements placed inside the recommendation set, against the regulator's Section 2.2 scoping language); and (d) inflation and fabrication of named-entity lists (a consultation-respondent list of fifteen to twenty named organisations, against the regulator's seven specific respondents in Annex 1).

The pattern signals that on international standard-setting documents with layered publication records, the AI subjects under test do not reliably distinguish between an interim instrument's working set and a final instrument's prescribed set, do not reliably reproduce a specific count of recommendations, and do not reliably reproduce a named respondent list. The failure surfaces specifically in board-style, analyst-style, and policy-note deliverables where the model is asked to commit to a specific answer in a deliverable register.

What your team should do

Training-data implications

The CPMI's October 2024 publication record is in the public domain on the BIS portal. Both the interim d219 and the final d223 are accessible without authentication. The named respondent list in d223 Annex 1 is structurally distinct in the published document. The scoping language in Section 2.2 and the Graph 2 caption is structurally distinct.

The training-data implication is that an AI lab team should treat layered international standard-setting publication records as a class where the model may import an interim instrument's working set into a final instrument's prescribed set, and where the model may fail to distinguish the scope-defining language from the general discussion sections.

Post-training logic implications

The failure mode in this audit is not a refusal failure: the AI subjects committed in a board-style, analyst-style, or policy-note deliverable to a specific count or a specific named list. A post-training logic implication is that on questions where the deliverable register (board memo, analyst note, policy briefing) cues the model to produce a specific number or a specific named list, the model should be tuned to either retrieve the exact figure or named list from the source document at runtime, or to record the inability to do so as a hedge in the output.

The audit findings show neither behaviour: the AI subjects produced a confident specific answer that the source contradicts.

RegLeg-suggested probes

An AI lab team can probe for this failure mode with three classes of question: (a) ask the model to reproduce the exact count of recommendations in an international standard-setting publication (e.g., the d223 oversight recommendations) in a board-briefing-style deliverable, and check whether the model distinguishes the count of the final instrument from the count in a preceding interim publication; (b) ask the model to reproduce the named consultation-respondent list of a public consultation on an international standard-setting publication, and check whether the model produces a structurally plausible list that includes organisations not in the published record; (c) ask the model whether a specific cross-border payment model (e.g., the single access point gateway arrangement) is inside or outside the scope of a specific international standard-setting recommendation set, and check whether the model imports the model into the recommendation set in spite of explicit scoping language in the source document.

The audit findings show that all three probes surface the failure mode in the frontier AI subjects under test.

How RLB can help

RegLeg's published hallucination research catalogues the specific question types where AI subjects produce confident, fluent answers on international standard-setting documents that the regulator's own primary text contradicts. For an AI lab team considering how to scope an internal evaluation of model behaviour on cross-border fast-payment, payment-system oversight, and CPMI-related deployments, the catalogue is available as an open-access reference. RegLeg also offers bespoke deep-dives into specific international standard-setting instruments and adjacent regulatory regimes, designed to scope the failure modes that surface when the model is asked to reproduce specific counts, named lists, or scoping treatments.

The output is designed to be shared across an AI lab team's evaluation, product, and partnership functions and used as a durable reference for partnership conversations on the question types where the lab's models are most exposed.

← Back to summary Other AI Labs white papers →

Every finding on this page compares an AI subject's account of the rule against the regulator's verbatim text from the regulator's own portal. Both are linked. Each delta, its root causes, and impact analysis are documented and published with immutable Citation IDs.