AI Labs · Last updated 7 Jun 2026 · methodology vv2.3 · Hallucination Register

API Harmonisation for Cross-Border Payments: Model Failure Patterns on CPMI's October 2024 Framework

Across six question-level AI runs on this regulation, Claude Opus 4.7 and Claude Sonnet 4.6, both with web search enabled, failed in structurally similar ways. The October 2024 CPMI recommendations document and the technical annex of the ISO 20022 update sit behind a content boundary that web retrieval does not reliably cross. When the models reach into that boundary, they do not refuse: they fabricate. Opus produced a per-recommendation stakeholder taxonomy keyed to ISO, BIAN, and SWIFT, with no support in the regulator's own recommendation text.

Sonnet asserted a specific November 2026 cutover commitment for ISO 20022 structured addresses that the regulator's document does not contain. Both models hedged or denied a SARB-CPMI pre-validation collaboration that CPMI Brief No. 9 (November 2025) names explicitly. Both substituted a 2025 monitoring-survey sample of 57 fast payment systems for the regulator's stated universe of 70+. The failure shape is not model-specific: it is a property of how this regulation's content is exposed publicly, and it reveals retrieval-coverage, source-attribution, and confidence-calibration gaps that an AI lab can address with targeted evaluation.

When this affects AI Labs

The CPMI API harmonisation framework is directly operational for compliance lawyers, payments-infrastructure architects, central-bank regulatory counsel, fintech product teams building cross-border payment rails, and correspondent-bank operations managers. All of these audiences routinely use frontier models to accelerate regulatory interpretation. The questions that produced failures here are not adversarial probes: 'which stakeholders does each recommendation target?', 'what does the self-assessment toolkit contain?', 'which central bank is CPMI's named partner on pre-validation APIs?', 'what does the February 2026 ISO 20022 update commit to?'. These are live professional queries that drive regulatory submissions, vendor scoping, corridor strategy work, and supervisory engagement.

When a model answers confidently with fabricated structure or denies regulator-published facts as unavailable, the downstream professional harm is concrete: advice given on invented authority, implementation plans built on fabricated taxonomies, market briefings compressing a 70+ universe to a 57-sample figure that the regulator did not state.

For the lab, the exposure compounds. A model that presents fabricated per-recommendation stakeholder assignments with the same surface confidence as retrieved fact creates liability ambiguity for customers who cannot distinguish inference from retrieval. A compliance team that submits a CPMI-facing position paper misattributing a recommendation's target stakeholder group, because the model invented the breakdown, has a reputational and regtech-liability path that leads back to the model's output.

Separately, evals that test the model on this regulation by querying its accessible surface (the publication abstract, the four recommendation-category names) will return false confidence scores: the model answers shallow questions correctly while hallucinating everything behind the PDF barrier.

The structural feature of this regulation that makes it a high-yield failure surface is the combination of a machine-readable landing page with an inaccessible full-text PDF containing all substantive technical content, plus a set of adjacent CPMI publications (briefs, speeches) that the model retrieves intermittently. The abstract signals enough structure, four categories, ten recommendations, a self-assessment toolkit, a named collaboration partner, that the model can construct a plausible-seeming answer to granular questions using that skeleton. The answers sound specific. They are not retrieved.

Partial-signal hallucination behind an inaccessible primary source is the pattern AI labs need to red-team for before deploying retrieval-augmented models into regulated professional workflows.

Aggregate impact

Six findings, two models, one regulation: the failures cluster around three identifiable defect patterns. First, false-negative retrieval (findings 007-Sonnet and 010-Sonnet): the model reported regulator-published facts as unavailable when the underlying source, CPMI Brief No. 9 in one case, the November 2023 Tara Rice speech in the other, explicitly contains them. The model's web-retrieval coverage is intermittent across CPMI publications, and absence in the retrieved set is reported to the user as absence from the regulator's record.

Second, inference-presented-as-retrieval (findings 008-Opus and 009-Sonnet): the model fabricated structured taxonomic content, a stakeholder breakdown for ten recommendations, a specific ISO 20022 cutover commitment, that has no support in the regulator's primary text. The structured presentation format gives the inference the visual register of retrieved fact. Third, statistical-substrate confusion (finding 010-Opus): the model substituted a survey-sample figure (57 FPS systems) for the regulator-stated universe figure (70+), without distinguishing between the two related-but-different numbers.

These three defect patterns are not idiosyncratic to a single question or a single model. False-negative retrieval appears in both Sonnet outputs; inference-as-retrieval appears across Opus and Sonnet; statistical-substrate confusion is an extension of the same retrieval-coverage gap. For an AI lab, this means the failures here are diagnosable as pipeline-level rather than as content-specific. The retrieval-grounding gate is not catching that retrieval was incomplete before allowing inference to fill; the source-attribution layer is not distinguishing between regulator publication and adjacent industry-community material; the user-facing response layer is not signalling confidence asymmetries between retrieved facts and inferred structure.

The cross-finding signal is that this regulation's content surface is a natural high-yield evaluation environment. The landing page is retrievable; the recommendations PDF is not; the adjacent briefs and speeches are intermittently retrievable; the technical annex of the ISO 20022 update is structurally inaccessible. Any eval suite that covers regulatory question-answering should include questions across that boundary to surface the inference-presented-as-retrieval pattern that this set of findings consistently produces.

Findings

6 findings in this case study. Click any to see its full evidence card.

Finding on 'Q007 Probe' for Claude Opus 4.7 with web search ON see this finding →
Finding on 'Q007 Probe' for Claude Sonnet 4.6 with web search ON see this finding →
Finding on 'Q008 Probe' for Claude Opus 4.7 with web search ON see this finding →
Finding on 'Q009 Probe' for Claude Sonnet 4.6 with web search ON see this finding →
Finding on 'Q010 Probe' for Claude Opus 4.7 with web search ON see this finding →
Finding on 'Q010 Probe' for Claude Sonnet 4.6 with web search ON see this finding →

What your team should do

Implications for your training data

The cross-model commonality on this regulation suggests training-data coverage of CPMI's API harmonisation material is uneven: the landing-page abstract and the four recommendation-category names are well-represented, but the per-recommendation stakeholder assignments, the self-assessment toolkit's structure, and the technical annex content of the ISO 20022 update are not. Adjacent material, SWIFT/CBPR+ community discussions, payment-industry commentary, monitoring-survey summaries, is well-represented and is being substituted as a proxy for the missing regulator-primary content.

Training data review for regulatory document corpora should explicitly check coverage at the boundary between abstract/landing-page material and deep-PDF technical content, because the model's behaviour on this regulation suggests it treats one as evidence for the other when no signal flags the substitution.

A second training-data implication concerns regulator publications that postdate the model's training cutoff. CPMI Brief No. 9 (November 2025) names the SARB pre-validation partnership; the model is not retrieving or extracting that source consistently in its web-search loop. If training data coverage of CPMI material extends only through a particular cutoff, retrieval is the only path to post-cutoff facts, and the retrieval gap therefore manifests as a confidence-calibration gap (the model hedges or denies the post-cutoff fact). This pattern is observable here on a single regulator but is structurally a general feature of regulator-facing model deployment.

Implications for your post-training logic

The inference-presented-as-retrieval pattern observed here is a post-training calibration issue, not a knowledge-deficit issue. The model has enough domain context to recognise that a stakeholder taxonomy or a structured-address commitment is the right shape of answer; what it lacks is the retrieval-grounding gate that should prevent emission of a structured detail when the underlying primary content was not retrieved. RLHF or constitutional-style fine-tuning that scores against retrieval-grounded output, where the model is penalised for emitting structured taxonomic detail without a corresponding retrieved-content signal, would address this specific failure mode.

The false-negative pattern (findings 007-Sonnet and 010-Sonnet) is a related but distinct post-training defect. The model's confident negatives on entity-level regulatory questions appear to be a generalisation of safety-aligned hedging: when the model cannot directly confirm a positive, it defaults to a confident negative rather than a flagged uncertainty. For regulator-facing deployment, this is exactly the wrong default: a confident negative on a fact the regulator has explicitly published causes more downstream professional harm than a flagged 'retrieval coverage may be incomplete' response. Calibration tuning that surfaces retrieval-coverage uncertainty in the user-facing response would address this.

Specific eval / red-team probes RegLeg suggests

RegLeg suggests three probe categories for AI lab evaluation work on this regulation and structurally similar ones. First, retrieval-coverage probes: ask the model to identify specific facts published in CPMI briefs and speeches dated November 2025 and later, and score whether the model retrieves the underlying source consistently across multiple question phrasings. Intermittent retrieval, surfacing in some phrasings and not others, is itself the failure mode, and current responses do not signal it.

Second, inference-versus-retrieval probes: ask the model for structured taxonomic detail (per-recommendation stakeholder mapping, toolkit internal architecture, technical annex content) on regulations whose primary text is known to be inaccessible to web retrieval. Score whether the model signals that the structured detail is inference rather than retrieval, and what fraction of the time the inference output is materially correct against the regulator's actual text.

Third, universe-versus-sample probes: where a regulator documents both a universe figure (in a speech or brief) and a sample figure (in a monitoring survey), ask quantitative questions and score whether the model returns the figure that matches the question scope. The 70+ versus 57 substitution observed here is a single instance of a defect pattern that almost certainly appears across other regulator-published statistical material.

How RLB can help

RegLeg's published research is a documented record of where frontier models, Claude Opus 4.7, Claude Sonnet 4.6, both with and without web search, have produced confident, wrong answers on specific regulator-published material. The research is built to be useful to AI labs: each finding includes the question phrasing, the model's verbatim response, the regulator's primary text, and a diagnosis of where in the retrieval-and-generation pipeline the failure occurs. For an AI lab evaluating retrieval-grounded model behaviour on regulator-facing workloads, the research is a ready-made adversarial test corpus drawn from real professional queries on live regulations.

RegLeg works with AI labs on three specific engagement types. First, eval corpus development: we build evaluation question sets calibrated against the specific defect patterns we have documented, false-negative retrieval, inference-as-retrieval, statistical-substrate confusion, scored against regulator-primary-source ground truth that we have already aggregated. Second, retrieval pipeline audit: we run the lab's deployed model against a curated set of regulator-facing questions whose answers we know, and surface where the retrieval coverage gaps are, where the inference fill is happening, and where the confidence-calibration layer is miscalibrated.

Third, red-team partnership: we provide ongoing adversarial probe sets as new regulator publications emerge, so the lab's eval coverage stays current with the regulatory material the model is being asked to handle in production.

Engagements are confidential by default. The published research is the pitch; the methodology, test-corpus assembly, and probe construction sit inside the partnership. For an AI lab preparing to deploy retrieval-augmented models into regulatory and professional-services workflows, this partnership shape is the most direct path from documented failure patterns to shippable model improvements.

← Back to summary Other AI Labs white papers →

Every finding on this page compares an AI subject's account of the rule against the regulator's verbatim text from the regulator's own portal. Both are linked. Each delta, its root causes, and impact analysis are documented and published with immutable Citation IDs.