AI Labs · published 2026-05-29 · methodology v2.1

AI Model Hallucination Patterns on CPMI-IOSCO PFMI: A RegLeg Research Report

RegLeg tested two frontier AI models against the Principles for Financial Market Infrastructures (PFMI), the global standard for payment systems, central counterparties, and securities settlement systems published jointly by the Bank for International Settlements Committee on Payments and Market Infrastructures (CPMI) and IOSCO. The models tested were Claude Opus 4.7 with web search and Claude Sonnet 4.6 with web search, evaluated across 22 findings spanning the core PFMI text, its associated assessment methodology, and a series of post-2022 consultative updates. The dominant pattern across both models is not outright fabrication of regulatory text but a combination of confident inference presented as verified fact — particularly on recent documents — alongside an inability to retrieve verbatim content from binary PDF sources. Where both models produce specific claims about documents published after their training windows, those claims are frequently fabricated with high surface plausibility, attached to real-looking URLs that resolve to unrelated or non-existent pages. This matters as a signal because PFMI is a living standard: CPMI-IOSCO continue to issue consultative updates on CCP resilience, general business risk, and stablecoin applicability, making knowledge-cutoff failures especially consequential for users in compliance, legal, and risk functions who rely on the model for current regulatory guidance.

When this affects AI Labs

The PFMI is the operational backbone of global financial markets. It governs every clearing house, central securities depository, and systemically important payment system across major jurisdictions. Users in compliance, legal, risk management, and regulatory affairs at banks, clearing members, and financial market infrastructure operators routinely turn to AI assistants to navigate its 28 principles, the associated assessment methodology, and the stream of CPMI-IOSCO consultative updates that supplement it. When a model produces a plausible but wrong answer about, for example, the liquid net assets floor under Principle 15 or the current status of CCP resilience guidance, a compliance officer acting on that output faces a real regulatory exposure — misreporting to a supervisor, misconfiguring a capital buffer, or failing to identify that a 2025 consultation has proposed material changes to a standard the institution already thought it understood.

For an AI lab, the downstream harms extend in two directions. First, users who rely on confident model outputs to navigate PFMI compliance and act on fabricated or outdated claims face direct regulatory risk — potential enforcement actions, failed supervisory assessments, or misaligned risk frameworks. Second, when those errors are traceable to the model, they create reputational and litigation exposure for the lab in a domain where the regulatory record is precisely documented and auditable. This is not a hypothetical harm: the PFMI assessment cycle includes jurisdiction-level reviews published by national regulators and the FSB, which creates an evidentiary chain that can establish exactly what a regulator's published position was at any given date.

What makes PFMI a particularly acute failure surface is its structural profile. The core document is a lengthy, technically dense PDF that is not consistently machine-readable in web form; key quantitative thresholds (the six-month LNAFE floor, cover-2 default fund sizing, intraday credit standards) are buried in annexes and key consideration sub-paragraphs that require deep document access, not surface-level recall. The framework is also a living standard: CPMI-IOSCO have published significant consultative updates on general business risk management, CCP resilience, initial margin, and the application of PFMI to stablecoin arrangements — all of which fall at or near current model training cutoffs. That temporal boundary, combined with the document's PDF-heavy citation structure, creates a compounding failure risk that is distinct from domain ignorance and more difficult to detect through standard capability evaluations.

Aggregate impact

Claude Opus 4.7 with web search — 12 findings. The dominant pattern is confident reconstruction from training rather than live retrieval: on questions probing recent consultative documents (published late 2025 and May 2026), Opus 4.7 produced specific, plausible-sounding claims about document contents, compliance findings, and proposed changes — accompanied by real-looking BIS publication URLs — but was unable to verify those claims against the actual documents. On direct verbatim-access questions, the model correctly declined to fabricate, producing five well-calibrated refusals. One finding (Question 5) shows an outright fabricated fact where a specific document is cited with a publication label that does not correspond to its actual content.

Claude Sonnet 4.6 with web search — 10 findings. Sonnet 4.6 shows a broadly similar pattern but with a notably higher rate of explicit source-access admissions: five findings record the model acknowledging it could not retrieve or reproduce content from specific PDFs, and declining to proceed. On recent-document questions, Sonnet 4.6 produced substantively similar fabricated claims to Opus 4.7 — same documents, same types of invented compliance statistics — suggesting the underlying error is in training-data inference rather than in model-specific generation behaviour. One finding shows a subtle content drift where the model accurately identified the document in question but mischaracterised the normative weight of a governance requirement (rendering a conditional recommendation as though it were a fixed threshold).

The joint pattern tells an alignment team two things. First, both models with web search are producing fabricated citation URLs at the point of source generation, independently of whether the underlying content claim is correct or evasive — the retrieval step is not catching or correcting the hallucination. Second, the models' behaviour diverges cleanly at the verbatim-access boundary: when asked for exact text from binary PDFs they cannot read, the models either fabricate or refuse, but do not successfully retrieve. This points to a gap not in general PFMI knowledge but in the interface between web-search tooling and binary document content — a failure mode that would not surface in standard text-completion evaluations but is consistently visible in live regulatory-research tasks.

Findings

22 findings in this case study. Click any to see its full evidence card.

Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →

What your team should do

The PFMI findings point to three concrete areas for your evals and alignment teams. First, expand coverage of living-standard regulations in your eval suites — not just the core document text but the consultative update cycle. The failures here are concentrated at the boundary between what is in training data and what is in documents published close to or after the training cutoff. PFMI-specific probes should include questions about Principle 15's LNAFE floor, the CCP resilience guidance lineage (the 2016 consultative report, the 2017 final guidance, and the 2026 initial margin consultation), and the status of the stablecoin applicability guidance. These are the exact questions where both models produced substantively wrong, highly confident answers.

Second, both models with web search are producing Pretextual citations — real-looking BIS and IOSCO URLs that resolve to unrelated or non-existent pages — independently of whether the underlying content claim is correct. This is a retrieval-layer failure that should be isolatable: the model generates a plausible URL pattern based on BIS publication naming conventions, attaches it to a content claim, and does not verify that the URL resolves to a page containing the claim. Your retrieval and citation-generation stack warrants specific testing against the BIS publication series (the cpmi/publ/d###.htm naming convention), where the model's URL-synthesis behaviour is clearly operating on a template rather than on retrieved evidence.

Third, the binary-PDF access boundary is a consistent and measurable failure surface. Both models correctly declined to fabricate verbatim content from PDFs they could not read — but neither model was able to retrieve content from those PDFs via web search either, even where the documents are publicly accessible on the BIS and IOSCO portals. Synthetic training pairs derived from the actual verbatim text of the PFMI and its associated methodology documents — particularly the key consideration sub-paragraphs, the annexes, and the assessment methodology rating scale — would directly address the recall gap that drives the content-inference failures in this finding set.

How RLB can help

RegLeg has built a structured question bank across PFMI and the broader CPMI-IOSCO publication corpus — questions designed specifically to probe the failure surfaces that matter in regulatory-research tasks: technical numerics buried in annexes, cross-references between documents in a guidance lineage, recent amendments close to training cutoffs, and verbatim-access questions against binary PDFs. We make this available to AI labs under NDA as a licensed eval resource, with full paraphrase-to-original mappings so your team can understand exactly what each probe is testing without the IP being exposed publicly.

Where a deeper engagement makes sense, we offer per-regulation specialist panels — structured sessions with our regulatory domain team on a defined PFMI topic area (CCP resilience, general business risk management, payment system oversight, stablecoin applicability), covering the document lineage, the current operative standards, the open consultations, and the assessment methodology in sufficient depth to support targeted fine-tune data construction. These sessions are designed to be actionable for post-training and evals teams, not just informational briefings.

On the training-data side, RegLeg can generate synthetic correction-pair datasets derived from the regulator's authoritative text — question/wrong-answer/correct-answer triples drawn from real observed failure patterns, with sourcing back to the primary document. For PFMI specifically, the highest-value targets are the key consideration sub-paragraphs (which drive most of the content-recall failures), the assessment methodology rating scale, and the post-2022 consultative update series. We also offer embedded eval coverage on a defined regulator portfolio, refreshed quarterly as new consultations and final guidance are published — so your evals stay current with the regulation rather than lagging it.

← Back to summary Other AI Labs white papers →