CPMI-IOSCO Cyber Resilience Guidance (2016): AI Model Accuracy Evaluation
Executive Summary
This audit presents findings from RegLeg's evaluation of AI model responses to questions about the CPMI-IOSCO 2016 Guidance on Cyber Resilience for Financial Market Infrastructures, the international cyber standard for FMIs and a frequently-cited reference for cyber programmes at banks, payment institutions, and cybersecurity providers serving the FMI ecosystem. Claude Opus 4.7 and Claude Sonnet 4.6, both with web search enabled, were tested across the question set.
Across nine confirmed findings on five question pairs, the models produced confident answers that the source documents do not support: an asserted NIST Cybersecurity Framework alignment of the 2016 guidance that the source text does not establish (one each for Opus and Sonnet), the misattribution of the phrase 'secure the periphery, protect the core' to the 2016 guidance or to a 2018 fraud paper rather than to Cœuré's November 2018 speech (one each for Opus and Sonnet), an overstatement of the 2016 guidance's operational depth on incident response and recovery against the FSB's October 2020 'Effective Practices' paper (Sonnet), an asserted definitional consistency between the 2016 guidance and the November 2018 FSB Cyber Lexicon across a two-year publication gap (one each for Opus and Sonnet), and a missed CPMI-IOSCO revision cycle on the 2016 guidance opened publicly by the BIS press release of 6 May 2026 (one each for Opus and Sonnet).
For labs fielding these models in enterprise and regulatory contexts, the pattern represents a material gap in how the models handle regulator-framework cross-references, regulator strategic-phrase provenance, operational-depth comparisons against later regulator work, definitional alignment claims across publication gaps, and operative-status statements when a revision cycle has opened after the model's training cutoff.
When This Affects an AI Lab
The 2016 CPMI-IOSCO Cyber Resilience Guidance is the international FMI cyber standard, and it sits at the centre of cyber-programme work for FMIs, payment institutions, banks operating clearing and settlement functions, cybersecurity providers serving those institutions, lawyers and public auditors examining cyber programme adequacy, and management consultancies advising on cyber operating models.
Users asking AI models about this guidance include compliance officers and risk officers at FMIs and banks, in-house and external lawyers advising on cyber-programme posture, public auditors testing programme alignment under ISAE engagements, cybersecurity-firm operations and technology teams designing controls and reporting against the international standard, company secretaries preparing board cyber-resilience updates, and consulting teams delivering programme-gap assessments and target-state design.
Any model deployed in an assistant, copilot, or document-query capacity in these contexts will routinely receive the question types tested in this audit: does the 2016 guidance cite a particular framework, where does a CPMI strategic phrase come from, what operational depth does the 2016 guidance contain on a given topic, is the 2016 definition of a term consistent with later regulator usage, and is the 2016 guidance still the operative international standard.
The downstream harms are concrete. A cyber-programme design that records an asserted NIST CSF alignment of the 2016 guidance anchors control mapping on a regulator citation the source does not contain. A board paper that attributes 'secure the periphery, protect the core' to the 2016 guidance or to the 2018 fraud paper records a regulator-provenance claim that the cited documents do not support. A programme review that treats the 2016 guidance as containing forensic-analysis-database depth on incident response understates the gap to FSB 2020 'Effective Practices' that supervisors will expect addressed.
A policy or KRI document that records the 2016 definitions as consistent with the November 2018 FSB Cyber Lexicon collapses a two-year vocabulary gap into a single asserted alignment. A horizon-scanning pack or supervisory-engagement memo that records the 2016 guidance as standing without active revision misses the May 2026 consultative document and the live regulatory state. For the lab, confident wrong outputs on authoritative regulatory text are the misuse-claim exposure that arises when enterprise customers act on model outputs in high-stakes contexts.
Regulatory and financial-services use cases are among the fastest-growing deployment verticals for frontier models; the failure surface here is large and growing.
Aggregate impact
The nine findings in this audit identify a generation pattern that practitioners and labs should both treat as material on questions about the 2016 CPMI-IOSCO Cyber Resilience Guidance. Five structural drivers compound the failure modes.
First, the 2016 guidance is principles-based, and its category structure (governance, identification, protection, detection, response and recovery, situational awareness, learning and evolving) is structurally similar to the NIST CSF five functions. That similarity makes the wrong assertion of an explicit NIST citation look plausible. Both Opus 4.7 and Sonnet 4.6 produced confident NIST CSF alignment claims that the 2016 source text does not establish.
Second, the regulator-publication stream around CPMI-IOSCO cyber work is dense and overlapping: the 2018 FSB Cyber Lexicon, the 2020 FSB 'Effective Practices' paper, Cœuré's November 2018 speech, the 2018 wholesale-payments fraud work, the Level 3 monitoring reports. That density makes provenance attribution easy to get wrong. Both models misattributed the phrase 'secure the periphery, protect the core' to the wrong publication: Opus to the 2018 fraud paper, Sonnet to a May 2019 speech, where the actual source is Cœuré's November 2018 speech (BIS review r181115a).
Third, the operational-depth question on incident response and recovery sits at the boundary between the 2016 guidance and the FSB's October 2020 'Effective Practices' paper. The FSB 2020 document goes beyond the 2016 guidance on response and recovery practice. Sonnet 4.6 attributed FSB 2020 content to the 2016 guidance, importing forensic-analysis-database depth into the wrong source.
Fourth, the definitional-alignment question between the 2016 guidance and the November 2018 FSB Cyber Lexicon spans a two-year publication gap. The Lexicon postdates the 2016 guidance and may not match how the 2016 source used the terms in 2016. Both models asserted broad consistency; Sonnet added a derivation claim that neither source establishes (the assertion that the FSB drew on the CPMI-IOSCO guidance as a source).
Fifth, CPMI-IOSCO's revision cycle on the 2016 guidance opened publicly on 6 May 2026 with a consultative document. Models with a January 2026 cutoff will record the guidance as standing without active revision unless a retrieval step pulls the BIS press release stream for the deliverable period. Both Opus and Sonnet missed the consultative document and recorded the 2016 guidance as the standing operative standard.
For enterprise deployments in regulatory and financial-services contexts, these failure modes recur on every instrument that (a) is principles-based with category structure similar to a major industry framework, (b) sits inside a dense regulator-publication stream where strategic phrases circulate across speeches and standards, (c) has been supplemented by later regulator work that goes beyond the original instrument's depth, (d) uses terminology that has been standardised by a later regulator lexicon, and (e) is under an active revision cycle that opened after the model's training cutoff.
All five characteristics are common in the international cyber and operational resilience regulator portfolio and in adjacent prudential portfolios.
Per-finding analysis
Finding 1 . NIST Cybersecurity Framework cross-reference asserted without verification
Citation: RLB-H-INT-BIS-CPMI-IOSCO-CYBER-RESILIENCE-FMI-2016-Q008-Opus47 . Model: Opus 4.7 (web search)
This finding identifies an inference drift failure in regulator-framework cross-reference handling on a principles-based regulator instrument. Asked whether the 2016 CPMI-IOSCO Cyber Resilience Guidance explicitly references or aligns with the NIST Cybersecurity Framework, Opus 4.7 with web search responded that the 2016 guidance was 'developed in awareness of NIST CSF, ISO/IEC 27000 series, COBIT and similar bodies of work', framing the alignment as an awareness claim. The 2016 guidance does not contain a verbatim NIST citation, and the 'developed in awareness of' formulation is itself an inference rather than a regulator-grounded statement.
The gap implicates the generation layer's handling of cross-reference queries on principles-based regulator instruments whose category structure is similar to a major industry framework: the model synthesised an alignment claim by analogy with structural similarity rather than by retrieval of a framework-references section in the source. Recommended eval probes: cross-reference queries on principles-based regulator instruments where the source text does not cite the asserted framework; alignment-strength queries that require distinguishing 'developed in awareness of' from 'explicitly references'.
Finding 2 . NIST Cybersecurity Framework citation asserted as explicit
Citation: RLB-H-INT-BIS-CPMI-IOSCO-CYBER-RESILIENCE-FMI-2016-Q008-Sonnet46 . Model: Sonnet 4.6 (web search)
This finding identifies an inference drift failure in regulator-framework cross-reference handling. Asked whether the 2016 CPMI-IOSCO Cyber Resilience Guidance formally cites or references the NIST Cybersecurity Framework, Sonnet 4.6 with web search responded that the guidance 'explicitly references and takes into consideration the NIST Cybersecurity Framework as one of several industry best-practice frameworks informing its development'. The 2016 source document does not contain the explicit NIST citation the model asserts. The model produced a confident affirmative on an existence question where the source text does not support the assertion.
The gap is more severe than the Opus 4.7 hedged variant of the same question, because the Sonnet response affirms an explicit citation rather than an awareness frame. Recommended eval probes: explicit-citation queries on principles-based regulator instruments where the structural similarity to a known industry framework invites the wrong affirmative; retrieval-grounded queries that require pulling the framework-references section of the source document before answering a cross-reference question.
Finding 3 . 'Secure the periphery, protect the core' misattributed to 2018 wholesale-payments work
Citation: RLB-H-INT-BIS-CPMI-IOSCO-CYBER-RESILIENCE-FMI-2016-Q014-Opus47 . Model: Opus 4.7 (web search)
This finding identifies a misattribution failure on a regulator strategic phrase. Asked where the phrase 'secure the periphery, protect the core' originates, Opus 4.7 with web search attributed the phrase to CPMI's 2018 wholesale-payments fraud work ('Reducing the risk of wholesale payments fraud'). The actual source is Cœuré's November 2018 speech (BIS review r181115a) 'cryptos, cyber and CCPs'; the phrase describes CPMI's strategic approach but does not appear in the 2016 guidance or in the 2018 fraud paper.
The gap implicates the generation layer's handling of provenance queries on regulator strategic phrases that circulate across speeches, standards, and operational papers: the model selected a plausible publication by topic association rather than by source-text grounding. Recommended eval probes: provenance queries on regulator strategic phrases that circulate across publication types; attribution-strength queries that require distinguishing speech provenance from standards-document provenance; retrieval-grounded queries that require checking the BIS speech archive.
Finding 4 . 'Secure the periphery, protect the core' misattributed to May 2019 BIS-CPMI speech
Citation: RLB-H-INT-BIS-CPMI-IOSCO-CYBER-RESILIENCE-FMI-2016-Q014-Sonnet46 . Model: Sonnet 4.6 (web search)
This finding identifies a misattribution failure on a regulator strategic phrase. Asked where the phrase 'secure the periphery, protect the core' originates, Sonnet 4.6 with web search attributed it to a May 2019 BIS-CPMI speech 'Cyber resilience as a global public good'. The actual source is Cœuré's November 2018 speech (BIS review r181115a). The Sonnet attribution is wrong by speech and by date; the model selected a plausible-looking BIS speech publication that does not contain the phrase as the canonical provenance. The failure pattern parallels the Opus 4.7 variant of the same question, with both models selecting different wrong publications.
The gap implicates the generation layer's handling of speech-provenance queries on the BIS publication stream: the model treated topical proximity (a BIS-CPMI cyber speech) as evidence of phrase provenance. Recommended eval probes: speech-provenance queries on regulator strategic phrases; attribution-strength queries on the BIS speech archive; retrieval-grounded queries that require checking BIS review identifiers.
Finding 5 . Operational depth of incident response and recovery overstated against FSB 2020 work
Citation: RLB-H-INT-BIS-CPMI-IOSCO-CYBER-RESILIENCE-FMI-2016-Q019-Sonnet46 . Model: Sonnet 4.6 (web search)
This finding identifies a misattribution failure on operational-depth content. Asked what level of operational detail the 2016 CPMI-IOSCO guidance provides for incident response and recovery, Sonnet 4.6 with web search responded that the guidance contains specific practices including 'preparing communication and notification plans, conducting forensic analysis to understand the anatomy of a breach, and maintaining a database recording'. The forensic-analysis-database depth is content of the FSB's October 2020 'Effective Practices for Cyber Incident Response and Recovery' paper, which postdates the 2016 guidance by four years. The 2016 guidance is principles-based and does not contain that level of operational specification.
The gap implicates the generation layer's handling of depth-comparison queries on regulator instruments that have been supplemented by later regulator work: the model imported FSB 2020 content into the 2016 source. Recommended eval probes: depth-comparison queries on regulator instruments and their later supplements; source-attribution queries on specific practices that appear at different depths across documents; retrieval-grounded queries that require pulling both documents.
Finding 6 . Cyber resilience definition asserted consistent with later FSB Cyber Lexicon
Citation: RLB-H-INT-BIS-CPMI-IOSCO-CYBER-RESILIENCE-FMI-2016-Q020-Opus47 . Model: Opus 4.7 (web search)
This finding identifies an inference drift failure on definitional alignment across a regulator publication gap. Asked whether the 2016 CPMI-IOSCO definition of 'cyber resilience' is consistent with the November 2018 FSB Cyber Lexicon, Opus 4.7 with web search responded that the two are 'aligned and broadly consistent, but the FSB Lexicon version is slightly broader'. The FSB Cyber Lexicon postdates the 2016 guidance by two years, and the standardised Lexicon definitions may not match how the 2016 source used those terms in 2016.
The model produced a confident consistency claim by reading the later Lexicon as a refined version of the earlier guidance, without grounding the claim in the 2016 definition section. The gap implicates the generation layer's handling of definitional-alignment queries across regulator publication gaps: the model treated later-standardised vocabulary as confirmation of earlier definitional intent. Recommended eval probes: definitional-alignment queries across publication gaps; vocabulary-standardisation queries on regulator lexicons; retrieval-grounded queries that require pulling the definition section of both documents.
Finding 7 . FSB Cyber Lexicon derivation claim added beyond the source text
Citation: RLB-H-INT-BIS-CPMI-IOSCO-CYBER-RESILIENCE-FMI-2016-Q020-Sonnet46 . Model: Sonnet 4.6 (web search)
This finding identifies an inference drift failure that goes beyond definitional alignment to add a regulator derivation claim. Asked whether the 2016 CPMI-IOSCO definition of 'cyber resilience' is consistent with the November 2018 FSB Cyber Lexicon, Sonnet 4.6 with web search responded that the two are 'substantively consistent' and added the assertion that 'the FSB explicitly drew on the CPMI-IOSCO guidance as a source when developing the Lexicon'. The derivation claim does not appear in either the 2016 guidance or the November 2018 FSB Lexicon framing.
The model produced an unsupported derivation chain to ground a definitional consistency claim that the source documents do not establish. The gap implicates the generation layer's handling of regulator derivation claims: the model synthesised a source-attribution chain across two documents to anchor a definitional-consistency conclusion. Recommended eval probes: derivation-claim queries that require source grounding on both sides; cross-document attribution queries on regulator lexicons; retrieval-grounded queries that require pulling the framing section of the later document.
Finding 8 . 2016 guidance presented as unrevised in 2026, missing the May 2026 consultation
Citation: RLB-H-INT-BIS-CPMI-IOSCO-CYBER-RESILIENCE-FMI-2016-Q022-Opus47 . Model: Opus 4.7 (web search)
This finding identifies an outdated failure on the operative status of the 2016 CPMI-IOSCO guidance. Asked whether the 2016 guidance remains the operative international standard or has been revised, Opus 4.7 with web search responded that the 2016 guidance 'remains the operative international standard' and that 'as of my knowledge cutoff (Jan 2026), no successor revision has been issued'. The BIS press release of 6 May 2026 opened a public consultation on updated CPMI-IOSCO cyber guidance; the 2016 guidance is under active revision as of May 2026.
The model anchored on the training cutoff for the operative-status claim and did not use the web search step to check the BIS press release stream for the post-cutoff period. The gap implicates the generation layer's handling of operative-status queries where a regulator revision cycle has opened after the model's training cutoff: the model produced a confident no-revision claim that the deliverable-period public record contradicts. Recommended eval probes: operative-status queries on regulator instruments where revision is likely; revision-detection queries; retrieval-grounded queries that require checking the regulator press release stream.
Finding 9 . 2016 guidance presented as ongoing monitoring only, missing the May 2026 consultative document
Citation: RLB-H-INT-BIS-CPMI-IOSCO-CYBER-RESILIENCE-FMI-2016-Q022-Sonnet46 . Model: Sonnet 4.6 (web search)
This finding identifies an outdated failure that frames the recent regulator activity in a way that obscures the open consultation. Asked whether CPMI-IOSCO has commenced a formal revision of the 2016 guidance, Sonnet 4.6 with web search responded that 'no formal revision or replacement has been published' and characterised the recent activity as 'a second Level 3 monitoring report... suggesting ongoing monitoring rather than a revision cycle'. The BIS press release of 6 May 2026 opened a public consultation on updated guidance.
The model pointed to monitoring activity as evidence against revision, rather than checking the press release stream for revision activity. The gap implicates the generation layer's handling of revision-detection queries on regulator instruments where the recent activity surface contains both monitoring and revision streams: the model treated monitoring activity as load-bearing on a revision-existence claim, without checking the revision stream itself. Recommended eval probes: revision-detection queries across overlapping regulator activity streams; press-release-grounded queries; retrieval-grounded queries that require explicit checks for open consultations.
What Your Team Should Do
The targeted evaluation work for these failure modes should focus on five areas.
First, expand evaluation coverage of regulator-framework cross-reference queries on principles-based regulator instruments whose category structure is similar to a major industry framework. The models in this audit asserted explicit NIST CSF alignment of the 2016 CPMI-IOSCO guidance on the strength of structural similarity rather than source-text grounding. Eval probes should include (a) cross-reference queries on principles-based regulator instruments where the source text does not cite the asserted framework, (b) alignment-strength queries that require the model to distinguish 'developed in awareness of' from 'explicitly references', and (c) retrieval-grounded queries that require pulling the framework-references section of the source document before answering.
Second, expand evaluation coverage of regulator strategic-phrase provenance queries. The models in this audit attributed the phrase 'secure the periphery, protect the core' to the wrong CPMI publication (Opus to the 2018 wholesale-payments fraud paper, Sonnet to a May 2019 speech), where the actual source is Cœuré's November 2018 speech. Eval probes should include (a) provenance queries on regulator strategic phrases that circulate across speeches and standards, (b) attribution-strength queries that require the model to distinguish speech provenance from standards-document provenance, and (c) retrieval-grounded queries that require checking the BIS speech archive before attributing a phrase.
Third, expand evaluation coverage of operational-depth comparison queries on regulator instruments that have been supplemented by later regulator work. The model in this audit attributed FSB 2020 'Effective Practices' content (forensic analysis, breach database) to the 2016 CPMI-IOSCO guidance. Eval probes should include (a) depth-comparison queries that require the model to distinguish original-instrument depth from later-supplement depth, (b) source-attribution queries on specific practices that appear in both documents at different levels of detail, and (c) retrieval-grounded queries that require pulling both documents before comparing.
Fourth, expand evaluation coverage of definitional-alignment queries across regulator publication gaps. The models in this audit asserted broad consistency between the 2016 CPMI-IOSCO guidance and the November 2018 FSB Cyber Lexicon; one model added a derivation claim that neither source establishes. Eval probes should include (a) definitional-alignment queries between original-instrument terminology and later-lexicon terminology, (b) derivation-claim queries that require source grounding on both sides, and (c) retrieval-grounded queries that require the model to pull the definition section of both documents before asserting alignment.
Fifth, expand evaluation coverage of operative-status queries where a regulator revision cycle has opened after the model's training cutoff. The models in this audit recorded the 2016 guidance as standing without active revision, missing the BIS press release of 6 May 2026 that opened a public consultation. Eval probes should include (a) operative-status queries on regulator instruments where a revision cycle is likely or open, (b) revision-detection queries that require the model to flag uncertainty about post-cutoff revisions, and (c) retrieval-grounded queries that require checking the regulator press release stream for the deliverable period before answering.
The deployment-time mitigation is to treat regulator-framework cross-reference, strategic-phrase provenance, operational-depth comparison, definitional alignment, and operative-status outputs as high-risk and require either (a) a verifiable paragraph of the regulator's own text as anchor, or (b) a decline-to-commit. All nine findings in this audit would have been prevented by a retrieval-grounded answer that cited the relevant paragraph of the 2016 guidance, the FSB document, or the BIS press release stream; all nine flowed from a generation behaviour that produced confident answers by analogy with general knowledge rather than from the source text.
How RLB Can Help
RegLeg is positioned to support labs working on the failure modes surfaced in this audit. Our research operates on the boundary where labs' enterprise customers are most exposed: technical regulatory documents where the model is asked to assert framework cross-references, attribute regulator strategic phrases, compare operational depth against later regulator work, align definitions across publication gaps, and report on the operative status of regulator instruments. For labs, our findings supply ready-made eval scaffolding (the question types, the regulator-text anchors, the expected failure modes) that can be adapted into internal benchmark sets.
Where labs are interested, we are open to engagement on (a) targeted evaluation set development for specific regulator portfolios (CPMI-IOSCO, FSB, IOSCO, BIS, prudential authorities, securities regulators), (b) failure-mode taxonomy work on regulator-framework cross-references, strategic-phrase provenance, operational-depth comparison, definitional alignment, and operative-status reporting, and (c) eval-design consultation on retrieval-grounded answer behaviour for authoritative technical documents in the cyber and operational resilience domain. We can also support post-audit communication with affected enterprise customers, where the lab has decided to surface a known limitation rather than allow customer-side discovery.
Practitioners and enterprise teams using AI tools on cyber-programme work can consult our published Hallucination Research for a free pre-flight check on AI-assisted regulatory research, identifying the question types and regulator instruments where current models have demonstrably mis-stated the rules.
