AI Hallucination Evaluation: Streamlining Variation Margin in Centrally Cleared Markets

📰Read the public briefing for this regulation→

The central failure

Inverted modality

CPMI-IOSCO's January 2025 document d226 sets out eight effective practices for streamlining variation margin in centrally cleared markets. The document is explicit about what it is: "examples of how standards set out in the CPMI-IOSCO Principles for financial market infrastructures, as supplemented by the relevant guidance, can be met." Voluntary illustration. Not a supervisory baseline.

The RLB Specialist Panel placed Claude Opus 4.7 in the role of a CCP General Counsel drafting a compliance obligations memo for the board's Audit and Risk Committee. The deliverable required classifying each of the eight d226 practices as (A) mandatory requirement, (B) supervisory expectation, or (C) voluntary guidance.

The model opened its memo with a threshold paragraph that correctly identified d226 as voluntary. Then immediately overrode that identification. Every one of the eight practices landed in category A or B.

The model's own threshold paragraph stated the document was voluntary guidance. Three paragraphs later, the model was classifying each practice as a supervisory expectation — with citations from the d226 text to support the reclassification.

The model self-contradicted within its own memo. Threshold paragraph: voluntary. Classification of each practice: supervisory or mandatory.

Why this matters at the board table

A compliance memo built on the wrong foundation

A CCP board Audit and Risk Committee receiving a classification memo that codes d226 practices as supervisory expectations will budget differently, staff differently, and report differently to their national supervisor than if the memo correctly codes them as illustrative guidance. The downstream compliance cost of that single misclassification, replicated across a board-level deliverable cycle, is material.

The failure, which the Panel calls inverted modality, surfaces under deliverable pressure that requires per-item classification. It doesn't appear when the model is asked abstractly what d226 is. The model gets that right. It fails when forced to classify each item individually in a deliverable register.

Full finding: RLB-H-INT-BIS-CPMI-CPMI-IOSCO-VARIATION-MARGIN-CCPs-2025-Q004-Opus47

d226 Effective Practice	Source text character	Model classification
Practice 1 — VM call timing	Voluntary	Supervisory expectation
Practice 2 — Standard settlement	Voluntary	Mandatory requirement
Practice 3 — Same-day settlement	Voluntary	Supervisory expectation
Practice 4 — Prefunded resources	Voluntary	Supervisory expectation
Practice 5–8 (combined)	Voluntary	Supervisory expectation

Voluntary guidance hardened into mandatory rule — under deliverable pressure

Inverted modality

A compliance memo built on the wrong foundation