Both Claude Opus 4.7 with web search and Claude Sonnet 4.6 with web search fabricated regulatory structure when responding to CFTC Regulation 1.44 (FCM Margin Adequacy and Separate Accounts) — reconstructing rule taxonomies from inferred analogues rather than the regulation's enumerated provisions. The dominant failure shape is structural compression: multi-tier, explicitly enumerated lists in the regulation (distinct deadlines by currency tier; distinct cessation events by category) were collapsed into simpler, intuitively plausible substitutes.
Claude Sonnet 4.6 flattened a three-tier currency deadline structure into a two-tier model and disaggregated a six-item cessation checklist into four items that reassembled margin failure into three checkboxes — leaving operations teams without monitoring coverage for the majority of regulatory triggers. Claude Opus 4.7 substituted a plausible operational checklist for the FCM-level cessation events specified in the regulation.
The pattern is consistent with models reconstructing compliance schema from surface-level intuition about how margin rules typically work, rather than from the specific enumeration the CFTC published — a failure mode that is structurally invisible to evals built on paraphrased or summary-level test items.
Regulation 1.44 governs how Futures Commission Merchants segregate and margin customer assets in separate accounts — a technical regime that sits at the intersection of derivatives clearing, treasury operations, and FCM compliance. Models are routinely deployed in these contexts: compliance teams at FCMs and their technology vendors use AI assistants to draft operational procedures, configure monitoring thresholds, and interpret regulatory text on tight timelines. When a model reconstructs rule structure from pattern-matching rather than the actual enumeration, the output looks authoritative — structured, numbered, checklist-formatted — and treasury or operations staff act on it directly.
The downstream harms are asymmetric and high-stakes. An FCM treasury team that misconfigures currency settlement deadlines based on a compressed two-tier read of a three-tier rule faces real regulatory exposure with the CFTC — and the model's confident, well-formatted output provides no signal that a tier was dropped. An operations team that builds automated monitoring against a four-item cessation checklist instead of the regulation's six-item list has a surveillance gap that a regulatory examination would surface.
The failure is not recoverable by the user in real-time: the model's output is internally consistent and plausible enough that validation against the primary text would only happen if the user already knew what to look for. For labs deploying in financial-services and regtech contexts, this is the misuse-claim exposure that matters: confidently wrong procedural outputs acted on by professionals who had reasonable grounds to trust them.
What makes Regulation 1.44 a structurally difficult surface for models is the combination of recent enactment (final rule 2025), cross-reference dependency across appendices, and technical enumeration where count and exact membership of a list is legally material. Currency deadlines are defined by explicit Appendix A membership, not by intuitive currency groupings. Cessation triggers are enumerated with no catch-all — each item is a distinct legal event. Models that reconstruct these lists from prior-generation regulatory patterns or from third-party summary text will produce outputs that are wrong in exactly the ways that matter most operationally while appearing structurally correct.
| Model | Configuration | Failure count | Dominant error pattern |
|---|---|---|---|
| Claude Opus 4.7 | Web search | 1 | Substituted plausible operational checklist for enumerated FCM-level cessation events |
| Claude Sonnet 4.6 | Web search | 2 | Structural compression of enumerated regulatory lists into intuitively constructed analogues |
Claude Sonnet 4.6 with web search produced two confirmed fabrications on this regulation, both involving enumeration collapse. On the currency deadline structure, the model produced a two-tier read of a three-tier rule — grouping CAD with non-USD currencies and ignoring the Appendix A designation that creates a second-business-day tier for ten specific currencies. The model cited a third-party law-firm URL (Fabricated) to support the incorrect structure. When re-probed, it self-retracted — indicating the correct answer was accessible but not the one it initially produced.
On the cessation triggers question, the model disaggregated margin failure into three separate checklist items (initial, maintenance, variation margin) while dropping four of the six customer-specific cessation events the regulation enumerates, reducing regulatory coverage to two customer-level items from the required six.
Claude Opus 4.7 with web search failed on the cessation trigger question, substituting a customer-centric margin-call checklist for the FCM-level events the regulation specifies — distress notification, internal distress determination, and insolvency. The seven-item checklist it produced contained no entry for any of the three FCM-level events.
The cross-model pattern is specific and worth naming directly: failures concentrate on recently enacted provisions with explicit, membership-defined enumerations — particularly where the correct answer requires counting list items and respecting membership criteria the model has no strong prior for. Both models defaulted to structurally plausible reconstructions of what compliance checklists of this type typically look like, rather than the regulation's actual content. The self-retraction behaviour on the Sonnet 4.6 currency finding is diagnostic: the model's retrieval path accessed the correct answer but something in the generation pathway selected the intuitively constructed version first.
That gap between retrieved content and generated output is a calibration signal the lab's internal evals may not be designed to catch.
3 findings in this case study. Click any to see its full evidence card.
The currency deadline finding points to a training-data gap specific to Appendix A membership. The regulation's three-tier structure is not derivable from the names of the currencies — CAD's assignment to the USD Fedwire tier and the ten-currency Appendix A list are both explicit regulatory decisions with no intuitive basis in the currencies' properties. Training corpora that rely on third-party summaries of this regulation — law-firm client alerts, compliance-platform writeups — are likely to reproduce the simpler two-tier intuition, because that's how summary authors paraphrase the rule.
The correct training signal requires the regulation's primary text paired with explicit annotation of the Appendix A membership mapping. Without that structured pairing, the model's generative prior for "how FCM currency deadlines work" will crowd out the specific enumeration.
The cessation-trigger findings point to a related problem: the model's prior on what a margin-operations checklist looks like (margin-type sub-categories, event-of-default) is strong enough to override a regulatory enumeration it has access to. Both models self-retracted on re-probe, which means the primary-text information was reachable. The training signal that builds a reliable prior for "regulatory enumeration is closed — list items are not inferrable from category logic" is probably sparse for recently enacted CFTC rules. Pairing the enumerated cessation events from the regulation's primary text with negative examples showing what plausible-but-wrong reconstructions look like would directly address this failure shape.
When web search is enabled and returns third-party law-firm or compliance-platform content, the ranker's effective weight on that content relative to the regulator's primary text appears to be too high for queries about recently finalized rules. On the currency deadline finding, the model cited a Fabricated third-party URL and produced the two-tier structure that third-party summaries commonly use. A retrieval-routing layer that boosts regulator-domain primary text for regulator-named queries — and degrades third-party summary content when it conflicts with primary text on numeric thresholds — would close this exposure without retraining.
The self-retraction pattern is diagnostic of a calibration gap between retrieval confidence and generation commitment. The model accessed the correct information but generated the incorrect structure with enough confidence to present it as procedural guidance. A post-training calibration pass that penalizes high-confidence generation of enumerated lists where the generation disagrees with retrieved primary text would surface this class of failure at generation time.
Additionally, a structured self-check pass on numeric tiers and explicit list counts — where the model commits to a specific count of enumerated items, it verifies that count against retrieved text before finalizing — would catch the compression failures documented here.
We document nuanced failure modes on regulatory content that internal evals built against summary-level or paraphrased test items systematically miss. The failure shapes we've documented across CFTC and adjacent regulatory work include: enumeration-compression on recently enacted provisions with explicit Appendix-defined membership; multi-tier numeric structure collapse where the correct tier assignment is specified in primary text but contradicted by widely-cited third-party summaries; cessation-trigger dropout where a model's operational prior for a checklist type overrides the regulation's closed enumeration; and self-retraction gaps where the correct answer is retrievable but the generation path selects a plausible-prior reconstruction first.
These failure shapes are consistent across model versions and configurations in ways that reflect something structural about how the models represent recently enacted regulatory content — not incidental noise.
The partnership we can offer is scoped to model improvement. We can generate targeted correction pairs per documented failure mode — regulator primary text matched to wrong-but-plausible model reconstructions, formatted for direct ingestion into your training pipeline. We can run embedded eval cycles against a defined CFTC and multi-regulator portfolio on a quarterly basis, with comparative reports across model versions and regression monitoring on previously documented failure modes so you can track whether fine-tuning or RLHF adjustments are moving the needle.
For capability releases touching financial services, payments infrastructure, or cross-border regulatory content, we can run pre-release evaluation cycles to flag failure shapes before they reach customers who will act on them as operational guidance. And we can provide red-team consultation on regulator-specific failure surfaces as new regulatory domains enter your deployment footprint.
To scope a technical partnership for refining your models against these failure modes, contact us at reglegbrief.com.
Every finding on this page compares an AI subject's account of the rule against the regulator's verbatim text from the regulator's own portal. Both are linked. Each delta, its root causes, and impact analysis are documented and published with immutable Citation IDs.