
Two frontier AI models, Two frontier AI models, each running with web search, produced structurally correct-looking operational guidance on CFTC Regulation 1.44 that contradicts the regulation’s text. The Specialist Panel tested each model on the same operational question: map the currency deadline tiers for an FCM margin operations team. Both failed on the regulatory detail that determines how early or late an FCM must collect margin.
Claude Opus 4.7 collapsed the regulation’s three-tier structure into two tiers. Regulation 1.44(f) distinguishes: USD and Canadian dollars (same-day); Appendix A currencies, AUD, CNY, HKD, HUF, ILS, JPY, NZD, SGD, TRY, ZAR, (end of the second business day, T+2); and all other fiat currencies (end of the next business day, T+1). The model assigned Appendix A currencies a T+1 deadline and all other non-USD currencies same-day, compressing an intermediate tier out of existence.
An FCM margin team following this guidance would call Appendix A margin one full business day earlier than required, and apply same-day urgency to currencies the regulation permits T+1 collection on.
Claude Sonnet 4.6 introduced a deadline that does not exist in the regulation: “12:00 p.m. ET on the FIRST U.S. business day.” Regulation 1.44(f)(3) sets only “end of the business day after the day on which the margin call is issued”, no intraday time. A system parameter or operations procedure built on this output would implement a self-imposed noon cutoff with no regulatory basis, and the documentation trail would cite a specific time the regulation does not contain.
Both outputs were formatted, internally consistent, and produced without any caveat that source verification was needed. That is the failure mode these findings document.
Both Claude Opus 4.7 with web search and Claude Sonnet 4.6 with web search fabricated regulatory structure when responding to CFTC Regulation 1.44 (FCM Margin Adequacy and Separate Accounts), reconstructing rule taxonomies from inferred analogues rather than the regulation's enumerated provisions. The dominant failure shape is structural compression: multi-tier, explicitly enumerated lists in the regulation were collapsed into simpler, intuitively plausible substitutes. Claude Sonnet 4.6 flattened a three-tier currency deadline structure into a two-tier model and disaggregated a six-item cessation checklist into four items, leaving operations teams without monitoring coverage for the majority of regulatory triggers. Claude Opus 4.7 substituted a plausible operational checklist for the FCM-level cessation events specified in the regulation. The pattern is consistent with models reconstructing compliance schema from surface-level intuition about how margin rules typically work, rather than from the specific enumeration the CFTC published, a failure mode structurally invisible to evals built on paraphrased or summary-level test items.
This is the consolidated view of findings. Click 'see details →' on any item for the full details for each finding.
The model's collapse of Regulation 1.44(f)'s three-tier currency deadline structure into two tiers — Appendix A currencies at T+1, all other non-USD fiat at same-day — implicates the training-data representation of the regulation: the two-tier reconstruction matches the format of third-party law-firm summary content that pre-dates or simplifies the rule's actual T+2 Appendix A tier, while the correct end-of-second-business-day deadline is the specific detail that distinguishes the regulation from the model's prior. The retrieval-augmented generation layer either did not surface §1.44(f)(2) verbatim, or did surface it but failed to override the model's prior toward the simpler two-tier schema.
On re-probe, the model corrected to the three-tier structure — confirming the correct mapping was reachable. This is a calibration failure in how primary regulatory text is weighted against summary-content priors at generation time.
see details →The model's two-tier reconstruction and the Fabricated third-party citation together implicate the retrieval-ranking layer: when web search returns third-party law-firm summary content that uses a simpler two-tier schema, that content appears to be weighted comparably to primary regulatory text. The self-retraction on re-probe confirms the correct three-tier structure was accessible — the generation pathway selected the wrong output despite having the right information available. This is a calibration failure in the RAG-to-generation handoff: retrieved primary text was not given sufficient authority to override the model's prior or the third-party summary's framing.
see details →Every finding on this page compares an AI subject's account of the rule against the regulator's verbatim text from the regulator's own portal. Both are linked. Each delta, its root causes, and impact analysis are documented and published with immutable Citation IDs.