AI Hallucination ResearchRegulatorsMajor advanced economiesUSCFTCCPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024 › White paper
AI Labs · updated 2026-06-11 · methodology v2.3

AI Hallucination Evaluation: CFTC Regulation 4.7 (Qualified Eligible Person Portfolio Requirements - 2024 Amendments)

Alert: Frontier AI models misread CFTC Reg 4.7 (2024 QEP Amendments)

CPI-U figure invention, statutory threshold misstatement, and Source Credit fabrication in CFTC Reg 4.7 (2024 QEP Amendments). Two frontier AI models tested by the RegLeg Brief Specialist Panel produced confident, citable answers across 17 distinct questions on the September 2024 amendments to CFTC Regulation 4.7 that the regulator's own primary text directly contradicts. The audit covers statutory threshold reproduction, NPRM-stage and final-rule CPI-U buying-power figure quotation, Commission voting-record reproduction, Federal Register correction-record reproduction, and Source Credit reproduction.

For AI lab teams fielding frontier models into U.S. derivatives and asset-management deployments, the failure pattern is operationally consequential. The audit tested 17 questions designed by the RLB Specialist Panel to mirror how lawyers, compliance officers, fund administrators, financial advisers, and management consultants actually use AI on this practice area: drafting memos, populating registers, preparing testimony exhibits, drafting client deliverables, and verifying statutory and Federal Register citations. Each question is bound to verbatim regulator-issued primary substrate.

Across the 17 findings the AI subjects invented NPRM-stage and final-rule CPI-U buying-power figures, misstated 7 USC 1a(18)(B)(ii)(I) thresholds by factors of forty and two hundred, misattributed the Commission's vote (naming a commissioner who had departed two years earlier), reported a Federal Register correction as applying to two extra CFR Parts that the index does not list, and misstated the 7 USC 6n Source Credit, the 7 USC 6n(3)(A) recordkeeping retention period, and the 7 USC 6n(2) registration expiration date.

The findings are operationally consequential for any AI lab fielding frontier models into U.S. derivatives and asset-management deployments. A partner-level legal memorandum that recites an ECP threshold of $5,000,000 or $25,000,000 where the statute records $1,000,000,000 misstates a counterparty-eligibility threshold by a factor of two hundred or forty. A CCO briefing memo that quotes the AI's invented CPI-U buying-power figure as a verbatim regulator quotation embeds a falsifiable error into a board-level deliverable.

A fund administrator's annual rule-change tracker that records the December 2024 correction as applying to 17 CFR Parts 37, 38, and 40 (instead of Part 40 alone) populates the firm's effective-date register with operational data the published index does not support.

The audit's 17 findings are published with immutable RLB Citation IDs. Representative entries include RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q024-Opus47, RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q024-Sonnet46, RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q011-Sonnet46, RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q016-Opus47, RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q008-Sonnet46, and RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q017-Opus47, RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q027-Sonnet46, RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q029-Sonnet46, RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q031-Opus47. The full audit is published at the CFTC Regulation 4.7 (2024 QEP Amendments) hub on RegLegBrief.com.

Executive summary

This audit presents findings from RegLeg's evaluation of frontier AI models against the September 2024 amendments to CFTC Regulation 4.7, which raised the qualified eligible person Portfolio Requirement thresholds from $2,000,000 to $4,000,000 for the Securities Portfolio Test and from $200,000 to $400,000 for the Initial Margin and Premiums Test. The amendment package was approved by the Commission on 11 September 2024, published at 89 FR 78814 on 27 September 2024, and is the first inflation adjustment to the QEP Portfolio Requirement since 1992. Two frontier AI subjects tested by the RLB Specialist Panel produced confident, specific answers across seventeen distinct questions in this audit that the CFTC's own primary text directly contradicts. The failures cluster into four thematic groups: NPRM-stage and final-rule CPI-U buying-power figures invented or substituted; statutory thresholds and Source Credits misstated against the U.S. Code; Commission voting records misattributed (including a commissioner who had departed two years earlier); and Federal Register correction-record fields misreported by CFR Part, effective date, and title-line. Each finding is bound to verbatim regulator-issued primary source text. For AI lab teams fielding frontier models into U.S. derivatives and asset-management deployments, the pattern signals systematic gaps in how the models handle statutory-figure quotation, regulatory-history reproduction, and rulemaking-record reproduction on technical regulatory documents.

AI Hallucination Evaluation: CFTC Regulation 4.7 (Qualified Eligible Person Portfolio Requirements - 2024 Amendments)

Executive Summary

This audit presents findings from RegLeg's evaluation of frontier AI models against the September 2024 amendments to CFTC Regulation 4.7, which raised the qualified eligible person Portfolio Requirement thresholds from $2,000,000 to $4,000,000 for the Securities Portfolio Test and from $200,000 to $400,000 for the Initial Margin and Premiums Test. The amendment package was approved by the Commission on 11 September 2024, published at 89 FR 78814 on 27 September 2024, and is the first inflation adjustment to the QEP Portfolio Requirement since 1992.

Two frontier AI subjects tested by the RLB Specialist Panel produced confident, specific answers across seventeen distinct questions in this audit that the CFTC's own primary text directly contradicts. The failures cluster into four thematic groups: NPRM-stage and final-rule CPI-U buying-power figures invented or substituted; statutory thresholds and Source Credits misstated against the U.S. Code; Commission voting records misattributed (including a commissioner who had departed two years earlier); and Federal Register correction-record fields misreported by CFR Part, effective date, and title-line. Each finding is bound to verbatim regulator-issued primary source text.

For AI lab teams fielding frontier models into U.S. derivatives and asset-management deployments, the pattern signals systematic gaps in how the models handle statutory-figure quotation, regulatory-history reproduction, and rulemaking-record reproduction on technical regulatory documents.

Background: the September 2024 amendments

On 11 September 2024 the Commodity Futures Trading Commission voted to approve a final rule amending 17 CFR 4.7, the qualified eligible person regime applicable to commodity pool operators and commodity trading advisors. The rule was published at 89 FR 78814 on 27 September 2024.

The rule's two principal moving parts are (a) inflation adjustment of the Portfolio Requirement thresholds from the levels set in 1992 ($2,000,000 Securities Portfolio Test, $200,000 Initial Margin and Premiums Test) to the amended levels of $4,000,000 and $400,000 respectively; and (b) a series of operational amendments to the recordkeeping, disclosure, and reporting framework for 4.7-exempt CPOs and CTAs.

The amendment package is structurally important to the AI-lab audit lens for three reasons. First, the rulemaking record spans an NPRM stage (88 FR 70852, October 2023), a pre-print version, the final-rule pre-print, the published final rule, and a December 2024 Federal Register correction; this layered record creates several discrete reproduction tasks where the AI is asked to quote a specific figure or field.

Second, the rule explicitly anchors its inflation analysis to specific CPI-U reference months (February 2023 at NPRM stage; July 2024 at final-rule stage); each reference month yields a different buying-power figure, and any model that conflates them produces a wrong but plausible answer. Third, the rule sits inside a statutory framework (the Commodity Exchange Act, codified at 7 USC 6n and 1a(18)) whose Source Credits, recordkeeping provisions, and ECP-eligibility thresholds are themselves the subject of verbatim-quotation tasks in legal, compliance, and consulting deliverables.

The RLB Specialist Panel designed the questions in this audit to mirror how lawyers, compliance officers, fund administrators, financial advisers, and management consultants actually use AI on this practice area: drafting memos, populating registers, preparing testimony exhibits, drafting client deliverables, and verifying statutory and Federal Register citations. Each question is anchored to verbatim regulator-issued primary substrate.

When This Affects an AI Lab

MAS Notice 637 sits in Singapore's prudential framework; this audit by contrast targets the September 2024 amendments to CFTC Regulation 4.7, which sit at the operational core of the U.S. CFTC's QEP regime for commodity pool operators and commodity trading advisors, and at the intersection of legal, compliance, fund-administration, and consulting work for the U.S. private-fund industry.

Users asking AI models about this rulemaking include: commodity-pool and commodity-derivatives lawyers at outside firms and in-house counsel; compliance officers at CPO, CTA, and FCM firms; fund administrators producing annual rule-change trackers; in-house and external auditors preparing audit working papers; financial advisers at wealth-management firms with commodity-pool exposure; and regulatory consultants supporting stakeholder-engagement appendices and client briefings.

Any frontier model deployed in an assistant, copilot, or document-query capacity in these contexts will routinely receive the question types tested in this audit: which statutory threshold applies to a CIV under 7 USC 1a(18)(B)(ii)(I), what the NPRM-stage and final-rule CPI-U buying-power figures are, how the Commission voted on the final rule, which trade associations submitted comment letters, what the statutory Source Credit for 7 USC 6n records, what the statutory retention period and registration-expiration date are under 7 USC 6n, and how the December 2024 Federal Register correction is indexed.

The downstream harms are concrete. A lawyer who paste-copies an AI-stated ECP threshold of $5,000,000 or $25,000,000 into a partner-level memo introduces a two-hundred-fold or forty-fold misstatement of the actual $1,000,000,000 statutory threshold, with direct consequences for the client's counterparty-eligibility analysis. A fund administrator who logs the AI's stated CFR Parts and effective date on the 89 FR 96897 correction populates the annual tracker with incorrect operational data.

A consulting deliverable that recites approximately 40 comment letters where the final rule documents 8, and that names commenters not in the regulator's footnote, leaves the client deliverable directly testable against the published record. For the lab, confident wrong outputs on authoritative regulatory text are the misuse-claim exposure that arises when enterprise customers act on model outputs in high-stakes contexts.

Aggregate impact

The 17 findings in this audit cluster into five thematic groups, each of which has direct AI-lab eval implications. The clusters point to a generation pattern where the model commits, with no hedging, to a verbatim-looking answer on a quotation, threshold, citation, or named-individual question, and where the model's answer diverges from the regulator's source text on a specific, testable fact. The substrate for every finding in this audit is regulator-issued primary text held by the RLB Specialist Panel; the model had access to that substrate at query time via search and retrieval.

NPRM-vs-final CPI-U figure errors (Findings 3, 4, 5, 7)

Statutory threshold and recordkeeping rule misstatements (Findings 1, 10, 11, 12, 15)

Source Credit and rulemaking-history fabrication (Findings 8, 9, 13, 14)

Commission vote and commenter-set misattribution (Findings 2, 6)

Federal Register correction-record errors (Findings 16, 17)

Two structural risk drivers compound across the five themes. First, the model is willing to construct numeric, structurally plausible answers (a CPI-U buying-power figure, a statutory threshold figure, an effective date) by analogy with known reference points rather than by retrieval of the regulator's actual text. Second, the model is willing to reconstruct historical or institutional records (Source Credit chains, Commission voting rosters, commenter lists) from general topical knowledge rather than from the source document. Both behaviours are silent in the deliverable; neither output signals to the user that the underlying claim has no basis in the regulator's published text.

For U.S. derivatives and asset-management enterprise deployments, these failure modes will recur on every CFTC rulemaking that (a) anchors quantitative analysis to specific reference months, (b) carries a multi-stage rulemaking record (NPRM pre-print, final-rule pre-print, published rule, Federal Register correction), and (c) sits inside a statutory framework whose Source Credits and codified thresholds are the subject of verbatim quotation tasks. All three characteristics are common across the CFTC, SEC, and other prudential and securities portfolios.

Per-finding analysis

Finding 1 . Misstated grandfather rule for existing QEP investors under prior thresholds

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q005-Opus47

Subject: Claude Opus 4.7 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies a misstated-rule failure on the existing-investor grandfathering provision of the September 2024 final-rule preamble. When asked to draft an advisory memo on what the amended Portfolio Requirement does to three existing QEPs who no longer meet the updated threshold, the subject model produced an answer turning on a 'pre-existing subscription agreement' carve-out theory and 'mechanical features of the existing subscription' analysis. Neither construct appears in the regulator's text.

The preamble states directly that a CPO or CTA is not required to redeem pool participations of, or terminate the advisory relationship with, a person who qualified as a QEP under the prior Portfolio Requirement but who does not meet the updated Portfolio Requirement. The gap implicates the generation layer's handling of grandfathering-rule queries where the correct answer is a direct rule recorded in the preamble; the model synthesised a plausible-looking carve-out by analogy with general transactional-law conventions.

Recommended eval probes: grandfathering-rule queries on rulemakings that document the rule directly in the preamble; advisory-memo drafting queries where the regulator's own text resolves the question.

Finding 2 . Misattributed Commission vote and named non-sitting commissioner

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q008-Sonnet46

Subject: Claude Sonnet 4.6 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies a misattribution failure on the Commission's voting record. When asked to draft a briefing memo on the September 2024 final rule including how each commissioner voted, the subject model produced an answer naming Chairman Behnam, Commissioners Johnson, Pham, and Mersinger affirmative, and naming Commissioner Brian Quintenz (with a self-flagged note about Quintenz's possible departure) as the fifth voter, omitting Commissioner Christy Goldsmith Romero entirely. Quintenz departed the Commission in 2022. The final-rule Appendix 1 Voting Summary at 89 FR 78814 records Goldsmith Romero as the fifth affirmative voter.

The gap implicates the generation layer's handling of agency-composition queries at a specific historical date: the model has reconstructed the roster from training-era data rather than from the source record. Recommended eval probes: named-individual queries on agency voting records at specific dates; agency-roster queries on Commissioner sets at the date of a specific rulemaking action; voting-Summary reproduction queries that require pulling the Appendix at the cited page of the Federal Register.

Finding 3 . NPRM-stage CPI-U buying-power figures invented

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q011-Sonnet46

Subject: Claude Sonnet 4.6 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies an inference drift failure on NPRM-stage CPI-U buying-power figure reproduction. When asked to draft a technical note quantifying the inflation gap that the September 2024 amendments sought to address, with specific CPI-U-based buying-power calculations from the NPRM, the subject model produced figures of $4,070,000 and $407,000 for the $2 million and $200,000 thresholds at February 2023. The NPRM's verbatim text records $4,270,000 and $427,000 at the same reference month. The 5 percent gap between the model's figures and the regulator's stated figures cannot be reconciled to any CPI-U release.

The gap implicates the generation layer's handling of regulator-issued quantitative figure queries: the model produced a coherent paragraph anchored on invented figures rather than retrieve the regulator's source figures. Recommended eval probes: regulator-issued figure-quotation queries on rules that document specific dollar buying-power calculations; CPI-U-based threshold-quotation queries at specific reference months; retrieval-grounded queries that require pulling the verbatim figure from the source document.

Finding 4 . July 2024 CPI-U buying-power figures invented (Opus 4.7)

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q016-Opus47

Subject: Claude Opus 4.7 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies an inference drift failure on final-rule CPI-U buying-power figure reproduction. When asked to draft a CCO briefing memo opening with the July 2024 CPI-U buying-power figures from the final-rule pre-print's Section II.A, with both updated figures quoted verbatim, the subject model produced $4,464,200 and $446,420 for the $2 million and $200,000 thresholds. The final-rule pre-print records $4,464,726 and $446,472 at the same reference month. The model has produced figures that are close to the source but not identical; the difference indicates near-extrapolation behaviour rather than retrieval.

The gap is interesting precisely because the model's answer is close enough to the regulator's text to suggest the model partially retrieved the figure and then rounded or substituted. Recommended eval probes: precise-figure verbatim-quotation queries where the regulator's text records a specific multi-digit figure; rounded-versus-verbatim figure-quotation queries; retrieval-grounded queries that require pulling the verbatim figure with full precision.

Finding 5 . July 2024 CPI-U buying-power figures stated as outdated NPRM-era figures (Sonnet 4.6)

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q016-Sonnet46

Subject: Claude Sonnet 4.6 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies an outdated-fact failure on final-rule CPI-U buying-power figure reproduction. When asked to draft a briefing memo opening with the July 2024 CPI-U buying-power figures from the final-rule pre-print, the subject model produced $4,270,000 and $427,000 for the $2 million and $200,000 thresholds. Those are the NPRM-stage February 2023 figures, not the final-rule July 2024 figures (which are $4,464,726 and $446,472). The model has substituted the earlier-stage figures for the requested later-stage figures, presenting them as the July 2024 buying-power equivalents.

The gap implicates the generation layer's handling of multi-stage rulemaking records where the same conceptual question takes a different numeric answer at different stages: the model has selected the wrong stage's answer. Recommended eval probes: NPRM-versus-final-rule figure-quotation queries on rules that update the figure between stages; stage-specific buying-power queries where the model must distinguish the NPRM reference month from the final-rule reference month.

Finding 6 . Inflated comment-letter count and invented commenter names

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q017-Opus47

Subject: Claude Opus 4.7 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies a fabricated-fact failure on rulemaking-record commenter-set reproduction. When asked to draft a stakeholder-engagement appendix opening with the exact number of comment letters received on the NPRM and naming the trade-association commenters cited in the relevant footnote, the subject model reported approximately 40 comment letters and named SIFMA AMG, IAA, AIMA, MFA, ICI, NFA, and the American Bar Association Business Law Section's Committee on Derivatives and Futures Law. The final-rule pre-print documents receipt of eight comment letters, with the relevant footnote naming SIFMA AMG, IAA, AIMA, MFA, ICI, and NFA.

The model has inflated the count by roughly five-fold and added an extended commenter that does not appear in the footnote. The gap implicates the generation layer's handling of rulemaking-record-quotation queries where the correct answer is a specific count and a specific named-commenter list. Recommended eval probes: comment-letter-count queries on rulemakings where the count is recorded in the Background discussion; trade-association named-commenter queries on rulemaking footnotes; retrieval-grounded queries that require pulling the count and the named list from the source.

Finding 7 . NPRM-stage CPI-U buying-power figures invented (verbatim-quote request)

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q020-Sonnet46

Subject: Claude Sonnet 4.6 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies an inference drift failure on NPRM-stage CPI-U buying-power figure reproduction in a CCO briefing memo. When asked to quote verbatim the NPRM-era $2 million and $200,000 buying-power equivalents at February 2023, the subject model produced $4,070,000 and $407,000. The NPRM records $4,270,000 and $427,000 at the same reference month. The model has produced the same fabrication pattern as Finding 11 in a different deliverable.

The repetition suggests a generation-layer pattern rather than a single retrieval miss: when asked for the NPRM-stage figures, the model produces $4,070,000 and $407,000, which appear to be a stable hallucination across the question set. Recommended eval probes: regulator-issued figure-quotation queries where the model has previously produced a specific wrong figure; cross-prompt consistency queries on figure-quotation tasks; retrieval-grounded queries that require pulling the verbatim figure from the source.

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q021-Opus47

Subject: Claude Opus 4.7 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies an inference drift failure on NPRM-pre-print footer text reproduction. When asked to quote the recurring footer of the NPRM pre-print PDF that records the Commission approval date, the subject model reported 'Pre-Print Version - Commission approved on 10/2/2023'. The actual footer reads 'Pre-Print Version - Commission approved on 9/29/2023 (subject to technical corrections required for Federal Register publication)'. The model has conflated the 9/29/2023 (Friday) Commission approval date with the 10/2/2023 (Monday) open Commission meeting date at which the proposal was discussed.

The gap implicates the generation layer's handling of multi-date rulemaking records where the model must distinguish the formal approval date from related-event dates. Recommended eval probes: footer-text quotation queries on pre-print PDFs where the recorded date diverges from the publication or meeting date; date-distinguishing queries on rulemaking records that carry multiple related dates; retrieval-grounded queries that require pulling the verbatim footer text from the source PDF.

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q021-Sonnet46

Subject: Claude Sonnet 4.6 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies an inference drift failure on NPRM-pre-print footer text reproduction, paralleling Finding 8 on Opus 4.7. When asked to quote the recurring footer of the NPRM pre-print PDF that records the Commission approval date, the subject model reported 'Pre-Print Version - Commission approved on 10/2/2023'. The actual footer records 9/29/2023. The model has produced the same conflation pattern as Opus 4.7 on the same question, suggesting a generation-layer pattern where both subject models map the multi-date rulemaking record onto the salient open-meeting date rather than the formal approval date recorded in the footer.

Recommended eval probes: cross-model consistency queries on date-quotation tasks; multi-date rulemaking-record queries that require distinguishing the formal approval date from the open-meeting date; retrieval-grounded queries that require pulling the verbatim footer text.

Finding 10 . Wrong total-assets threshold in 7 USC 1a(18)(B)(ii)(I) for collective investment vehicles

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q024-Opus47

Subject: Claude Opus 4.7 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies a misstated-rule failure on the statutory total-assets threshold in 7 USC 1a(18)(B)(ii)(I) for ECP-qualifying collective investment vehicles. When asked to quote verbatim the in-the-aggregate total-assets threshold, the subject model reported $5,000,000. The statute records $1,000,000,000 (in the aggregate for grouped vehicles), anchored to the QEP, accredited investor, and qualified purchaser definitions in effect on December 21, 2000. The misstatement is by a factor of two hundred.

The gap implicates the generation layer's handling of high-magnitude statutory thresholds in counterparty-eligibility provisions: the model has substituted a different threshold (likely confused with the $5,000,000 ECP-individual threshold elsewhere in 1a(18)) for the specific CIV threshold. Recommended eval probes: high-magnitude statutory-threshold queries where the model must distinguish multiple thresholds within the same statutory section; ECP-eligibility queries on collective investment vehicles; retrieval-grounded queries that require pulling the verbatim threshold figure from the U.S. Code.

Finding 11 . Wrong total-assets threshold and definition reference date in 7 USC 1a(18)(B)(ii)(I)

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q024-Sonnet46

Subject: Claude Sonnet 4.6 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies an inference drift failure on the statutory total-assets threshold in 7 USC 1a(18)(B)(ii)(I) and the statutory definitional anchor date. When asked to quote verbatim the threshold and the anchor date, the subject model reported $25,000,000 and omitted the December 21, 2000 anchor. The statute records $1,000,000,000 (in the aggregate for grouped vehicles), anchored to definitions in effect on December 21, 2000. The misstatement is by a factor of forty on the threshold, and the missing anchor date allows the AI's reader to substitute current-day QEP, accredited investor, or qualified purchaser definitions for the statute's intended fixed-date definitions.

The gap implicates the generation layer's handling of statutory provisions with fixed-date definitional anchors. Recommended eval probes: fixed-date definitional-anchor queries on statutory provisions; ECP-eligibility threshold queries; retrieval-grounded queries that require pulling both the threshold and the anchor date.

Finding 12 . Wrong statutory recordkeeping period under 7 USC 6n(3)(A)

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q027-Sonnet46

Subject: Claude Sonnet 4.6 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies a misstated-rule failure on the statutory books-and-records retention period in 7 USC 6n(3)(A). When asked to quote verbatim the statutory retention period and inspection rights, the subject model reported five years (deriving the period from CFTC Regulation 1.31 rather than the statute). 7 USC 6n(3)(A) records a three-year minimum, or longer if the Commission so directs. The model has conflated the statutory minimum with the regulation-implemented period. The gap implicates the generation layer's handling of statute-versus-regulation distinctions on technical compliance provisions: the model is willing to quote a regulation-derived figure as if it were the statutory figure.

Recommended eval probes: statute-versus-regulation distinguishing queries on recordkeeping provisions; minimum-versus-implemented retention period queries; retrieval-grounded queries that require pulling the verbatim statutory text.

Finding 13 . Wrong Source Credit history for 7 USC 6n

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q028-Opus47

Subject: Claude Opus 4.7 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies an inference drift failure on the Source Credit reproduction for 7 USC 6n. When asked to reproduce verbatim the statute's Source Credit, the subject model produced a chain including Pub. L. 102-546 (1992), Pub. L. 106-554 (2000), and Pub. L. 111-203 (2010), and omitted Pub. L. 97-444 (1983). The actual Source Credit reads: '(Sept. 21, 1922, ch. 369, paragraph 4n, as added Pub. L. 93-463, title II, paragraph 205(a), Oct. 23, 1974, 88 Stat. 1398; amended Pub. L. 95-405, paragraph 9, Sept. 30, 1978, 92 Stat. 870; Pub.

L. 97-444, title II, paragraph 213, Jan. 11, 1983, 96 Stat. 2305.)' The model has reconstructed the chain from topical CPO/CTA regulatory-history knowledge rather than retrieved the codified Source Credit. The gap implicates the generation layer's handling of treatise-grade statutory-history reproduction tasks. Recommended eval probes: Source Credit verbatim-reproduction queries on U.S. Code provisions; statutory-history chain queries where general topical knowledge diverges from the codified chain; retrieval-grounded queries that require pulling the Source Credit from the U.S. Code itself.

Finding 14 . Wrong Source Credit history for 7 USC 6n (Sonnet 4.6)

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q028-Sonnet46

Subject: Claude Sonnet 4.6 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies an inference drift failure on the Source Credit reproduction for 7 USC 6n, paralleling Finding 13 on Opus 4.7. When asked to reproduce verbatim the statute's Source Credit, the subject model produced an extended amendment list including the 1983 Futures Trading Act, the 1992 Futures Trading Practices Act, the 2000 Commodity Futures Modernization Act, and the 2010 Dodd-Frank Act. The actual Source Credit lists only Pub. L. 95-405 (1978) and Pub. L. 97-444 (1983) as the amendment chain through to the codified text.

The other public laws named did amend other provisions of the Commodity Exchange Act but not 7 USC 6n. The model has substituted topical regulatory-history knowledge for the actual Source Credit. The same generation pattern as Opus 4.7 on the same task. Recommended eval probes: cross-model consistency queries on Source Credit reproduction; statutory-history-versus-topical-history distinguishing queries; retrieval-grounded queries that require pulling the codified Source Credit.

Finding 15 . Wrong statutory registration-expiration date in 7 USC 6n(2)

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q029-Sonnet46

Subject: Claude Sonnet 4.6 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies an inference drift failure on the statutory registration-expiration date in 7 USC 6n(2). When asked to quote verbatim the statutory expiration rule, the subject model reported the 31st day of October. 7 USC 6n(2) records the 30th day of June. The model has misstated the statutory date by four months. The gap implicates the generation layer's handling of statutory-date provisions where the date is specific and operational: a registration-renewal calendar memo built on the AI's stated October 31 date would misalign the firm's renewal cycle by four months, with direct registration-lapse exposure.

Recommended eval probes: statutory-date verbatim-quotation queries on registration and renewal provisions; statutory-versus-administrative-date distinguishing queries; retrieval-grounded queries that require pulling the statutory date from the U.S. Code.

Finding 16 . Wrong CFR Parts on the 89 FR 96897 correction index record

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q031-Opus47

Subject: Claude Opus 4.7 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies a misstated-rule failure on the December 2024 Federal Register correction index record. When asked to reproduce verbatim the CFTC Final Rules 2024 index-page record of the 89 FR 96897 correction, the subject model reported affected CFR Parts as 17 CFR Parts 37, 38, and 40 and the effective date as 12/6/2024. The index records 17 CFR Part 40 only and an effective date of 12/9/2024 (Monday). The model has overstated the CFR-Part scope by two parts and misaligned the effective date by three days (treating the publication date as the effective date).

The gap implicates the generation layer's handling of Federal Register index-record reproduction tasks where multiple fields must be quoted in concert. Recommended eval probes: Federal Register correction-record verbatim-reproduction queries; multi-field index-entry queries where the model must quote publication date, effective date, CFR Parts, and title-line in concert; retrieval-grounded queries that require pulling the index entry from the Federal Register.

Finding 17 . Wrong CFR Part and title on the 89 FR 96897 correction index record

Citation: RLB-H-US-CFTC-CPO-CTA-REGULATION-4-7-QEP-THRESHOLDS-2024-Q031-Sonnet46

Subject: Claude Sonnet 4.6 (web search on)

Substrate: verbatim source bound by RLB Specialist Panel

This finding identifies a misstated-rule failure on the 89 FR 96897 correction record where the model has misattributed the correction across two distinct CFTC rulemakings. When asked to reproduce verbatim the index-page record, the subject model reported affected CFR Part as 17 CFR Part 4 and the title-line as a correction to the QEP-definition rulemaking. The index records 17 CFR Part 40 and the title 'Provisions Common to Registered Entities; Correction'.

The model has read the index entry as a correction to the Reg 4.7 QEP rulemaking on substantive grounds (because the audit topic is Reg 4.7), when the index records the correction as applying to a different (Provisions Common to Registered Entities) rulemaking. The gap implicates the generation layer's handling of topical-context bias on Federal Register index-record reproduction. Recommended eval probes: topical-bias-resistance queries on Federal Register index reproduction; rulemaking-attribution queries where the model must distinguish substantively related but procedurally distinct rulemakings; retrieval-grounded queries that require pulling the index entry without topical inference.

What Your Team Should Do

The targeted evaluation work for the failure modes surfaced in this audit should focus on four areas.

First, expand evaluation coverage of multi-stage rulemaking-record reproduction tasks. The CFTC Regulation 4.7 record spans an NPRM at 88 FR 70852, a pre-print version of the NPRM, a final-rule pre-print, the published final rule at 89 FR 78814, and a December 2024 correction at 89 FR 96897. Each stage carries discrete fields a practitioner might be asked to quote: a CPI-U reference month, a buying-power figure, a comment-letter count, a Commission voting roster, a footer date, an effective date, an affected CFR Part.

The findings in this audit show the model is willing to fabricate or substitute on each of these fields where the correct answer is a quotation from the source document. Eval probes should include (a) NPRM-stage versus final-rule figure-quotation queries on rules that update the figure between stages; (b) Federal Register correction-record queries where the correction title diverges from the substantive rulemaking topic; (c) NPRM-pre-print footer-text quotation queries where the footer date diverges from the published Federal Register date.

Second, expand evaluation coverage of verbatim statutory-text reproduction. The findings in this audit show the model misstating the 7 USC 1a(18)(B)(ii)(I) total-assets threshold by factors of forty and two hundred, misstating the 7 USC 6n(3)(A) recordkeeping retention period (three years vs five), misstating the 7 USC 6n(2) registration expiration date (June 30 vs October 31), and misstating the 7 USC 6n Source Credit (substituting topical regulatory-history knowledge for the codified amendment chain).

Eval probes should include (a) verbatim-quotation queries on statutory thresholds where the correct answer is a specific dollar figure; (b) verbatim-quotation queries on statutory date provisions; (c) verbatim-quotation queries on statutory Source Credits where the correct answer is a specific Pub. L. chain.

Third, expand evaluation coverage of institutional-record reproduction. The findings in this audit show the model misattributing the Commission's vote on the final rule (naming a departed commissioner, omitting the actual fifth voter), and inflating the comment-letter count from eight to approximately forty (with an extended commenter list).

Eval probes should include (a) named-individual queries on agency voting records where the correct answer is the Appendix 1 Voting Summary; (b) commenter-set queries on rulemaking records where the correct answer is the final-rule's footnote enumeration; (c) institutional-composition queries on agency rosters where the correct answer is the published record at the relevant date.

Fourth, the deployment-time mitigation is to treat statutory-quotation, figure-quotation, and institutional-record answers as high-risk outputs that should either (a) cite a verifiable paragraph of the regulator's own text or the U.S. Code, or (b) decline to commit. Every failure mode in this audit would have been prevented by a retrieval-grounded answer that cited the relevant paragraph of the final rule, the NPRM, the U.S. Code, or the Federal Register index; every failure mode flowed from a generation behaviour that produced a confident answer by analogy with general knowledge rather than from the source text.

How RLB Can Help

RegLeg is positioned to support labs working on the failure modes surfaced in this audit. Our research operates on the boundary where labs' enterprise customers are most exposed: technical regulatory documents where the model is asked to quote statutory text, reproduce regulator-issued figures, and characterise rulemaking-record fields. For labs, our findings supply ready-made eval scaffolding (the question types, the regulator-text anchors, the expected failure modes) that can be adapted into internal benchmark sets.

Where labs are interested, we are open to engagement on (a) targeted evaluation set development for specific regulator portfolios (CFTC, SEC, FCA, MAS, EU prudential authorities); (b) failure-mode taxonomy work on statutory-figure quotation, regulatory-history reproduction, and institutional-record reproduction; and (c) eval-design consultation on retrieval-grounded answer behaviour for authoritative technical documents. We can also support post-audit communication with affected enterprise customers, where the lab has decided to surface a known limitation rather than allow customer-side discovery.

Practitioners and enterprise teams using AI tools on CFTC and U.S. derivatives work can consult our published Hallucination Research for a free pre-flight check on AI-assisted regulatory research, identifying the question types and instrument areas where current models have demonstrably misstated the rules.

← Other AI Labs white papers

Every finding on this page compares an AI subject's account of the rule against the regulator's verbatim text from the regulator's own portal. Both are linked. Each delta, its root causes, and impact analysis are documented and published with immutable Citation IDs.