AI Hallucination ResearchRegulatorsMajor advanced economiesGBFCACONSUMER-DUTY-PS22-9 › White paper
AI Labs · Last updated 7 Jun 2026 · methodology v2.3 · Hallucination Register

Consumer Duty Hallucination Report: frontier AI models

Alert: Frontier AI models misread FCA Consumer Duty (PS22/9)

Two frontier AI models running with web search enabled, both tested by the RLB Specialist Panel, produced confidently wrong reconstructions of the UK Financial Conduct Authority's Consumer Duty (PS22/9 and PRIN 2A), the conduct framework governing how authorised firms must act to deliver good outcomes for retail customers. The RegLeg Brief Specialist Panel tested both models across the Duty's foreseeable-harm provision, fair-value assessment expectations, and scope exclusions, and documents eleven findings in which the models added requirements the FCA's text does not contain, or restated settled rules with new conditions attached.

Claude Opus 4.7, asked whether the Consumer Duty applies to group insurance distribution, asserted that "the FCA addressed this in further consultation (CP23/something on group insurance practices) and confirmed that firms manufacturing/distributing policies where individual retail beneficiaries are protected fall within scope." PRIN 2A says the opposite: the Consumer Duty "does not apply to ... activities connected to the distribution of group insurance policies or the extension of these policies to new members." The cited consultation paper number is itself a placeholder, "CP23/something", that the regulator never issued.

Claude Sonnet 4.6, asked about fair-value methodology, wrote that the FCA "does expect firms to go beyond qualitative description and provide substantiated comparisons" for non-monetary costs and benefits. FG22/5 says the precise opposite: "The FCA does not expect firms to quantify non-monetary costs and benefits as part of its fair value assessment process, but firms should undertake some form of qualitative assessment." The model inverted the regulator's stated position on what kind of analysis the Duty requires.

A compliance officer at a UK bank, insurer, or investment platform relying on either output would build a Consumer Duty programme around requirements the FCA never imposed, and would route those requirements through internal governance papers, board minutes, and supervisor-facing disclosures. That is the failure mode these findings document.

Executive summary

This paper presents findings from a structured evaluation of two frontier AI models, Claude Opus 4.7 with web search and Claude Sonnet 4.6 with web search, against the Financial Conduct Authority's Consumer Duty rules, as set out in PS22/9 and PRIN 2A. Across fifteen findings, both models produced responses containing factual errors, omissions, and unsupported additions relative to the FCA's published regulatory text. The dominant error pattern is confident elaboration beyond what the regulator's text permits: models added conditions, caveats, and procedural requirements that are not present in the FCA's rules, while in other cases declining to provide information that the regulator has published clearly. Every cited source across both models was independently assessed and found to be either non-authoritative or presented in a context that did not support the claim it was used to justify. These results are a signal that current models, even with live web retrieval, show systematic weaknesses on regulatory content characterised by precise defined terms, recent amendments, and technical cross-references.

Findings — impact summary

This is the consolidated view of findings. Click 'see details →' on any item for the full details for each finding.

  1. Finding on 'Q003 Probe' for Claude Opus 4.7 with web search ONRLB-H-GB-FCA-CONSUMER-DUTY-PS22-9-Q003-Opus47

    This finding implicates the model's handling of qualified regulatory rules: the FCA's foreseeable harm provision is a single-condition safe harbour ('reasonably believes'), but the model reconstructed it as a multi-factor compliance test ('good faith', 'supported understanding', 'avoided firm-caused harm', 'otherwise complied with the Duty'). The model dropped the qualifier 'reasonably believes' and inflated the test in the direction of a more demanding standard. The retrieval layer did not correct this; the cited FCA Handbook URL did not anchor the response to the actual rule text.

    For a compliance-eval probe, this is a high-priority candidate: pair a rule that turns on a single test with a question that invites a multi-factor answer, and watch whether the model preserves the regulator's structure or reconstructs it.

    see details →
  2. Claude Opus 4.7 with web search
  3. Finding on 'Q013 Probe' for Claude Opus 4.7 with web search ONRLB-H-GB-FCA-CONSUMER-DUTY-PS22-9-Q013-Opus47

    This finding implicates the model's temporal reasoning on regulatory events: it split a single March 2025 announcement (FS25/2) into two events across April and August 2025. This suggests the model had partial training coverage of the FS25/2 publication and filled the gaps with invented dates. The fabricated August 2025 tranche is particularly notable because it post-dates the model's likely training window: this may be the model generating a plausible continuation of a partial knowledge record rather than retrieving a real event.

    The same fabricated timeline appears under a differently framed question (Finding#9), confirming this is a persistent internal model representation rather than a random generation error. Correction pairs targeting FS25/2 directly would be the most efficient remediation.

    see details →
  4. Claude Opus 4.7 with web search
  5. Claude Sonnet 4.6 with web search
  6. Finding on 'Q007 Probe' for Claude Sonnet 4.6 with web search ONRLB-H-GB-FCA-CONSUMER-DUTY-PS22-9-Q007-Sonnet46

    This finding implicates the model's cross-referencing between binding rules and non-binding guidance within the FCA Handbook. The model cited a specific rule reference (PRIN 2A.5.10R) as the basis for a testing requirement that actually appears in FG22/5 guidance. This is a rule/guidance conflation error: attributing the normative force of a binding rule to a provision that is guidance-level. This class of error is particularly impactful for compliance users who need to distinguish what they must do from what the FCA recommends.

    A targeted eval checking whether the model correctly attributes 'R', 'G', and 'E' provisions to the right normative level would surface this pattern systematically across the FCA Handbook and analogous structured rulebooks at other regulators.

    see details →
← Other AI Labs white papers The detailed Case study →

Every finding on this page compares an AI subject's account of the rule against the regulator's verbatim text from the regulator's own portal. Both are linked. Each delta, its root causes, and impact analysis are documented and published with immutable Citation IDs.