AI Labs · published 2026-05-26 · methodology v2.1

Consumer Duty Hallucination Report: Claude Opus 4.7 and Claude Sonnet 4.6

This paper presents findings from a structured evaluation of two frontier AI models — Claude Opus 4.7 with web search and Claude Sonnet 4.6 with web search — against the Financial Conduct Authority's Consumer Duty rules, as set out in PS22/9 and PRIN 2A. Across fifteen findings, both models produced responses containing factual errors, omissions, and unsupported additions relative to the FCA's published regulatory text. The dominant error pattern is confident elaboration beyond what the regulator's text permits: models added conditions, caveats, and procedural requirements that are not present in the FCA's rules, while in other cases declining to provide information that the regulator has published clearly. Every cited source across both models was independently assessed and found to be either non-authoritative or presented in a context that did not support the claim it was used to justify. These results are a signal that current models — even with live web retrieval — show systematic weaknesses on regulatory content characterised by precise defined terms, recent amendments, and technical cross-references.

When this affects AI Labs

The FCA's Consumer Duty framework is one of the most operationally significant pieces of UK financial regulation for firms serving retail customers. Legal, compliance, and financial-services teams at banks, insurers, investment platforms, fintechs, and regtechs routinely use frontier AI models to research their obligations under Consumer Duty — asking questions about scope, defined terms, the four outcomes framework, fair value requirements, and the boundary between binding rules and non-binding guidance. Any model deployed in those contexts, whether through a direct API integration, a co-pilot product, or an enterprise assistant, will be asked exactly the kinds of questions tested in this evaluation.

The downstream harms from errors in this domain are concrete. A compliance officer who acts on a model's incorrect description of the "retail customer" threshold for charities, or who takes a model's fabricated procedural requirement as authoritative, may build a compliance programme around a rule that does not exist in the FCA's text. Firms that rely on AI-assisted regulatory summaries for Consumer Duty implementation face regulatory risk if those summaries import errors silently; the FCA has stated explicitly that firms cannot outsource their regulatory obligations. For a lab, the exposure extends further: if customers act on confidently-presented incorrect outputs and then face regulatory action or financial loss, the lab's documentation of model limitations becomes material. Errors that embed plausibly-presented fabrications — rather than obvious failures — are the ones that reach this downstream harm threshold.

This regulation is a structurally demanding surface for models. Consumer Duty spans a primary Principle (Principle 12), a detailed PRIN 2A rulebook chapter with multiple sub-chapters containing binding rules ("R"), guidance ("G"), and evidential provisions ("E"), and a body of finalised guidance (FG22/5) that supplements but does not override the rules. Defined terms such as "retail customer", "foreseeable harm", and "consumer understanding" have specific regulatory meanings that differ from ordinary language usage. The framework was finalised in July 2022 with implementation staggered across 2023–2024, meaning models trained before those dates, or with patchy post-2022 coverage, will fill gaps with plausible-sounding inference. The combination of technical defined terms, recent layered amendments, and a distinction between binding rules and guidance that is marked typographically in the Handbook but invisible in plain prose makes this regulation a high-probability hallucination surface across model generations.

Aggregate impact

Per-model summary:

Model	Configuration	Findings	Dominant failure pattern
Claude Opus 4.7	With web search	8	Confident elaboration beyond the regulator's text — adding conditions, procedural steps, and qualifications that appear reasonable but are not present in the FCA's published rules. Five of eight findings involve fabricated factual content.
Claude Sonnet 4.6	With web search	7	A mix of over-elaboration and under-disclosure: in some cases the model added unsupported requirements; in others it declined to provide information that the regulator has published clearly, citing an inability to find a verified source. Two findings involve evasion responses where the published answer was accessible.

Claude Opus 4.7 with web search produced a consistent pattern of response augmentation: when the FCA's text provides a clean, unqualified rule, the model tended to reconstruct a more elaborate or hedged version — adding conditions and caveats that would seem plausible to a general reader but that the regulator's text does not support. This is most visible in findings relating to the foreseeable harm provision, the scope of the fair value assessment methodology, and the group insurance exclusion. The model appears to be generating regulatory-sounding completions from inference rather than grounding its responses in retrieved text, even when web retrieval was active. Citation behaviour reinforces this: sources cited were consistently non-authoritative or used out of context, suggesting the model is selecting plausible-seeming URLs independently of whether those sources contain the claimed content.

Claude Sonnet 4.6 with web search shows a different surface: it is more likely to produce evasion responses on recent regulatory developments — particularly FCA feedback statements and withdrawal notices from 2025 — while also generating unsupported augmentations on definitional questions. The contrast between the two models is instructive for an alignment team: Opus 4.7 with web search is more willing to commit to a specific (often wrong) answer; Sonnet 4.6 with web search is more likely to hedge or decline when recent content is involved, but still produces augmented responses on older settled provisions. Both models shared a structural weakness in citation generation — neither model's web retrieval step produced sources that independently supported the response content. This suggests that the web-search-augmented configuration does not reliably anchor content generation to retrieved documents: the retrieval and content generation steps appear to be operating with some independence, with fabricated or pretextual citations appended to responses that were already constructed from training inference.

Findings

15 findings in this case study. Click any to see its full evidence card.

Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Opus 4.7 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →
Claude Sonnet 4.6 with web search see this finding →

What your team should do

The most immediate eval surface this evaluation surfaces is the boundary between binding rules and non-binding guidance in the FCA Handbook. Consumer Duty spans both a rulebook (PRIN 2A, with provisions marked "R", "G", and "E" in the Handbook) and a separate finalised guidance document (FG22/5). Models consistently failed to maintain this distinction — attributing guidance-level recommendations as rule requirements, or vice versa. A targeted eval set should include paired questions that probe whether the model can correctly identify whether a Consumer Duty provision is a binding rule or guidance, and whether it can distinguish between what the FCA requires and what it recommends. This is a replicable pattern across both models tested.

The findings on numerical thresholds and defined terms point to a second eval category: questions where precision matters and where the regulatory text uses a specific term that has a close synonym in ordinary language. The £1 million "annual turnover" threshold for charities (versus "annual income") and the specific conditions in the foreseeable harm provision are examples where both models substituted plausible alternatives for the exact regulatory language. Synthetic training correction pairs built from the FCA Handbook's defined terms — particularly for "retail customer", "foreseeable harm", and the four outcomes — would directly address this failure mode. These terms appear throughout the Handbook and FG22/5 in contexts that could generate large correction-pair sets without requiring novel regulatory interpretation.

The retrieval findings warrant attention from teams working on RAG architectures and web-tool integrations. Both models with web search produced cited sources that were either non-authoritative (law firm commentary, practitioner guides) or used out of context relative to the claim being cited. Neither model's retrieval step appeared to anchor content generation to retrieved documents: the responses read as training-inference completions with citations appended, rather than as summaries of retrieved content. Red-team probes targeting recent FCA publications — particularly feedback statements, Dear CEO letter withdrawals, and post-implementation supervisory communications from 2023–2025 — would expose how far the web-search augmentation actually shifts the model's grounding on recently-published regulatory content. The two evasion findings from Claude Sonnet 4.6 with web search on FS25/2 content, combined with the fabricated timeline from Claude Opus 4.7 with web search on the same material, suggest this is a shared weak point in the retrieval layer rather than a model-specific content gap.

How RLB can help

RegLeg's core capability is maintaining a verified, current, machine-readable corpus of regulatory text across a defined portfolio of regulators and instruments. For the Consumer Duty specifically, this covers PS22/9, PRIN 2A (all sub-chapters), FG22/5, and the ongoing FCA feedback statement and supervisory communication record — updated as the FCA publishes. We can provide your evals or post-training team with licensed access to the full question bank under NDA, including the verbatim question set used in this evaluation, the verified regulator-text answers, and the model response records. This gives your team a ready-to-use benchmark on a real-world regulatory surface without any of the IP or sourcing overhead of building it from scratch.

Beyond the question bank, RegLeg can support your teams in two complementary ways. First, we can generate synthetic correction-pair datasets derived directly from the regulator's authoritative text — designed specifically to address the failure modes your models are showing on defined terms, rule/guidance distinctions, and numeric thresholds. These pairs are built by our regulatory specialists from primary source material, not reconstructed from third-party commentary, so they carry the evidentiary weight needed for post-training use. Second, we offer per-regulation deep-dive panels where our specialists walk your alignment or red-team staff through the structural features of a regulation that make it a high-probability hallucination surface: the typographic distinctions between binding rules and guidance, the cross-reference architecture, the defined terms, and the amendment history.

For labs building towards embedded eval coverage on financial regulation, RegLeg offers a quarterly refresh service across a defined regulator portfolio — covering the FCA, PRA, and selected EU and US regulators — that keeps your benchmark current as regulations evolve. Regulatory content is not static: new feedback statements, Dear CEO letters, and supervisory expectations are published continuously, and the gap between a model's training coverage and the live regulatory record is the primary driver of the failure patterns in this evaluation. A partnership with RegLeg means that gap is actively managed rather than accumulating silently between training runs. We are structured as a domain partner, not a data vendor — our preference is to embed within your evals workflow and adapt as your model development priorities shift.

← Back to summary Other AI Labs white papers →