This is the consolidated view of findings. Click 'see details →' on any item for the full details for each finding.
This finding implicates the model's handling of qualified regulatory rules: the FCA's foreseeable harm provision is a single-condition safe harbour, but the model reconstructed it as a multi-factor compliance test. This points to a training data gap or inference pattern where precise qualifiers in regulatory text — particularly 'reasonably believes' — are dropped or elaborated in favour of a more elaborated standard. The retrieval layer did not correct this; the cited FCA Handbook URL did not anchor the response to the actual rule text.
see details →This finding is a precise defined-term substitution: 'annual income' for 'annual turnover'. This implicates the model's lexical handling of regulatory definitions — specifically, whether its training corpus contained the exact FCA definition or a paraphrase. The fact that web search did not surface the correct term suggests either the retrieval step did not reach the Handbook definition, or the model did not use the retrieved content to override its training inference. A targeted eval checking exact regulatory defined terms (particularly financial thresholds with close synonyms) would catch this class of error.
see details →This is a negation-reversal error on a clear FCA policy position: the regulator explicitly does not expect quantification, and the model inverted this into an affirmative expectation. This pattern — where a clean regulatory negative is reconstructed as a positive standard — is particularly dangerous for compliance users, as it creates phantom obligations. The retrieval step appears to have been ineffective: the cited FCA Handbook and FG22/5 URLs, if successfully retrieved, would have contained the correct negative statement.
see details →This finding implicates the model's temporal reasoning on regulatory events: it split a single March 2025 announcement (FS25/2) into two events across April and August 2025. This suggests the model had partial training coverage of the FS25/2 publication and filled the gaps with invented dates. The fabricated August 2025 tranche is particularly notable because it post-dates the model's likely training window — this may be the model generating a plausible continuation of a partial knowledge record rather than retrieving a real event.
see details →This finding exposes a confidence-calibration failure on unverifiable content: the model produced specific speech dates and titles for FCA communications it could not have retrieved, presenting them with the same surface confidence as verified facts. This implicates the model's uncertainty signalling — specifically, when retrieved content is inaccessible, the model defaults to generating plausible-sounding specifics rather than disclosing the source gap. The citation step then appended a plausible-looking URL rather than flagging the absence of a verified source.
see details →This finding shows the model constructing a plausible policy history without verified primary sources. The CP21/36-to-PS22/9 comparison requires access to both documents and comparative analysis; the model produced a numbered enumeration without signalling that it was drawing on inference rather than retrieval. This implicates both the retrieval step (which should have attempted to reach the FCA's primary policy documents) and the confidence calibration layer (which should have flagged the absence of verified comparative content).
see details →This is the most consequential factual error in the Opus 4.7 evaluation: the model directly contradicted a specific regulatory exclusion. The FCA's rules explicitly exclude group insurance distribution from Consumer Duty scope; the model asserted the opposite. This implicates the model's handling of explicit scope exclusions in regulatory text — a pattern where general principles about distribution chains and retail customer protection override the specific carved-out exclusion. Eval probes that pair a general principle with a specific exclusion would surface this failure mode across regulations.
see details →The recurrence of the same fabricated April/August 2025 timeline (also present in Finding 13) across two differently-framed questions on the same topic confirms this is a persistent internal model representation rather than a random generation error. This has implications for how the model handles regulatory withdrawal and amendment records: it appears to have constructed a specific (incorrect) account of the FS25/2 action that it reproduces consistently. Correction pairs targeting this specific publication and its contents would be the most direct remediation.
see details →This finding implicates the model's handling of compound regulatory questions where one element has a clear published answer (the general legal basis) and another element requires engagement with a specific recent Act (FSMA 2023). The model answered the first element correctly and silently omitted the second. This is a selective engagement pattern — the model responded to the part of the question it had strong training coverage on and left the novel element unaddressed. Eval probes that include a secondary specific element alongside a general question would test this pattern.
see details →Like the Opus 4.7 finding on the same question, this implicates the model's coverage of the FCA's defined terms. Sonnet 4.6 confirmed the category membership correctly but dropped the specific numeric threshold — suggesting the model has partial coverage of the retail customer definition but did not retrieve or apply the complete definition including the £1 million annual turnover figure for charities. This is a precision gap rather than a direction-of-error gap: the model's answer is incomplete rather than wrong, but the incompleteness makes it unusable for compliance purposes.
see details →This finding implicates the model's cross-referencing between binding rules and non-binding guidance within the FCA Handbook. The model cited a specific rule reference (PRIN 2A.5.10R) as the basis for a testing requirement that actually appears in FG22/5 guidance. This is a rule/guidance conflation error — attributing the normative force of a binding rule to a provision that is guidance-level. This class of error is particularly impactful for compliance users who need to distinguish what they must do from what the FCA recommends. A targeted eval checking whether the model correctly attributes 'R', 'G', and 'E' provisions to the right normative level would surface this pattern systematically.
see details →This is the same negation-reversal error as the Opus 4.7 finding on the same question: the FCA's clear negative on quantification expectations was reconstructed as an affirmative requirement. The fact that two different models independently produced the same type of error on the same regulatory provision is a strong signal that this specific FCA position is poorly represented in the models' training data or is systematically overridden by inference from general compliance norms. This is a high-priority candidate for correction pairs.
see details →This evasion finding implicates the model's retrieval coverage of recent FCA publications. FS25/2 was published in March 2025 and contains a specific, discrete factual claim (90+ letters withdrawn). The model's stated inability to find a verified count suggests either the retrieval step did not surface FS25/2, or the model did not recognise it as an authoritative source for this question. The contrast with Opus 4.7 (which fabricated a specific but incorrect answer) shows the two models handle the same retrieval gap differently — Sonnet 4.6 defaults to evasion, Opus 4.7 defaults to fabrication. Both are failure modes; the evasion response is safer but still represents a failure to deliver available information.
see details →This finding shows the model using a law firm commentary article as the basis for an enumerated account of FCA policy development decisions. The cited third-party source cannot serve as an authoritative record of what the FCA changed between its consultation and final rules — only the FCA's own documents (CP21/36 and PS22/9) can do that. This implicates the retrieval ranking and source-authority assessment within the web-search integration: the model appears to have accepted a practitioner commentary source as sufficient for a primary regulatory history question. Eval probes requiring the model to distinguish between FCA primary sources and third-party commentary would test this directly.
see details →This finding combines two failure modes: an evasion response on verifiable content, and a fabricated citation. The fabricated Clifford Chance URL is the only explicitly fabricated citation in the Sonnet 4.6 evaluation set. The combination of declining to provide a published answer while simultaneously generating a fabricated supporting citation is the most concerning failure pattern in this evaluation — it represents a model that is simultaneously under-confident on retrievable content and over-confident in citation generation. This implicates both the retrieval grounding layer and the citation generation step as independent failure surfaces.
see details →