This is the consolidated view of findings. Click 'see details →' on any item for the full details for each finding.
This finding implicates the model's training data coverage of the MAS regulatory record and the generation layer's behaviour when retrieval returns no authoritative result for a specific regulatory instrument query. The model fabricated a complete notice designation — number and title — in a domain where the correct answer was 'this instrument does not exist or could not be located.' The gap is in how the generation layer handles absent retrieval signal for regulatory instrument lookups: rather than producing a calibrated 'not found' response, it synthesises a plausible-looking instrument by analogy with the known notice. Targeted eval coverage for regulatory instrument existence queries — where the correct answer is a null or a scope boundary, not a named instrument — would surface this failure mode systematically.
see details →This finding implicates the model's handling of document-specific formatting conventions — a content type that requires retrieval of the specific document rather than reconstruction from general knowledge. The model's characterisation of yellow highlighting as a general-purpose visual aid reflects generation from training knowledge about drafting conventions rather than from the amendment text itself. The gap is in how the RAG layer handles document-convention queries where the correct answer is specific to a single instrument: the model should either retrieve the relevant passage from the amendment PDF or decline to characterise the convention. Eval probes for amendment-specific formatting and editorial conventions — where general knowledge and document-specific meaning diverge — would cover this failure surface.
see details →This finding, parallel to the Opus 4.7 finding on the same question, implicates the same RAG-to-generation failure mode in Claude Sonnet 4.6 with web search: the model generated a characterisation of yellow highlights as editorial annotations rather than retrieving or deferring on the document-specific convention. The fact that both models produce different but equally wrong characterisations of the same formatting element suggests this is a systematic failure mode for this content type rather than model-specific. The training data is unlikely to contain the specific MAS amendment convention, and neither model's retrieval layer surfaced the relevant passage — pointing to both a training data gap and a retrieval coverage gap for recent MAS amendment instruments.
see details →This finding is diagnostically valuable because the model's response explicitly flags retrieval uncertainty while still generating wrong content — making visible the gap between the uncertainty-signalling subsystem and the generation subsystem. The retrieval layer produced inconsistent signals (different search summaries pointing to different frameworks), the model's own output acknowledged this, and the model generated a specific wrong answer anyway. This points to a calibration gap in the generation layer's use of retrieval confidence: the model should suppress or substantially hedge specific structural claims when its own search results are flagged as inconsistent. This is a targetable post-training intervention: training the model to treat explicit internal retrieval inconsistency as a signal to abstain from specific structural claims.
see details →This finding implicates training data coverage of MAS Notice 637's annex structure, combined with the model's tendency to generate plausible Basel-framework content when the specific annex content is not available. Prudent valuation is a real concept in Basel III capital adequacy frameworks, and the model's assignment of it to Annex 6C reflects reconstruction from general framework knowledge rather than from the specific document. The gap is in training data granularity: the model has sufficient knowledge of Basel framework concepts to generate plausible annex descriptions, but insufficient knowledge of the specific MAS Notice 637 annex structure to assign them correctly. Correction-pair training on annex-level structural claims — correct annex content paired with the type of wrong content the model produces — is a concrete intervention for this failure mode.
see details →This finding presents the same uncertainty-with-wrong-content pattern as Finding 4, applied to divisional structure within Part VI of MAS Notice 637. The model characterised Division 4 as covering submission requirements for capital instruments — a plausible Basel-adjacent topic — while flagging that the divisional structure could not be verified from search results. The explicit self-caveat makes this finding particularly useful for calibration research: it isolates the case where the model's uncertainty detection is working (it correctly identifies that search has not resolved the question) but the generation-suppression mechanism is not engaged. Identifying and raising the abstention threshold for specific structural claims about named regulatory divisions — when retrieval has explicitly failed to confirm the structure — is a direct intervention target for the post-training team.
see details →