This paper presents findings from RegLeg's evaluation of AI model responses to questions about MAS Notice 637 — the Monetary Authority of Singapore's risk-based capital adequacy framework for banks — covering both the consolidated notice and its 2024 amendment. Two Anthropic models were tested in web-search-enabled configurations: Claude Opus 4.7 with web search and Claude Sonnet 4.6 with web search. Across six findings, both models produced responses in which the model asserted specific regulatory details — annex content, document structure, the significance of formatting elements — that had no basis in the regulator's published text and in some cases directly contradicted it. The dominant pattern is one of confident fabrication in low-retrieval-coverage territory: when the model's search results do not surface the precise regulatory text, it generates plausible-sounding content instead of signalling uncertainty, with the confidence of the output giving no indication to the user that the underlying basis is absent. For labs fielding these models in enterprise and regulatory contexts, this pattern represents a material gap in how the models handle authoritative technical documents under partial information retrieval.
MAS Notice 637 sits at the operational core of bank capital management in Singapore — one of the world's principal financial centres. Users asking models about this notice include compliance officers at Singapore-licensed banks, legal teams advising on capital instrument eligibility, fintech and regtech builders embedding regulatory logic into product flows, and financial-services consultants advising clients on prudential requirements. Any model deployed in an assistant, copilot, or document-query capacity in these contexts will routinely receive questions of exactly the kind tested here: what does this annex cover, what does this clause mean, how does this amendment change the consolidated text. The failure modes documented in this paper — producing specific, structurally plausible but factually incorrect answers to granular regulatory questions — are therefore not edge-case adversarial probes; they are the normal use case.
The downstream harms are concrete. A compliance officer who acts on a fabricated annex description when structuring a capital instrument filing may produce a submission that is non-compliant with MAS requirements, with potential regulatory consequences for their institution. A legal team relying on a confidently-stated but wrong characterisation of an amendment's scope may advise on a transaction that is materially mispriced for regulatory risk. For the lab, confident wrong outputs on authoritative regulatory text — especially where the model produces a specific structural claim it cannot have retrieved — represent exactly the class of misuse-claim exposure that arises when enterprise customers act on model outputs in high-stakes contexts. Regulatory and financial-services use cases are among the fastest-growing deployment verticals for frontier models; the failure surface here is large and growing.
MAS Notice 637 is structurally representative of an entire class of documents that frontier models handle poorly at retrieval depth. It is a long, heavily cross-referenced PDF with numerous annexes, tables of numerical thresholds, and a layered amendment history that requires the reader to track changes between a base consolidated notice and subsequent amendment instruments. The 2024 amendment introduces tracked-change formatting conventions that are specific to the regulator's publication style and that carry precise legal meaning. Models under partial search coverage tend to reconstruct answers to detailed structural questions by analogy with other capital adequacy frameworks or with general document conventions — producing outputs that are plausible-sounding but wrong in the specific. This is a predictable failure pattern for any document in this class, and MAS Notice 637 provides a well-scoped, high-stakes test case for it.
Model results at a glance:
| Model | Configuration | Findings | Dominant failure pattern |
|---|---|---|---|
| Claude Opus 4.7 | Web search enabled | 2 | Fabricated specific regulatory constructs — a non-existent notice designation and a mischaracterised document convention — with no hedging and no basis in the regulator's published text. |
| Claude Sonnet 4.6 | Web search enabled | 4 | Asserted specific structural and content details about named annexes and divisions of MAS Notice 637 that the regulator's text does not support, sometimes with explicit internal acknowledgement that the claim could not be verified from search results. |
For Claude Opus 4.7 with web search, both findings involve the model generating specific, named regulatory constructs that do not exist: a notice designation ("Notice FHC-N637") that has no basis in the MAS regulatory record, and a characterisation of a document formatting convention that inverts the actual meaning of the convention. In each case the model's response is delivered without hedging, as if the information had been retrieved — the fabrication is presented with the same surface confidence as a verified fact.
For Claude Sonnet 4.6 with web search, the four findings share a common structure: the model is asked about a named annex or structural division of MAS Notice 637, and it produces a specific content description that is either wrong or unverifiable. In two of the four findings the model's own response contains an internal caveat acknowledging that search results did not allow the claim to be confirmed — which means the model both generated the unverified content and flagged its own uncertainty, without suppressing the fabricated content. This is a distinctive failure mode worth isolating in eval design: the model's uncertainty signalling is present but insufficient to prevent the generation of potentially misleading content.
The cross-model picture is notable in two respects. First, the failure pattern is present in both models despite the web search tool being active — meaning retrieval access did not prevent the models from generating content with no retrieved basis. This points to a gap in how the RAG-to-generation handoff handles low-confidence or absent retrieval signal: rather than declining to answer or clearly flagging a gap, both models default to plausible-sounding generation. Second, the errors cluster around the same content type: structural metadata about a complex regulatory document (annex scope, divisional coverage, amendment formatting conventions). Alignment teams should treat this content type — granular structural claims about long-form regulatory PDFs — as a specific failure surface requiring targeted eval coverage and, potentially, calibrated abstention behaviour in retrieval-augmented configurations.
6 findings in this case study. Click any to see its full evidence card.
The findings in this paper point to a specific, tractable gap in how both Claude Opus 4.7 and Claude Sonnet 4.6 handle granular structural questions about long-form regulatory documents in web-search-enabled configurations. For your evals team, this means adding coverage for annex-level and division-level structural claims in regulatory PDFs — not just top-level summaries, but questions that require the model to accurately characterise what a specifically-named sub-section covers and, critically, what it excludes. MAS Notice 637 is a good test corpus for this: it has a layered annex structure, a recent amendment with specific formatting conventions, and a Basel-derived technical framework that models are likely to reconstruct from training rather than retrieve correctly. Questions about the meaning of editorial conventions in amendment PDFs (as in Findings 2 and 3) are a particularly clean probe for this failure mode because the correct answer is document-specific and cannot be derived from general knowledge of drafting practice.
For your retrieval and tool-use teams, the findings suggest that web search is not providing sufficient grounding on this content type. In the two Sonnet 4.6 findings where the model explicitly flagged that search results were inconclusive, the model generated specific wrong content anyway rather than abstaining or clearly declining to characterise the annex. This points to a calibration gap in the model's handling of low-confidence retrieval signal: the generation layer is not adequately conditioned on the retrieval layer's uncertainty state. A targeted investigation into how the model's citation and generation behaviour changes as a function of retrieval confidence on document-structure queries would likely surface useful signal. The fabricated notice designation in Finding 1 (Claude Opus 4.7) suggests a harder variant of the same problem — where search returns nothing and the model synthesises a plausible-looking instrument from analogy — which may require a different intervention.
For post-training and fine-tuning teams, the consistent pattern across both models — confident generation of specific wrong structural details when precise regulatory content is not retrievable — is a candidate for synthetic correction-pair training. The correction signal here is available from the authoritative regulatory text: for any wrong annex description, the regulator's actual annex text provides the ground-truth correction pair. RegLeg can supply structured correction data derived from MAS Notice 637 and a broader portfolio of comparable regulatory documents if that workstream is of interest. Additionally, the two findings where the model's own response flagged uncertainty but still produced wrong content suggest a specific post-training target: training the model to suppress the generated content when its own uncertainty signal crosses a defined threshold for regulatory structure claims, rather than generating both the caveat and the wrong answer together.
RegLeg maintains a structured question bank covering MAS Notice 637 and a growing portfolio of comparable regulatory instruments across major financial centres. The questions are designed to probe exactly the failure surfaces documented in this paper: annex-level structural claims, amendment-convention interpretation, scope-boundary questions, and cross-reference accuracy. We can make the full question bank available to your evals and alignment teams under NDA, with associated ground-truth answers derived from the regulator's authoritative published text. This gives your team a ready-made evaluation corpus for this regulation — one that is grounded in real regulatory content rather than synthetically constructed scenarios — and that can be integrated directly into your evals pipeline without requiring your team to build regulatory domain expertise in-house.
Beyond the question bank, RegLeg can offer two forms of deeper engagement. First, per-regulation deep-dive sessions with our regulatory specialists, focused on the specific failure surfaces a model is showing on a given regulatory corpus. For MAS Notice 637, this would cover the amendment structure, the Basel framework mapping, the scope provisions, and the annex architecture — giving your team the regulatory context needed to design targeted probes and understand why models fail where they do. Second, we can generate synthetic correction-pair datasets for training use, derived from the regulator's text and structured to match the failure patterns your models are showing. For the findings in this paper, that means correctly-characterised annex descriptions, accurate amendment-convention explanations, and correctly-scoped instrument designations, each paired with the kind of wrong output the models produced — giving your post-training team high-quality signal in the exact domain where these models are currently weakest.
For labs building out systematic regulatory coverage, RegLeg can provide embedded eval coverage across a defined regulator portfolio — currently spanning multiple jurisdictions and regulatory domains — refreshed quarterly as regulators publish new instruments and amendments. This is designed to function as an always-current regulatory accuracy benchmark rather than a point-in-time snapshot, addressing the structural problem that models trained on regulatory text at one point in time will increasingly diverge from the current regulatory record as that record evolves. We see this as a domain-partnership model rather than a vendor relationship: our expertise is in regulatory content and in understanding where models go wrong on it; your expertise is in what to do about it. The combination is more productive than either working alone.