AI Hallucination ResearchRegulatorsGlobal standard-settersINTOECDOECD-MERGER-REVIEW-RECOMMENDATION-2025White paper › Detail
AI Labs · Last updated 7 Jun 2026 · methodology v2.1 · Hallucination Register

Structural Fabrication and Qualifier Erasure: AI Failure Modes on the 2025 OECD Merger Review Recommendation

Both Claude Opus 4.7 with web search and Claude Sonnet 4.6 with web search independently fabricated a structural section of the 2025 OECD Merger Review Recommendation (OECD/LEGAL/0333) that does not exist in the instrument, each constructing a sixth operative section when the Recommendation has five. The convergence is exact: both models inserted "Monitoring and Review" or "Cross-Jurisdictional Co-operation" as a standalone section, drawing on merger-review convention and prior OECD instruments rather than the 2025 revision's actual architecture.

Beyond the structural fabrication, both models erased the Recommendation's explicit epistemic qualifier on the failing firm defence, converting an "inter alia" open-ended evidentiary standard into a closed, exhaustive three-condition test. Claude Sonnet 4.6 with web search produced two further failure patterns: elaborating a multi-tier internal remedy-ranking framework that appears nowhere in the instrument's text, and collapsing a two-interval reporting cadence (initial report within five years, then at least every ten years thereafter) into a uniform five-year cycle, projecting specific future years with no basis in the text.

The cross-model convergence on both the structural fabrication and the failing-firm-defence qualifier erasure is the signal of consequence: these are not stochastic errors but shared gaps in how the 2025 revision's text is represented across training. On an instrument that governs jurisdictional alignment across merger review globally, referenced by competition lawyers, M&A advisers, and regulatory economists advising on multi-jurisdiction transactions, confident fabrications about section structure, evidentiary standard exhaustiveness, and remedy hierarchy propagate directly into professional work product.

When this affects AI Labs

Competition lawyers, M&A advisers, regulatory economists, and in-house counsel at multinational companies routinely ask models about the OECD Merger Review Recommendation when advising on multi-jurisdiction transaction clearance strategy. The 2025 revision is the operative version; practitioners distinguishing it from the 2005 predecessor are exactly the users whose queries surface the structural fabrication documented here. When a model confidently generates a six-section architecture for an instrument that has five, or presents a closed exhaustive standard where the text explicitly says "inter alia," the output functions as authoritative guidance, it has the format and register of a correct answer.

Users with partial familiarity with the Recommendation are unlikely to detect the error before acting on it.

The downstream harms a lab should map are concrete: merger filings that mischaracterise the Recommendation's operative scope, advisory memoranda that cite a non-existent "Monitoring and Review" section as support, transaction counsel that presents the failing firm defence as a closed three-condition gate when the regulator's text leaves the evidentiary list open. Any of these constitutes a professional liability exposure for the user, and positions the lab's model as the source of a consequential error in a regulated, adversarial proceeding context.

The remedy-hierarchy fabrication (an elaborated multi-tier ranking drawn from EU and US practice that is not in the OECD text) is particularly high-risk: remedy structuring is a late-stage, high-stakes moment in merger review, and divergence from the applicable instrument's actual hierarchy shapes negotiating positions with competition authorities.

The structural properties of this regulation make it a likely failure surface across models generally. The 2025 revision is recent: it superseded the 2005 version and the instrument's updated architecture is not extensively represented in pre-cutoff training data. The Recommendation's five operative sections do not map onto the section numbering conventions of the EU Merger Regulation, the US HSR framework, or prior OECD merger guidance, all of which models have seen extensively and from which they appear to reconstruct a plausible-but-wrong schema.

The "inter alia" qualifier on the failing firm defence is a brief, easily-overlooked phrase whose erasure converts a flexible standard into a rigid one, precisely the kind of low-salience precision that standard evals miss and that regulatory practitioners depend on.

Aggregate impact

Model Configuration Failure count Dominant error pattern
Claude Opus 4.7 Web search 3 Structural section fabrication; open standard converted to closed exhaustive test
Claude Sonnet 4.6 Web search 4 Structural fabrication; cross-framework schema elaboration; numeric-interval collapse; qualifier erasure

Claude Opus 4.7 with web search produced three failures, all sharing the same underlying shape: the model reconstructed plausible-sounding content from adjacent training signal rather than from the 2025 Recommendation's primary text. Two of the three findings concern the instrument's section structure, on separate questions, the model independently generated a six-section architecture (the Recommendation has five), each time inserting a "Monitoring and Review" or "Transnational Co-operation" section and attributing the missing content to a specific OECD legal instrument identifier that does not appear in the 2025 text.

The third failure converts the failing firm defence's "inter alia" evidentiary list into a closed three-condition test, with the third condition subtly reframed from a competitive-harm counterfactual into an asset-exit inevitability gate, a meaningful legal distinction that the model erased. Web search did not correct any of these reconstructions; the model appears to have retrieved third-party commentary or prior-instrument descriptions rather than the 2025 primary text.

Claude Sonnet 4.6 with web search produced four failures across four distinct failure shapes. The structural fabrication matches Opus 4.7 exactly on the same question, a near-identical six-section reconstruction, independently reached. Beyond that, the model elaborated a three-tier internal ranking for structural remedies (upfront divestiture, buyer pool with trustee backstop, crown jewel packages) that maps onto EU and US merger remedy practice but does not appear in the OECD Recommendation's text, which specifies only a two-level preference.

On the Competition Committee's reporting cadence, the model collapsed a two-stage interval (initial report within five years, then at minimum every ten years thereafter) into a uniform five-year cycle and arithmetically projected specific report years. And on the failing firm defence, it dropped the "inter alia" qualifier and presented the evidentiary list as closed and exhaustive, converging with Opus 4.7 on the same erasure.

The joint failure pattern across both configurations signals something specific about the 2025 revision's coverage in training. Failures cluster on two axes: the instrument's structural schema (where both models reconstruct from convention rather than text) and the precision qualifiers that distinguish the 2025 standard from prior-generation merger doctrine (where both models default to the harder, more familiar version of the rule).

The cross-model convergence on the same structural fabrication and the same qualifier erasure, across configurations that differ in model size, post-training tuning, and retrieval behaviour, points to a shared training-data gap on the 2025 text rather than a model-specific artefact. Web search on both configurations failed to correct either failure, suggesting the retrieval layer is not surfacing the 2025 Recommendation's primary text at sufficient weight to override reconstruction.

Findings

7 findings in this case study. Click any to see its full evidence card.

  1. Finding on 'Q001 Probe' for Claude Opus 4.7 with web search ON see this finding →
  2. Finding on 'Q005 Probe' for Claude Opus 4.7 with web search ON see this finding →
  3. Finding on 'Q006 Probe' for Claude Opus 4.7 with web search ON see this finding →
  4. Finding on 'Q001 Probe' for Claude Sonnet 4.6 with web search ON see this finding →
  5. Finding on 'Q002 Probe' for Claude Sonnet 4.6 with web search ON see this finding →
  6. Finding on 'Q004 Probe' for Claude Sonnet 4.6 with web search ON see this finding →
  7. Finding on 'Q005 Probe' for Claude Sonnet 4.6 with web search ON see this finding →

What your team should do

Implications for your training data

The structural fabrication, both models independently generating a six-section architecture for a five-section instrument, points to a gap specific to the 2025 revision. The 2025 Recommendation updated the 2005 predecessor; models appear to hold a representation of the prior instrument's structure (or of merger-review convention generally) that is not overridden by the 2025 text. The training corpus for this regulator's primary outputs likely lacks sufficient coverage of the 2025 revision at the verbatim-text level, leaving the models to reconstruct from adjacent signal.

For this regulator and for OECD soft-law instruments generally, corpus ingestion should prioritise the primary legal text in its current revision, not commentary, secondary summaries, or prior-instrument versions which may be more heavily represented.

The "inter alia" erasure on the failing firm defence is a precision-qualifier failure that training data alone may not fix, but corpus composition shapes it. Where the 2025 revision's verbatim text is sparse in training, models default to a harder, more rule-like version of the standard, which is the more commonly articulated form in practitioner commentary and comparative-law literature. Pairing the regulator's verbatim text with explicitly flagged contrast cases (open-ended standard vs. closed exhaustive standard) as a structured training signal would provide the model with a stable representation of where this Recommendation diverges from more prescriptive frameworks.

The remedy-hierarchy over-specification (elaborating a three-tier ranking from a two-level text preference) follows the same pattern: the elaborated framework is drawn from EU and US remedy doctrine, which is heavily represented in training; the OECD's simpler formulation needs to be weighted as the authoritative source for OECD-context queries.

Implications for your post-training logic

Two calibration targets are directly implicated. First, when the model commits to an instrument's section count or structural schema, it should register uncertainty where its representation of the primary text is thin, particularly for instruments revised within the last 24 months. A self-check pass that surfaces "my training coverage of this specific revision may be limited" before committing to a structural description would catch the class of fabrication documented here.

The web search retrieval layer is not currently providing this correction: both configurations with web search active produced the structural fabrication, suggesting retrieved content consisted of commentary or prior-instrument material rather than the 2025 primary text, and the model did not flag the retrieval gap.

Second, where a legal standard's text contains an explicit exhaustiveness qualifier ("inter alia", "including but not limited to", "among other things"), the model's response should preserve that qualifier rather than convert the standard into a closed rule. Post-training calibration on precision-qualifier preservation, specifically in regulatory and legal text contexts, would address both the failing firm defence erasure and the reporting-interval collapse (where "no later than five years... and at least every ten years thereafter" was converted into a single uniform interval).

These are not retrieval failures; they are characterisation failures that occur when the model has a partial representation of the text and fills in a simpler, more familiar version.

Specific eval / red-team probes RegLeg suggests

How RLB can help

We document nuanced, regulation-specific failure modes across model versions and configurations, the kind that surface only when testing against primary regulatory text rather than widely-reproduced commentary.

The failure shapes documented in this paper represent categories we track systematically across an expanding regulatory portfolio: structural schema reconstruction from prior-instrument conventions on recently-revised soft-law instruments; precision-qualifier erasure converting open-ended legal standards into closed rules; cross-framework schema elaboration where a model substitutes a more familiar jurisdiction's doctrine for the applicable instrument's simpler standard; multi-interval numeric collapse where a two-stage cadence is flattened to a single recurring interval; and fabricated cross-reference generation where instrument identifiers are produced without basis in the primary text.

Each of these is a failure shape the lab's standard internal evals are unlikely to surface at the granularity needed to act on.

Where your team wants to close gaps, we can provide: correction-pair generation derived directly from the regulator's authoritative text, wrong response paired with the ground-truth excerpt and the precise point of divergence, formatted for ingestion into your training pipeline. We can run embedded comparative evaluations against a defined regulatory portfolio on a quarterly cadence, tracking regression on previously documented failure modes across model versions and flagging where a capability release degrades precision-qualifier preservation or structural schema fidelity.

For capability launches that touch regulated domains, financial services, cross-border transaction review, payments infrastructure, competition law, we can run pre-release evaluation cycles scoped to the regulator portfolio relevant to your deployment surface and deliver a failure-mode report before the release reaches customers. And for specific regulators your team is prioritising, we can run targeted red-team consultations focused on the failure surfaces most likely to generate consequential errors for your users.

To scope a partnership focused on refining your models against these failure modes, reach out at reglegbrief.com.

← Back to summary Other AI Labs white papers →

Every finding on this page compares an AI subject's account of the rule against the regulator's verbatim text from the regulator's own portal. Both are linked. Each delta, its root causes, and impact analysis are documented and published with immutable Citation IDs.