RLB Panel Speak
And what it means for every regulatory output AI produces
Each generation of AI loses a little more of what made the previous one accurate. The tails disappear first — and the tails are exactly where regulatory precision lives.
— RLB Specialist Panel
A peer-reviewed paper in Nature showed that when AI models are trained on AI-generated data, they begin to lose the rarest, most precise, most unusual parts of their knowledge. The tails disappear first. What remains becomes flatter, safer, more repetitive — and increasingly wrong in the specific, verifiable ways that matter most to professionals who rely on regulatory accuracy.
In July 2024, researchers from Oxford, Cambridge, Imperial College London, the University of Toronto, and the University of Edinburgh published a paper in Nature that named something the AI industry had been quietly aware of but unwilling to discuss directly. They called it model collapse.
The mechanism is straightforward. Shumailov et al. · Nature 631 · 2024 When a generative AI model is trained on data produced by an earlier AI model, rather than on original human-generated content, it begins to lose the extremities of its knowledge. The rare cases — the unusual regulatory instrument, the minority interpretation, the precise technical distinction — disappear first. What remains is a narrower, blander, more averaged version of the knowledge the model once held. Train the next model on the output of the first, and the process compounds. Each generation loses a little more of the edge.
Indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear.
— Shumailov, Shumaylov, Zhao, Papernot, Anderson & Gal · Nature 631:755–759 · July 2024The paper's authors distinguish two phases. Early model collapse: the tails of the distribution begin to disappear — the rare events, the minority cases, the precise technical distinctions. Late model collapse: the model converges to a shrunken distribution with very low variance — essentially a confident average that has forgotten it was ever uncertain about anything at the edges. The mechanism compounds across three error types: statistical approximation (finite sampling loses rare cases), functional expressivity (the model class cannot represent the true distribution), and functional approximation (learning errors accumulate).
A follow-up arXiv paper by Ali Borji concluded that the outcomes reported by Shumailov et al. are a statistical phenomenon and may be unavoidable. Borji · arXiv 2410.12954 · Oct 2024 This is not a bug to be patched. It is a structural property of recursive training.
Model collapse would be a theoretical concern if the internet were still primarily human-generated. It is not.
AI-written pages in Google's top-20 search results climbed from 11.11% to 19.56% between May 2024 and July 2025 — roughly 0.6 percentage points per month. Ahrefs / WinsSolutions · 2025 The web — the primary training corpus for most large language models — is rapidly filling with AI-generated content at exactly the moment that AI labs need more diverse, high-quality human data to improve their models.
The Communications of the ACM put it directly in April 2026: "The recursion has started. We're just not talking about it honestly. Large portions of the open web now contain text produced by LLMs, and the volume grows every month. According to some estimates, over 50% of all content is now AI-generated. It's an ouroboros with an increasing appetite but a shrinking portion size." CACM · Apr 2026
The web is not just an information environment for human readers. It is also a training corpus for AI systems. The models that generate tomorrow's content are being trained on today's content, which is already substantially synthetic.
— Medium / Prince Saaluon · March 2026Researchers at Epoch AI have predicted that the world may run out of new human-generated text suitable for AI training sometime between 2026 and 2032. Epoch AI · 2024 The global pool of high-quality human-authored text is estimated at approximately 17 trillion tokens, growing at only 4–5% annually — while AI consumption of training data grows at a rate that dwarfs that figure. A 2021 report predicted that by 2025, 90% of internet content would be AI-generated. That projection is now reading as conservative.
The degradation can be visualised as a photocopy of a photocopy: each generation loses resolution, each copy of a copy flattens what was once sharp. But the regulatory content version of this story has a specific quality that makes it more dangerous than a general degradation of output diversity.
Model collapse affects all AI output — creative writing, code, general knowledge. But it hits regulatory content with particular precision, because regulatory content has a property that creative output does not: it is either correct or it is wrong, and the wrong version can be verified by opening the primary source document.
Shumailov et al. note that the rarest cases disappear first — the tails of the distribution, the minority events, the unusual data points that do not appear often enough in training to survive the compression. In regulatory content, those rare cases are the precise technical details.
These are exactly the details that secondary sources — the compliance blogs, the law firm bulletins, the aggregator summaries — routinely omit or simplify. They are also exactly the details that distinguish compliant from non-compliant in a regulatory examination. And they are precisely what disappears first as model collapse compounds across training generations.
Query → What is the minimum liquid asset ratio under MAS Notice 649? G1 model → [Retrieved from primary source signal in training data] The minimum liquid asset ratio is 16% of qualifying liabilities. Specific carve-outs apply to certain categories. G3 model → [Training data now dominated by AI-generated summaries] The MAS requires banks to maintain adequate liquid assets. The standard ratio requirement is in the range of 16–18%. Specific requirements vary by institution type. G4 model → [Training data almost entirely synthetic; primary signal lost] The minimum liquid asset ratio under MAS Notice 649 is 18%. [Confident. Specific. Wrong. And the model has no mechanism to know this.]
This is not hypothetical. The RegLegBrief Hallucination Register has documented exactly this pattern: AI systems producing confident, specific, wrong figures for regulatory requirements — figures that exist nowhere in the primary source document, but which appear consistently across multiple systems because they have been absorbed from the secondary source environment and then amplified through recursive training. RLB-HAL-0002 · Apr 2026
Model collapse is the structural explanation for why this happens — and why it will get worse as AI-generated regulatory content continues to flood the secondary sources that training pipelines harvest.
The Nature paper's authors make this point directly: the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet. Shumailov et al. · Nature · 2024
The economic logic is straightforward. As AI-generated content floods the secondary source ecosystem, the scarcity premium on verified primary source material increases. The regulatory document that has not been summarised, paraphrased, or redistributed through an AI layer — the original gazette, the original circular, the original notice, as published by the regulator — becomes progressively more valuable as everything else becomes progressively less trustworthy.
Pebblous.ai's June 2026 analysis of model collapse economics puts it plainly: "The more recursive training collapses models, the more provenance becomes a market price rather than a technical checkbox." Pebblous · Jun 2026 This is not a prediction. The pricing of human-verified primary source data is already a live market dynamic. AI labs are paying premiums for curated, provenance-verified datasets precisely because the open web is no longer a reliable source of clean training signal.
The more AI generates, the scarcer verified, human-made original data becomes. And that scarcity is already being priced.
— Shumailov, Shumaylov, Zhao, Papernot, Anderson & Gal · Nature 631:755–759 · July 2024For regulatory intelligence specifically, this irony has a concrete operational expression. As AI-generated regulatory summaries multiply across the secondary source ecosystem, the ability to produce a regulatory output that is demonstrably verified against the primary source document — with a permanent citation ID that proves the access chain — becomes not just a quality differentiator but a liability defence.
The question for every organisation using AI in regulated work is the same question it has always been, now made more urgent by the recursion: was this verified against the primary source, or against a version of the primary source that has been through an AI layer — possibly many AI layers — and has lost the precise details that matter?
Every signal in the RegLegBrief Hallucination Register is a documented case where AI output was measured directly against the primary source document. Not against a secondary summary. Not against another AI system. The actual instrument.
reglegbrief.com/hallucination-register →