AI Hallucination ResearchRLB Panel Speak › The Curse of Recursion: AI Is Eating Itself

RLB Panel Speak

The Curse of Recursion: AI Is Eating Itself

And what it means for every regulatory output AI produces

By RLB Specialist Panel · 14 Jun 2026
The Curse of Recursion: AI Is Eating Itself

Each generation of AI loses a little more of what made the previous one accurate. The tails disappear first — and the tails are exactly where regulatory precision lives.

— RLB Specialist Panel

RegLegBrief · Panel Speak · Verdus Technologies Pte. Ltd.

The Curse of Recursion:
AI Is Eating Itself.

And What It Means for Every Regulatory Output It Produces

A peer-reviewed paper in Nature showed that when AI models are trained on AI-generated data, they begin to lose the rarest, most precise, most unusual parts of their knowledge. The tails disappear first. What remains becomes flatter, safer, more repetitive — and increasingly wrong in the specific, verifiable ways that matter most to professionals who rely on regulatory accuracy.

74.2%
of newly created web pages in 2025 contained AI-generated content
Ahrefs · 900K page study · Apr 2025
~90%
of all online content projected to be AI-generated by 2026
Europol / Graphite projections · 2022–2025
2026–32
Epoch AI prediction: world runs out of new human-generated text for AI training
Epoch AI · 2024
RegLegBrief · reglegbrief.com Published: June 2026 Register: reglegbrief.com/hallucination-register
The research

A Nature paper named it. The numbers have since confirmed it is already running.

In July 2024, researchers from Oxford, Cambridge, Imperial College London, the University of Toronto, and the University of Edinburgh published a paper in Nature that named something the AI industry had been quietly aware of but unwilling to discuss directly. They called it model collapse.

The mechanism is straightforward. Shumailov et al. · Nature 631 · 2024 When a generative AI model is trained on data produced by an earlier AI model, rather than on original human-generated content, it begins to lose the extremities of its knowledge. The rare cases — the unusual regulatory instrument, the minority interpretation, the precise technical distinction — disappear first. What remains is a narrower, blander, more averaged version of the knowledge the model once held. Train the next model on the output of the first, and the process compounds. Each generation loses a little more of the edge.

Indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear.

— Shumailov, Shumaylov, Zhao, Papernot, Anderson & Gal · Nature 631:755–759 · July 2024

The paper's authors distinguish two phases. Early model collapse: the tails of the distribution begin to disappear — the rare events, the minority cases, the precise technical distinctions. Late model collapse: the model converges to a shrunken distribution with very low variance — essentially a confident average that has forgotten it was ever uncertain about anything at the edges. The mechanism compounds across three error types: statistical approximation (finite sampling loses rare cases), functional expressivity (the model class cannot represent the true distribution), and functional approximation (learning errors accumulate).

A follow-up arXiv paper by Ali Borji concluded that the outcomes reported by Shumailov et al. are a statistical phenomenon and may be unavoidable. Borji · arXiv 2410.12954 · Oct 2024 This is not a bug to be patched. It is a structural property of recursive training.

The scale

The recursion has already started. We are just not talking about it honestly.

Model collapse would be a theoretical concern if the internet were still primarily human-generated. It is not.

74.2%
of newly published web pages in April 2025 contained AI-generated text. Only 2.5% were purely AI — 71.7% were mixed human and AI.
Ahrefs study · 900,000 pages · April 2025
1,271
AI-generated "news" sites tracked by NewsGuard by May 2025, up from 49 in May 2023 — a 25× increase in 24 months.
NewsGuard tracker · May 2023–May 2025

AI-written pages in Google's top-20 search results climbed from 11.11% to 19.56% between May 2024 and July 2025 — roughly 0.6 percentage points per month. Ahrefs / WinsSolutions · 2025 The web — the primary training corpus for most large language models — is rapidly filling with AI-generated content at exactly the moment that AI labs need more diverse, high-quality human data to improve their models.

The Communications of the ACM put it directly in April 2026: "The recursion has started. We're just not talking about it honestly. Large portions of the open web now contain text produced by LLMs, and the volume grows every month. According to some estimates, over 50% of all content is now AI-generated. It's an ouroboros with an increasing appetite but a shrinking portion size." CACM · Apr 2026

The web is not just an information environment for human readers. It is also a training corpus for AI systems. The models that generate tomorrow's content are being trained on today's content, which is already substantially synthetic.

— Medium / Prince Saaluon · March 2026

Researchers at Epoch AI have predicted that the world may run out of new human-generated text suitable for AI training sometime between 2026 and 2032. Epoch AI · 2024 The global pool of high-quality human-authored text is estimated at approximately 17 trillion tokens, growing at only 4–5% annually — while AI consumption of training data grows at a rate that dwarfs that figure. A 2021 report predicted that by 2025, 90% of internet content would be AI-generated. That projection is now reading as conservative.

· · ·
The mechanism

How each generation loses a little more of what made the previous one accurate

The degradation can be visualised as a photocopy of a photocopy: each generation loses resolution, each copy of a copy flattens what was once sharp. But the regulatory content version of this story has a specific quality that makes it more dangerous than a general degradation of output diversity.

G1
Generation 1 — trained on human internet (pre-2020)
Primary regulatory documents exist in training data. Secondary summaries also exist but are a small fraction. The model has some grounding in the actual instrument text, even if imperfect.
G2
Generation 2 — trained on human + AI-assisted internet (2021–2023)
AI-generated summaries, AI-drafted legal blogs, AI-assisted compliance content begins entering the training corpus. These secondary descriptions of regulatory instruments are now competing with primary sources at scale. The precise details begin to blur.
G3
Generation 3 — trained on predominantly synthetic internet (2024–2025)
74.2% of new web content is AI-generated. The model is now predominantly training on AI descriptions of AI descriptions of regulatory instruments. The rare technical qualifications — the carve-outs, the defined terms, the instrument-specific thresholds — disappear from the training distribution. What remains is the confident average.
G4
Generation 4 — the confident wrong answer (2026 and beyond)
The model does not know it has lost the edge cases. It produces confident, well-structured, professionally formatted regulatory content — based on a training distribution that has systematically erased the precise details that make that content accurate. And it deprioritises live search results because its internal representation feels like deep knowledge.
The regulatory dimension

Why model collapse hits regulatory content harder than any other domain

Model collapse affects all AI output — creative writing, code, general knowledge. But it hits regulatory content with particular precision, because regulatory content has a property that creative output does not: it is either correct or it is wrong, and the wrong version can be verified by opening the primary source document.

Shumailov et al. note that the rarest cases disappear first — the tails of the distribution, the minority events, the unusual data points that do not appear often enough in training to survive the compression. In regulatory content, those rare cases are the precise technical details.

These are exactly the details that secondary sources — the compliance blogs, the law firm bulletins, the aggregator summaries — routinely omit or simplify. They are also exactly the details that distinguish compliant from non-compliant in a regulatory examination. And they are precisely what disappears first as model collapse compounds across training generations.

What this looks like in practice — regulatory specificity loss
Query →  What is the minimum liquid asset ratio under MAS Notice 649?

G1 model →  [Retrieved from primary source signal in training data]
           The minimum liquid asset ratio is 16% of qualifying liabilities.
           Specific carve-outs apply to certain categories.

G3 model →  [Training data now dominated by AI-generated summaries]
           The MAS requires banks to maintain adequate liquid assets.
           The standard ratio requirement is in the range of 16–18%.
           Specific requirements vary by institution type.

G4 model →  [Training data almost entirely synthetic; primary signal lost]
           The minimum liquid asset ratio under MAS Notice 649 is 18%.
           [Confident. Specific. Wrong. And the model has no mechanism to know this.]

This is not hypothetical. The RegLegBrief Hallucination Register has documented exactly this pattern: AI systems producing confident, specific, wrong figures for regulatory requirements — figures that exist nowhere in the primary source document, but which appear consistently across multiple systems because they have been absorbed from the secondary source environment and then amplified through recursive training. RLB-HAL-0002 · Apr 2026

Model collapse is the structural explanation for why this happens — and why it will get worse as AI-generated regulatory content continues to flood the secondary sources that training pipelines harvest.

Implications

What model collapse means for professionals who rely on AI for regulatory content

📉
The errors are not random — they are systematically biased toward the confident average
Model collapse does not produce random errors. It produces errors that reflect the most common version of a piece of information across the secondary source ecosystem. This means AI systems across different vendors will tend to produce the same wrong answer — because they are all training on the same contaminated internet. A compliance officer who cross-checks one AI system with another is not performing a meaningful verification. They are checking two systems with correlated training data contamination against each other.
🔁
AI-generated regulatory content is now feeding the next model's training data
When an AI system produces regulatory content — a compliance summary, a legal brief, a regulatory update newsletter — and that content is published online, it enters the training corpus for the next generation of models. The 74.2% AI-content figure means that regulatory content is already being predominantly generated, published, indexed, and harvested back into training pipelines. Each cycle amplifies the deviation from the primary source. Ahrefs · Apr 2025
⚖️
The liability exposure compounds with each training generation
Courts have established that professionals bear personal liability for AI-generated content they rely on without verification. As model collapse progresses, the gap between what AI states and what the primary source says widens — but the AI's confidence level does not decrease. The professional relying on a Generation 4 model faces the same personal liability as the professional relying on Generation 1, but is receiving output with a greater systematic deviation from the primary source. Wadsworth v. Walmart · 2025 · Couvrette v. Wisnovsky · 2026
🔍
Live search does not solve this — the model deprioritises it
Shumailov et al. describe how outputs become progressively safer, flatter, and more repetitive as collapse progresses — and how the model eventually begins confusing a copy of reality with reality itself. This is why AI systems deprioritise live web search results in favour of their internal representation. The training data is so densely reinforced that it feels — from inside the model's architecture — like deep, confident knowledge. A model that believes it already knows the answer does not reach for the search result that contradicts it.
📄
Primary source access becomes the only durable defence
If the contamination is in the training data, and the training data is the internet, then the only verification that sits outside the contamination loop is the primary source document — the instrument as published by the regulator, before any AI or human intermediary summarised, paraphrased, or redistributed it. This is what makes the primary source the only reliable verification reference. Not a different AI system. Not a legal aggregator. The actual document.
The deeper irony

The more AI generates, the more valuable the primary source becomes

The Nature paper's authors make this point directly: the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet. Shumailov et al. · Nature · 2024

The economic logic is straightforward. As AI-generated content floods the secondary source ecosystem, the scarcity premium on verified primary source material increases. The regulatory document that has not been summarised, paraphrased, or redistributed through an AI layer — the original gazette, the original circular, the original notice, as published by the regulator — becomes progressively more valuable as everything else becomes progressively less trustworthy.

Pebblous.ai's June 2026 analysis of model collapse economics puts it plainly: "The more recursive training collapses models, the more provenance becomes a market price rather than a technical checkbox." Pebblous · Jun 2026 This is not a prediction. The pricing of human-verified primary source data is already a live market dynamic. AI labs are paying premiums for curated, provenance-verified datasets precisely because the open web is no longer a reliable source of clean training signal.

The more AI generates, the scarcer verified, human-made original data becomes. And that scarcity is already being priced.

— Shumailov, Shumaylov, Zhao, Papernot, Anderson & Gal · Nature 631:755–759 · July 2024

For regulatory intelligence specifically, this irony has a concrete operational expression. As AI-generated regulatory summaries multiply across the secondary source ecosystem, the ability to produce a regulatory output that is demonstrably verified against the primary source document — with a permanent citation ID that proves the access chain — becomes not just a quality differentiator but a liability defence.

The question for every organisation using AI in regulated work is the same question it has always been, now made more urgent by the recursion: was this verified against the primary source, or against a version of the primary source that has been through an AI layer — possibly many AI layers — and has lost the precise details that matter?

Sources & References
01
Shumailov et al. — AI models collapse when trained on recursively generated data
Nature 631:755–759 · July 2024 · doi:10.1038/s41586-024-07566-y · Oxford, Cambridge, Imperial, Toronto, Edinburgh
02
Borji — A Note on Shumailov et al. (2024)
arXiv 2410.12954 · October 2024 · Statistical phenomenon may be unavoidable
03
Ahrefs — AI content study · 900,000 web pages
Ryan Law · April 2025 · 74.2% of new pages contain AI-generated content
04
NewsGuard — AI news site tracker
May 2023–May 2025 · 49 → 1,271 AI news sites · ~51 new sites/month
05
WinsSolutions — The AI Model Collapse Risk is Not Solved in 2025
October 2025 · AI pages in top-20 Google results: 11.11% → 19.56% (May 2024–Jul 2025)
06
Epoch AI — human text data projections
2024 · World runs out of new human-generated training text 2026–2032
07
CACM — Model Collapse Is Already Happening, We Just Pretend It Isn't
Communications of the ACM · April 3, 2026
08
Pebblous.ai — Model Collapse and Human Data Provenance Pricing
June 2026 · Provenance as market price, not technical checkbox
09
RegLegBrief Hallucination Register — RLB-HAL-0002
April 2026 · MAS Notice 649 · Wrong figure confirmed across 5 AI systems
10
Europol projection — synthetic content online by 2026
2022 · ~90% synthetic content projection now reading as conservative
11
Spennemann 2025 — large-scale web corpus analysis
30–40% of active web text now originates from AI-generated or AI-edited sources
12
Liang et al. 2025 — AI in professional text
18% of financial consumer complaint records · 24% of corporate press releases · AI-assisted

RegLegBrief verifies regulatory content against primary sources — before the AI layers get to it.

Every signal in the RegLegBrief Hallucination Register is a documented case where AI output was measured directly against the primary source document. Not against a secondary summary. Not against another AI system. The actual instrument.

reglegbrief.com/hallucination-register →
RegLegBrief · Verdus Technologies Pte. Ltd. · UEN 201616982R · Singapore reglegbrief.com · Published June 2026
model collapseAI hallucinationregulatory accuracyprimary source verificationAI safetyprofessional liability
← All RLB Panel Speak