Model Collapse & Mode Collapse on a $0 Budget

1) Why these three research papers belong together

You have three papers that look different on the surface (LLMs, diffusion/image models, RLHF/post‑training). They fit together because they describe the same core problem from three angles:

Paper A — “Learning by Surprise: Surplexity for Mitigating Model Collapse in Generative AI”

This paper focuses on model collapse caused by “autophagy” (models being trained on their own generated outputs). It argues that collapse relates to repeatedly training on data that does not “surprise” the model and proposes a mitigation: keep (or prioritize) the synthetic examples that the model finds most surprising (high surplexity).

Why this matters for a student project:

  • It provides a simple experiment structure (train → generate synthetic data → retrain → repeat).
  • It suggests a simple mitigation you can implement with a few lines: filter generated items by “surprise score”.

Paper B — “A Closer Look at Model Collapse: From a Generalization‑to‑Memorization Perspective”

This paper studies diffusion models (image generators) and reframes collapse as a shift from generalizing (creating new samples) to memorizing (reproducing training data). It connects this shift to declining entropy (variety) of the synthetic data and proposes entropy-based selection to mitigate collapse.

Why this matters for a student project:

  • It gives a very “visual” story: models stop inventing and start copying.
  • It reinforces the same high‑level idea as Paper A: the synthetic dataset itself gets less diverse each generation, and that accelerates collapse.

Paper C — “Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity”

This paper is adjacent but extremely useful for a high‑school talk: it explains mode collapse in aligned LLMs (often after RLHF) as driven by typicality bias in preference data (people often prefer “familiar” outputs). It proposes Verbalized Sampling (VS): a training‑free prompting method that increases diversity by asking the model to produce multiple answers with probabilities and then sampling from the low‑probability “tail.”

Why this matters for a student project:

  • It creates “wow/aha” moments in front of an AP audience without training any model at all.
  • It complements the training‑loop collapse story: collapse can show up from the data you train on or from how you post-train / prompt / decode.

2) Key ideas explained like an AP textbook (no assumed ML background)

2.1 “Model collapse” in plain language

A generative AI model learns a pattern from a dataset (text, images, etc.). Now imagine this happens:

  1. You train a model on a dataset.
  2. You use that model to generate a new dataset (synthetic data).
  3. You train the next model mostly (or only) on that synthetic data.
  4. Repeat.

Over generations, the training data becomes less like the original world and more like a copy of a copy of a copy. The model can:

  • lose rare patterns (“tail” behavior),
  • become more repetitive,
  • get worse at matching real test data,
  • sometimes start effectively “memorizing” artifacts of its own generation process.

That gradual degradation is what many authors call model collapse. Paper A explicitly describes this “train on your own outputs” loop as autophagy and connects collapse to training on data that does not surprise the model.

2.2 “Mode collapse” vs “model collapse”

These sound similar but are different:

  • Model collapse (training-time / data recursion):
    “As you retrain on more synthetic data, performance and diversity degrade across generations.”
  • Mode collapse (output diversity / alignment or decoding):
    “The model keeps producing the same ‘safe/typical’ style of answer even when many valid answers exist.”

Paper C uses “mode collapse” to describe loss of diversity after post-training alignment, driven in part by typicality bias in preference data, and shows a prompting workaround (Verbalized Sampling).

2.3 “Surprise score” (what Paper A calls “surplexity”)

You do not need the word “surplexity” to understand the project.

Instead, think of a model reading a sentence and saying:

  • “I expected that.” (low surprise)
  • “That’s unusual.” (high surprise)

A standard way to measure that is perplexity (or equivalently average “surprisal”). You can treat it as:

Surprise score: how “shocked” the model is by the text.

Paper A’s idea (translated):
If you must train on synthetic data, prioritize the synthetic samples that the model finds most surprising, because those samples carry more information and help preserve diversity.

2.4 “Entropy” (Paper B’s key word) in one sentence

In this context:

Entropy ≈ variety.
High entropy means lots of different outcomes; low entropy means the dataset is repetitive / predictable.

Paper B argues that in diffusion model recursion, the synthetic dataset’s entropy can decline, driving the model from generating novel outputs toward memorizing training data.


3) Can we create a toy experiment?

Yes — and we can do it in a way that:

  • is understandable at senior high-school level,
  • runs on a laptop or free cloud notebooks,
  • produces graphs and “before/after” examples that are easy to present.

The most reliable $0 demo is not a giant LLM. It’s a “toy language model” (a Markov/bigram model). Why?

Because it lets you actually run multiple generations (train → generate → retrain → repeat) quickly, and the collapse signal is very clear: vocabulary shrinks, repetitions grow, and “surprise on real text” worsens.

Then, if a free GPU is available, the student can optionally repeat the same logic with a small transformer + LoRA.


4) Can we run it on an ultra‑low budget?

  • Local laptop (CPU) — guaranteed, no accounts needed.
  • Or free Google Colab — Colab is free-of-charge, but GPU access is not guaranteed and usage limits fluctuate; free notebooks can run “at most 12 hours” depending on availability and usage patterns.
  • Or free Kaggle notebooks — Kaggle has historically enforced GPU quotas (commonly cited as ~30 GPU hours/week) with session limits (often referenced as ~9 hours/session).

Low-cost “rent an RTX for a day” route (optional)

If a teacher/mentor can sponsor a few dollars, services like RunPod offer on-demand GPU pricing pages including RTX-class GPUs (e.g., RTX 4090 listings).
(Exact prices change often; use the live pricing page the day you rent.)

API / SaaS route (optional, not $0)

Thinking Machines Lab’s Tinker is a training API that handles infrastructure and uses LoRA; it lists per‑million‑token pricing for training and sampling for several open models.
This is low engineering effort but not “no money.”


5) Can it be presented to a high‑school AP audience?

Yes. The “audience-friendly” version is:

  1. Show recursive self-training visually (a loop diagram).
  2. Show three generations of outputs (Gen0 vs Gen1 vs Gen3).
  3. Show two simple graphs:

    • “Surprise on real test set” (goes up = worse)
    • “Diversity” (goes down = more repetitive)
    1. Then do a “wow demo” from Paper C:
    • direct prompt vs Verbalized Sampling prompt
    • show that diversity jumps without any training.

PART I — The hands‑on project (guaranteed $0 version)

6) Project overview (what you will build)

You will build a tiny text generator that learns how words follow each other (a “bigram language model”). Then you will run an autophagy loop:

  • Train model on real text → generate synthetic text → retrain on synthetic text → repeat.

You’ll measure:

  • Quality proxy: how “surprised” the model is by real text it did not train on (lower surprise is better).
  • Diversity proxy: how many different words and word-pairs appear (higher diversity is better).

You will also test a mitigation inspired by Paper A:

  • Instead of training on all generated samples, keep only the most surprising ones.

This directly maps to Paper A’s main claim: collapse happens when training on data that doesn’t surprise the model, so choose synthetic training items that maximize surprise.


7) Lab 1 — Model collapse with a Bigram (Markov) language model

7.1 Materials

  • Any computer that can run Python 3.
  • A notebook environment:

    • local Jupyter, or
    • Google Colab, or
    • Kaggle notebook.

7.2 Choose a “real text” dataset

You need a starting dataset that is not generated by your model.

A classic small option: “Tiny Shakespeare” (public-domain Shakespeare excerpts often used in ML tutorials). It’s available as a raw text file.

You can also use your own writing, but avoid personal data (names, addresses, private messages). Use public domain or self-written fiction.


7.3 The complete notebook code (copy/paste)

This is written to be readable and modifiable by a high-school student.
It’s not “industry-grade”; it’s a teaching tool.

# =========================
# Lab 1: Model collapse demo (Bigram / Markov text model)
# =========================

import re, math, random, urllib.request
from collections import Counter, defaultdict
import matplotlib.pyplot as plt

random.seed(42)

# ---------- 1) Load text ----------
DATA_URL = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"

def download_text(url: str) -> str:
    with urllib.request.urlopen(url) as f:
        return f.read().decode("utf-8", errors="ignore")

raw_text = download_text(DATA_URL)

# Optional: shrink dataset so it runs fast everywhere
raw_text = raw_text[:300_000]  # keep first 300k characters

# ---------- 2) Split into "documents" (lines) ----------
lines = [ln.strip() for ln in raw_text.splitlines()]
lines = [ln for ln in lines if len(ln) > 20]   # drop tiny lines

# Simple tokenization: keep letters/numbers/apostrophes, lowercase
def tokenize(s: str):
    s = re.sub(r"[^a-zA-Z0-9'\s]+", " ", s)
    return [t for t in s.lower().split() if t]

docs = [tokenize(ln) for ln in lines]
docs = [d for d in docs if len(d) >= 5]

# Train/test split
random.shuffle(docs)
split = int(0.9 * len(docs))
train_docs = docs[:split]
test_docs  = docs[split:]

BOS = "<BOS>"
EOS = "<EOS>"

def add_markers(doc):
    return [BOS] + doc + [EOS]

train_docs = [add_markers(d) for d in train_docs]
test_docs  = [add_markers(d) for d in test_docs]

# ---------- 3) Bigram model ----------
# P(next_word | current_word) estimated from counts (with add-alpha smoothing)

def train_bigram(docs, alpha=1.0):
    bigram = defaultdict(Counter)
    unigram = Counter()
    vocab = set()

    for doc in docs:
        for a, b in zip(doc[:-1], doc[1:]):
            bigram[a][b] += 1
            unigram[a] += 1
            vocab.add(a); vocab.add(b)

    vocab = sorted(vocab)
    V = len(vocab)

    def prob(b, a):
        # add-alpha smoothing
        return (bigram[a][b] + alpha) / (unigram[a] + alpha * V)

    model = {
        "bigram": bigram,
        "unigram": unigram,
        "vocab": vocab,
        "V": V,
        "alpha": alpha,
        "prob": prob
    }
    return model

def avg_surprise_nll(model, docs):
    # Average negative log likelihood per token (lower is better)
    total = 0.0
    count = 0
    for doc in docs:
        for a, b in zip(doc[:-1], doc[1:]):
            p = model["prob"](b, a)
            total += -math.log(p)
            count += 1
    return total / max(1, count)

def perplexity(model, docs):
    # Perplexity is exp(NLL); interpret as "effective number of choices"
    return math.exp(avg_surprise_nll(model, docs))

# ---------- 4) Generation ----------
def sample_next(model, a):
    vocab = model["vocab"]
    V = model["V"]
    alpha = model["alpha"]
    bigram = model["bigram"]
    unigram = model["unigram"]

    # Build a simple distribution over vocab. (Fine for toy scale.)
    weights = []
    denom = unigram[a] + alpha * V
    for b in vocab:
        weights.append((bigram[a][b] + alpha) / denom)

    return random.choices(vocab, weights=weights, k=1)[0]

def generate_doc(model, max_len=60):
    out = [BOS]
    while len(out) < max_len:
        nxt = sample_next(model, out[-1])
        out.append(nxt)
        if nxt == EOS:
            break
    return out

def strip_markers(doc):
    return [t for t in doc if t not in (BOS, EOS)]

# ---------- 5) Diversity metrics ----------
def distinct_1_and_2(docs):
    # distinct-1: unique unigrams / total unigrams
    # distinct-2: unique bigrams  / total bigrams
    unigrams = []
    bigrams = []
    for d in docs:
        tokens = strip_markers(d)
        unigrams += tokens
        bigrams += list(zip(tokens[:-1], tokens[1:]))

    total_u = len(unigrams) if unigrams else 1
    total_b = len(bigrams) if bigrams else 1
    return (len(set(unigrams))/total_u, len(set(bigrams))/total_b)

def top_token_share(docs, top_k=10):
    tokens = []
    for d in docs:
        tokens += strip_markers(d)
    c = Counter(tokens)
    total = sum(c.values()) if c else 1
    return sum(v for _, v in c.most_common(top_k)) / total

# ---------- 6) "Surprise filtering" mitigation ----------
def doc_surprise(model, doc):
    # Average surprise (NLL) for a single doc
    return avg_surprise_nll(model, [doc])

def filter_most_surprising(model, docs, keep_fraction=0.3):
    scored = [(doc_surprise(model, d), d) for d in docs]
    scored.sort(reverse=True)  # high surprise first
    k = max(1, int(keep_fraction * len(scored)))
    return [d for _, d in scored[:k]]

# ---------- 7) Recursive self-training experiment ----------
def run_recursive_experiment(
    train_docs_real,
    test_docs_real,
    generations=6,
    synth_docs_per_gen=2000,
    alpha=1.0,
    mitigation=False,
    keep_fraction=0.3
):
    metrics = {
        "gen": [],
        "test_perplexity": [],
        "distinct1": [],
        "distinct2": [],
        "top10_share": [],
    }

    current_train = train_docs_real

    for g in range(generations):
        model = train_bigram(current_train, alpha=alpha)

        # Evaluate on REAL held-out text (important!)
        ppl = perplexity(model, test_docs_real)

        # Generate synthetic docs
        synth = [generate_doc(model) for _ in range(synth_docs_per_gen)]

        # Diversity on synthetic outputs
        d1, d2 = distinct_1_and_2(synth)
        t10 = top_token_share(synth, top_k=10)

        metrics["gen"].append(g)
        metrics["test_perplexity"].append(ppl)
        metrics["distinct1"].append(d1)
        metrics["distinct2"].append(d2)
        metrics["top10_share"].append(t10)

        print(f"\n=== Generation {g} ===")
        print(f"Test perplexity (on real text): {ppl:.2f}")
        print(f"Diversity distinct-1: {d1:.3f} | distinct-2: {d2:.3f}")
        print(f"Top-10 token share: {t10:.3f}")
        print("Sample output:", " ".join(strip_markers(generate_doc(model))[:35]), "...")

        # Prepare training data for next generation:
        # Baseline: train ONLY on synthetic data (max collapse pressure)
        next_train = synth

        # Optional mitigation: keep only the most surprising synthetic docs
        if mitigation:
            next_train = filter_most_surprising(model, next_train, keep_fraction=keep_fraction)

        current_train = next_train

    return metrics

# Run two experiments:
# A) Pure synthetic recursion (expect strong collapse)
metrics_collapse = run_recursive_experiment(
    train_docs_real=train_docs,
    test_docs_real=test_docs,
    generations=6,
    synth_docs_per_gen=2000,
    mitigation=False
)

# B) Surprise-filtered recursion (expect slower collapse)
metrics_mitigate = run_recursive_experiment(
    train_docs_real=train_docs,
    test_docs_real=test_docs,
    generations=6,
    synth_docs_per_gen=2000,
    mitigation=True,
    keep_fraction=0.3
)

# ---------- 8) Plot results ----------
def plot_metrics(m1, m2, title_suffix=""):
    gens = m1["gen"]

    plt.figure()
    plt.plot(gens, m1["test_perplexity"], label="Pure synthetic")
    plt.plot(gens, m2["test_perplexity"], label="Surprise-filtered")
    plt.xlabel("Generation")
    plt.ylabel("Perplexity on REAL test set (lower is better)")
    plt.title("Model collapse signal: perplexity vs generation" + title_suffix)
    plt.legend()
    plt.show()

    plt.figure()
    plt.plot(gens, m1["distinct2"], label="Pure synthetic")
    plt.plot(gens, m2["distinct2"], label="Surprise-filtered")
    plt.xlabel("Generation")
    plt.ylabel("Distinct-2 (unique bigrams / total bigrams; higher is better)")
    plt.title("Diversity signal: distinct-2 vs generation" + title_suffix)
    plt.legend()
    plt.show()

    plt.figure()
    plt.plot(gens, m1["top10_share"], label="Pure synthetic")
    plt.plot(gens, m2["top10_share"], label="Surprise-filtered")
    plt.xlabel("Generation")
    plt.ylabel("Share of tokens from top-10 words (lower is better)")
    plt.title("Collapse symptom: top-token concentration vs generation" + title_suffix)
    plt.legend()
    plt.show()

plot_metrics(metrics_collapse, metrics_mitigate)

7.4 How to explain what you just measured (AP-friendly)

Metric 1: “Perplexity on real test text” (quality proxy)

  • You trained models on training data that becomes more synthetic each generation.
  • You evaluate on real held‑out text that never changes.
  • If perplexity goes up across generations, the model is becoming worse at predicting real text.

Interpretation for an AP audience:

“Perplexity is like the model’s ‘confusion score.’ Higher means it’s more surprised by real text.”

Metric 2: “Distinct‑2” (diversity proxy)

Distinct‑2 is:

unique word‑pairs ÷ total word‑pairs

If it goes down, the model is repeating the same phrasing more often.

Metric 3: “Top‑10 token share” (concentration proxy)

If “top‑10 share” goes up, a few very common words dominate the outputs.


7.5 What results should look like (qualitatively)

You’ll typically see:

  • Pure synthetic recursion:
    perplexity worsens on real test text, diversity drops, concentration increases.
  • Surprise-filtered recursion:
    collapse is often slower or smaller, because you forced the training set to include items the model didn’t already “expect.”
    That mirrors Paper A’s mitigation concept.

Important honesty note for the student:

  • This is a toy model.
  • The point is not “we proved the whole field is doomed.”
  • The point is “we demonstrated a simple recursive self-training failure mode.”

PART II — Optional upgrades (closer to LLMs, still low cost)

8) Lab 2 — “Mini‑LLM” recursive fine‑tuning with Unsloth (free GPU if available)

If the student can access a free GPU (Kaggle/Colab) they can run a similar experiment using LoRA (Low-Rank Adaptation):

  • Instead of training a model from scratch, you keep most weights fixed and train a small adapter.
  • This is how many low-budget fine-tuning workflows work.

Why Unsloth is relevant

Unsloth provides beginner-friendly notebooks and aims to reduce memory usage, letting you fine-tune models on limited GPUs; their GitHub lists “start for free” notebooks and a quickstart install.

What you do (high-level)

You repeat the same experimental logic:

  1. Fine-tune a small open model on real text (short run).
  2. Generate synthetic text.
  3. Fine-tune again on synthetic text (or a mixture).
  4. Measure:

    • perplexity on real held-out text,
    • diversity of outputs,
    • repetition.

Practical advice (so a student doesn’t get stuck)

  • Pick a small model (≈ 0.3B–1B parameters) so it fits free GPUs.
  • Use a small dataset (few thousand short lines).
  • Keep training short (few steps) to make multiple generations feasible.
  • Save results after each generation.

Where to start (the “lowest threshold” path)

  • Use Unsloth’s notebook approach:

    • Their main repo links to “Start for free” notebooks and quickstart.
    • Unsloth Studio explicitly advertises “Finetune for Free” notebooks.
    • Their Colab install guide shows the typical workflow and example imports.

(You can still do a fully custom script, but for a student project the notebook route is less fragile.)


9) If you’re allowed to spend a few dollars: “one-stop” options

9.1 Tinker (Thinking Machines Lab)

Tinker is a “training API for researchers” that exposes functions like forward/backward, optimizer step, sampling, saving state, and states it uses LoRA.
It lists a per‑million‑token pricing table for various models (including Llama‑3.2‑1B).

This is not $0, but it is low engineering effort: the “project work” becomes about designing the synthetic-data loop and evaluation, not fighting CUDA.

9.2 GPU renting (RunPod-style)

RunPod publishes a public pricing page listing many GPUs including RTX-class GPUs.
This route is useful if:

  • the student can’t get consistent free GPU access,
  • but a sponsor can cover a very small budget.

PART III — The “wow demo” for an AP audience (no training required)

10) Lab 3 — Mode collapse & Verbalized Sampling (VS)

Paper C’s core claim: post‑training alignment can reduce diversity (mode collapse) and a driver is typicality bias (humans prefer familiar responses). It proposes Verbalized Sampling, a prompting trick that asks the model to generate multiple responses with probabilities and sample from the low‑probability tail.

10.1 The 60-second classroom demonstration

Pick a creative prompt:

“Tell me a short story about a bear who discovers a vending machine.”

Direct prompt: ask once; you’ll usually get one “most typical” story.

Verbalized Sampling prompt (copy/paste):
(From the project’s repo quickstart idea.)

1
2
3
4
5
6
7
<instructions>
Generate 5 responses to the user query, each within a separate <response> tag.
Each <response> must include a <text> and a numeric <probability>.
Please sample at random from the tails of the distribution, such that the probability of each response is less than 0.10.
</instructions>

Tell me a short story about a bear who discovers a vending machine.

Then:

  • show the five candidates,
  • roll a die (or use a random number generator) to select one,
  • repeat once and show the style shifts.

This is compelling because it feels like:

“The model had more creativity inside it — we just needed to ask for the distribution.”

The paper reports diversity improvements in creative writing tasks and positions VS as training-free.


PART IV — How to present the whole project (10–12 minute AP talk)

11) Slide-by-slide outline (simple and strong)

  1. Title: “When AI trains on AI: copies of copies”
  2. One diagram: the recursive loop (Train → Generate → Retrain → Repeat)
  3. Define two collapses:

    • model collapse (training on synthetic data loop)
    • mode collapse (loss of output diversity due to alignment / preferences)
    1. Show outputs: Generation 0 vs Generation 3 text samples (from Lab 1)
    2. Graph 1: perplexity on real test set vs generation
    3. Graph 2: diversity (distinct‑2) vs generation
    4. Mitigation idea: “keep surprising synthetic data” (Paper A concept)
    5. Wow demo: Verbalized Sampling live (Paper C)
    6. Conclusion: “Toy demo ≠ final word, but it shows a real risk mechanism”
    7. Ethics / data hygiene: don’t train on private/copyrighted data; label synthetic data.

PART V — Feasibility & value (honest evaluation)

12) What this project can show well

  • A clean, repeatable demonstration that recursive self‑training can reduce diversity and worsen “match to real text” on a toy model.
  • A direct demonstration of Paper A’s intuition: if you choose synthetic training items that maximize “surprise,” collapse pressure can be reduced.
  • A compelling, no-training-required demo of mode collapse and an inference-time workaround (VS).

13) What this project cannot honestly claim

  • It cannot settle the full debate of “how serious model collapse is for frontier LLMs.”
    The field is active; results vary by model size, data mixing, filtering, retrieval, and training recipes.
  • A toy model cannot reproduce every mechanism present in large-scale training runs (and that’s okay for a class project).

14) Why it still has real educational value

  • It teaches scientific thinking: control group vs treatment, holding out a test set, measuring change over generations.
  • It turns a complex topic into something you can see (graphs + examples) and explain to non-experts.

PART VI — Full URLs (papers + tools + datasets)

The user asked for “references as full URLs,” so here they are in copy/paste form.

Papers

1
2
3
https://arxiv.org/abs/2410.12341
https://arxiv.org/abs/2509.16499
https://arxiv.org/abs/2510.01171

Verbalized Sampling (code + site)

https://github.com/CHATS-lab/verbalized-sampling
https://www.verbalized-sampling.com/

Unsloth (fine-tuning tooling)

1
2
3
4
5
https://github.com/unslothai/unsloth
https://github.com/unslothai/unsloth-studio
https://unsloth.ai/docs/get-started/unsloth-notebooks
https://unsloth.ai/docs/get-started/install/google-colab
https://unsloth.ai/docs

Free notebook compute

1
2
3
https://research.google.com/colaboratory/faq.html
https://www.kaggle.com/general/108481
https://www.kaggle.com/page/GPU-tips-and-tricks

GPU rental (optional)

https://www.runpod.io/pricing

Tinker (optional training API)

https://thinkingmachines.ai/tinker/
https://tinker-docs.thinkingmachines.ai/

Dataset used in Lab 1 (Tiny Shakespeare raw text)

https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Edit

Pub: 10 Jan 2026 23:34 UTC

Views: 14