Context window stress test

A small experiment to see where longer context starts to degrade quality.

Layout

Key takeaways

  • Longer context is not free; quality bends before it breaks.
  • A fixed dataset makes the drift visible.
  • Measure the threshold where summaries go vague.

If context windows are new, start with Context windows as working memory.

Architecture map

The experiment is a loop: fixed documents in, fixed prompt, outputs out. The only variable is context length.

Context test loop Documents feed a prompt, model output is scored, and the loop repeats. Docs Prompt Model Score
The only variable is context length; everything else stays fixed.

What happened

Quality held up through the middle range, then bent. Past a threshold, summaries became vague and citations started drifting. The model still responded, but the answers softened.

The two gates

The first gate is retrieval fit. If the context is already noisy, adding more text does not help.

The second gate is attention budget. Past a point, the model spends more effort tracking tokens than answering the question.

Experiment walkthrough

  1. Fix a small, representative document set.
  2. Keep the prompt constant across runs.
  3. Increase context in steps and score the output.

First-time config

export OLLAMA_MODEL="llama3.1:8b"
export CONTEXT_STEPS="4k,8k,16k,32k"

Quick checks

ollama run llama3.1:8b "summarize the document set"

Failure modes

  • Changing the document set between runs hides the drift.
  • Adding context without scoring quality gives false confidence.
  • Overfitting the prompt to one test case.

What made the difference

I fixed the dataset and prompt, then focused on the slope where quality started to bend. That gave me a usable boundary.

What I would do next time

I would add a small human rubric and track a single quality score per step to make the curve explicit.