Context window stress test
A small experiment to see where longer context starts to degrade quality.
Key takeaways
- Longer context is not free; quality bends before it breaks.
- A fixed dataset makes the drift visible.
- Measure the threshold where summaries go vague.
If context windows are new, start with Context windows as working memory.
Architecture map
The experiment is a loop: fixed documents in, fixed prompt, outputs out. The only variable is context length.
What happened
Quality held up through the middle range, then bent. Past a threshold, summaries became vague and citations started drifting. The model still responded, but the answers softened.
The two gates
The first gate is retrieval fit. If the context is already noisy, adding more text does not help.
The second gate is attention budget. Past a point, the model spends more effort tracking tokens than answering the question.
Experiment walkthrough
- Fix a small, representative document set.
- Keep the prompt constant across runs.
- Increase context in steps and score the output.
First-time config
export OLLAMA_MODEL="llama3.1:8b"
export CONTEXT_STEPS="4k,8k,16k,32k"
Quick checks
ollama run llama3.1:8b "summarize the document set"
Failure modes
- Changing the document set between runs hides the drift.
- Adding context without scoring quality gives false confidence.
- Overfitting the prompt to one test case.
What made the difference
I fixed the dataset and prompt, then focused on the slope where quality started to bend. That gave me a usable boundary.
What I would do next time
I would add a small human rubric and track a single quality score per step to make the curve explicit.