Shelf • local-experiments

Context window stress test

A small experiment to see where longer context starts to degrade quality.

Key takeaways

Longer context is not free; quality bends before it breaks.

A fixed dataset makes the drift visible.

Measure the threshold where summaries go vague.

If context windows are new, start with Context windows as working memory.

Architecture map

The experiment is a loop: fixed documents in, fixed prompt, outputs out. The only variable is context length.

The only variable is context length; everything else stays fixed.

What happened

Quality held up through the middle range, then bent. Past a threshold, summaries became vague and citations started drifting. The model still responded, but the answers softened.

The two gates

The first gate is retrieval fit. If the context is already noisy, adding more text does not help.

The second gate is attention budget. Past a point, the model spends more effort tracking tokens than answering the question.

Experiment walkthrough

Fix a small, representative document set.
Keep the prompt constant across runs.
Increase context in steps and score the output.

First-time config

export OLLAMA_MODEL="llama3.1:8b"
export CONTEXT_STEPS="4k,8k,16k,32k"

Quick checks

ollama run llama3.1:8b "summarize the document set"

Failure modes

Changing the document set between runs hides the drift.
Adding context without scoring quality gives false confidence.
Overfitting the prompt to one test case.

What made the difference

I fixed the dataset and prompt, then focused on the slope where quality started to bend. That gave me a usable boundary.

What I would do next time

I would add a small human rubric and track a single quality score per step to make the curve explicit.

#experiments#local-llm#evaluation#context

← Back to Shelf