Ollama on M1: thermal baseline
A baseline run to learn where M1 throttles and why it matters.
Key takeaways
- Local performance is a temperature story, not just a model story.
- Baselines should be short, repeatable, and recorded.
- Throttling hides inside long runs and ruins comparisons.
I wanted a fair baseline for local models on an M1 machine. The surprise was not raw speed. It was how quickly a clean run drifted once the system warmed up.
Architecture map
The system is simple: model runtime, the host machine, and the environment around it. The host is the real boundary.
What happened
Short runs were fast. Long runs slowed down. Once I logged temperatures, the pattern was obvious: the system throttled after a few minutes, and the numbers became incomparable.
The two gates
The first gate is time. If you test too long, you stop testing the model and start testing heat dissipation.
The second gate is consistency. Without a fixed prompt and a fixed run length, the baseline drifts even when nothing else changes.
Setup walkthrough
- Pick one model and one prompt.
- Run the prompt in short bursts to avoid thermal drift.
- Log run time and system temperature together.
First-time config
export OLLAMA_HOST="http://127.0.0.1:11434"
export OLLAMA_MODEL="llama3.1:8b"
Quick checks
ollama run llama3.1:8b "ping"
Failure modes
- Long runs hide throttling until you compare a later test.
- Background apps push the host into a hotter state.
- Changing the prompt changes the performance profile.
What made the difference
I treated the baseline as a short, repeatable ritual and logged the host temperature alongside the timing. That made later comparisons honest.
What I would do next time
I would automate a short benchmark loop, record system temperature per run, and stop the test before the host warms up.