Ollama setup: local baseline

A first clean install that taught me where local stacks really fail.

Layout

Key takeaways

  • Local stacks fail at the edges: model names, memory, and heat.
  • A repeatable baseline beats a fast demo.
  • Verify the model before you wire the app.

I wanted a stable local baseline before changing models, prompts, or context size. The surprise was how many failure modes live outside the code. For the RAG side of this stack, see Retrieval-Augmented Generation in plain terms.

Architecture map

Local Ollama is a short stack: a model runtime, a local API, and your app. The boundaries are simple, which makes the bottlenecks obvious.

Local Ollama architecture App calls a local API which routes to a model runtime and files on disk. App Ollama API Model runtime Model files
The shortest useful path: app → local API → runtime + model files.

What happened

The install looked easy and it was. The failure showed up later: a model name mismatch, a stalled pull, and thermal throttling after a long run. Each issue felt minor, but they compounded into unstable baselines.

The two gates

The first gate is model identity. If the model name in your prompt runner does not match the local registry, the request fails or pulls the wrong thing.

The second gate is host capacity. Local runs are bounded by RAM, VRAM, and heat. Even on a good machine, long runs will drift if the system throttles.

Setup walkthrough

  1. Install Ollama and verify the CLI is present.
  2. Pull one model and run a single prompt to confirm the runtime.
  3. Keep the first run small. Your goal is a stable baseline, not speed.

First-time config

I pinned the model and forced explicit names in config so I did not guess later.

export OLLAMA_HOST="http://127.0.0.1:11434"
export OLLAMA_MODEL="llama3.1:8b"

Quick checks

ollama list
ollama run llama3.1:8b "ping"

Failure modes

  • Model name in code does not match the local registry.
  • The first run pulls a large model and stalls without feedback.
  • Thermal throttling changes latency after a few minutes.

What made the difference

I locked the model name, measured the first five runs only, and recorded machine state while the fan curve was stable. That turned the baseline into something I could compare later.

What I would do next time

I would script the baseline run from the start, capture system temperature in the log, and freeze the model version before testing anything else.