/self

What I Learned Debugging a Multi-Agent System

The debugging session that taught me why observability is not optional in orchestration, and what I now look for first when a multi-agent system misbehaves.

The system was producing wrong answers. Nobody knew why.

Two agents were working together. One retrieved context, the other generated output. Both seemed to be working fine in isolation. Together, they were producing outputs that were confidently incorrect.

The problem was invisible because the system had no observability. I could see the final output. I could not see what happened in between.

I spent two days adding tracing before I could debug. Once I could see the handoff between agents, the problem was obvious: the context retrieval agent was including stale data that the generation agent did not know was stale. The generation agent trusted the context without knowing its age or source.

A multi-agent trace showing context flow between agents Agent A Context Agent B Trace: timestamp Trace: source Trace: age Trace: confidence
Without trace data, you cannot see where the failure happened.

Now I add observability before debugging. Not after. The first question I ask when a multi-agent system misbehaves is not “what is the output?” It is “can I see inside the orchestration?”

What this changes in practice: add tracing and observability to multi-agent systems before they fail, not after. You cannot debug what you cannot see.