AI Engineering

From Spans to Trajectories: Observability for Long-Running Agents

Agents have evolved. We've moved from orchestration frameworks — where agents operate through definite steps and turns you defined upfront — to harnesses, where the LLM uses skills and tools to chart its own trajectory. Modern agents run for hours or days, producing hundreds to thousands of steps in a single session. This calls for a fundamentally different methodology for monitoring and evaluating them in production. This talk shares what we've learned building observability infrastructure for agent harnesses at HoneyHive. We'll start with why the harness — not the model — has become the hardest engineering problem in production AI, and why traditional APM breaks down when traces are 10,000 spans deep and failures happen four tool calls deep. We'll walk through a live trajectory view to see what long-running agent traces actually look like at scale, and the specific challenges they create: context rot, semantic failure modes, and the needle-in-a-haystack problem of finding the moment that mattered. Then we'll dig into skills as the new unit of behavior and the dual role of clustering in agent development: unsupervised clustering for discovering emergent patterns and identifying where guardrails are needed, and supervised classifiers for production evaluation at scale. We'll close on what comes next — swarm observability for multi-agent systems.

Speakers