Agent Infrastructure

Building Durable, Long-Running Autonomous Agents

Most AI agents work in demos. Few survive in production. LLMs are stateless. Infrastructure fails. Context windows reset. Real-world objectives span hours or days. Building long-running autonomous agents requires durability engineered across the entire system. This talk compares and contrasts dominant approaches to durability for agents and presents three pillars of durable agentic systems. 1. Durable Execution Agents must survive crashes, retries, and partial task completion. Durable execution engines like Temporal persist workflow state and enable deterministic replay. Graph-based orchestrators such as LangGraph model control flow as explicit state machines. These approaches reflect different assumptions about recovery, replayability, and operational resilience, and directly shape how agents behave under failure. 2. Durable Autonomy Autonomous systems inevitably encounter ambiguity and incomplete information. Durable autonomy means designing agents that recognize uncertainty, escalate intelligently to humans when necessary, and resume coherently without losing progress. We’ll examine architectural patterns for human-in-the-loop integration that preserve control while maintaining forward momentum. 3. Durable Statefulness Long-running agents cannot rely on ever-growing prompts. Some systems serialize state into resumable bursts using patterns like Anthropic’s Git-Commit approach. Others externalize cognition into layered memory architectures - separating working, episodic, semantic, or procedural memory through memory virtualization. Different workloads and time horizons demand different state strategies. Attendees will leave with a deeper understanding of agent durability and a practical architectural framework for building resilient agents, systems designed not just to respond, but to endure.

Speakers