Agent Infrastructure

Inference for Async Agents in Production

As models improve, we are starting to build long-running, asynchronous agents such as deep research agents and browser agents that can execute multi-step workflows autonomously. These systems unlock new use cases, but they use orders of magnitude more tokens and compute, creating scaling bottlenecks. This talk discusses practical strategies builders can use to maximize async agent performance while keeping inference costs under control. Topics covered include context engineering, compaction, cache maintenance, model routing, and batch inference. This talk is aimed at use case developers, with secondary relevance to platform engineers.

Speakers