Benchmarking AI Agents Against Realistic Analytical Tasks with ADE-bench
There are many benchmarks that attempt to measure how well LLMs and AI agents can write SQL queries or do complicated statistical analysis. But as most practitioners know, this is only a small part of our job. Before we can write a query, we have to figure out the business context behind the question. We have figure out which tables to use in a messy database. We have to make subjective decisions about vaguely defined problems. All of this makes benchmarking analytical agents difficult. We built a new benchmark—ADE-bench—that aspires to do exactly that. It gives agents complex analytical environments to work in and ambiguous tasks to solve, and measures how well they perform. In this talk, we'll share how we built the benchmark, the results of our tests, a bunch of things we learned along the way, and what we think is coming next. The benchmark harness is open source, and can be found here: https://github.com/dbt-labs/ade-bench
Speakers