AGI is coming. But is it? How good are agents really at doing things in production systems? And how well can they actually replace humans in doing mundane tasks in those systems?
This is the first version of the Agent-API Benchmark. In it, we're exploring how well agents can "do things" in production systems.
Current benchmarks tell you if a model can write Shakespeare or solve math problems. We don't care about that - we want to know how reliably models work IRL, in day-to-day work processes that we're claiming they'll automate. Whether that's accessing your CRM, your billing system, or in handling requests between those systems.
We built this benchmark to explore how well agents can execute against APIs:
- Which LLMs can reliably build working integrations into your tech stack?
- Which APIs are actually usable by agents?
- Where do agents fail, and why?
- What makes an API "agent-ready"?