The Agent-API Benchmark

Which LLMs handle APIs best? Which APIs can agents actually work with?

Why This Benchmark Exists

AGI is coming. But is it? How good are agents really at doing things in production systems? And how well can they actually replace humans in doing mundane tasks in those systems?

This is the first version of the Agent-API Benchmark. In it, we're exploring how well agents can "do things" in production systems.

Current benchmarks tell you if a model can write Shakespeare or solve math problems. We don't care about that - we want to know how reliably models work IRL, in day-to-day work processes that we're claiming they'll automate. Whether that's accessing your CRM, your billing system, or in handling requests between those systems.

We built this benchmark to explore how well agents can execute against APIs:

  • Which LLMs can reliably build working integrations into your tech stack?
  • Which APIs are actually usable by agents?
  • Where do agents fail, and why?
  • What makes an API "agent-ready"?

Best LLMs for Building Integrations

Average success rate across all tested API integration tasks:

Rank LLM Success Rate

[1] superglue is an integration layer designed specifically for agent-API integrations, not a general-purpose LLM

Best Agent-Ready APIs

Which APIs can agents figure out and use without human help?

Prompts we tested:

Slack: "Find user ID by email, then send direct message"
JIRA: "Get sprint issues, calculate completion %, identify blocked/high-priority items"
Notion: "Query database, find duplicate emails, return count and list"
Rank API Score superglue claude-4-sonnet claude-4-opus gpt-4.1 o4-mini gemini-2.5-flash

Key Findings

  • 84% vs 50-62%: Specialized agent platforms outperform general-purpose LLMs by 30+ points
  • Only 6 APIs work reliably across LLMs All of them are well-documented and use open standards
  • Multi-step workflows expose weaknesses: Most LLMs can't chain API calls reliably

What Makes APIs Agent-Ready

  • Clear endpoints: /users/123 not /v2/entities?type=user&id=123
  • Standard auth: OAuth, Bearer tokens, API keys in headers
  • Real error messages: "User not found" not "Error 1047"
  • Consistent responses: Same structure every time
  • No custom query languages or weird filters

Methodology

TL;DR: We tested 21 APIs across 6 different LLMs.

Out of 630 integration attempts (21 APIs × 6 platforms × 5 attempts each):

  • 23% failed - The agent couldn't even complete basic tasks
  • Only 6 APIs worked 100% of the time across all platforms
  • Custom query and request schemes are the biggest struggle, they usually require careful planning and prompt engineering
  • superglue beats general-purpose LLMs by 30+ points - purpose-built wins

Note: superglue is an integration layer designed specifically for agent-API integrations, not a general-purpose LLM. We included it to show the performance gap between specialized agent systems and general language models.

All evaluation code is open source. Check out the full benchmark implementation on GitHub to run your own tests or contribute new APIs.

See you in the comments

We hope you found this helpful and would love to hear from you on LinkedIn, Twitter and GitHub.

Connect with us via these channels for any inquiries:

Contact: founders@superglue.ai

Y Combinator