The Agent-API Benchmark

Why This Benchmark Exists

AGI is coming. But is it? How good are agents really at doing things in production systems? And how well can they actually replace humans in doing mundane tasks in those systems?

This is the first version of the Agent-API Benchmark. In it, we're exploring how well agents can "do things" in production systems.

Current benchmarks tell you if a model can write Shakespeare or solve math problems. We don't care about that - we want to know how reliably models work IRL, in day-to-day work processes that we're claiming they'll automate. Whether that's accessing your CRM, your billing system, or in handling requests between those systems.

We built this benchmark to explore how well agents can execute against APIs:

Which LLMs can reliably build working integrations into your tech stack?
Which APIs are actually usable by agents?
Where do agents fail, and why?
What makes an API "agent-ready"?

Best LLMs for Building Integrations

Average success rate across all tested API integration tasks:

Rank	LLM	Success Rate

[1] superglue is an integration layer designed specifically for agent-API integrations, not a general-purpose LLM

Best Agent-Ready APIs

Which APIs can agents figure out and use without human help?

Prompts we tested:

Slack: "Find user ID by email, then send direct message"

JIRA: "Get sprint issues, calculate completion %, identify blocked/high-priority items"

Notion: "Query database, find duplicate emails, return count and list"

Rank	API	Score	superglue	claude-4-sonnet	claude-4-opus	gpt-4.1	o4-mini	gemini-2.5-flash

Key Findings

84% vs 50-62%: Specialized agent platforms outperform general-purpose LLMs by 30+ points
Only 6 APIs work reliably across LLMs All of them are well-documented and use open standards
Multi-step workflows expose weaknesses: Most LLMs can't chain API calls reliably

What Makes APIs Agent-Ready

Clear endpoints: /users/123 not /v2/entities?type=user&id=123
Standard auth: OAuth, Bearer tokens, API keys in headers
Real error messages: "User not found" not "Error 1047"
Consistent responses: Same structure every time
No custom query languages or weird filters

Methodology

TL;DR: We tested 21 APIs across 6 different LLMs.

Out of 630 integration attempts (21 APIs × 6 platforms × 5 attempts each):

23% failed - The agent couldn't even complete basic tasks
Only 6 APIs worked 100% of the time across all platforms
Custom query and request schemes are the biggest struggle, they usually require careful planning and prompt engineering
superglue beats general-purpose LLMs by 30+ points - purpose-built wins

Note: superglue is an integration layer designed specifically for agent-API integrations, not a general-purpose LLM. We included it to show the performance gap between specialized agent systems and general language models.

All evaluation code is open source. Check out the full benchmark implementation on GitHub to run your own tests or contribute new APIs.

See you in the comments

We hope you found this helpful and would love to hear from you on LinkedIn, Twitter and GitHub.

Connect with us via these channels for any inquiries:

hi@superglue.ai Twitter LinkedIn GitHub

Contact: founders@superglue.ai

Get Started Free