What Can AI Agents
Actually Do Today?

A data-driven capability map, updated weekly. Automated evaluation of model × framework combinations across 15 real-world agent scenarios. Open benchmarks, real numbers, no vibes.

0
evaluation runs
0
configs tested
0
scenario templates
0
devs on waitlist
Use CaseStatusCompletionAssists7d
Simple API IntegrationProduction ReadyCRR ≥ 95 — Reliably handles this capability in production
0.0%
0.20.3+
Best Configuration
Claude Sonnet 4.6 + LangGraph
Completion: 97.0%First-try: 92.0%
Key Gaps
None
Scenarios (2)View all →
S01easy
Look Up Order Status3 sub-tasks
97.2%
S02easy
Send a Weather Briefing4 sub-tasks
94.6%
Research & AnalysisProduction ReadyCRR ≥ 95 — Reliably handles this capability in production
0.0%
0.41.2+
Best Configuration
Claude Sonnet 4.6 + LangGraph
Completion: 96.0%First-try: 88.0%
Key Gaps
None
Scenarios (1)View all →
S06medium
Compile a Competitive Research Brief8 sub-tasks
88.7%
Multi-Step Data PipelineNearly ThereCRR 80–94 — Close to production quality, occasional failures
0.0%
1.42.8+
Best Configuration
GPT-5.4 Pro + OpenAI Agents SDK
Completion: 89.0%First-try: 72.0%
Key Gaps
Timeout HandlingPartial Failures
Scenarios (1)View all →
S12hard
Multi-Step Data Pipeline8 sub-tasks
70.6%
Customer Support TriageNearly ThereCRR 80–94 — Close to production quality, occasional failures
0.0%
1.21.5+
Best Configuration
Claude Opus 4.6 + CrewAI
Completion: 87.0%First-try: 74.0%
Key Gaps
Edge CasesPolicy Interpretation
Scenarios (3)View all →
S03medium
Process a Customer Refund7 sub-tasks
87.4%
S05medium
Translate and Route a Support Ticket5 sub-tasks
85.9%
S07medium
Investigate a Billing Discrepancy6 sub-tasks
84.3%
Operations & AdminNearly ThereCRR 80–94 — Close to production quality, occasional failures
0.0%
1.50.2+
Best Configuration
Claude Sonnet 4.6 + LangGraph
Completion: 85.0%First-try: 70.0%
Key Gaps
Edge Cases
Scenarios (2)View all →
S08medium
Onboard a New Employee7 sub-tasks
86.2%
S09medium
Generate and Deliver a Daily Digest6 sub-tasks
85.1%
Scheduling & CoordinationNearly ThereCRR 80–94 — Close to production quality, occasional failures
0.0%
1.81.0+
Best Configuration
Claude Sonnet 4.6 + LangGraph
Completion: 83.0%First-try: 68.0%
Key Gaps
Timezone HandlingConflict Resolution
Scenarios (2)View all →
S04medium
Schedule a Cross-Timezone Meeting6 sub-tasks
82.1%
S14hard
Resolve Conflicting Calendar Chaos8 sub-tasks
65.2%
Autonomous ProcurementExperimentalCRR 60–79 — Works sometimes, needs human oversight
0.0%
2.83.2+
Best Configuration
Claude Sonnet 4.6 + LangGraph
Completion: 71.0%First-try: 48.0%
Key Gaps
Budget TrackingVendor ComparisonApproval Handling
Scenarios (3)View all →
S11hard
Plan Event Logistics Under Budget9 sub-tasks
72.4%
S13hard
Coordinate a Product Recall10 sub-tasks
62.5%
S15hard
Autonomous Procurement Workflow11 sub-tasks
58.9%
Infrastructure MgmtExperimentalCRR 60–79 — Works sometimes, needs human oversight
0.0%
3.20.4+
Best Configuration
GPT-5.4 Pro + LangGraph
Completion: 67.0%First-try: 42.0%
Key Gaps
Root Cause AnalysisCode Execution Timeout
Scenarios (1)View all →
S10hard
Debug a Failing Script7 sub-tasks
67.8%

Leaderboard

OverallBy ModelBy FrameworkCost
#ConfigurationCRREloCost7d
1
Claude Sonnet 4.6Raw FC
73.3±4.2
1720$0.510
2
GPT-5.4Raw FC
67.7±5.1
1680$0.090
3
Grok 4.20 BetaRaw FC
65.5±5.5
1650$0.120
4
DeepSeek V3.2Raw FC
65.4±6.0
1645$0.190

Think your agent can do better?

Benchmark against 15 real-world scenarios. See where your agent excels — and where it breaks.

2,047 developers already testing

Run your own evaluations

Sign in with GitHub to run live evaluations and see execution traces.

Last updated: March 15, 2026
CRTF
GitHubTwitter/XAPI Docs