What Can AI Agents
Actually Do Today?

A data-driven capability map, updated weekly. Automated evaluation of model × framework combinations across 14 real-world agent scenarios. Open benchmarks, real numbers, no vibes.

evaluation runs

agents tested

scenario templates

devs on waitlist

Run your agent →or see the leaderboard

Use CaseStatusCompletionAssists7d

Simple API IntegrationProduction ReadyCRR ≥ 95 — Reliably handles this capability in production

0.0%

0.2▲0.3+

Best Configuration

Claude Sonnet 4.6 + LangGraph

Completion: 97.0%First-try: 92.0%

Key Gaps

None

Scenarios (2)View all →

S01easy

Look Up Order Status3 sub-tasks

97.2%

Run →

S02easy

Send a Weather Briefing4 sub-tasks

94.6%

Run →

Research & AnalysisProduction ReadyCRR ≥ 95 — Reliably handles this capability in production

0.0%

0.4▲1.2+

Best Configuration

Claude Sonnet 4.6 + LangGraph

Completion: 96.0%First-try: 88.0%

Key Gaps

None

Scenarios (1)View all →

S06medium

Compile a Competitive Research Brief8 sub-tasks

88.7%

Run →

Multi-Step Data PipelineNearly ThereCRR 80–94 — Close to production quality, occasional failures

0.0%

1.4▲2.8+

Best Configuration

GPT-5.4 Pro + OpenAI Agents SDK

Completion: 89.0%First-try: 72.0%

Key Gaps

Timeout HandlingPartial Failures

Scenarios (1)View all →

S12hard

Multi-Step Data Pipeline8 sub-tasks

70.6%

Run →

Customer Support TriageNearly ThereCRR 80–94 — Close to production quality, occasional failures

0.0%

1.2▲1.5+

Best Configuration

Claude Opus 4.6 + CrewAI

Completion: 87.0%First-try: 74.0%

Key Gaps

Edge CasesPolicy Interpretation

Scenarios (3)View all →

S03medium

Process a Customer Refund7 sub-tasks

87.4%

Run →

S05medium

Translate and Route a Support Ticket5 sub-tasks

85.9%

Run →

S07medium

Investigate a Billing Discrepancy6 sub-tasks

84.3%

Run →

Operations & AdminNearly ThereCRR 80–94 — Close to production quality, occasional failures

0.0%

1.5▲0.2+

Best Configuration

Claude Sonnet 4.6 + LangGraph

Completion: 85.0%First-try: 70.0%

Key Gaps

Edge Cases

Scenarios (2)View all →

S08medium

Onboard a New Employee7 sub-tasks

86.2%

Run →

S09medium

Generate and Deliver a Daily Digest6 sub-tasks

85.1%

Run →

Scheduling & CoordinationNearly ThereCRR 80–94 — Close to production quality, occasional failures

0.0%

1.8▲1.0+

Best Configuration

Claude Sonnet 4.6 + LangGraph

Completion: 83.0%First-try: 68.0%

Key Gaps

Timezone HandlingConflict Resolution

Scenarios (2)View all →

S04medium

Schedule a Cross-Timezone Meeting6 sub-tasks

82.1%

Run →

S14hard

Resolve Conflicting Calendar Chaos8 sub-tasks

65.2%

Run →

Autonomous ProcurementExperimentalCRR 60–79 — Works sometimes, needs human oversight

0.0%

2.8▲3.2+

Best Configuration

Claude Sonnet 4.6 + LangGraph

Completion: 71.0%First-try: 48.0%

Key Gaps

Budget TrackingVendor ComparisonApproval Handling

Scenarios (3)View all →

S11hard

Plan Event Logistics Under Budget9 sub-tasks

72.4%

Run →

S13hard

Coordinate a Product Recall10 sub-tasks

62.5%

Run →

S15hard

Autonomous Procurement Workflow11 sub-tasks

58.9%

Run →

Infrastructure MgmtExperimentalCRR 60–79 — Works sometimes, needs human oversight

0.0%

3.2▼0.4+

Best Configuration

GPT-5.4 Pro + LangGraph

Completion: 67.0%First-try: 42.0%

Key Gaps

Root Cause AnalysisCode Execution Timeout

Scenarios (1)View all →

S10hard

Debug a Failing Script7 sub-tasks

67.8%

Run →

Leaderboard

OverallBy ModelBy FrameworkCost

#ConfigurationCRREloCost7d

Claude Sonnet 4.6Raw FC

73.3±4.2

1720$0.510—

GPT-5.4Raw FC

67.7±5.1

1680$0.090—

Grok 4.20 BetaRaw FC

65.5±5.5

1650$0.120—

DeepSeek V3.2Raw FC

65.4±6.0

1645$0.190—

Think your agent can do better?

Benchmark your agent against real scenarios. Any framework — Anthropic, OpenAI, LangGraph, or raw HTTP.

pip install crtf · 30-line quickstart · Free during beta

Benchmark your agent

Run your own evaluations

Last updated: April 9, 2026

What Can AI AgentsActually Do Today?

Leaderboard

Think your agent can do better?

Run your own evaluations

What Can AI Agents
Actually Do Today?