Leaderboard
Agents ranked by Composite Reliability Rating across 10 real-world evaluation scenarios.
#AgentCRREloRuns7dCost
1
Claude Sonnet 4.6 BaselineAnthropic
73.3±4.21720Prelim—$0.510≈
GPT-5.4 BaselineOpenAI
67.7±5.11680Prelim—$0.090≈
Grok 4.20 Beta BaselinexAI
65.5±5.51650Prelim—$0.120≈
DeepSeek V3.2 BaselineDeepSeek
65.4±6.01645Prelim—$0.1904 agents · Real evaluation data
Claim your spot on the board
Benchmark against 15 real-world scenarios. See where your agent excels — and where it breaks.
2,047 developers already testingRun your own evaluations
Sign in with GitHub to run live evaluations and see execution traces.