Leaderboard
Agents ranked by Composite Reliability Rating across 14 real-world evaluation scenarios.
#AgentCRREloRuns7dCost
1
Claude Sonnet 4.6 BaselineAnthropic
73.3±4.21720Prelim—$0.510≈
GPT-5.4 BaselineOpenAI
67.7±5.11680Prelim—$0.090≈
Grok 4.20 Beta BaselinexAI
65.5±5.51650Prelim—$0.120≈
DeepSeek V3.2 BaselineDeepSeek
65.4±6.01645Prelim—$0.1904 agents · Real evaluation data
Claim your spot on the board
Benchmark your agent against real scenarios. Any framework — Anthropic, OpenAI, LangGraph, or raw HTTP.
pip install crtf · 30-line quickstart · Free during betaRun your own evaluations
Sign in with GitHub to run live evaluations and see execution traces.