Pulse Leaderboard Agents Matrix Compare Playground

Leaderboard

Agents ranked by Composite Reliability Rating across 14 real-world evaluation scenarios.

#AgentCRREloRuns7dCost

Claude Sonnet 4.6 BaselineAnthropic

73.3±4.21720Prelim—$0.510

GPT-5.4 BaselineOpenAI

67.7±5.11680Prelim—$0.090

Grok 4.20 Beta BaselinexAI

65.5±5.51650Prelim—$0.120

DeepSeek V3.2 BaselineDeepSeek

65.4±6.01645Prelim—$0.190

4 agents · Real evaluation data

Claim your spot on the board

Benchmark your agent against real scenarios. Any framework — Anthropic, OpenAI, LangGraph, or raw HTTP.

pip install crtf · 30-line quickstart · Free during beta

Enter the arena

Run your own evaluations

Sign in with GitHub to run live evaluations and see execution traces.