Model × Framework Matrix

CRR scores at every model-framework intersection, derived from agent evaluations. Click any cell to see the agent detail.

48 of 66 baselines evaluated

ModelAutoGenCrewAILangGraphOpenAI Agents SDKRaw FCSmolagents
Claude Opus 4.687.189.192.191.187.9
Claude Sonnet 4.689.291.394.293.190.288.2
DeepSeek V3.283.486.282.2
Gemini 3 Flash76.879.875.8
Gemini 3.1 Pro85.687.690.689.686.684.6
GPT-5.486.188.191.293.487.2
GPT-5.4 Pro86.888.891.893.887.8
Grok 4.20 Beta84.487.486.483.2
Llama 4 Maverick77.179.182.178.1
Mistral Large76.478.481.477.4
Qwen 3.581.885.181.1

See a gap? Fill it.

Benchmark against 15 real-world scenarios. See where your agent excels — and where it breaks.

2,047 developers already testing
CRTF
GitHubTwitter/XAPI Docs