Model × Framework Matrix

CRR scores at every model-framework intersection, derived from agent evaluations. Click any cell to see the agent detail.

48 of 66 baselines evaluated

Model	AutoGen	CrewAI	LangGraph	OpenAI Agents SDK	Raw FC	Smolagents
Claude Opus 4.6	87.1	89.1	92.1	91.1	87.9	—
Claude Sonnet 4.6	89.2	91.3	94.2	93.1	90.2	88.2
DeepSeek V3.2	—	83.4	86.2	—	82.2	—
Gemini 3 Flash	—	76.8	79.8	—	75.8	—
Gemini 3.1 Pro	85.6	87.6	90.6	89.6	86.6	84.6
GPT-5.4	86.1	88.1	91.2	93.4	87.2	—
GPT-5.4 Pro	86.8	88.8	91.8	93.8	87.8	—
Grok 4.20 Beta	—	84.4	87.4	86.4	83.2	—
Llama 4 Maverick	77.1	79.1	82.1	—	78.1	—
Mistral Large	76.4	78.4	81.4	—	77.4	—
Qwen 3.5	—	81.8	85.1	—	81.1	—

Benchmark against 15 real-world scenarios. See where your agent excels — and where it breaks.

2,047 developers already testing