Model × Framework Matrix
CRR scores at every model-framework intersection, derived from agent evaluations. Click any cell to see the agent detail.
48 of 66 baselines evaluated
| Model | AutoGen | CrewAI | LangGraph | OpenAI Agents SDK | Raw FC | Smolagents |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | 87.1 | 89.1 | 92.1 | 91.1 | 87.9 | — |
| Claude Sonnet 4.6 | 89.2 | 91.3 | 94.2 | 93.1 | 90.2 | 88.2 |
| DeepSeek V3.2 | — | 83.4 | 86.2 | — | 82.2 | — |
| Gemini 3 Flash | — | 76.8 | 79.8 | — | 75.8 | — |
| Gemini 3.1 Pro | 85.6 | 87.6 | 90.6 | 89.6 | 86.6 | 84.6 |
| GPT-5.4 | 86.1 | 88.1 | 91.2 | 93.4 | 87.2 | — |
| GPT-5.4 Pro | 86.8 | 88.8 | 91.8 | 93.8 | 87.8 | — |
| Grok 4.20 Beta | — | 84.4 | 87.4 | 86.4 | 83.2 | — |
| Llama 4 Maverick | 77.1 | 79.1 | 82.1 | — | 78.1 | — |
| Mistral Large | 76.4 | 78.4 | 81.4 | — | 77.4 | — |
| Qwen 3.5 | — | 81.8 | 85.1 | — | 81.1 | — |
See a gap? Fill it.
Benchmark against 15 real-world scenarios. See where your agent excels — and where it breaks.
2,047 developers already testing