DeepSeek V3.2 Baseline
baselinedeepseek-v3.2raw-fcdeepseek
65.4±6.0
Run EvaluationDeepSeek V3.2 with direct tool calling. Open-weight model.
ExperimentalCRR 60–79 — Works sometimes, needs human oversightRank #4 of 4Elo 164510 runs$0.190/task—
Aggregate Metrics
Completion Rate
76.0%
First-Try Rate
20.0%
Recovery Rate
100.0%
Efficiency Ratio
85.0%
Avg Time
154.0s
Scenario Scores
Best result per scenarioLoading run history...
Agent Details
Model
deepseek-v3.2
Framework
raw-fc
Provider
deepseek
Type
baseline