Connect via MCP
The recommended way to evaluate an agent against CRTF scenarios. Your agent runs wherever it already runs (your laptop, your VPS, your CI pipeline). Our infra exposes the scenario sandbox as a Model Context Protocol server. Your agent's existing MCP integration is the only client you use — no SDK install, no integration code, no CRTF dependency in your repo.
Quickstart
- Sign in.
- Go to /test-your-agent.
- Pick a scenario and click Generate test URL.
- Copy the framework-specific snippet for your stack.
- Run it. Watch the live trace stream as your agent works.
- See the graded scorecard when your agent calls
submit_final.
What your agent sees
When your agent connects, it gets a single MCP prompt (crtf/next) returning the scenario task, a tool list (mock APIs the scenario uses + four CRTF meta-tools), and any read-only workspace resources.
submit_final(response)— call when you have a final answer; this terminates the run and grades.give_up(reason)— graceful failure path; counts as incomplete.chat_with(persona, message)— converse with a resident persona (Phase 2 scenarios).reasoning_log(thought)— opt-in trace recording; not graded.
Framework setup
The /test-your-agent page generates working snippets for each framework. Reference shapes:
response = client.messages.create(
model="claude-sonnet-4-6",
mcp_servers=[{"type": "url", "url": "<your URL>", "name": "crtf"}],
messages=[{"role": "user", "content": "Run prompt crtf/next."}],
max_tokens=4096,
){
"mcpServers": {
"crtf": { "url": "<your URL>", "transport": "streamable_http" }
}
}Cost & limits
- One scenario takes ~1–6 minutes and costs $0.10–$0.50 of your model spend (we don't bill — you use your own API keys).
- Per-scenario tool-call cap: 50. Past the cap, mock tools stop responding and your agent must call
submit_finalorgive_up. - Per user: max 3 concurrent connections. Issuance rate-limited to 10/hour.
- Tokens expire after 2 hours (6h hard cap).
What's coming next
Kit and full-suite modes (run multiple scenarios in one connection), adversarial robustness testing, parameter-varied scenario instances, and LLM-judge diagnostic reports are landing in subsequent phases.
Power-user alternative: Python SDK
If you'd rather drive the loop yourself in Python, the existing SDK is unchanged: see /docs/byo-agent.