Home/docs/mcp

Connect via MCP

The recommended way to evaluate an agent against CRTF scenarios. Your agent runs wherever it already runs (your laptop, your VPS, your CI pipeline). Our infra exposes the scenario sandbox as a Model Context Protocol server. Your agent's existing MCP integration is the only client you use — no SDK install, no integration code, no CRTF dependency in your repo.

Quickstart

Sign in.
Go to /test-your-agent.
Pick a scenario and click Generate test URL.
Copy the framework-specific snippet for your stack.
Run it. Watch the live trace stream as your agent works.
See the graded scorecard when your agent calls submit_final.

What your agent sees

When your agent connects, it gets a single MCP prompt (crtf/next) returning the scenario task, a tool list (mock APIs the scenario uses + four CRTF meta-tools), and any read-only workspace resources.

submit_final(response) — call when you have a final answer; this terminates the run and grades.
give_up(reason) — graceful failure path; counts as incomplete.
chat_with(persona, message) — converse with a resident persona (Phase 2 scenarios).
reasoning_log(thought) — opt-in trace recording; not graded.

Framework setup

The /test-your-agent page generates working snippets for each framework. Reference shapes:

Anthropic SDK ≥ 1.5

response = client.messages.create(
    model="claude-sonnet-4-6",
    mcp_servers=[{"type": "url", "url": "<your URL>", "name": "crtf"}],
    messages=[{"role": "user", "content": "Run prompt crtf/next."}],
    max_tokens=4096,
)

Claude Desktop / Cursor

{
  "mcpServers": {
    "crtf": { "url": "<your URL>", "transport": "streamable_http" }
  }
}

Cost & limits

One scenario takes ~1–6 minutes and costs $0.10–$0.50 of your model spend (we don't bill — you use your own API keys).
Per-scenario tool-call cap: 50. Past the cap, mock tools stop responding and your agent must call submit_final or give_up.
Per user: max 3 concurrent connections. Issuance rate-limited to 10/hour.
Tokens expire after 2 hours (6h hard cap).

What's coming next

Kit and full-suite modes (run multiple scenarios in one connection), adversarial robustness testing, parameter-varied scenario instances, and LLM-judge diagnostic reports are landing in subsequent phases.

Power-user alternative: Python SDK

If you'd rather drive the loop yourself in Python, the existing SDK is unchanged: see /docs/byo-agent.