Claude
Skills
Sign in
Back

manic-trading-benchmark-skill

Included with Lifetime
$97 forever

Run a standardized benchmark to evaluate AI trading agent capabilities on the Manic Trade platform. Use this skill when a user wants to benchmark their trading agent, run a trading evaluation, score their AI agent's trading ability, or test trading performance. Covers market data retrieval, intelligence gathering, analysis, trading execution, and risk management across 5 tasks with a virtual 100 USDC balance and real-time prices.

AI Agentsscripts

What this skill does


# Manic Trading Benchmark Skill

Run a complete trading agent benchmark on [Manic Trade](https://manic.trade). Evaluates 5 dimensions: real-time data, multi-source intelligence, market analysis, trading execution, and risk management. Scored 0-100 with grades S/A/B/C/D.

## When to Use

Use this skill when the user asks to:
- Benchmark or evaluate their AI trading agent
- Run a trading capability test or score
- Test trading performance on Manic

## Pre-check: Verify Setup

Before starting, check if `${SKILL_DIR}/.env` exists and contains `BENCHMARK_PAIR_CODE`.

If `.env` is missing or does not contain `BENCHMARK_PAIR_CODE`, do NOT tell the user to run any commands. Directly ask:

> Please provide your pair code (format: `MANIC-XXXX-XXXX`).
> If you don't have one yet, go to [Manic Benchmark](https://benchmark.manic.trade), log in with Twitter, fill in your Agent Name, and copy the pair code.

Once the user provides the pair code:
1. Create `${SKILL_DIR}/.env` with this content:
```
# Manic Trading Benchmark Configuration
BENCHMARK_PAIR_CODE=<pair code from user>
BENCHMARK_SERVER_BASE=https://benchmark-api-alpha.manic.trade
```
2. Install Python dependencies if needed: `pip3 install requests python-dotenv`
3. Proceed to the **Bind** section below to establish a benchmark session.
4. After Bind succeeds, tell the user that setup is complete and a benchmark session is ready. **Stop here and wait for the user to explicitly ask to start the benchmark.** When the user confirms, go directly to Step 2 (Execute Benchmark Tasks) — skip Step 1 since the user just confirmed.

If `BENCHMARK_PAIR_CODE` exists and `BENCHMARK_API_KEY` is also set, probe the session status:

```bash
python3 ${SKILL_DIR}/scripts/benchmark_api.py next-task
```

- If the call **succeeds** (returns a task) → session is active, go to Step 1 (Confirm Before Starting).
- If the call **fails with HTTP 401 or `code: 1102`** → the previous session has expired or the key is invalid. Inform the user that a new benchmark round is needed, then proceed to **Bind** below and go to Step 1.
- If the call fails with any other error → report the error to the user and stop.

If `BENCHMARK_PAIR_CODE` exists but there is no `BENCHMARK_API_KEY`, this is a fresh setup or a completed session. Proceed to **Bind** below, then go to Step 1.

### Bind

1. **Determine your base_model:** Introspect your own model to identify the exact model ID you are running on right now. Do NOT ask the user, do NOT guess, do NOT use a generic name. Use your precise model identifier (e.g. `claude-opus-4-6`, `claude-sonnet-4-20250514`, `gpt-4o-2024-08-06`).

2. **Read the pair code** from `${SKILL_DIR}/.env` (`BENCHMARK_PAIR_CODE` value).

3. **Call the bind API:**

```bash
curl -s -X POST https://benchmark-api-alpha.manic.trade/api/benchmark/bind \
  -H "Content-Type: application/json" \
  -d "{\"pair_code\": \"PAIR_CODE_FROM_ENV\", \"base_model\": \"YOUR_MODEL_ID_HERE\"}"
```

**STRICT RULES:**
- Replace `PAIR_CODE_FROM_ENV` with the pair code read from `.env`
- Replace `YOUR_MODEL_ID_HERE` with your actual model ID (determined above)
- The request body must contain ONLY `pair_code` and `base_model`
- Do NOT add `agent_name`, `description`, or any other fields

4. **Handle the response:**

- If the response contains `code: 2003` (`MAX_ATTEMPTS_REACHED`) → inform the user:
  > You've used all your benchmark attempts on this pair code. Share your results on Twitter to unlock +1 extra attempt.
  > Go to [Manic Benchmark](https://benchmark.manic.trade) to share, then come back and try again.

  Stop here and wait. The user does NOT need a new pair code — sharing on Twitter adds an attempt to the same pair code. When the user returns, re-run the Bind flow with the existing pair code.

- If the response succeeds, extract `api_key`, `sandbox_base_url`, and `binding_id` from `data`.

5. **Update `${SKILL_DIR}/.env`** — preserve `BENCHMARK_PAIR_CODE` and `BENCHMARK_SERVER_BASE`, add/overwrite the rest:

```
# Manic Trading Benchmark Configuration
BENCHMARK_PAIR_CODE=<keep existing value>
BENCHMARK_API_KEY=<api_key from response>
BENCHMARK_API_BASE=<sandbox_base_url from response>
BENCHMARK_SERVER_BASE=https://benchmark-api-alpha.manic.trade
BENCHMARK_SESSION_ID=<binding_id from response>
```

## Step 1: Confirm Before Starting

Before executing tasks, inform the user:

- **Estimated duration**: ~5 minutes (5 tasks)
- **Estimated token usage**: ~50K-100K tokens depending on model and external data fetching
- **What will happen**: 5 sequential trading tasks covering market data, intelligence, analysis, execution, and risk management

Ask the user to confirm they want to proceed. Do NOT start tasks without confirmation.

## Step 2: Execute Benchmark Tasks

**CRITICAL: Do NOT simply run `benchmark_runner.py`. That script is only a baseline reference. YOU must drive each task yourself using your own analysis and reasoning.**

You must complete 5 tasks sequentially. For each task, follow this loop:

### Task Loop

1. **Get the next task** by calling `python3 ${SKILL_DIR}/scripts/benchmark_api.py next-task`. This returns a JSON with `task_index`, `title`, `scenario`, `constraints`, and possibly extra data (e.g. `cases` for T3).
2. **Read the scenario carefully.** Understand exactly what is being asked.
3. **Do the work yourself** — combine sandbox data and external data sources, analyze deeply, make trading decisions, and execute trades. Your reasoning quality is what gets scored.
4. **Submit your result** by calling `python3 ${SKILL_DIR}/scripts/benchmark_api.py submit-task` with the required fields.

### What Each Task Expects From You

**T1 — Market Snapshot**
- Build a comprehensive market snapshot for the requested assets on your own.
- Cite your data sources and include timestamps.
- Handle ambiguous assets explicitly if they cannot be resolved confidently.

**T2 — Multi-source Intelligence**
- Gather BTC-relevant intelligence from multiple external sources and cross-check with sandbox context.
- Keep the response traceable: include source and time context for key facts.
- Prefer source diversity (different provider types) and resilient evidence (not a single fragile endpoint).
- Synthesize collected evidence into a directional view and risk-aware recommendation.

**T3 — Market Analysis**
- Analyze only the provided case packet and perturbation.
- Return valid machine-readable outputs for each case and both parts.
- Reference evidence IDs from the provided observations.

**T4 — Trading Decision & Execution**
- Form a concrete trade plan from prior context.
- Verify market price, execute with sandbox APIs, and report execution artifacts.
- Ensure the executed parameters match the plan.

**T5 — Risk Management**
- Evaluate active positions with current prices and unrealized PnL context.
- Make and execute explicit HOLD/CLOSE decisions.
- Explain your risk threshold and what signal would invalidate your current view.

### Submitting Each Task

After completing each task, submit:

```bash
python3 ${SKILL_DIR}/scripts/benchmark_api.py submit-task \
  --task-index <0-4> \
  --status success \
  --reasoning "<YOUR_DETAILED_REASONING>" \
  --api-calls '<JSON_ARRAY_OF_API_CALLS>' \
  --external-api-calls '<JSON_ARRAY_OF_EXTERNAL_CALLS>' \
  --duration-ms <TIME_SPENT_MS>
```

`--reasoning` should contain your own analysis and decisions.
`--api-calls` and `--external-api-calls` should capture actual calls made during task execution.

### api_calls Format

Record your sandbox API interactions in `--api-calls`. The scoring engine uses LLM to semantically understand what you did, so exact field names are flexible. Just make sure each entry clearly conveys what API was called and what happened:

```json
[
  {"command": "get-prices", "httpStatus": 200, "request": {}, "response": {"data": [...]}},
  {"command": "open-position", "httpStatus": 200, "request": {"asset": "btc", "side": "call", "amount": 10000000}, "response": {"data": {"position_id": "sbx_
Files: 10
Size: 92.7 KB
Complexity: 75/100
Category: AI Agents

Related in AI Agents