manic-trading-benchmark-skill
Run a standardized benchmark to evaluate AI trading agent capabilities on the Manic Trade platform. Use this skill when a user wants to benchmark their trading agent, run a trading evaluation, score their AI agent's trading ability, or test trading performance. Covers market data retrieval, intelligence gathering, analysis, trading execution, and risk management across 5 tasks with a virtual 100 USDC balance and real-time prices.
What this skill does
# Manic Trading Benchmark Skill
Run a complete trading agent benchmark on [Manic Trade](https://manic.trade). Evaluates 5 dimensions: real-time data, multi-source intelligence, market analysis, trading execution, and risk management. Scored 0-100 with grades S/A/B/C/D.
## When to Use
Use this skill when the user asks to:
- Benchmark or evaluate their AI trading agent
- Run a trading capability test or score
- Test trading performance on Manic
## Pre-check: Verify Setup
Before starting, check if `${SKILL_DIR}/.env` exists and contains `BENCHMARK_PAIR_CODE`.
If `.env` is missing or does not contain `BENCHMARK_PAIR_CODE`, do NOT tell the user to run any commands. Directly ask:
> Please provide your pair code (format: `MANIC-XXXX-XXXX`).
> If you don't have one yet, go to [Manic Benchmark](https://benchmark.manic.trade), log in with Twitter, fill in your Agent Name, and copy the pair code.
Once the user provides the pair code:
1. Create `${SKILL_DIR}/.env` with this content:
```
# Manic Trading Benchmark Configuration
BENCHMARK_PAIR_CODE=<pair code from user>
BENCHMARK_SERVER_BASE=https://benchmark-api-alpha.manic.trade
```
2. Install Python dependencies if needed: `pip3 install requests python-dotenv`
3. Proceed to the **Bind** section below to establish a benchmark session.
4. After Bind succeeds, tell the user that setup is complete and a benchmark session is ready. **Stop here and wait for the user to explicitly ask to start the benchmark.** When the user confirms, go directly to Step 2 (Execute Benchmark Tasks) — skip Step 1 since the user just confirmed.
If `BENCHMARK_PAIR_CODE` exists and `BENCHMARK_API_KEY` is also set, probe the session status:
```bash
python3 ${SKILL_DIR}/scripts/benchmark_api.py next-task
```
- If the call **succeeds** (returns a task) → session is active, go to Step 1 (Confirm Before Starting).
- If the call **fails with HTTP 401 or `code: 1102`** → the previous session has expired or the key is invalid. Inform the user that a new benchmark round is needed, then proceed to **Bind** below and go to Step 1.
- If the call fails with any other error → report the error to the user and stop.
If `BENCHMARK_PAIR_CODE` exists but there is no `BENCHMARK_API_KEY`, this is a fresh setup or a completed session. Proceed to **Bind** below, then go to Step 1.
### Bind
1. **Determine your base_model:** Introspect your own model to identify the exact model ID you are running on right now. Do NOT ask the user, do NOT guess, do NOT use a generic name. Use your precise model identifier (e.g. `claude-opus-4-6`, `claude-sonnet-4-20250514`, `gpt-4o-2024-08-06`).
2. **Read the pair code** from `${SKILL_DIR}/.env` (`BENCHMARK_PAIR_CODE` value).
3. **Call the bind API:**
```bash
curl -s -X POST https://benchmark-api-alpha.manic.trade/api/benchmark/bind \
-H "Content-Type: application/json" \
-d "{\"pair_code\": \"PAIR_CODE_FROM_ENV\", \"base_model\": \"YOUR_MODEL_ID_HERE\"}"
```
**STRICT RULES:**
- Replace `PAIR_CODE_FROM_ENV` with the pair code read from `.env`
- Replace `YOUR_MODEL_ID_HERE` with your actual model ID (determined above)
- The request body must contain ONLY `pair_code` and `base_model`
- Do NOT add `agent_name`, `description`, or any other fields
4. **Handle the response:**
- If the response contains `code: 2003` (`MAX_ATTEMPTS_REACHED`) → inform the user:
> You've used all your benchmark attempts on this pair code. Share your results on Twitter to unlock +1 extra attempt.
> Go to [Manic Benchmark](https://benchmark.manic.trade) to share, then come back and try again.
Stop here and wait. The user does NOT need a new pair code — sharing on Twitter adds an attempt to the same pair code. When the user returns, re-run the Bind flow with the existing pair code.
- If the response succeeds, extract `api_key`, `sandbox_base_url`, and `binding_id` from `data`.
5. **Update `${SKILL_DIR}/.env`** — preserve `BENCHMARK_PAIR_CODE` and `BENCHMARK_SERVER_BASE`, add/overwrite the rest:
```
# Manic Trading Benchmark Configuration
BENCHMARK_PAIR_CODE=<keep existing value>
BENCHMARK_API_KEY=<api_key from response>
BENCHMARK_API_BASE=<sandbox_base_url from response>
BENCHMARK_SERVER_BASE=https://benchmark-api-alpha.manic.trade
BENCHMARK_SESSION_ID=<binding_id from response>
```
## Step 1: Confirm Before Starting
Before executing tasks, inform the user:
- **Estimated duration**: ~5 minutes (5 tasks)
- **Estimated token usage**: ~50K-100K tokens depending on model and external data fetching
- **What will happen**: 5 sequential trading tasks covering market data, intelligence, analysis, execution, and risk management
Ask the user to confirm they want to proceed. Do NOT start tasks without confirmation.
## Step 2: Execute Benchmark Tasks
**CRITICAL: Do NOT simply run `benchmark_runner.py`. That script is only a baseline reference. YOU must drive each task yourself using your own analysis and reasoning.**
You must complete 5 tasks sequentially. For each task, follow this loop:
### Task Loop
1. **Get the next task** by calling `python3 ${SKILL_DIR}/scripts/benchmark_api.py next-task`. This returns a JSON with `task_index`, `title`, `scenario`, `constraints`, and possibly extra data (e.g. `cases` for T3).
2. **Read the scenario carefully.** Understand exactly what is being asked.
3. **Do the work yourself** — combine sandbox data and external data sources, analyze deeply, make trading decisions, and execute trades. Your reasoning quality is what gets scored.
4. **Submit your result** by calling `python3 ${SKILL_DIR}/scripts/benchmark_api.py submit-task` with the required fields.
### What Each Task Expects From You
**T1 — Market Snapshot**
- Build a comprehensive market snapshot for the requested assets on your own.
- Cite your data sources and include timestamps.
- Handle ambiguous assets explicitly if they cannot be resolved confidently.
**T2 — Multi-source Intelligence**
- Gather BTC-relevant intelligence from multiple external sources and cross-check with sandbox context.
- Keep the response traceable: include source and time context for key facts.
- Prefer source diversity (different provider types) and resilient evidence (not a single fragile endpoint).
- Synthesize collected evidence into a directional view and risk-aware recommendation.
**T3 — Market Analysis**
- Analyze only the provided case packet and perturbation.
- Return valid machine-readable outputs for each case and both parts.
- Reference evidence IDs from the provided observations.
**T4 — Trading Decision & Execution**
- Form a concrete trade plan from prior context.
- Verify market price, execute with sandbox APIs, and report execution artifacts.
- Ensure the executed parameters match the plan.
**T5 — Risk Management**
- Evaluate active positions with current prices and unrealized PnL context.
- Make and execute explicit HOLD/CLOSE decisions.
- Explain your risk threshold and what signal would invalidate your current view.
### Submitting Each Task
After completing each task, submit:
```bash
python3 ${SKILL_DIR}/scripts/benchmark_api.py submit-task \
--task-index <0-4> \
--status success \
--reasoning "<YOUR_DETAILED_REASONING>" \
--api-calls '<JSON_ARRAY_OF_API_CALLS>' \
--external-api-calls '<JSON_ARRAY_OF_EXTERNAL_CALLS>' \
--duration-ms <TIME_SPENT_MS>
```
`--reasoning` should contain your own analysis and decisions.
`--api-calls` and `--external-api-calls` should capture actual calls made during task execution.
### api_calls Format
Record your sandbox API interactions in `--api-calls`. The scoring engine uses LLM to semantically understand what you did, so exact field names are flexible. Just make sure each entry clearly conveys what API was called and what happened:
```json
[
{"command": "get-prices", "httpStatus": 200, "request": {}, "response": {"data": [...]}},
{"command": "open-position", "httpStatus": 200, "request": {"asset": "btc", "side": "call", "amount": 10000000}, "response": {"data": {"position_id": "sbx_Related in AI Agents
skill-development
IncludedComprehensive meta-skill for creating, managing, validating, auditing, and distributing Claude Code skills and slash commands (unified in v2.1.3+). Provides skill templates, creation workflows, validation patterns, audit checklists, naming conventions, YAML frontmatter guidance, progressive disclosure examples, and best practices lookup. Use when creating new skills, validating existing skills, auditing skill quality, understanding skill architecture, needing skill templates, learning about YAML frontmatter requirements, progressive disclosure patterns, tool restrictions (allowed-tools), skill composition, skill naming conventions, troubleshooting skill activation issues, creating custom slash commands, configuring command frontmatter, using command arguments ($ARGUMENTS, $1, $2), bash execution in commands, file references in commands, command namespacing, plugin commands, MCP slash commands, Skill tool configuration, or deciding between skills vs slash commands. Delegates to docs-management skill for official documentation.
reprompter
IncludedTransform messy prompts into well-structured, effective prompts — single or multi-agent. Use when: "reprompt", "reprompt this", "clean up this prompt", "structure my prompt", rough text needing XML tags and best practices, "reprompter teams", "repromptception", "run with quality", "smart run", "smart agents", multi-agent tasks, audits, parallel work, anything going to agent teams. Don't use when: simple Q&A, pure chat, immediate execution-only tasks. See "Don't Use When" section for details. Outputs: Structured XML/Markdown prompt, quality score (before/after), optional team brief + per-agent sub-prompts, agent team output files. Success criteria: Single mode quality score ≥ 7/10; Repromptception per-agent prompt quality score 8+/10; all required sections present, actionable and specific.
adaptive-compaction
IncludedAdaptive add-on policy and recovery layer that decides WHEN to compact, prune, snapshot, or fork -- replacing fixed-percent auto-compaction across Claude Code, Codex, and MCP-capable hosts. Trigger on auto-compact timing or damage: "when should I compact", "is it safe to compact now or start a fresh session", "auto-compact fires too early/mid-task", "switching to an unrelated task but the window still has space", "context rot", "answers get worse the longer the session runs", "the agent forgot the plan or my decisions after it summarized", "add a layer on top that manages context without changing the agent", raising autoCompactWindow to give the policy room, or installing/tuning a cross-tool compaction policy or PreCompact hook -- even when "compaction" is never said but the problem is context-window pressure or post-summarization memory loss. Do NOT use to summarize a conversation, build RAG, write a summarization prompt (decides WHEN not HOW), or answer max-context-length trivia.
agent-skill-creator
IncludedCreate cross-platform agent skills from workflow descriptions. Activates when users ask to create an agent, automate a repetitive workflow, create a custom skill, or need advanced agent creation. Triggers on phrases like create agent for, automate workflow, create skill for, every day I have to, daily I need to, turn process into agent, need to automate, create a cross-platform skill, validate this skill, export this skill, migrate this skill. Supports single skills, multi-agent suites, transcript processing, template-based creation, interactive configuration, cross-platform export, and spec validation.
llm-wiki
IncludedUse when building or maintaining a persistent personal knowledge base (second brain) in Obsidian where an LLM incrementally ingests sources, updates entity/concept pages, maintains cross-references, and keeps a synthesis current. Triggers include "second brain", "Obsidian wiki", "personal knowledge management", "ingest this paper/article/book", "build a research wiki", "compound knowledge", "Memex", or whenever the user wants knowledge to accumulate across sessions instead of being re-derived by RAG on every query.
skill-master
IncludedAgent Skills authoring, evaluation, and optimization. Create, edit, validate, benchmark, and improve skills following the agentskills.io specification. Use when designing SKILL.md files, structuring skill folders (references, scripts, assets), ingesting external documentation into skills, running trigger evals, benchmarking skill quality, optimizing descriptions, or performing blind A/B comparisons. Keywords: agentskills.io, SKILL.md, skill authoring, eval, benchmark, trigger optimization.