gepa-demo
Guide users who want to optimize their LLM prompts. We will interact with them, understanding their datasets and grader requirements, and finally writing DSPy code to optimize their prompt (using a custom implementation of the GEPA algorithm).
What this skill does
# What is prompt optimization? Prompt optimization is the process of improving the quality of prompts used in language models. It is often done manually, but increasingly their are frameworks (such as DSPy) being used to use LLMs to do this. In essence, the process involves the user providing a dataset and a grader or reward model to judge an LLM's output. A prompt's performance on the dataset is measured, the gaps in in its performance identified, and a new prompt is then proposed and tested. This runs in a loop until an end state is reached. # What is GEPA? GEPA (which stands for GEnetic PAreto) is a prompt optimization algorithm that follows the process above. It is increasingly a popular approach to prompt optimization, and utilizes two key strategic choices compared to other algorithms: 1. **Text Feedback:** most prompt optimization algorithms and RL simply use a scalar reward. GEPA however also uses textual feedback, including log traces, which can be used to identify the root cause of issues in a prompt. This makes it particularly well suitable for LLM as Judges, which may not be well calibrated from a score perspective but can provide high quality text feedback identifying what needs to be improved. 2. **Pareto Frontiers:** GEPA does not simply select the best overall prompt based on score, but checks for which prompt dominates other candidate prompts on most test cases in the validation dataset. Through this, you are more likely to find a prompt that performs reliably on a wider variety of cases, rather than one that might simply spike on some scenarios but underperform on others. More can be read about GEPA on [this Github repo](https://github.com/gepa-ai/gepa) if needed. # The goal for this skill Help users improve their prompts using our [custom implementation of GEPA](https://github.com/raveeshbhalla/dspy-gepa-logger). Ask them for a dataset and create a grader for them (if they don't already have one), then optimize the prompt. Our custom GEPA implementation forks the original library to provide observability into its run. The Github repo also packages a webserver which is used to visualize the process. ## Workflow (follow in order) 0) Validate prerequisites - Check that required tools are installed: - Python 3.8+: `python --version` - Node.js 20.19+: `node --version` (Prisma requirement) - npm: `npm --version` - git: `git --version` - If any are missing, provide installation instructions for the user's OS. 1) Ensure repo is cloned and installed - Confirm if it's ok to clone `dspy-gepa-logger` repo to the current folder (if not, ask where to clone it). - Clone it from https://github.com/raveeshbhalla/dspy-gepa-logger.git - Install in editable mode so `gepa_observable` imports work from any folder. - See `references/repo-setup.md`. 2) Confirm data location - Look in the current folder for csv, json or jsonl files. - Confirm with the user that if that's the file they want to use for optimization. - If not, ask where the data file lives and move it to the current folder. 3) Inspect data and define the task schema - Load a small sample (first ~5–10 rows). - Ask the user to map which fields are **inputs** and which are **expected outputs**. - Show them the fields and some sample values so they have an idea of what data they're working with. - Identify optional **metadata** fields (IDs, timestamps, categories, etc.). - Decide the output type (single field, multi-field, structured JSON, etc.). - For messy columns, propose a normalized mapping and confirm. - See `references/data-mapping.md` for question prompts and mapping patterns. 4) Collect model + budget preferences - **IMPORTANT**: Use AskUserQuestion tool if available for better interactivity - Ask for: - **Task LM** (what they'll use in production - can be smaller/cheaper like gpt-5-mini, claude-4-5-haiku) - **Reflective/Judge LM** (recommend stronger model like gpt-5.2, claude-4-5-sonnet) - **Budget**: `auto=light|medium|heavy` OR `max_full_evals` OR `max_metric_calls`. Default to `auto=light` - Whether to attach the web dashboard (default: yes) - Project/run name - Seed prompt (user can provide some text, rubrics or point to a file to extract from) - If unclear, propose sensible defaults based on your web search and confirm. 5) Create `.env.local` and verify API key - Create `<user-folder>/optimizer-script/.env.local` with template: ``` OPENAI_API_KEY=sk-... # User must replace with actual key TASK_LM=openai/gpt-5-mini REFLECTIVE_LM=openai/gpt-5.2 JUDGE_LM=openai/gpt-5.2 ``` - **⚠️ CRITICAL**: Inform the user they MUST manually edit this file and add their actual API key - The placeholder `your_openai_api_key_here` will cause authentication errors - If using other providers (Anthropic, etc.), add those keys too (see `references/env.md`) - Keep the reflective LM and judge LM the same unless the user explicitly splits them - **WAIT** for user confirmation that they've added their API key before proceeding to verify that it can be found 6) Build DSPy objects - Create `dspy.Example` list from the dataset and call `.with_inputs(...)`. - Define a `dspy.Signature` from input/output fields. - Choose a simple `dspy.Module` (e.g., `ChainOfThought` or `Predict`). - See `references/data-mapping.md` for patterns. 7) Create a grader that can be used as the "metric with feedback" for the GEPA run - Check if the dataset provided has any "expected" columns. - If not, ask the user to provide some context for the rubric to be used- Prefer rule-based metrics when ground truth is clean. - Otherwise use an LLM judge with strong feedback text. - Metric must return `dspy.Prediction(score=float, feedback=str)`. - For LLM judges, define a DSPy `Signature` (e.g., `JudgeSig`) and call `dspy.ChainOfThought(JudgeSig)` to get structured outputs - See `references/metrics.md` for templates. 8) Split dataset (train/val/test) - If dataset <100, skip test set. - Ensure valset is representative but not too large for budget. - Use a fixed seed for reproducibility. 9) Generate demo scripts (split files) - Create `<user-folder>/optimizer-script/` and copy **all files** from `assets/gepa-demo/` into it. - Edit `gepa_demo_data.py`, `gepa_demo_program.py`, `gepa_demo_metric.py` to match the dataset. - Ensure the runner `gepa-demo-script.py` keeps CLI args for: - `--seed-prompt` - `--auto` or `--max-full-evals` or `--max-metric-calls` - `--server-url` - `--project-name` - `--log-dir` - `--train-ratio`/`--val-ratio` - Keep all paths relative to `optimizer-script/`. 10) Run a pre-optimization smoke test - Run `python <user-folder>/optimizer-script/gepa-demo-eval.py --data-file <path>`. - Confirm the program runs, metric returns `score` + `feedback`, and outputs look sane. - This validates data mapping, program wiring, and metric behavior before full optimization. - Also run with `--use-llm-judge`. If scores are all zero, the judge output is likely malformed or not being parsed. 11) Start the web dashboard - Follow `references/web-dashboard.md` to start the server from the **cloned repo** `web/` directory. - Ensure `npx prisma generate` is run before `npx prisma migrate deploy` - Keep the terminal window open - the server must stay running - Verify server is running at http://localhost:3000 - If server stops, restart with: `cd web && npm run dev` 12) Run the optimizer - **BEFORE starting**: Ask user to open http://localhost:3000 in their browser to watch live progress - Run `python <user-folder>/optimizer-script/gepa-demo-script.py ...` with the server URL - Inform user this will take 2-10 minutes depending on budget and dataset size - Progress will appear in both: - Terminal output (text logs) - Web dashboard (visual charts and tables) - Don't close terminal while optimization is running - After completion, verify runs appear in the dashboard - If the CLI times out, the run may still be active. Check the dashboard before restarting. 13) Sanity
Related in AI Agents
skill-development
IncludedComprehensive meta-skill for creating, managing, validating, auditing, and distributing Claude Code skills and slash commands (unified in v2.1.3+). Provides skill templates, creation workflows, validation patterns, audit checklists, naming conventions, YAML frontmatter guidance, progressive disclosure examples, and best practices lookup. Use when creating new skills, validating existing skills, auditing skill quality, understanding skill architecture, needing skill templates, learning about YAML frontmatter requirements, progressive disclosure patterns, tool restrictions (allowed-tools), skill composition, skill naming conventions, troubleshooting skill activation issues, creating custom slash commands, configuring command frontmatter, using command arguments ($ARGUMENTS, $1, $2), bash execution in commands, file references in commands, command namespacing, plugin commands, MCP slash commands, Skill tool configuration, or deciding between skills vs slash commands. Delegates to docs-management skill for official documentation.
reprompter
IncludedTransform messy prompts into well-structured, effective prompts — single or multi-agent. Use when: "reprompt", "reprompt this", "clean up this prompt", "structure my prompt", rough text needing XML tags and best practices, "reprompter teams", "repromptception", "run with quality", "smart run", "smart agents", multi-agent tasks, audits, parallel work, anything going to agent teams. Don't use when: simple Q&A, pure chat, immediate execution-only tasks. See "Don't Use When" section for details. Outputs: Structured XML/Markdown prompt, quality score (before/after), optional team brief + per-agent sub-prompts, agent team output files. Success criteria: Single mode quality score ≥ 7/10; Repromptception per-agent prompt quality score 8+/10; all required sections present, actionable and specific.
adaptive-compaction
IncludedAdaptive add-on policy and recovery layer that decides WHEN to compact, prune, snapshot, or fork -- replacing fixed-percent auto-compaction across Claude Code, Codex, and MCP-capable hosts. Trigger on auto-compact timing or damage: "when should I compact", "is it safe to compact now or start a fresh session", "auto-compact fires too early/mid-task", "switching to an unrelated task but the window still has space", "context rot", "answers get worse the longer the session runs", "the agent forgot the plan or my decisions after it summarized", "add a layer on top that manages context without changing the agent", raising autoCompactWindow to give the policy room, or installing/tuning a cross-tool compaction policy or PreCompact hook -- even when "compaction" is never said but the problem is context-window pressure or post-summarization memory loss. Do NOT use to summarize a conversation, build RAG, write a summarization prompt (decides WHEN not HOW), or answer max-context-length trivia.
agent-skill-creator
IncludedCreate cross-platform agent skills from workflow descriptions. Activates when users ask to create an agent, automate a repetitive workflow, create a custom skill, or need advanced agent creation. Triggers on phrases like create agent for, automate workflow, create skill for, every day I have to, daily I need to, turn process into agent, need to automate, create a cross-platform skill, validate this skill, export this skill, migrate this skill. Supports single skills, multi-agent suites, transcript processing, template-based creation, interactive configuration, cross-platform export, and spec validation.
llm-wiki
IncludedUse when building or maintaining a persistent personal knowledge base (second brain) in Obsidian where an LLM incrementally ingests sources, updates entity/concept pages, maintains cross-references, and keeps a synthesis current. Triggers include "second brain", "Obsidian wiki", "personal knowledge management", "ingest this paper/article/book", "build a research wiki", "compound knowledge", "Memex", or whenever the user wants knowledge to accumulate across sessions instead of being re-derived by RAG on every query.
skill-master
IncludedAgent Skills authoring, evaluation, and optimization. Create, edit, validate, benchmark, and improve skills following the agentskills.io specification. Use when designing SKILL.md files, structuring skill folders (references, scripts, assets), ingesting external documentation into skills, running trigger evals, benchmarking skill quality, optimizing descriptions, or performing blind A/B comparisons. Keywords: agentskills.io, SKILL.md, skill authoring, eval, benchmark, trigger optimization.