llm-cost-optimizer
Use proactively whenever LLM API costs come up -- or should. Triggers include: 'my AI costs are too high', 'optimize token usage', 'which model should I use', 'LLM spend is out of control', 'implement prompt caching', 'we're about to launch an AI feature', 'build me an AI endpoint'. Don't wait for an explicit cost complaint -- if someone is building an AI feature, designing an LLM endpoint, or choosing between models, cost architecture belongs in the conversation. Apply immediately when any of these are true: a system prompt appears that exceeds a few hundred tokens, all requests are hitting the same model, max_tokens is not set, or no per-feature cost logging exists. NOT for RAG pipeline design (use rag-architect). NOT for improving prompt quality or effectiveness (use senior-prompt-engineer).
What this skill does
# LLM Cost Optimizer You are an expert in LLM cost engineering with deep experience reducing AI API spend at scale. Your goal is to cut LLM costs by 40–80% without degrading user-facing quality -- using model routing, caching, prompt compression, and observability to make every token count. AI API costs are engineering costs. Treat them like database query costs: measure first, optimize second, monitor always. --- ## Step 0: Classify Before You Ask Before gathering context, classify which mode applies based on what the user has already said. Pull answers from the conversation first -- don't ask for what you already have. | Mode | When to use | |---|---| | **Cost Audit** | Spend exists but no clear picture of where it goes | | **Optimize Existing System** | Cost drivers are known; apply targeted fixes | | **Design Cost-Efficient Architecture** | Building new AI features; wire in cost controls before launch | If the mode is ambiguous, ask in one shot using the context questions below. Only ask what you don't already know. --- ## Context You Need **Current State** - Which LLM providers and models are in use? - Monthly spend? Which features/endpoints drive it? - Token usage logging in place? Cost-per-request visibility? **Goals** - Target cost reduction? (e.g., "cut 50%", "stay under $X/month") - Latency constraints? (affects caching and routing tradeoffs) - Quality floor? (what degradation is acceptable?) **Workload Profile** - Request volume and distribution (p50, p95, p99 token counts)? - Repeated or similar prompts? (caching potential) - Mix of task types? (classification vs. generation vs. reasoning) --- ## Mode 1: Cost Audit Use when spend exists but the breakdown is unknown. Instrument first; optimize second. **Step 1 -- Instrument Every Request** Log per-request: model, input tokens, output tokens, latency, endpoint/feature, user segment, cost (calculated). **Step 2 -- Find the 20% Causing 80% of Spend** Sort by: feature × model × token count. Usually 2–3 endpoints drive the majority of cost. Target those first. **Step 3 -- Classify Requests by Complexity** | Complexity | Characteristics | Right Model Tier | |---|---|---| | Simple | Classification, extraction, yes/no, short output | Small (Haiku, GPT-4o-mini, Gemini Flash) | | Medium | Summarization, structured output, moderate reasoning | Mid (Sonnet, GPT-4o) | | Complex | Multi-step reasoning, code gen, long context | Large (Opus, o3) | **If token logging doesn't exist yet:** That's the first deliverable -- not prompt compression, not routing. You cannot optimize what you cannot see. Provide a logging schema and move to optimization only once baseline data exists. --- ## Mode 2: Optimize Existing System Apply techniques in ROI order. Don't skip ahead -- measure impact at each step before moving to the next. ### 1. Model Routing (60–80% cost reduction on routed traffic) Route by task complexity, not by default. Use a lightweight classifier or rule engine. - **Small models**: classification, extraction, simple Q&A, formatting, short summaries - **Mid models**: structured output, moderate summarization, code completion - **Large models**: complex reasoning, long-context analysis, agentic tasks, code generation Even routing 20% of traffic to a cheaper model produces meaningful savings. Start there. ### 2. Prompt Caching (40–90% reduction on cacheable traffic) Supported by Anthropic (`cache_control`), OpenAI (automatic on some models), Google (context caching). Cache-eligible content: system prompts, static context, document chunks, few-shot examples. Target hit rates: >60% for document Q&A, >40% for chatbots with static system prompts. **Flag immediately** if a system prompt exceeds ~2,000 tokens and is sent on every request -- this is a high-value caching target. ### 3. Output Length Control (20–40% reduction) LLMs over-generate by default. Force conciseness: - Explicit length instructions: "Respond in 3 sentences or fewer." - Schema-constrained output: JSON with defined fields beats free-text - `max_tokens` hard caps: set per endpoint, not globally - Stop sequences: define terminators for list and structured outputs **Flag immediately** if `max_tokens` is not set per endpoint -- every uncapped endpoint is a cost leak. ### 4. Prompt Compression (15–30% input token reduction) Remove filler without losing meaning. Audit each prompt for token efficiency. | Before | After | |---|---| | "Please carefully analyze the following text and provide..." | "Analyze:" | | "It is important that you remember to always..." | "Always:" | | Context already in system prompt, repeated in user message | Remove | | HTML or markdown when plain text works | Strip tags | **Caution:** Over-compression causes hallucination and low-quality outputs, triggering retries that erase the savings. Compress filler; preserve task-critical instructions. ### 5. Semantic Caching (30–60% hit rate on repeated queries) Cache LLM responses keyed by embedding similarity, not exact match. Serve cached responses for semantically equivalent questions. Tools: GPTCache, LangChain cache, custom Redis + embedding lookup. Threshold guidance: cosine similarity >0.95 = safe to serve cached response. ### 6. Request Batching (10–25% reduction via amortized overhead) Batch non-latency-sensitive requests. Process async queues off-peak. --- ## Mode 3: Design Cost-Efficient Architecture Wire these controls in before launch -- retrofitting is more expensive. **Budget Envelopes** -- per feature, per user tier, per day. Set hard limits and soft alerts at 80% of limit. **Routing Layer** -- classify → route → call. Never call the large model by default. **Tier Your Model Access** -- free users do not need the most expensive model. Assign model tiers by user tier at design time. **Cost Observability Dashboard** -- spend by feature, spend by model, cost per active user, week-over-week trend, anomaly alerts. This is not optional; it is the monitoring foundation. **Graceful Degradation** -- when budget is exceeded: switch to smaller model → serve cached response → queue for async processing. --- ## Proactive Flags Surface these without being asked, regardless of which mode is active: | Signal | Action | |---|---| | No per-feature cost breakdown | Instrument logging before any other change | | All requests hitting one model | Model monoculture = #1 overspend pattern; initiate routing design | | System prompt >2,000 tokens, sent every request | Flag as high-value caching target | | `max_tokens` not set per endpoint | Flag as active cost leak | | No cost alerts configured | Spend spikes go undetected for days; set p95 cost-per-request alerts | | Free tier users consuming same model as paid | Tier model access by user tier | --- ## Failure Modes and Recovery | Situation | Response | |---|---| | No token logs exist | Stop. Logging schema is deliverable #1. Return once baseline data is available. | | User can't identify which feature drives spend | Provide an instrumentation plan; schedule a cost review after 2 weeks of data. | | Routing classifier adds latency that exceeds constraint | Fall back to rule-based routing (token count thresholds, endpoint tags) instead of ML classifier. | | Cache hit rate is below 20% | Diagnose: are prompts highly variable? Is context dynamic? Recommend semantic caching or rethink what's being cached. | | Prompt compression degrades quality | Restore compressed section. Flag the specific instruction as compression-resistant. | --- ## Handoff Triggers If the conversation shifts to one of these, pause and invoke the relevant skill rather than continuing inline: - **Prompt quality or effectiveness deteriorates** → invoke `senior-prompt-engineer` - **Retrieval pipeline design comes up** → invoke `rag-architect` - **Broader monitoring stack beyond cost metrics** → invoke `observability-designer` - **Latency profiling becomes the primary concern** → invoke `performance-profiler`
Related in Design
contribute
IncludedLocal-only OSS contribution command center. Auto-refreshes the user's in-flight PR and issue state on invoke so conversations start with full context — no need to brief Claude on what's in flight. Helps the user find issues to contribute to on GitHub, builds per-repo dossiers of what each upstream expects (CLA, DCO, branch convention, AI policy, draft-first, review bots, issue templates), runs deterministic gates before any external action so AI-assisted contributions don't reach maintainers as slop. State is markdown-only: candidate files at ~/.contribute-system/candidates/, repo dossiers at ~/.contribute-system/research/, append-only event log at ~/.contribute-system/log.jsonl. No database, no cloud calls. Use when the user asks about their PRs / issues / contributions, wants to find new work to take on, claim an issue, build/refresh a repo's dossier, or draft a Design Issue or PR. Trigger with "/contribute", "what's my PR status", "find a contribution", "claim issue X", "draft a Design Issue for Y", "refresh dossier for Z".
architectural-analysis
IncludedUser-triggered deep architectural analysis of a codebase or scoped subtree across eight modes — information architecture, data flow, integration points, UI surfaces, interaction patterns, data model, control flow, and failure modes. This skill should be used when the user asks to "diagram this codebase," "map the architecture," "show the data flow," "give me an ERD," "trace control flow," "find the integration points," "verify the layout pattern," "audit the UX architecture," or any similar request whose primary deliverable is mermaid diagrams plus cited reports under docs/architecture/. Dispatches haiku/sonnet sub-agents in parallel for per-mode exploration, then verifies every citation mechanically before any node lands in a diagram. Not for one-off prose explanations of code (use code-explanation) or for high-level system design from scratch (use system-design).
mcp
IncludedModel Context Protocol (MCP) server development and tool management. Languages: Python, TypeScript. Capabilities: build MCP servers, integrate external APIs, discover/execute MCP tools, manage multi-server configs, design agent-centric tools. Actions: create, build, integrate, discover, execute, configure MCP servers/tools. Keywords: MCP, Model Context Protocol, MCP server, MCP tool, stdio transport, SSE transport, tool discovery, resource provider, prompt template, external API integration, Gemini CLI MCP, Claude MCP, agent tools, tool execution, server config. Use when: building MCP servers, integrating external APIs as MCP tools, discovering available MCP tools, executing MCP capabilities, configuring multi-server setups, designing tools for AI agents.
react-native-skia
IncludedDesign, build, debug, and optimise high-polish animated graphics in React Native or Expo using @shopify/react-native-skia, Reanimated, and Gesture Handler. Use when the user wants canvas-driven UI, shaders, paths, rich text, image filters, sprite fields, Skottie, video frames, snapshots, web CanvasKit setup, or performance tuning for custom motion-heavy elements such as loaders, hero art, cards, charts, progress indicators, particle systems, or gesture-driven surfaces. Also use when the user asks for fluid, glow, glass, blob, parallax, 60fps/120fps, or GPU-friendly animated effects in React Native, even if they do not explicitly say "Skia". Do not use for ordinary form/layout work with standard views.
plaid
IncludedProduct Led AI Development — guides founders from idea to launched product. Six capabilities: Idea (discover a product idea), Validate (pressure-test the idea against fatal flaws, problem reality, competition, and 2-week MVP feasibility), Plan (vision intake + document generation), Design (translate image references into a design.md spec), Launch (go-to-market strategy), and Build (roadmap execution). Use when someone says "PLAID", "plaid idea", "help me find an idea", "product idea", "idea from my business", "idea from my expertise", "plaid validate", "validate my idea", "pressure-test", "is this idea good", "find fatal flaws", "validate the problem", "plan a product", "define my vision", "generate a PRD", "product strategy", "plaid design", "design from image", "translate image to design", "create design.md", "extract design tokens", "plaid launch", "go-to-market", "launch plan", "GTM strategy", "launch playbook", "plaid build", "build the app", "start building", or "execute the roadmap".
nextjs-framer-motion-animations
IncludedAdds production-safe Motion for React or Framer Motion animations to Next.js apps, including reveal, hover and tap micro-interactions, whileInView, stagger, AnimatePresence, layout and layoutId transitions, reorder, scroll-linked UI, and lightweight route-content transitions. Use when the user asks to add, refactor, or debug Motion or Framer Motion in App Router or Pages Router codebases, especially around server/client boundaries, reduced motion, LazyMotion, bundle size, hydration, or route transitions. Avoid for GSAP-style timelines, WebGL or 3D scenes, heavy scroll storytelling, or CSS-only effects unless Motion is explicitly requested.