evaluate-skills

Included with Lifetime

$97 forever

Evaluate Claude Code skills against best practices for size, structure, examples, and prompt engineering. Use when reviewing skills for deployment, optimization, or standards compliance.

AI Agents

What this skill does


# Claude Code Skill Evaluator

Systematically evaluate Claude Code skills for quality, compliance with best practices, and optimization opportunities. Provides detailed assessment with actionable suggestions for improvement.

## Table of Contents

- [Instructions](#instructions)
  - [1. Find Skill](#1-find-skill)
  - [2. Read the Skill File](#2-read-the-skill-file)
  - [3. Analyze Against Best Practices](#3-analyze-against-best-practices)
    - [Dimension 1: Size & Length](#dimension-1-size--length)
    - [Dimension 2: Token Economy](#dimension-2-token-economy)
    - [Dimension 3: Degrees of Freedom](#dimension-3-degrees-of-freedom)
    - [Dimension 4: Scope Definition](#dimension-4-scope-definition)
    - [Dimension 5: Description Quality](#dimension-5-description-quality)
    - [Dimension 6: Structure & Organization](#dimension-6-structure--organization)
    - [Dimension 7: Examples](#dimension-7-examples)
    - [Dimension 8: Anti-Pattern Detection](#dimension-8-anti-pattern-detection)
    - [Dimension 9: Prompt Engineering Quality](#dimension-9-prompt-engineering-quality)
    - [Dimension 10: Completeness](#dimension-10-completeness)
  - [4. Generate Comprehensive Evaluation Report](#4-generate-comprehensive-evaluation-report)
  - [5. Deliver Report to User](#5-deliver-report-to-user)
- [Important Guidelines](#important-guidelines)
- [Requirements](#requirements)
- [Context & Standards](#context--standards)

## Instructions

### 1. Find Skill

Identify the skill passed in the directory passed to you or find all in the user's `~/.claude/skills/` directory. For each directory (excluding hidden files), verify it contains a `SKILL.md` file.

Present the user with:
- List of available skills
- Ask which skill to evaluate (or accept skill name as input)

### 2. Read the Skill File

Once a skill is selected, read its `SKILL.md` file and extract:
- Frontmatter metadata (name, description)
- Total line count
- Word count
- Character count
- Structure and sections

#### Error Handling

If SKILL.md is malformed, missing frontmatter, or unreadable:
- Report the specific error to the user (e.g., "SKILL.md missing required frontmatter field: name")
- Skip the full evaluation
- Suggest corrective action if possible

#### Review Example Report Format

Before analyzing, consult the example evaluation reports:
- **`examples/EXAMPLE.md`** - Demonstrates evaluation of a production-ready skill with passing scores
- **`examples/EXAMPLE-WITH-WARNINGS.md`** - Demonstrates evaluation of a near-production skill with warnings and improvement suggestions

These examples show proper report structure, formatting, status indicators (✓ Pass / ⚠ Warning / ❌ Fail), and how to deliver actionable feedback across the quality spectrum.

### 3. Analyze Against Best Practices

Evaluate the skill across **10 dimensions**:

#### Dimension 1: Size & Length
**Guidelines:**
- Body: Under 500 lines (hard maximum)
- Name: Maximum 64 characters
- Description: Maximum 1024 characters (200 char summary preferred)
- Table of Contents: Include if over 100 lines

**Assessment:**
- Count total lines in SKILL.md body
- Flag if over 500 lines
- Compliment if well-sized (ideal: 100-300 lines for medium skills)
- Check if TOC exists (expected for 100+ line skills)

#### Dimension 2: Token Economy
**Guidelines:**
- Default assumption: Claude is already very smart
- Challenge each piece of information: "Does Claude really need this explanation?"
- Avoid over-explaining concepts Claude already knows (e.g., what PDFs are, how libraries work)
- Concise examples preferred over verbose explanations

**Assessment:**
- Are there paragraphs explaining concepts Claude inherently knows?
- Could explanations be shortened without losing meaning?
- Is the skill concise within its size limits, or padded with unnecessary context?
- Does each section justify its token cost?

#### Dimension 3: Degrees of Freedom
**Guidelines:**
- High freedom (text-based instructions): Use when multiple approaches are valid or decisions depend on context
- Medium freedom (pseudocode/scripts with parameters): Use when a preferred pattern exists but variation is acceptable
- Low freedom (specific scripts, few parameters): Use when operations are fragile, consistency is critical, or exact sequence required

**Assessment:**
- Does the skill match instruction specificity to task fragility?
- Are fragile/destructive operations given explicit, low-freedom instructions?
- Are context-dependent tasks given appropriate flexibility?
- Does the skill avoid over-constraining where multiple valid approaches exist?

#### Dimension 4: Scope Definition
**Guidelines:**
- Narrow focus (one skill = one capability)
- Clear boundary of what the skill does and doesn't do
- No scope creep (e.g., "document processing" → "PDF form filling")

**Assessment:**
- Does the description clearly state what the skill does?
- Are there multiple conflicting capabilities within one skill?
- Is the boundary clear to a new user?

#### Dimension 5: Description Quality
**Guidelines:**
- Third-person voice (avoid "I can" or "you can")
- Include both WHAT and WHEN TO USE
- Specific, searchable terminology
- 200 character summary ideal

**Assessment:**
- Voice and tone appropriate?
- Discovery terms clear? (Would users search for these terms?)
- Is "when to use" explained?

#### Dimension 6: Structure & Organization
**Guidelines:**
- Clear section hierarchy (headings, subsections)
- Logical flow (progressive disclosure): start with a small, stable entry point and point to deeper sections/references rather than front-loading everything
- Step-by-step instructions preferred for workflows
- Rules/constraints clearly stated

**Assessment:**
- Is structure logical?
- Can a user easily navigate?
- Are instructions sequential or scattered?

#### Dimension 7: Examples
**Guidelines:**
- Quality over quantity
- Typical: 2-3 examples for basic skills, more for format-heavy
- Concrete (not abstract)
- Show patterns and edge cases

**Assessment:**
- How many examples? (count them)
- Are examples concrete and realistic?
- Do they demonstrate key patterns?
- Are there enough to show variations?

#### Dimension 8: Anti-Pattern Detection
**Red flags (check for these):**
- ❌ Windows-style paths (should use forward slashes)
- ❌ Magic numbers without justification
- ❌ Vague terminology (inconsistent synonyms)
- ❌ Time-sensitive instructions (date-dependent)
- ❌ Nested file references (over 1 level from SKILL.md - all reference files should link directly from SKILL.md)
- ❌ Vague descriptions (missing WHAT or WHEN)
- ❌ Scope creep (trying to do too much)
- ❌ No error handling or validation steps
- ❌ No user feedback loops (for complex workflows)
- ❌ Multiple conflicting approaches for same task
- ❌ MCP tool references without server prefix (should use `ServerName:tool_name` format)
- ❌ Assumed package availability (missing explicit installation instructions)
- ❌ Vague/generic naming (`helper`, `utils`, `tools` instead of imperative verb form like `process-pdfs`)

**Assessment:**
- Count violations
- Severity of each violation
- Impact on usability

#### Dimension 9: Prompt Engineering Quality
**Guidelines:**
- Imperative language (verb-first instructions)
- Explicit rules with clear boundaries
- Validation loops where appropriate (especially for destructive ops)
- Clear error handling
- Assumes user is intelligent (don't over-explain)

**Assessment:**
- Is language imperative?
- Are there validation steps?
- How clear are the rules?
- Is error handling explicit?

#### Dimension 10: Completeness
**Guidelines:**
- Requirements listed (what's needed to use the skill)
- Edge cases acknowledged
- Limitations stated where relevant

**Assessment:**
- Are prerequisites clear?
- Are limitations or edge cases mentioned?
- Is scope of responsibility clear?

### 4. Generate Comprehensive Evaluation Report

Create a detailed evaluation report with these components:

1. **Executive Summary**: 1-2 paragraphs covering

Files: 3

Size: 33.0 KB

Complexity: 41/100

Category: AI Agents

Source: https://github.com/lhohan/agent-chisels/tree/main/agentfiles/shared/skills/evaluate-skills

Related in AI Agents

skill-development

Included

Comprehensive meta-skill for creating, managing, validating, auditing, and distributing Claude Code skills and slash commands (unified in v2.1.3+). Provides skill templates, creation workflows, validation patterns, audit checklists, naming conventions, YAML frontmatter guidance, progressive disclosure examples, and best practices lookup. Use when creating new skills, validating existing skills, auditing skill quality, understanding skill architecture, needing skill templates, learning about YAML frontmatter requirements, progressive disclosure patterns, tool restrictions (allowed-tools), skill composition, skill naming conventions, troubleshooting skill activation issues, creating custom slash commands, configuring command frontmatter, using command arguments ($ARGUMENTS, $1, $2), bash execution in commands, file references in commands, command namespacing, plugin commands, MCP slash commands, Skill tool configuration, or deciding between skills vs slash commands. Delegates to docs-management skill for official documentation.

AI Agentsscripts

reprompter

Included

Transform messy prompts into well-structured, effective prompts — single or multi-agent. Use when: "reprompt", "reprompt this", "clean up this prompt", "structure my prompt", rough text needing XML tags and best practices, "reprompter teams", "repromptception", "run with quality", "smart run", "smart agents", multi-agent tasks, audits, parallel work, anything going to agent teams. Don't use when: simple Q&A, pure chat, immediate execution-only tasks. See "Don't Use When" section for details. Outputs: Structured XML/Markdown prompt, quality score (before/after), optional team brief + per-agent sub-prompts, agent team output files. Success criteria: Single mode quality score ≥ 7/10; Repromptception per-agent prompt quality score 8+/10; all required sections present, actionable and specific.

AI Agentsscripts

adaptive-compaction

Included

Adaptive add-on policy and recovery layer that decides WHEN to compact, prune, snapshot, or fork -- replacing fixed-percent auto-compaction across Claude Code, Codex, and MCP-capable hosts. Trigger on auto-compact timing or damage: "when should I compact", "is it safe to compact now or start a fresh session", "auto-compact fires too early/mid-task", "switching to an unrelated task but the window still has space", "context rot", "answers get worse the longer the session runs", "the agent forgot the plan or my decisions after it summarized", "add a layer on top that manages context without changing the agent", raising autoCompactWindow to give the policy room, or installing/tuning a cross-tool compaction policy or PreCompact hook -- even when "compaction" is never said but the problem is context-window pressure or post-summarization memory loss. Do NOT use to summarize a conversation, build RAG, write a summarization prompt (decides WHEN not HOW), or answer max-context-length trivia.

AI Agentsscripts

agent-skill-creator

Included

Create cross-platform agent skills from workflow descriptions. Activates when users ask to create an agent, automate a repetitive workflow, create a custom skill, or need advanced agent creation. Triggers on phrases like create agent for, automate workflow, create skill for, every day I have to, daily I need to, turn process into agent, need to automate, create a cross-platform skill, validate this skill, export this skill, migrate this skill. Supports single skills, multi-agent suites, transcript processing, template-based creation, interactive configuration, cross-platform export, and spec validation.

AI Agentsscripts

llm-wiki

Included

Use when building or maintaining a persistent personal knowledge base (second brain) in Obsidian where an LLM incrementally ingests sources, updates entity/concept pages, maintains cross-references, and keeps a synthesis current. Triggers include "second brain", "Obsidian wiki", "personal knowledge management", "ingest this paper/article/book", "build a research wiki", "compound knowledge", "Memex", or whenever the user wants knowledge to accumulate across sessions instead of being re-derived by RAG on every query.

AI Agentsscripts

skill-master

Included

Agent Skills authoring, evaluation, and optimization. Create, edit, validate, benchmark, and improve skills following the agentskills.io specification. Use when designing SKILL.md files, structuring skill folders (references, scripts, assets), ingesting external documentation into skills, running trigger evals, benchmarking skill quality, optimizing descriptions, or performing blind A/B comparisons. Keywords: agentskills.io, SKILL.md, skill authoring, eval, benchmark, trigger optimization.

AI Agentsscripts