Claude
Skills
Sign in
Back

fine-tuning-data-generator

Included with Lifetime
$97 forever

Generates comprehensive synthetic fine-tuning datasets in ChatML format (JSONL) for use with Unsloth, Axolotl, and similar training frameworks. Gathers requirements, creates datasets with diverse examples, validates quality, and provides framework integration guidance.

Generalscripts

What this skill does


# Fine-Tuning Data Generator

This skill generates high-quality synthetic training data in ChatML format for fine-tuning language models using frameworks like Unsloth, Axolotl, or similar tools.

## What Do I Need?

| Need | Resource |
|------|----------|
| **Planning my dataset** - requirements, strategy, quality checklist | [`resources/dataset-strategy.md`](resources/dataset-strategy.md) |
| **How to create diverse examples** - variation techniques, multi-turn patterns, format-specific guidance | [`resources/generation-techniques.md`](resources/generation-techniques.md) |
| **ChatML format details** - structure, specification, common issues, framework compatibility | [`resources/chatml-format.md`](resources/chatml-format.md) |
| **Example datasets** - inspiration across domains, multi-turn samples, edge cases | [`resources/examples.md`](resources/examples.md) |
| **Validating quality** - validation workflow, analyzing datasets, troubleshooting | [`resources/quality-validation.md`](resources/quality-validation.md) |
| **Training & deployment** - framework setup, hyperparameters, optimization, deployment | [`resources/framework-integration.md`](resources/framework-integration.md) |

## Workflow

### Phase 1: Gather Requirements

Start with these essential clarifying questions:

**Task Definition:**
- What is the model being trained to do? (e.g., customer support, code generation, creative writing)
- What specific domain or subject matter? (e.g., legal, medical, e-commerce, software development)
- How many training examples are needed? (Recommend: 100+ for simple tasks, 500-1000+ for complex)

**Quality & Diversity:**
- Complexity range: simple to complex mix, or focus on specific difficulty level?
- Diversity: edge cases, error handling, unusual scenarios?
- Tone/style: professional, friendly, technical, concise, detailed?
- Response length preferences?
- Any specific formats: code blocks, lists, tables, JSON?

**Dataset Composition:**
- Distribution across subtopics: evenly distributed or weighted?
- Include negative examples (what NOT to do)?
- Need validation split? (Recommend 10-20% of total)

See [`resources/dataset-strategy.md`](resources/dataset-strategy.md) for detailed question templates.

### Phase 2: Create Generation Plan

Present a plan covering:
- Number and distribution of examples across categories
- Key topics/scenarios to cover
- Diversity strategies (phrasing variations, complexity levels, edge cases)
- System prompt approach (consistent vs. varied)
- Quality assurance approach

**Get user approval before generating.**

### Phase 3: Generate Synthetic Data

Create examples following these quality standards:

**Key Principles:**
- Realistic scenarios reflecting real-world use cases
- Natural language with varied phrasing and formality levels
- Accurate, helpful responses aligned with desired behavior
- Consistent ChatML formatting throughout
- Balanced difficulty (unless specified)
- Meaningful variety (no repetition)
- Include edge cases and error scenarios

**Diversity Techniques:**
- Vary query phrasing (questions, commands, statements)
- Include different expertise levels (beginner, intermediate, expert)
- Cover both positive and negative examples
- Mix short and long-form responses
- Include multi-step reasoning when appropriate
- Add context variations

See [`resources/generation-techniques.md`](resources/generation-techniques.md) for detailed techniques, domain-specific guidance, and batch generation workflow.

### Phase 4: Validate & Document

Run validation tools and checks:

```bash
# Validate JSON formatting and structure
python scripts/validate_chatml.py training_data.jsonl

# Analyze dataset statistics and diversity
python scripts/analyze_dataset.py training_data.jsonl

# Export statistics
python scripts/analyze_dataset.py training_data.jsonl --export stats.json
```

**Quality Checklist:**
- [ ] JSON validation passed (no errors)
- [ ] Analysis shows good diversity metrics
- [ ] Manual sample review passed
- [ ] No duplicate or near-duplicate examples
- [ ] All required fields present
- [ ] Realistic user queries
- [ ] Accurate, helpful responses
- [ ] Balanced category distribution
- [ ] Dataset metadata documented

See [`resources/quality-validation.md`](resources/quality-validation.md) for validation details, troubleshooting, and documentation templates.

### Phase 5: Integration & Training

Prepare for training with your framework of choice:

**Output Files:**
- `training_data.jsonl` - Main training set
- `validation_data.jsonl` - Optional validation set
- `dataset_info.txt` - Metadata and statistics

**Framework Setup:**
- Unsloth: Automatic ChatML detection, efficient 4-bit training
- Axolotl: Specify `type: chat_template` and `chat_template: chatml`
- Hugging Face: Use tokenizer's `apply_chat_template()` method
- Custom: Load from JSONL, handle ChatML formatting

See [`resources/framework-integration.md`](resources/framework-integration.md) for setup code, hyperparameters, deployment options, and best practices.

## ChatML Format Overview

Each training example is a JSON object with a `messages` array:

```json
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "Use slicing: `text[::-1]`"}]}
```

**Roles:**
- `system`: Sets assistant behavior (optional but recommended)
- `user`: User's input/query
- `assistant`: Model's expected response

**Multi-turn:** Add additional user/assistant message pairs for conversations.

See [`resources/chatml-format.md`](resources/chatml-format.md) for detailed specification, validation, common issues, and framework-specific notes.

## Tool Reference

### Scripts in `scripts/`

#### validate_chatml.py
Validates ChatML format JSONL files:
```bash
python scripts/validate_chatml.py training_data.jsonl
python scripts/validate_chatml.py training_data.jsonl --verbose
```

**Checks:**
- Valid JSON formatting
- Required fields (messages, role, content)
- Valid role values (system, user, assistant)
- Proper message order
- Duplicate detection
- Diversity metrics

#### analyze_dataset.py
Provides comprehensive statistics and analysis:
```bash
python scripts/analyze_dataset.py training_data.jsonl
python scripts/analyze_dataset.py training_data.jsonl --export stats.json
```

**Provides:**
- Dataset overview (total examples, message counts)
- Message length statistics
- System prompt variations
- User query patterns (questions, commands, code-related, length categories)
- Assistant response patterns (code blocks, lists, headers, length categories)
- Quality indicators (diversity score, balance ratio)
- Token estimates and cost projection

## Common Workflows

### Small Dataset (100-200 examples)
1. Gather requirements
2. Create generation plan for 1-2 categories
3. Generate in single batch, review quality
4. Validate and document
5. Ready for training

### Medium Dataset (500-1000 examples)
1. Gather requirements
2. Create detailed plan with multiple categories
3. Generate in 2-3 batches, reviewing after each
4. Analyze diversity and adjust approach
5. Fill any gaps
6. Final validation and documentation

### Large Dataset (2000+ examples)
1. Gather comprehensive requirements
2. Create multi-batch generation plan
3. Batch 1 (50-100): Foundation examples
4. Batch 2 (100-200): Complexity expansion
5. Batch 3 (100-200): Coverage filling
6. Batch 4 (50-100): Polish and validation
7. Run full validation suite
8. Generate comprehensive documentation

## Best Practices

### Start Small, Iterate
1. Generate 10-20 examples first
2. Review and get feedback
3. Refine approach based on feedback
4. Scale up to full dataset

### Quality Over Quantity
- Better to have 500 diverse, high-quality examples than 5,000 repetitive ones
- Each example should teach something new
- Maintain consistent response quality throughout

### Diversify Systematically
- Vary query phrasing (questions, 

Related in General