fine-tuning-data-generator

Included with Lifetime

$97 forever

Generates comprehensive synthetic fine-tuning datasets in ChatML format (JSONL) for use with Unsloth, Axolotl, and similar training frameworks. Gathers requirements, creates datasets with diverse examples, validates quality, and provides framework integration guidance.

Generalscripts

What this skill does


# Fine-Tuning Data Generator

This skill generates high-quality synthetic training data in ChatML format for fine-tuning language models using frameworks like Unsloth, Axolotl, or similar tools.

## What Do I Need?

| Need | Resource |
|------|----------|
| **Planning my dataset** - requirements, strategy, quality checklist | [`resources/dataset-strategy.md`](resources/dataset-strategy.md) |
| **How to create diverse examples** - variation techniques, multi-turn patterns, format-specific guidance | [`resources/generation-techniques.md`](resources/generation-techniques.md) |
| **ChatML format details** - structure, specification, common issues, framework compatibility | [`resources/chatml-format.md`](resources/chatml-format.md) |
| **Example datasets** - inspiration across domains, multi-turn samples, edge cases | [`resources/examples.md`](resources/examples.md) |
| **Validating quality** - validation workflow, analyzing datasets, troubleshooting | [`resources/quality-validation.md`](resources/quality-validation.md) |
| **Training & deployment** - framework setup, hyperparameters, optimization, deployment | [`resources/framework-integration.md`](resources/framework-integration.md) |

## Workflow

### Phase 1: Gather Requirements

Start with these essential clarifying questions:

**Task Definition:**
- What is the model being trained to do? (e.g., customer support, code generation, creative writing)
- What specific domain or subject matter? (e.g., legal, medical, e-commerce, software development)
- How many training examples are needed? (Recommend: 100+ for simple tasks, 500-1000+ for complex)

**Quality & Diversity:**
- Complexity range: simple to complex mix, or focus on specific difficulty level?
- Diversity: edge cases, error handling, unusual scenarios?
- Tone/style: professional, friendly, technical, concise, detailed?
- Response length preferences?
- Any specific formats: code blocks, lists, tables, JSON?

**Dataset Composition:**
- Distribution across subtopics: evenly distributed or weighted?
- Include negative examples (what NOT to do)?
- Need validation split? (Recommend 10-20% of total)

See [`resources/dataset-strategy.md`](resources/dataset-strategy.md) for detailed question templates.

### Phase 2: Create Generation Plan

Present a plan covering:
- Number and distribution of examples across categories
- Key topics/scenarios to cover
- Diversity strategies (phrasing variations, complexity levels, edge cases)
- System prompt approach (consistent vs. varied)
- Quality assurance approach

**Get user approval before generating.**

### Phase 3: Generate Synthetic Data

Create examples following these quality standards:

**Key Principles:**
- Realistic scenarios reflecting real-world use cases
- Natural language with varied phrasing and formality levels
- Accurate, helpful responses aligned with desired behavior
- Consistent ChatML formatting throughout
- Balanced difficulty (unless specified)
- Meaningful variety (no repetition)
- Include edge cases and error scenarios

**Diversity Techniques:**
- Vary query phrasing (questions, commands, statements)
- Include different expertise levels (beginner, intermediate, expert)
- Cover both positive and negative examples
- Mix short and long-form responses
- Include multi-step reasoning when appropriate
- Add context variations

See [`resources/generation-techniques.md`](resources/generation-techniques.md) for detailed techniques, domain-specific guidance, and batch generation workflow.

### Phase 4: Validate & Document

Run validation tools and checks:

```bash
# Validate JSON formatting and structure
python scripts/validate_chatml.py training_data.jsonl

# Analyze dataset statistics and diversity
python scripts/analyze_dataset.py training_data.jsonl

# Export statistics
python scripts/analyze_dataset.py training_data.jsonl --export stats.json
```

**Quality Checklist:**
- [ ] JSON validation passed (no errors)
- [ ] Analysis shows good diversity metrics
- [ ] Manual sample review passed
- [ ] No duplicate or near-duplicate examples
- [ ] All required fields present
- [ ] Realistic user queries
- [ ] Accurate, helpful responses
- [ ] Balanced category distribution
- [ ] Dataset metadata documented

See [`resources/quality-validation.md`](resources/quality-validation.md) for validation details, troubleshooting, and documentation templates.

### Phase 5: Integration & Training

Prepare for training with your framework of choice:

**Output Files:**
- `training_data.jsonl` - Main training set
- `validation_data.jsonl` - Optional validation set
- `dataset_info.txt` - Metadata and statistics

**Framework Setup:**
- Unsloth: Automatic ChatML detection, efficient 4-bit training
- Axolotl: Specify `type: chat_template` and `chat_template: chatml`
- Hugging Face: Use tokenizer's `apply_chat_template()` method
- Custom: Load from JSONL, handle ChatML formatting

See [`resources/framework-integration.md`](resources/framework-integration.md) for setup code, hyperparameters, deployment options, and best practices.

## ChatML Format Overview

Each training example is a JSON object with a `messages` array:

```json
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "Use slicing: `text[::-1]`"}]}
```

**Roles:**
- `system`: Sets assistant behavior (optional but recommended)
- `user`: User's input/query
- `assistant`: Model's expected response

**Multi-turn:** Add additional user/assistant message pairs for conversations.

See [`resources/chatml-format.md`](resources/chatml-format.md) for detailed specification, validation, common issues, and framework-specific notes.

## Tool Reference

### Scripts in `scripts/`

#### validate_chatml.py
Validates ChatML format JSONL files:
```bash
python scripts/validate_chatml.py training_data.jsonl
python scripts/validate_chatml.py training_data.jsonl --verbose
```

**Checks:**
- Valid JSON formatting
- Required fields (messages, role, content)
- Valid role values (system, user, assistant)
- Proper message order
- Duplicate detection
- Diversity metrics

#### analyze_dataset.py
Provides comprehensive statistics and analysis:
```bash
python scripts/analyze_dataset.py training_data.jsonl
python scripts/analyze_dataset.py training_data.jsonl --export stats.json
```

**Provides:**
- Dataset overview (total examples, message counts)
- Message length statistics
- System prompt variations
- User query patterns (questions, commands, code-related, length categories)
- Assistant response patterns (code blocks, lists, headers, length categories)
- Quality indicators (diversity score, balance ratio)
- Token estimates and cost projection

## Common Workflows

### Small Dataset (100-200 examples)
1. Gather requirements
2. Create generation plan for 1-2 categories
3. Generate in single batch, review quality
4. Validate and document
5. Ready for training

### Medium Dataset (500-1000 examples)
1. Gather requirements
2. Create detailed plan with multiple categories
3. Generate in 2-3 batches, reviewing after each
4. Analyze diversity and adjust approach
5. Fill any gaps
6. Final validation and documentation

### Large Dataset (2000+ examples)
1. Gather comprehensive requirements
2. Create multi-batch generation plan
3. Batch 1 (50-100): Foundation examples
4. Batch 2 (100-200): Complexity expansion
5. Batch 3 (100-200): Coverage filling
6. Batch 4 (50-100): Polish and validation
7. Run full validation suite
8. Generate comprehensive documentation

## Best Practices

### Start Small, Iterate
1. Generate 10-20 examples first
2. Review and get feedback
3. Refine approach based on feedback
4. Scale up to full dataset

### Quality Over Quantity
- Better to have 500 diverse, high-quality examples than 5,000 repetitive ones
- Each example should teach something new
- Maintain consistent response quality throughout

### Diversify Systematically
- Vary query phrasing (questions,

Files: 11

Size: 100.5 KB

Complexity: 67/100

Category: General

Source: https://github.com/markpitt/claude-skills/tree/main/skills/fine-tuning-data-generator

Related in General

modeling-omnistudio-epc-catalog

Included

Salesforce Industries CME EPC product-modeling skill for Product2-based catalog creation. Use when creating EPC products, configuring product attributes, building offer bundles with Product Child Items, or reviewing EPC DataPack JSON metadata for product catalog changes. TRIGGER when: user creates or updates Product2 EPC records, AttributeAssignment payloads, AttributeMetadata/AttributeDefaultValues, Offer bundles, or ProductChildItem relationships. DO NOT TRIGGER when: designing OmniScripts/FlexCards/Integration Procedures (use building-omnistudio-omniscript, building-omnistudio-flexcard, or building-omnistudio-integration-procedure), implementing Apex business logic (use generating-apex), or troubleshooting deployment pipelines (use deploying-metadata).

Generalscripts

relationship-science-coach

Included

Use this skill for direct, practical adult relationship coaching: couples conflict, repair, trust, marriage, dating, flirting, attachment patterns, emotional connection, sex, desire differences, eroticism, kink negotiation, affection, love languages, breakups, and long-term passion. Draw on Gottman, EFT and Hold Me Tight, attachment science, modern sex research, Perel, Nagoski, Kerner, Schnarch, Love and Stosny, and flexible love-language tools. Be concrete and low-hedge. Redirect only for imminent danger, abuse, coercive control, minors, non-consent, self-harm, stalking, or medical/legal/psychiatric decisions.

Generalscripts

building-sf-integrations

Included

Salesforce integration architecture and runtime plumbing with 120-point scoring. Use this skill to set up Named Credentials, External Credentials, External Services, REST/SOAP callout patterns, Platform Events, and Change Data Capture. TRIGGER when: user sets up Named Credentials, External Services, REST/SOAP callouts, Platform Events, CDC, or touches .namedCredential-meta.xml files. DO NOT TRIGGER when: Connected App/OAuth config (use configuring-connected-apps), Apex-only logic (use generating-apex), or data import/export (use handling-sf-data).

Generalscripts

venue-templates

Included

Access comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.

Generalscripts

let-fate-decide

Included

Draws the 12 Houses of the Zodiac Tarot spread to inject entropy into planning when prompts are vague, ambiguous, or casually delegated. Interprets the spread to guide next steps. Use when the user says 'let fate decide', 'YOLO', 'whatever', 'idk', or other nonchalant phrases, makes Yu-Gi-Oh references, or when you are about to arbitrarily pick between multiple reasonable approaches. Prefer over ask-questions-if-underspecified when the user's tone is casual or playful rather than precision-seeking.

Generalscripts

net-ops

Included

Cross-platform network troubleshooting (Windows, macOS, Linux) via local or remote shell. Use for: DNS broken, can't resolve hostnames, nslookup/dig works but apps fail, NRPT, WFP, scutil, /etc/resolver, systemd-resolved, /etc/resolv.conf, NetworkManager, VPN DNS leak residue (ProtonVPN/Mullvad/WireGuard/AnyConnect), AV/firewall blocking DNS or DoH, Tailscale DNS interaction, intermittent connectivity, remote diagnostics over SSH.

Generalscripts