when-debugging-ml-training-use-ml-training-debugger

Included with Lifetime

$97 forever

Debug ML training issues and optimize performance including loss divergence, overfitting, and slow convergence

machine-learningdebuggingmltrainingoptimizationtroubleshooting

What this skill does


# ML Training Debugger - Diagnose and Fix Training Issues

## Overview

Systematic debugging workflow for ML training issues including loss divergence, overfitting, slow convergence, gradient problems, and performance optimization.

## When to Use

- Training loss becomes NaN or infinite
- Severe overfitting (train >> val performance)
- Training not converging
- Gradient vanishing/exploding
- Poor validation accuracy
- Training too slow

## Phase 1: Diagnose Issue (8 min)

### Objective
Identify the specific training problem

### Agent: ML-Developer

**Step 1.1: Analyze Training Curves**
```python
import json
import numpy as np

# Load training history
with open('training_history.json', 'r') as f:
    history = json.load(f)

# Diagnose issues
diagnosis = {
    'loss_divergence': check_loss_divergence(history['loss']),
    'overfitting': check_overfitting(history['loss'], history['val_loss']),
    'slow_convergence': check_convergence_rate(history['loss']),
    'gradient_issues': check_gradient_health(history),
    'nan_values': any(np.isnan(history['loss']))
}

def check_loss_divergence(losses):
    # Loss increasing over time
    if len(losses) > 10:
        recent_trend = np.mean(losses[-5:]) > np.mean(losses[-10:-5])
        return recent_trend

def check_overfitting(train_loss, val_loss):
    # Val loss diverging from train loss
    if len(train_loss) > 10:
        gap = np.mean(val_loss[-5:]) - np.mean(train_loss[-5:])
        return gap > 0.5  # Significant gap

def check_convergence_rate(losses):
    # Loss barely changing
    if len(losses) > 20:
        recent_change = abs(losses[-1] - losses[-10])
        return recent_change < 0.01  # Plateau

await memory.store('ml-debugger/diagnosis', diagnosis)
```

**Step 1.2: Identify Root Cause**
```python
root_causes = []

if diagnosis['loss_divergence']:
    root_causes.append({
        'issue': 'Loss Divergence',
        'likely_cause': 'Learning rate too high',
        'severity': 'HIGH',
        'fix': 'Reduce learning rate by 10x'
    })

if diagnosis['nan_values']:
    root_causes.append({
        'issue': 'NaN Loss',
        'likely_cause': 'Numerical instability',
        'severity': 'CRITICAL',
        'fix': 'Add gradient clipping, reduce LR, check data for extreme values'
    })

if diagnosis['overfitting']:
    root_causes.append({
        'issue': 'Overfitting',
        'likely_cause': 'Model too complex or insufficient regularization',
        'severity': 'MEDIUM',
        'fix': 'Add dropout, L2 regularization, or more training data'
    })

if diagnosis['slow_convergence']:
    root_causes.append({
        'issue': 'Slow Convergence',
        'likely_cause': 'Learning rate too low or poor initialization',
        'severity': 'LOW',
        'fix': 'Increase learning rate, use better initialization'
    })

await memory.store('ml-debugger/root-causes', root_causes)
```

**Step 1.3: Generate Diagnostic Report**
```python
report = f"""
# ML Training Diagnostic Report

## Issues Detected
{chr(10).join([f"- **{rc['issue']}** (Severity: {rc['severity']})" for rc in root_causes])}

## Root Cause Analysis
{chr(10).join([f"""
### {rc['issue']}
- **Likely Cause**: {rc['likely_cause']}
- **Recommended Fix**: {rc['fix']}
""" for rc in root_causes])}

## Training History Summary
- Final Train Loss: {history['loss'][-1]:.4f}
- Final Val Loss: {history['val_loss'][-1]:.4f}
- Epochs Completed: {len(history['loss'])}
"""

with open('diagnostic_report.md', 'w') as f:
    f.write(report)
```

### Validation Criteria
- [ ] Issues identified
- [ ] Root causes determined
- [ ] Severity assessed
- [ ] Report generated

## Phase 2: Analyze Root Cause (10 min)

### Objective
Deep dive into the specific problem

### Agent: Performance-Analyzer

**Step 2.1: Gradient Analysis**
```python
import tensorflow as tf

# Monitor gradients during training
def gradient_analysis(model, X_batch, y_batch):
    with tf.GradientTape() as tape:
        predictions = model(X_batch, training=True)
        loss = loss_fn(y_batch, predictions)

    gradients = tape.gradient(loss, model.trainable_variables)

    analysis = {
        'gradient_norms': [tf.norm(g).numpy() for g in gradients if g is not None],
        'has_nan': any(tf.reduce_any(tf.math.is_nan(g)) for g in gradients if g is not None),
        'has_inf': any(tf.reduce_any(tf.math.is_inf(g)) for g in gradients if g is not None)
    }

    # Check for vanishing/exploding gradients
    gradient_norms = np.array(analysis['gradient_norms'])
    analysis['vanishing'] = np.mean(gradient_norms) < 1e-7
    analysis['exploding'] = np.mean(gradient_norms) > 100

    return analysis

grad_analysis = gradient_analysis(model, X_train[:32], y_train[:32])
await memory.store('ml-debugger/gradient-analysis', grad_analysis)
```

**Step 2.2: Data Analysis**
```python
# Check for data issues
data_issues = {
    'class_imbalance': check_class_balance(y_train),
    'outliers': detect_outliers(X_train),
    'missing_normalization': check_normalization(X_train),
    'label_noise': estimate_label_noise(X_train, y_train, model)
}

def check_class_balance(labels):
    unique, counts = np.unique(labels, return_counts=True)
    imbalance_ratio = max(counts) / min(counts)
    return imbalance_ratio > 10  # Significant imbalance

def check_normalization(data):
    mean = np.mean(data, axis=0)
    std = np.std(data, axis=0)
    # Data should be roughly normalized
    return np.mean(np.abs(mean)) > 1 or np.mean(std) > 10

await memory.store('ml-debugger/data-issues', data_issues)
```

**Step 2.3: Model Architecture Review**
```python
# Analyze model complexity
architecture_analysis = {
    'total_params': model.count_params(),
    'trainable_params': sum([tf.size(v).numpy() for v in model.trainable_variables]),
    'depth': len(model.layers),
    'has_batch_norm': any('batch_norm' in layer.name for layer in model.layers),
    'has_dropout': any('dropout' in layer.name for layer in model.layers),
    'activation_functions': [layer.activation.__name__ for layer in model.layers if hasattr(layer, 'activation')]
}

# Check for common issues
architecture_issues = []

if architecture_analysis['total_params'] / len(X_train) > 10:
    architecture_issues.append('Model too complex relative to data size')

if not architecture_analysis['has_batch_norm'] and architecture_analysis['depth'] > 10:
    architecture_issues.append('Deep model without batch normalization')

await memory.store('ml-debugger/architecture-issues', architecture_issues)
```

### Validation Criteria
- [ ] Gradients analyzed
- [ ] Data issues identified
- [ ] Architecture reviewed
- [ ] Problems documented

## Phase 3: Apply Fix (15 min)

### Objective
Implement corrections based on diagnosis

### Agent: Coder

**Step 3.1: Fix Learning Rate**
```python
if 'Loss Divergence' in [rc['issue'] for rc in root_causes]:
    # Reduce learning rate
    old_lr = model.optimizer.learning_rate.numpy()
    new_lr = old_lr / 10
    model.optimizer.learning_rate.assign(new_lr)
    print(f"✅ Reduced learning rate: {old_lr} → {new_lr}")

if 'Slow Convergence' in [rc['issue'] for rc in root_causes]:
    # Increase learning rate with warmup
    new_lr = old_lr * 5
    model.optimizer.learning_rate.assign(new_lr)

    # Add LR scheduler
    lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=5,
        min_lr=1e-7
    )
```

**Step 3.2: Fix Overfitting**
```python
if 'Overfitting' in [rc['issue'] for rc in root_causes]:
    # Add regularization
    from tensorflow.keras import regularizers

    # Clone model with regularization
    new_layers = []
    for layer in model.layers:
        if isinstance(layer, tf.keras.layers.Dense):
            new_layer = tf.keras.layers.Dense(
                layer.units,
                activation=layer.activation,
                kernel_regularizer=regularizers.l2(0.01),  # Add L2
                name=layer.name + '_reg'
            )
            new_layers.append

Files: 5

Size: 30.7 KB

Complexity: 42/100

Category: machine-learning

Source: https://github.com/aiskillstore/marketplace/tree/main/skills/dnyoussef/when-debugging-ml-training-use-ml-training-debugger

Related in machine-learning

when-developing-ml-models-use-ml-expert

Included

Specialized ML model development, training, and deployment workflow

machine-learning

when-developing-ml-models-use-ml-expert

Included

Specialized ML model development, training, and deployment workflow

machine-learning