Claude
Skills
Sign in
Back

mlops

Included with Lifetime
$97 forever

End-to-end MLOps guidance on AWS — platform selection, training, inference, pipelines, monitoring, and cost optimization. This skill should be used when the user asks to "build an ML pipeline", "deploy a model on SageMaker", "set up MLOps", "configure SageMaker Pipelines", "choose between SageMaker and Bedrock", "deploy ML models to production", "set up model monitoring", "use MLflow on AWS", "train a model with Spot instances", "configure inference endpoints", "set up distributed training", or mentions SageMaker, MLflow, Kubeflow, ML pipelines, model registry, model monitoring, hyperparameter tuning, inference endpoints, or MLOps on AWS.

Cloud & DevOps

What this skill does


Specialist guidance for MLOps on AWS. Covers platform selection, training job configuration, inference deployment patterns, CI/CD for ML, experiment tracking, model monitoring, and cost optimization.

## Process

1. Identify the ML workload characteristics: model type (classical ML, deep learning, foundation model), training data volume, inference latency requirements, traffic pattern, team expertise
2. Use the `awsknowledge` MCP tools (`mcp__plugin_aws-dev-toolkit_awsknowledge__aws___search_documentation`, `mcp__plugin_aws-dev-toolkit_awsknowledge__aws___read_documentation`, `mcp__plugin_aws-dev-toolkit_awsknowledge__aws___recommend`) to verify current SageMaker instance types, limits, pricing, and feature availability
3. Select the appropriate MLOps platform using the decision matrix below
4. Design the training infrastructure (instance selection, distributed strategy, Spot configuration)
5. Design the inference topology (real-time, serverless, batch, async)
6. Configure the ML pipeline (SageMaker Pipelines, Step Functions, or CI/CD integration)
7. Set up experiment tracking (MLflow on SageMaker or SageMaker Experiments)
8. Configure model monitoring (data quality, model quality, bias drift, feature attribution drift)
9. Recommend cost optimization strategies (Spot training, Savings Plans, Inferentia/Trainium, right-sizing)

## Platform Selection Decision Matrix

| Requirement | Recommendation | Why |
|---|---|---|
| End-to-end ML platform, team wants managed infrastructure | SageMaker (full) | Integrated training, tuning, deployment, monitoring, and model registry in one service; eliminates infrastructure management |
| CI/CD for ML with automated retraining and approval workflows | SageMaker Pipelines | Native step types for Processing, Training, Tuning, Transform, Model, Condition, and Callback; integrates with Model Registry for approval gates |
| Team already uses MLflow, needs portability across clouds | MLflow on SageMaker (managed) | Zero-infrastructure MLflow tracking server with automatic SageMaker Model Registry sync; preserves existing MLflow workflows |
| Customizing a foundation model without managing training infra | Bedrock fine-tuning / continued pre-training | No instance selection, no distributed training config, no checkpointing — AWS manages all training infrastructure; pay per training token |
| Kubernetes-native teams with existing EKS clusters | Kubeflow on EKS | Leverages existing K8s expertise and cluster; full control over scheduling, GPU sharing, and custom operators; but significant operational overhead |
| Simple orchestration for inference-only or lightweight training | Step Functions + Lambda | Event-driven, serverless, pay-per-execution; appropriate when training is infrequent and models are small enough for Lambda memory limits |
| Large-scale foundation model training (billions of parameters) | SageMaker HyperPod | Persistent managed clusters with automatic fault detection and repair; checkpointless recovery; supports Slurm and EKS orchestration |

## Training Instance Selection

### Training Instances

| Instance Family | Accelerator | Use Case | Price-Performance Notes |
|---|---|---|---|
| **ml.trn1 / ml.trn1n** | AWS Trainium | Large model training (LLMs, diffusion) | Up to 50% cheaper than comparable GPU instances for supported architectures; requires Neuron SDK |
| **ml.p5.48xlarge** | 8x NVIDIA H100 | Largest models, highest performance | Most powerful GPU option; use when Trainium does not support the model architecture |
| **ml.p4d.24xlarge** | 8x NVIDIA A100 | Large model training | Previous-gen flagship; still strong for most distributed training |
| **ml.g5.xlarge-48xlarge** | NVIDIA A10G | Medium models, fine-tuning | Good balance of cost and capability for fine-tuning and smaller training jobs |
| **ml.m5.large-24xlarge** | CPU only | Classical ML (XGBoost, sklearn) | No GPU overhead; appropriate for tree-based models and tabular data |

### Inference Instances

| Instance Family | Accelerator | Use Case | Price-Performance Notes |
|---|---|---|---|
| **ml.inf2** | AWS Inferentia2 | LLM and generative AI inference | Up to 4x higher throughput and 10x lower latency vs Inf1; 50%+ cheaper than GPU for supported models |
| **ml.g5** | NVIDIA A10G | General-purpose GPU inference | Broad framework support; use when Inferentia does not support the model |
| **ml.g4dn** | NVIDIA T4 | Cost-effective GPU inference | Previous-gen but still the cheapest GPU option for small-medium models |
| **ml.c7g / ml.c6g** | Graviton (CPU) | CPU inference for classical ML | Best price-performance for models that do not need GPU (XGBoost, sklearn, small NLP) |
| **Serverless** | Auto-managed | Sporadic or unpredictable traffic | No idle cost; 1-6 GB memory; cold start latency of seconds; max 60s processing time |

### Default to Trainium/Inferentia When Possible

Always evaluate ml.trn1 for training and ml.inf2 for inference before selecting GPU instances. Trainium offers up to 50% cost savings for training and Inferentia2 offers 50%+ cost savings for inference on supported model architectures. The AWS Neuron SDK supports PyTorch and TensorFlow natively. Only fall back to GPU instances when the model architecture is not supported by the Neuron compiler (check the Neuron model support matrix) or when the team needs CUDA-specific libraries.

## Inference Deployment Decision Matrix

| Pattern | Latency | Max Payload | Timeout | Cost Model | When to Use |
|---|---|---|---|---|---|
| **Real-time endpoint** | Low (ms) | 25 MB | 60s (8 min streaming) | Per-instance-hour (always running) | Consistent traffic with latency SLAs; use auto-scaling to match demand |
| **Serverless inference** | Medium (cold start) | 4 MB | 60s | Per-request + per-ms compute | Sporadic traffic with idle periods; eliminates idle instance cost entirely |
| **Batch transform** | High (minutes-hours) | 100 MB/record | Days | Per-instance-hour (job duration) | Offline scoring of large datasets; no persistent endpoint needed |
| **Async inference** | Medium-high | 1 GB | 1 hour | Per-instance-hour (scale to 0) | Large payloads or long processing; queue-based with SNS notifications |

### Real-time Endpoint Patterns

- **Single-model endpoint**: One model per endpoint. Simplest. Use for most production deployments.
- **Multi-model endpoint (MME)**: Thousands of models behind one endpoint, loaded on demand from S3. Use when you have many similar models (per-customer, per-region) and cannot justify an endpoint per model. Trade-off: first-request latency while loading a model.
- **Multi-container endpoint**: Up to 15 containers per endpoint, invoked individually or as a serial pipeline. Use for A/B testing different model versions or combining pre/post-processing with inference.
- **Shadow testing**: Route production traffic to both current and candidate models simultaneously. Compare metrics before promoting. Always use shadow testing before replacing a production model because it reveals performance differences under real traffic that offline evaluation cannot capture.

### Auto-Scaling

Default to target-tracking scaling on `SageMakerVariantInvocationsPerInstance` because it automatically adjusts instance count based on actual request load without requiring manual threshold tuning.

```
Target value: start at 70% of the max RPS the instance can handle (benchmark first)
Scale-in cooldown: 300 seconds (prevent flapping)
Scale-out cooldown: 60 seconds (respond quickly to load spikes)
```

Use Inference Recommender before production deployment to benchmark instance types and find the optimal instance/model combination. It runs load tests and reports latency, throughput, and cost per inference, replacing guesswork with data.

## SageMaker Pipelines

### Pipeline Step Types

| Step | Purpose | Notes |
|---|---|---|
| **Processing** | Data prep, feature engineering, evaluation | Runs a processing container (sklearn, Spark, custom) |
| **Training** | Model training |

Related in Cloud & DevOps