Claude
Skills
Sign in
Back

Workers AI

Included with Lifetime
$97 forever

This skill should be used when the user asks about "Workers AI", "AI models", "text generation", "embeddings", "semantic search", "RAG", "Retrieval Augmented Generation", "AI inference", "LLaMA", "Llama", "bge embeddings", "@cf/ models", "AI Gateway", or discusses implementing AI features, choosing AI models, generating embeddings, or building RAG systems on Cloudflare Workers.

Cloud & DevOps

What this skill does


# Workers AI

## Purpose

This skill provides comprehensive guidance for using Workers AI, Cloudflare's AI inference platform. It covers available models, inference patterns, embedding generation, RAG (Retrieval Augmented Generation) architectures, AI Gateway integration, and best practices for AI workloads. Use this skill when implementing AI features, selecting models, building RAG systems, or optimizing AI inference on Workers.

## Workers AI Overview

Workers AI provides serverless AI inference at the edge with:
- **Text Generation**: LLMs for chat, completion, summarization
- **Embeddings**: Vector representations for semantic search
- **Image Generation**: Text-to-image models
- **Vision**: Image classification and object detection
- **Speech**: Text-to-speech and automatic speech recognition
- **Translation**: Language translation models

### Key Benefits

- **Edge deployment**: Low latency inference globally
- **No infrastructure**: Serverless, auto-scaling
- **Integrated**: Native integration with Workers, Vectorize, D1
- **Cost-effective**: Pay per inference, no minimum
- **Latest models**: Llama 3.1, Mistral, BAAI embeddings

## Project-Specific Model Decisions

Before recommending a model:
1. Check `.claude/cloudflare-expert.local.md` for existing decisions in the "AI Model Decisions" section
2. If found, use the saved decision and mention: "Based on your project's saved configuration..."
3. If not found, describe options with trade-offs and let the user decide
4. After user decides, offer to save the decision to memory with rationale

## Model Information Freshness

**Fetch fresh info via Docs MCP when**:
- User asks for "latest" or "current" models
- Memory decision is older than 90 days
- Starting a new project
- User mentions an unknown model

**Use skill knowledge when**:
- Explaining patterns (RAG workflow, chunking)
- Showing code patterns (API usage)
- Teaching concepts (temperature, top-k)

## Model Categories

### Text Generation Models

**LLaMA 3.1** (Long context, multilingual):
- `@cf/meta/llama-3.1-8b-instruct` - Chat and instruction following
- Best for: Conversational AI, Q&A, summarization, general text generation
- Context window: 128K tokens
- Multilingual support

**Mistral** (Fast, efficient):
- `@cf/mistral/mistral-7b-instruct-v0.2` - Fast instruction following
- Best for: Quick responses, simpler tasks
- Context window: 32K tokens

**Qwen** (Balanced efficiency):
- `@cf/qwen/qwen1.5-14b-chat-awq` - Quantized for efficiency
- Best for: Balance between speed and quality

See `references/model-selection-framework.md` for decision criteria and `references/workers-ai-models.md` for complete model catalog.

### Embedding Models

**BGE Base** (English, balanced):
- `@cf/baai/bge-base-en-v1.5` - High-quality English embeddings
- Dimensions: 768
- Best for: RAG, semantic search, English content

**BGE Large** (Higher quality, slower):
- `@cf/baai/bge-large-en-v1.5` - Higher quality, more compute
- Dimensions: 1024
- Best for: When quality is critical

**BGE Small** (Fast, compact):
- `@cf/baai/bge-small-en-v1.5` - Faster, smaller model
- Dimensions: 384
- Best for: When speed is critical, large volumes

**BGE M3** (Multilingual):
- `@cf/baai/bge-m3` - Multilingual support
- Best for: Multi-language content

### Image Generation

**Stable Diffusion**:
- `@cf/stabilityai/stable-diffusion-xl-base-1.0` - Text-to-image
- `@cf/bytedance/stable-diffusion-xl-lightning` - Faster generation
- Best for: Creating images from text descriptions

### Vision Models

**Image Classification**:
- `@cf/microsoft/resnet-50` - Object recognition
- Best for: Classifying image content

### Speech Models

**Text-to-Speech**:
- `@cf/meta/m2m100-1.2b` - Multilingual speech synthesis

**Automatic Speech Recognition**:
- `@cf/openai/whisper` - Speech-to-text
- Best for: Transcribing audio

## Text Generation

### Basic Inference

```javascript
export default {
  async fetch(request, env, ctx) {
    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      messages: [
        { role: 'system', content: 'You are a helpful assistant.' },
        { role: 'user', content: 'What is Cloudflare Workers?' }
      ]
    });

    return new Response(JSON.stringify(response));
  }
};
```

### Streaming Responses

```javascript
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'user', content: 'Write a story about...' }
  ],
  stream: true
});

return new Response(stream, {
  headers: { 'Content-Type': 'text/event-stream' }
});
```

### Model Parameters

```javascript
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [/* messages */],
  max_tokens: 512,        // Max tokens to generate
  temperature: 0.7,       // Creativity (0-1, higher = more random)
  top_p: 0.9,            // Nucleus sampling
  top_k: 40,             // Top-k sampling
  repetition_penalty: 1.2 // Penalize repetition
});
```

**Parameter guidelines**:
- **temperature**: 0.1-0.3 for factual, 0.7-0.9 for creative
- **max_tokens**: Set based on expected response length
- **top_p/top_k**: Usually leave at defaults unless fine-tuning behavior

## Embeddings

### Generating Embeddings

```javascript
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
  text: ['Hello world', 'Another sentence']
}) as { data: number[][] };

const vector1 = embeddings.data[0]; // [0.123, -0.456, ...]
const vector2 = embeddings.data[1];
```

**Important TypeScript note**: Always add `as { data: number[][] }` type assertion when using embeddings API.

### Batch Processing

```javascript
// Batch multiple texts for efficiency
const texts = documents.map(d => d.content);

// Process in batches of 100 (recommended batch size)
const batchSize = 100;
const allEmbeddings = [];

for (let i = 0; i < texts.length; i += batchSize) {
  const batch = texts.slice(i, i + batchSize);
  const result = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
    text: batch
  }) as { data: number[][] };

  allEmbeddings.push(...result.data);
}
```

### Text Chunking for Embeddings

For long documents, split into chunks before embedding:

```javascript
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 500,      // Characters per chunk
  chunkOverlap: 50     // Overlap between chunks
});

const chunks = await splitter.splitText(longDocument);

// Generate embedding for each chunk
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
  text: chunks
}) as { data: number[][] };

// Store each chunk with its embedding
for (let i = 0; i < chunks.length; i++) {
  await env.VECTOR_INDEX.insert([{
    id: `${docId}-chunk-${i}`,
    values: embeddings.data[i],
    metadata: { text: chunks[i], docId, chunkIndex: i }
  }]);
}
```

See `references/rag-architecture-patterns.md` for complete RAG implementation patterns.

## RAG (Retrieval Augmented Generation)

### Basic RAG Pattern

```javascript
async function answerQuestion(question, env) {
  // 1. Generate question embedding
  const questionEmbedding = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
    text: [question]
  }) as { data: number[][] };

  // 2. Find similar documents
  const similar = await env.VECTOR_INDEX.query(questionEmbedding.data[0], {
    topK: 3,
    returnMetadata: true
  });

  // 3. Build context from retrieved documents
  const context = similar.matches
    .map(match => match.metadata.text)
    .join('\n\n');

  // 4. Generate answer with context
  const answer = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    messages: [
      {
        role: 'system',
        content: 'Answer the question using only the provided context. If the answer is not in the context, say "I don\'t have enough information."'
      },
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${question}`
      }
    ]
  });

  return {
    answer: answer.response,
 

Related in Cloud & DevOps