elevenlabs-performance-tuning

Included with Lifetime

$97 forever

Optimize ElevenLabs TTS latency with model selection, streaming, caching, and audio format tuning. Use when experiencing slow TTS responses, implementing real-time voice features, or optimizing audio generation throughput. Trigger: "elevenlabs performance", "optimize elevenlabs", "elevenlabs latency", "elevenlabs slow", "fast TTS", "reduce elevenlabs latency", "TTS streaming".

Image & Videosaasvoiceaielevenlabsperformanceoptimization

What this skill does

# ElevenLabs Performance Tuning

## Overview

Optimize ElevenLabs TTS latency and throughput through model selection, streaming strategies, audio format tuning, and caching. Latency ranges from ~75ms (Flash) to ~500ms (v3) depending on configuration.

## Prerequisites

- ElevenLabs SDK installed
- Understanding of your latency requirements
- Audio playback infrastructure (browser, mobile, server-side)

## Instructions

### Step 1: Model Selection for Latency

The single biggest performance lever is model choice:

| Model | Avg Latency | Quality | Languages | Use Case |
|-------|-------------|---------|-----------|----------|
| `eleven_flash_v2_5` | ~75ms | Good | 32 | Real-time chat, IVR, gaming |
| `eleven_turbo_v2_5` | ~150ms | Good | 32 | Balanced speed/quality |
| `eleven_multilingual_v2` | ~300ms | High | 29 | Narration, content creation |
| `eleven_v3` | ~500ms | Highest | 70+ | Maximum expressiveness |

```typescript
// Select model based on use case
function selectModel(useCase: "realtime" | "balanced" | "quality" | "max_quality"): string {
  const models = {
    realtime:    "eleven_flash_v2_5",
    balanced:    "eleven_turbo_v2_5",
    quality:     "eleven_multilingual_v2",
    max_quality: "eleven_v3",
  };
  return models[useCase];
}
```

### Step 2: Output Format Optimization

Smaller formats = faster transfer:

| Format | Size/Second | Quality | Best For |
|--------|-------------|---------|----------|
| `mp3_44100_128` | ~16 KB/s | High | Downloads, archival |
| `mp3_22050_32` | ~4 KB/s | Medium | Streaming, mobile |
| `pcm_16000` | ~32 KB/s | Raw | Server-side processing |
| `pcm_44100` | ~88 KB/s | Raw | High-quality processing |
| `ulaw_8000` | ~8 KB/s | Phone | Telephony/IVR |

```typescript
// Use smaller format for streaming, higher quality for downloads
const streamingConfig = {
  output_format: "mp3_22050_32",  // 4 KB/s — fast streaming
  model_id: "eleven_flash_v2_5",   // ~75ms first byte
};

const downloadConfig = {
  output_format: "mp3_44100_128", // 16 KB/s — high quality
  model_id: "eleven_multilingual_v2",
};
```

### Step 3: HTTP Streaming for Time-to-First-Byte

Use the streaming endpoint to start playback before full generation completes:

```typescript
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";

const client = new ElevenLabsClient();

async function streamToResponse(
  text: string,
  voiceId: string,
  res: Response | import("express").Response
) {
  const startTime = performance.now();

  const stream = await client.textToSpeech.stream(voiceId, {
    text,
    model_id: "eleven_flash_v2_5",
    output_format: "mp3_22050_32",
    voice_settings: {
      stability: 0.5,
      similarity_boost: 0.75,
      style: 0.0,        // style=0 reduces latency
    },
  });

  let firstChunk = true;
  for await (const chunk of stream) {
    if (firstChunk) {
      const ttfb = performance.now() - startTime;
      console.log(`Time to first byte: ${ttfb.toFixed(0)}ms`);
      firstChunk = false;
    }
    // Write chunk to response or audio player
    (res as any).write(chunk);
  }
  (res as any).end();
}
```

### Step 4: WebSocket Streaming for Lowest Latency

For interactive applications where text arrives in chunks (e.g., from an LLM):

```typescript
import WebSocket from "ws";

interface WSStreamConfig {
  voiceId: string;
  modelId?: string;
  chunkLengthSchedule?: number[];
}

async function createTTSStream(config: WSStreamConfig) {
  const model = config.modelId || "eleven_flash_v2_5";
  const url = `wss://api.elevenlabs.io/v1/text-to-speech/${config.voiceId}/stream-input?model_id=${model}`;

  const ws = new WebSocket(url);
  const audioChunks: Buffer[] = [];
  let totalLatency = 0;
  let firstAudioTime = 0;

  await new Promise<void>((resolve, reject) => {
    ws.on("open", resolve);
    ws.on("error", reject);
  });

  // Initialize stream
  ws.send(JSON.stringify({
    text: " ",
    xi_api_key: process.env.ELEVENLABS_API_KEY,
    voice_settings: { stability: 0.5, similarity_boost: 0.75 },
    // Control buffering: fewer chars = lower latency, more = better prosody
    chunk_length_schedule: config.chunkLengthSchedule || [50, 120, 200],
  }));

  return {
    // Send text chunks as they arrive (e.g., from LLM stream)
    sendText(text: string) {
      ws.send(JSON.stringify({ text }));
    },

    // Signal end of input
    finish(): Promise<Buffer> {
      return new Promise((resolve) => {
        const sendTime = Date.now();

        ws.on("message", (data: Buffer) => {
          const msg = JSON.parse(data.toString());
          if (msg.audio) {
            if (!firstAudioTime) {
              firstAudioTime = Date.now();
              totalLatency = firstAudioTime - sendTime;
            }
            audioChunks.push(Buffer.from(msg.audio, "base64"));
          }
          if (msg.isFinal) {
            console.log(`WebSocket TTFB: ${totalLatency}ms`);
            ws.close();
            resolve(Buffer.concat(audioChunks));
          }
        });

        ws.send(JSON.stringify({ text: "" })); // EOS signal
      });
    },
  };
}

// Usage with LLM streaming
const stream = await createTTSStream({
  voiceId: "21m00Tcm4TlvDq8ikWAM",
  chunkLengthSchedule: [50, 100, 150],  // Aggressive buffering for speed
});

// As LLM tokens arrive:
stream.sendText("Hello, ");
stream.sendText("how are ");
stream.sendText("you today?");

const audio = await stream.finish();
```

### Step 5: Audio Caching

Cache generated audio for repeated content (greetings, prompts, errors):

```typescript
import { LRUCache } from "lru-cache";
import crypto from "crypto";

const audioCache = new LRUCache<string, Buffer>({
  max: 500,                    // Max cached audio files
  maxSize: 100 * 1024 * 1024,  // 100MB total
  sizeCalculation: (value) => value.length,
  ttl: 24 * 60 * 60 * 1000,    // 24 hours
});

function cacheKey(text: string, voiceId: string, modelId: string): string {
  return crypto.createHash("sha256")
    .update(`${voiceId}:${modelId}:${text}`)
    .digest("hex");
}

async function cachedTTS(
  text: string,
  voiceId: string,
  modelId = "eleven_multilingual_v2"
): Promise<Buffer> {
  const key = cacheKey(text, voiceId, modelId);

  const cached = audioCache.get(key);
  if (cached) {
    console.log("[Cache HIT]", key.substring(0, 8));
    return cached;
  }

  const stream = await client.textToSpeech.convert(voiceId, {
    text,
    model_id: modelId,
  });

  const chunks: Buffer[] = [];
  for await (const chunk of stream as any) {
    chunks.push(Buffer.from(chunk));
  }
  const audio = Buffer.concat(chunks);

  audioCache.set(key, audio);
  console.log("[Cache MISS]", key.substring(0, 8), `${audio.length} bytes`);
  return audio;
}
```

### Step 6: Parallel Generation

Generate multiple audio segments concurrently:

```typescript
import PQueue from "p-queue";

const queue = new PQueue({ concurrency: 5 }); // Match plan limit

async function generateChapters(
  chapters: { title: string; text: string }[],
  voiceId: string
): Promise<Buffer[]> {
  const results = await Promise.all(
    chapters.map(chapter =>
      queue.add(async () => {
        const start = performance.now();
        const audio = await cachedTTS(chapter.text, voiceId);
        const duration = performance.now() - start;
        console.log(`${chapter.title}: ${duration.toFixed(0)}ms`);
        return audio;
      })
    )
  );

  return results as Buffer[];
}
```

## Performance Optimization Checklist

| Optimization | Latency Impact | Implementation |
|-------------|----------------|----------------|
| Flash model | -60% vs v2, -85% vs v3 | Change `model_id` |
| Streaming endpoint | -50% time-to-first-byte | Use `.stream()` instead of `.convert()` |
| WebSocket streaming | Best for LLM integration | See Step 4 |
| Smaller output format | -30% transfer time | `mp3_22050_32` vs `mp3_44100_128` |
| Audio caching | -99% for repeated content | LRU cache with SHA-256 keys |
| `style: 0` | -10-20% latency | Remove style exaggera

Files: 1

Size: 9.4 KB

Complexity: 20/100

Category: Image & Video

Source: https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/main/plugins/saas-packs/elevenlabs-pack/skills/elevenlabs-performance-tuning

Related in Image & Video

watch

Included

Watch a video (URL or local path). Downloads with yt-dlp, extracts auto-scaled frames with ffmpeg, pulls the transcript from captions (or Whisper API fallback), and hands the result to Claude so it can answer questions about what's in the video.

Image & Videoscriptsfeatured

physical-ai-defect-image-generation

Included

Use when the user wants to orchestrate defect image generation, run associated setup, or handle outputs on OSMO. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and AnomalyGen to create initial PCBA datasets. The Day 1 path performs inference and labeling on real images. This skill helps with first-time asset setup, creation of finetuning checkpoints, and configuring deployment. Trigger keywords: defect image generation, dig workflow, dig pipeline, defect image detection workflow, aoi pipeline, aoi anomalygen, usd2roi anomalygen, day 0 pcba, day 1 pcba, day 1 real-photo alignment, day 1 manual roi, metal surface anomaly, glass defect, anomalygen finetune, setup_pcb, setup_metal, setup_glass, setup_pretrained, dig setup, dig datasets, dig pretrained checkpoint, dig image-edit endpoint.

Image & Videoscripts

accelint-react-best-practices

Included

React performance optimization and best practices. ALWAYS use this skill when working with any React code - writing components, hooks, JSX; refactoring; optimizing re-renders, memoization, state management; reviewing for performance; fixing hydration mismatches; debugging infinite re-renders, stale closures, input focus loss, animations restarting; preventing remounting; implementing transitions, lazy initialization, effect dependencies. Even simple React tasks benefit from these patterns. Covers React 19+ (useEffectEvent, Activity, ref props). Triggers - useEffect, useState, useMemo, useCallback, memo, inline components, nested components, components inside components, re-render, performance, hydration, SSR, Next.js, useDeferredValue, combined hooks.

Image & Videoscripts

elevenlabs-agents

Included

Build conversational AI voice agents with ElevenLabs Platform using React, JavaScript, React Native, or Swift SDKs. Configure agents, tools (client/server/MCP), RAG knowledge bases, multi-voice, and Scribe real-time STT. Use when: building voice chat interfaces, implementing AI phone agents with Twilio, configuring agent workflows or tools, adding RAG knowledge bases, testing with CLI "agents as code", or troubleshooting deprecated @11labs packages, Android audio cutoff, CSP violations, dynamic variables, or WebRTC config. Keywords: ElevenLabs Agents, ElevenLabs voice agents, AI voice agents, conversational AI, @elevenlabs/react, @elevenlabs/client, @elevenlabs/react-native, @elevenlabs/elevenlabs-js, @elevenlabs/agents-cli, elevenlabs SDK, voice AI, TTS, text-to-speech, ASR, speech recognition, turn-taking model, WebRTC voice, WebSocket voice, ElevenLabs conversation, agent system prompt, agent tools, agent knowledge base, RAG voice agents, multi-voice agents, pronunciation dictionary, voice speed control, elevenlabs scribe, @11labs deprecated, Android audio cutoff, CSP violation elevenlabs, dynamic variables elevenlabs, case-sensitive tool names, webhook authentication

Image & Videoscripts

humanizer

Included

Humanize AI-generated text by detecting and removing patterns typical of LLM output. Rewrites text to sound natural, specific, and human. Uses 28 pattern detectors, 560+ AI vocabulary terms across 3 tiers, and statistical analysis (burstiness, type-token ratio, readability) for comprehensive detection. Use when asked to humanize text, de-AI writing, make content sound more natural/human, review writing for AI patterns, score text for AI detection, or improve AI-generated drafts. Covers content, language, style, communication, and filler categories.

Image & Videoscripts

generating-mermaid-diagrams

Included

Salesforce architecture diagrams using Mermaid with ASCII fallback. Use this skill when generating text-based diagrams for Salesforce architecture, OAuth flows, ERDs, integration sequences, or Agentforce structure. TRIGGER when: user says "diagram", "visualize", "ERD", or asks for sequence diagrams, flowcharts, class diagrams, or architecture visualizations in Mermaid. DO NOT TRIGGER when: user wants PNG/SVG image output (use generating-visual-diagrams), or asks about non-Salesforce systems.

Image & Videoscripts