elevenlabs-core-workflow-a
Implement ElevenLabs text-to-speech and voice cloning workflows. Use when building TTS features, cloning voices from audio samples, or implementing the primary ElevenLabs money-path: voice generation. Trigger: "elevenlabs TTS", "text to speech", "voice cloning elevenlabs", "clone a voice", "generate speech", "elevenlabs voice".
What this skill does
# ElevenLabs Core Workflow A — TTS & Voice Cloning
## Overview
The primary ElevenLabs workflows: (1) Text-to-Speech with voice settings, (2) Instant Voice Cloning from audio samples, and (3) streaming TTS via WebSocket for real-time applications.
## Prerequisites
- Completed `elevenlabs-install-auth` setup
- Valid API key with sufficient character quota
- For voice cloning: audio recording(s) of the target voice (min 30 seconds, clean audio)
## Instructions
### Step 1: Advanced Text-to-Speech
```typescript
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
import { createWriteStream } from "fs";
import { Readable } from "stream";
import { pipeline } from "stream/promises";
const client = new ElevenLabsClient();
async function generateSpeech(
text: string,
voiceId: string,
outputPath: string
) {
const audio = await client.textToSpeech.convert(voiceId, {
text,
model_id: "eleven_multilingual_v2",
voice_settings: {
stability: 0.5, // Lower = more expressive, higher = more consistent
similarity_boost: 0.75, // How closely to match the original voice
style: 0.3, // Amplify the speaker's style (adds latency if > 0)
speed: 1.0, // 0.7 to 1.2 range
},
// Optional: enforce language for multilingual model
// language_code: "en", // ISO 639-1
});
await pipeline(Readable.fromWeb(audio as any), createWriteStream(outputPath));
console.log(`Generated: ${outputPath}`);
}
// Generate with different voice settings for comparison
await generateSpeech("Welcome to our platform.", "21m00Tcm4TlvDq8ikWAM", "stable.mp3");
```
### Step 2: Instant Voice Cloning (IVC)
Clone a voice from audio samples using `POST /v1/voices/add`:
```typescript
import { createReadStream } from "fs";
async function cloneVoice(
name: string,
description: string,
audioFiles: string[] // Paths to audio samples
) {
const voice = await client.voices.add({
name,
description,
files: audioFiles.map(f => createReadStream(f)),
// Optional: label the voice for organization
labels: JSON.stringify({ accent: "american", age: "young" }),
});
console.log(`Cloned voice created: ${voice.voice_id}`);
console.log(`Name: ${name}`);
// Use the cloned voice immediately
const audio = await client.textToSpeech.convert(voice.voice_id, {
text: "This is my cloned voice speaking!",
model_id: "eleven_multilingual_v2",
voice_settings: {
stability: 0.5,
similarity_boost: 0.85, // Higher for cloned voices to stay close to original
},
});
return { voiceId: voice.voice_id, audio };
}
// Clone from 1-25 audio samples (more = better quality)
await cloneVoice(
"My Custom Voice",
"Professional narrator voice",
["sample1.mp3", "sample2.mp3"]
);
```
### Step 3: WebSocket Streaming TTS
For real-time applications (chatbots, live narration), use the WebSocket endpoint:
```typescript
import WebSocket from "ws";
async function streamTTSWebSocket(
voiceId: string,
textChunks: string[]
) {
const modelId = "eleven_flash_v2_5"; // Best for real-time streaming
const wsUrl = `wss://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream-input?model_id=${modelId}`;
const ws = new WebSocket(wsUrl);
const audioChunks: Buffer[] = [];
return new Promise<Buffer>((resolve, reject) => {
ws.on("open", () => {
// Send initial config (BOS - Beginning of Stream)
ws.send(JSON.stringify({
text: " ", // Space signals BOS
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
},
xi_api_key: process.env.ELEVENLABS_API_KEY,
// How many chars to buffer before generating audio
chunk_length_schedule: [120, 160, 250, 290],
}));
// Stream text chunks
for (const chunk of textChunks) {
ws.send(JSON.stringify({ text: chunk }));
}
// Send EOS (End of Stream)
ws.send(JSON.stringify({ text: "" }));
});
ws.on("message", (data: Buffer) => {
const msg = JSON.parse(data.toString());
if (msg.audio) {
// Base64-encoded audio chunk
audioChunks.push(Buffer.from(msg.audio, "base64"));
}
if (msg.isFinal) {
ws.close();
}
});
ws.on("close", () => resolve(Buffer.concat(audioChunks)));
ws.on("error", reject);
});
}
// Stream from an LLM response in chunks
const chunks = ["Hello, ", "this is ", "streamed ", "speech!"];
const audio = await streamTTSWebSocket("21m00Tcm4TlvDq8ikWAM", chunks);
```
### Step 4: Voice Management
```typescript
// List all available voices
async function listVoices() {
const { voices } = await client.voices.getAll();
for (const v of voices) {
console.log(`${v.name} (${v.voice_id}) — ${v.category}`);
// category: "premade" | "cloned" | "generated"
}
}
// Get voice settings defaults
async function getVoiceSettings(voiceId: string) {
const settings = await client.voices.getSettings(voiceId);
console.log(`Stability: ${settings.stability}`);
console.log(`Similarity: ${settings.similarity_boost}`);
}
// Update default voice settings
async function updateVoiceSettings(voiceId: string) {
await client.voices.editSettings(voiceId, {
stability: 0.6,
similarity_boost: 0.8,
});
}
// Delete a cloned voice
async function deleteVoice(voiceId: string) {
await client.voices.delete(voiceId);
console.log(`Voice ${voiceId} deleted.`);
}
```
## Voice Cloning Requirements
| Aspect | Requirement |
|--------|-------------|
| Audio length | Minimum 30 seconds total (1+ minute recommended) |
| Audio quality | Clean, no background noise, no music |
| Format | MP3, WAV, M4A, FLAC, OGG |
| Samples | 1-25 files per voice |
| Languages | Works across all supported languages |
| Plan | Available on all paid plans |
## Voice Settings Guide
| Setting | Range | Low Value Effect | High Value Effect |
|---------|-------|-----------------|-------------------|
| `stability` | 0-1 | More expressive, varied | Consistent, monotone |
| `similarity_boost` | 0-1 | More creative deviation | Strictly matches voice |
| `style` | 0-1 | Neutral delivery | Exaggerated emotion |
| `speed` | 0.7-1.2 | Slower speech | Faster speech |
**Recommended starting points:**
- Narration: stability=0.5, similarity=0.75, style=0.0
- Conversational: stability=0.4, similarity=0.6, style=0.3
- Cloned voice: stability=0.5, similarity=0.85, style=0.0
## Error Handling
| Error | HTTP | Cause | Solution |
|-------|------|-------|----------|
| `voice_not_found` | 404 | Invalid voice_id | List voices first: `GET /v1/voices` |
| `text_too_long` | 400 | Over 5,000 chars per request | Split text and use `previous_text`/`next_text` for prosody |
| `quota_exceeded` | 401 | Character limit reached | Check usage, upgrade plan |
| `too_many_concurrent_requests` | 429 | Exceeds plan concurrency | Queue requests; see concurrency limits |
| `invalid_voice_sample` | 400 | Bad audio file for cloning | Use clean audio, supported format, 30s+ |
| WebSocket `model_not_supported` | N/A | eleven_v3 not available for WS | Use `eleven_flash_v2_5` or `eleven_multilingual_v2` |
## Resources
- [TTS API Reference](https://elevenlabs.io/docs/api-reference/text-to-speech/convert)
- [Voice Cloning Guide](https://elevenlabs.io/docs/eleven-api/guides/cookbooks/voices/instant-voice-cloning)
- [WebSocket Streaming](https://elevenlabs.io/docs/api-reference/text-to-speech/v-1-text-to-speech-voice-id-stream-input)
- [Voice Settings](https://elevenlabs.io/docs/api-reference/voices/settings/get)
## Next Steps
For speech-to-speech, sound effects, and audio isolation, see `elevenlabs-core-workflow-b`.
Related in Image & Video
watch
IncludedWatch a video (URL or local path). Downloads with yt-dlp, extracts auto-scaled frames with ffmpeg, pulls the transcript from captions (or Whisper API fallback), and hands the result to Claude so it can answer questions about what's in the video.
physical-ai-defect-image-generation
IncludedUse when the user wants to orchestrate defect image generation, run associated setup, or handle outputs on OSMO. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and AnomalyGen to create initial PCBA datasets. The Day 1 path performs inference and labeling on real images. This skill helps with first-time asset setup, creation of finetuning checkpoints, and configuring deployment. Trigger keywords: defect image generation, dig workflow, dig pipeline, defect image detection workflow, aoi pipeline, aoi anomalygen, usd2roi anomalygen, day 0 pcba, day 1 pcba, day 1 real-photo alignment, day 1 manual roi, metal surface anomaly, glass defect, anomalygen finetune, setup_pcb, setup_metal, setup_glass, setup_pretrained, dig setup, dig datasets, dig pretrained checkpoint, dig image-edit endpoint.
accelint-react-best-practices
IncludedReact performance optimization and best practices. ALWAYS use this skill when working with any React code - writing components, hooks, JSX; refactoring; optimizing re-renders, memoization, state management; reviewing for performance; fixing hydration mismatches; debugging infinite re-renders, stale closures, input focus loss, animations restarting; preventing remounting; implementing transitions, lazy initialization, effect dependencies. Even simple React tasks benefit from these patterns. Covers React 19+ (useEffectEvent, Activity, ref props). Triggers - useEffect, useState, useMemo, useCallback, memo, inline components, nested components, components inside components, re-render, performance, hydration, SSR, Next.js, useDeferredValue, combined hooks.
elevenlabs-agents
IncludedBuild conversational AI voice agents with ElevenLabs Platform using React, JavaScript, React Native, or Swift SDKs. Configure agents, tools (client/server/MCP), RAG knowledge bases, multi-voice, and Scribe real-time STT. Use when: building voice chat interfaces, implementing AI phone agents with Twilio, configuring agent workflows or tools, adding RAG knowledge bases, testing with CLI "agents as code", or troubleshooting deprecated @11labs packages, Android audio cutoff, CSP violations, dynamic variables, or WebRTC config. Keywords: ElevenLabs Agents, ElevenLabs voice agents, AI voice agents, conversational AI, @elevenlabs/react, @elevenlabs/client, @elevenlabs/react-native, @elevenlabs/elevenlabs-js, @elevenlabs/agents-cli, elevenlabs SDK, voice AI, TTS, text-to-speech, ASR, speech recognition, turn-taking model, WebRTC voice, WebSocket voice, ElevenLabs conversation, agent system prompt, agent tools, agent knowledge base, RAG voice agents, multi-voice agents, pronunciation dictionary, voice speed control, elevenlabs scribe, @11labs deprecated, Android audio cutoff, CSP violation elevenlabs, dynamic variables elevenlabs, case-sensitive tool names, webhook authentication
humanizer
IncludedHumanize AI-generated text by detecting and removing patterns typical of LLM output. Rewrites text to sound natural, specific, and human. Uses 28 pattern detectors, 560+ AI vocabulary terms across 3 tiers, and statistical analysis (burstiness, type-token ratio, readability) for comprehensive detection. Use when asked to humanize text, de-AI writing, make content sound more natural/human, review writing for AI patterns, score text for AI detection, or improve AI-generated drafts. Covers content, language, style, communication, and filler categories.
generating-mermaid-diagrams
IncludedSalesforce architecture diagrams using Mermaid with ASCII fallback. Use this skill when generating text-based diagrams for Salesforce architecture, OAuth flows, ERDs, integration sequences, or Agentforce structure. TRIGGER when: user says "diagram", "visualize", "ERD", or asks for sequence diagrams, flowcharts, class diagrams, or architecture visualizations in Mermaid. DO NOT TRIGGER when: user wants PNG/SVG image output (use generating-visual-diagrams), or asks about non-Salesforce systems.