videoagent-audio-studio

Included with Lifetime

$97 forever

Tired of juggling multiple audio APIs? This skill gives you one-command access to TTS, music generation, sound effects, and voice cloning. Use when you want to generate any audio without managing multiple API keys.

Image & Videovideoaudiottsmusicsfxvoice-cloneelevenlabsfal

What this skill does


# 🎙️ VideoAgent Audio Studio

**Use when:** User asks to generate speech, narrate text, create a voice-over, compose music, or produce a sound effect.

VideoAgent Audio Studio is a smart audio dispatcher. It analyzes your request and routes it to the best available model — ElevenLabs for speech and music, fal.ai for fast SFX — and returns a ready-to-use audio URL.

---

## Quick Reference

| Request Type | Best Model | Latency |
|---|---|---|
| Narrate text / Voice-over | `elevenlabs-tts-v3` | ~3s |
| Low-latency TTS (real-time) | `elevenlabs-tts-turbo` | <1s |
| Background music | `cassetteai-music` | ~15s |
| Sound effect | `elevenlabs-sfx` | ~5s |
| Clone a voice from audio | `elevenlabs-voice-clone` | ~10s |

---

## How to Use

### 1. Start the AudioMind server (once per session)

```bash
bash {baseDir}/tools/start_server.sh
```

This starts the ElevenLabs MCP server on port 8124. The skill uses it for all audio generation.

### 2. Route the request

Analyze the user's request and call the appropriate tool via the MCP server:

**Text-to-Speech (TTS)**

When user asks to "narrate", "read aloud", "say", or "create a voice-over":

```
Use MCP tool: text_to_speech
  text: "<the text to narrate>"
  voice_id: "JBFqnCBsd6RMkjVDRZzb"   # Default: "George" (professional, neutral)
  model_id: "eleven_multilingual_v2"   # Use "eleven_turbo_v2_5" for low latency
```

**Music Generation**

When user asks to "compose", "create background music", or "make a soundtrack":

```
Use MCP tool: text_to_sound_effects  (via cassetteai-music on fal.ai)
  prompt: "<music description, e.g. 'upbeat lo-fi hip hop, 90 seconds'>"
  duration_seconds: <duration>
```

**Sound Effect (SFX)**

When user asks for a specific sound (e.g., "a door creaking", "rain on a window"):

```
Use MCP tool: text_to_sound_effects
  text: "<sound description>"
  duration_seconds: <1-22>
```

**Voice Cloning**

When user provides an audio sample and wants to clone the voice:

```
Use MCP tool: voice_add
  name: "<voice name>"
  files: ["<audio_file_url>"]
```

---

## Example Conversations

**User:** "Voice this text for me: Welcome to our product launch"

```
→ Route to: text_to_speech
  text: "Welcome to our product launch"
  voice_id: "JBFqnCBsd6RMkjVDRZzb"
  model_id: "eleven_multilingual_v2"
```

> 🎙️ Voiceover done! [Listen here](audio_url)

---

**User:** "Generate 60 seconds of relaxing background music for a podcast"

```
→ Route to: cassetteai-music (fal.ai)
  prompt: "relaxing lo-fi background music for a podcast, gentle piano and soft beats, 60 seconds"
  duration_seconds: 60
```

> 🎵 Background music ready! [Listen here](audio_url)

---

**User:** "Generate a sci-fi style door opening sound effect"

```
→ Route to: text_to_sound_effects
  text: "a futuristic sci-fi door sliding open with a hydraulic hiss"
  duration_seconds: 3
```

---

## Setup

### Required

Set `ELEVENLABS_API_KEY` in `~/.openclaw/openclaw.json`:

```json
{
  "skills": {
    "entries": {
      "videoagent-audio-studio": {
        "enabled": true,
        "env": {
          "ELEVENLABS_API_KEY": "your_elevenlabs_key_here"
        }
      }
    }
  }
}
```

Get your key at [elevenlabs.io/app/settings/api-keys](https://elevenlabs.io/app/settings/api-keys).

### Optional (for fal.ai music & SFX models)

```json
"FAL_KEY": "your_fal_key_here"
```

Get your key at [fal.ai/dashboard/keys](https://fal.ai/dashboard/keys).

---

## Self-Hosting the Proxy

The `cli.js` connects to a hosted proxy by default. If you want full control — or need to serve users in regions where `vercel.app` is blocked — you can deploy your own instance from the `proxy/` directory.

### Quick Deploy (Vercel)

```bash
cd proxy
npm install
vercel --prod
```

### Environment Variables

Set these in your Vercel project (Dashboard → Settings → Environment Variables):

| Variable | Required For | Where to Get |
|---|---|---|
| `ELEVENLABS_API_KEY` | TTS, SFX, Voice Clone | [elevenlabs.io/app/settings/api-keys](https://elevenlabs.io/app/settings/api-keys) |
| `FAL_KEY` | Music generation | [fal.ai/dashboard/keys](https://fal.ai/dashboard/keys) |
| `VALID_PRO_KEYS` | (Optional) Restrict access | Comma-separated list of allowed client keys |

### Point cli.js to Your Proxy

```bash
export AUDIOMIND_PROXY_URL="https://your-domain.com/api/audio"
```

Or set it in `~/.openclaw/openclaw.json`:

```json
{
  "skills": {
    "entries": {
      "videoagent-audio-studio": {
        "env": {
          "AUDIOMIND_PROXY_URL": "https://your-domain.com/api/audio"
        }
      }
    }
  }
}
```

### Custom Domain (Recommended)

If your users are in mainland China, bind a custom domain in Vercel Dashboard → Settings → Domains to avoid DNS issues with `vercel.app`.

---

## Model Reference

| Model ID | Type | Provider | Notes |
|---|---|---|---|
| `eleven_multilingual_v2` | TTS | ElevenLabs | Best quality, supports 29 languages |
| `eleven_turbo_v2_5` | TTS | ElevenLabs | Ultra-low latency, ideal for real-time |
| `eleven_monolingual_v1` | TTS | ElevenLabs | English only, fastest |
| `cassetteai-music` | Music | fal.ai | Reliable, fast music generation |
| `elevenlabs-sfx` | SFX | ElevenLabs | High-quality sound effects (up to 22s) |
| `elevenlabs-voice-clone` | Clone | ElevenLabs | Clone any voice from a short audio sample |

---

## Changelog

### v3.0.0
- **Simplified routing table**: Removed unstable/offline models from the main reference. The skill now only surfaces models that reliably work.
- **Clearer use-case triggers**: Added "Use when" section so the agent activates this skill at the right moment.
- **Unified setup**: Single `ELEVENLABS_API_KEY` is all you need to get started. `FAL_KEY` is now optional.
- **Removed polling complexity**: Music generation now uses `cassetteai-music` by default, which completes synchronously.

### v2.1.0
- Added async workflow for long-running music generation tasks.
- Added `cassetteai-music` as a stable alternative for music generation.

### v2.0.0
- Migrated to ElevenLabs MCP server architecture.
- Added voice cloning support.

### v1.0.0
- Initial release with TTS, music, and SFX routing.

Files: 10

Size: 32.5 KB

Complexity: 50/100

Category: Image & Video

Source: https://github.com/pexoai/pexo-skills/tree/main/skills/videoagent-audio-studio

Related in Image & Video

watch

Included

Watch a video (URL or local path). Downloads with yt-dlp, extracts auto-scaled frames with ffmpeg, pulls the transcript from captions (or Whisper API fallback), and hands the result to Claude so it can answer questions about what's in the video.

Image & Videoscriptsfeatured

physical-ai-defect-image-generation

Included

Use when the user wants to orchestrate defect image generation, run associated setup, or handle outputs on OSMO. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and AnomalyGen to create initial PCBA datasets. The Day 1 path performs inference and labeling on real images. This skill helps with first-time asset setup, creation of finetuning checkpoints, and configuring deployment. Trigger keywords: defect image generation, dig workflow, dig pipeline, defect image detection workflow, aoi pipeline, aoi anomalygen, usd2roi anomalygen, day 0 pcba, day 1 pcba, day 1 real-photo alignment, day 1 manual roi, metal surface anomaly, glass defect, anomalygen finetune, setup_pcb, setup_metal, setup_glass, setup_pretrained, dig setup, dig datasets, dig pretrained checkpoint, dig image-edit endpoint.

Image & Videoscripts

accelint-react-best-practices

Included

React performance optimization and best practices. ALWAYS use this skill when working with any React code - writing components, hooks, JSX; refactoring; optimizing re-renders, memoization, state management; reviewing for performance; fixing hydration mismatches; debugging infinite re-renders, stale closures, input focus loss, animations restarting; preventing remounting; implementing transitions, lazy initialization, effect dependencies. Even simple React tasks benefit from these patterns. Covers React 19+ (useEffectEvent, Activity, ref props). Triggers - useEffect, useState, useMemo, useCallback, memo, inline components, nested components, components inside components, re-render, performance, hydration, SSR, Next.js, useDeferredValue, combined hooks.

Image & Videoscripts

elevenlabs-agents

Included

Build conversational AI voice agents with ElevenLabs Platform using React, JavaScript, React Native, or Swift SDKs. Configure agents, tools (client/server/MCP), RAG knowledge bases, multi-voice, and Scribe real-time STT. Use when: building voice chat interfaces, implementing AI phone agents with Twilio, configuring agent workflows or tools, adding RAG knowledge bases, testing with CLI "agents as code", or troubleshooting deprecated @11labs packages, Android audio cutoff, CSP violations, dynamic variables, or WebRTC config. Keywords: ElevenLabs Agents, ElevenLabs voice agents, AI voice agents, conversational AI, @elevenlabs/react, @elevenlabs/client, @elevenlabs/react-native, @elevenlabs/elevenlabs-js, @elevenlabs/agents-cli, elevenlabs SDK, voice AI, TTS, text-to-speech, ASR, speech recognition, turn-taking model, WebRTC voice, WebSocket voice, ElevenLabs conversation, agent system prompt, agent tools, agent knowledge base, RAG voice agents, multi-voice agents, pronunciation dictionary, voice speed control, elevenlabs scribe, @11labs deprecated, Android audio cutoff, CSP violation elevenlabs, dynamic variables elevenlabs, case-sensitive tool names, webhook authentication

Image & Videoscripts

humanizer

Included

Humanize AI-generated text by detecting and removing patterns typical of LLM output. Rewrites text to sound natural, specific, and human. Uses 28 pattern detectors, 560+ AI vocabulary terms across 3 tiers, and statistical analysis (burstiness, type-token ratio, readability) for comprehensive detection. Use when asked to humanize text, de-AI writing, make content sound more natural/human, review writing for AI patterns, score text for AI detection, or improve AI-generated drafts. Covers content, language, style, communication, and filler categories.

Image & Videoscripts

generating-mermaid-diagrams

Included

Salesforce architecture diagrams using Mermaid with ASCII fallback. Use this skill when generating text-based diagrams for Salesforce architecture, OAuth flows, ERDs, integration sequences, or Agentforce structure. TRIGGER when: user says "diagram", "visualize", "ERD", or asks for sequence diagrams, flowcharts, class diagrams, or architecture visualizations in Mermaid. DO NOT TRIGGER when: user wants PNG/SVG image output (use generating-visual-diagrams), or asks about non-Salesforce systems.

Image & Videoscripts