gemini-tts

Included with Lifetime

$97 forever

Generate speech from text using Google Gemini TTS models via scripts/. Use for text-to-speech, audio generation, voice synthesis, multi-speaker conversations, and creating audio content. Supports multiple voices and streaming. Triggers on "text to speech", "TTS", "generate audio", "voice synthesis", "speak this text".

Image & Videoscripts

What this skill does


# Gemini Text-to-Speech

Generate natural-sounding speech from text using Gemini's TTS models through executable scripts with support for multiple voices and multi-speaker conversations.

## When to Use This Skill

Use this skill when you need to:
- Convert text to natural speech
- Create audio for podcasts, audiobooks, or videos
- Generate multi-speaker conversations
- Stream audio for long content
- Choose from multiple voice options
- Create accessible audio content
- Generate voiceovers for presentations
- Batch convert text to audio files

## Available Scripts

### scripts/tts.js
**Purpose**: Convert text to speech using Gemini TTS models

**When to use**:
- Any text-to-speech conversion
- Multi-speaker conversation generation
- Streaming audio for long texts
- Voiceovers for content creation
- Accessible audio generation

**Key parameters**:
| Parameter | Description | Example |
|-----------|-------------|---------|
| `text` | Text to convert (required) | `"Hello, world!"` |
| `--voice`, `-v` | Voice name | `Kore` |
| `--output`, `-o` | Base name for output file | `welcome` |
| `--output-dir` | Output directory for audio | `audio/` |
| `--no-timestamp` | Disable auto timestamp | Flag |
| `--model`, `-m` | TTS model | `gemini-2.5-flash-preview-tts` |
| `--stream`, `-s` | Enable streaming | Flag |
| `--speakers` | Multi-speaker mapping | `"Joe:Kore,Jane:Puck"` |

**Output**: WAV audio file path

## Workflows

### Workflow 1: Basic Text-to-Speech
```bash
node scripts/tts.js "Hello, world! Have a wonderful day."
```
- Best for: Quick audio generation, simple messages
- Voice: `Kore` (default, clear and professional)
- Output: `audio/tts_output_YYYYMMDD_HHMMSS.wav` (auto timestamp)

### Workflow 2: Choose Different Voice
```bash
node scripts/tts.js "Welcome to our podcast about technology trends" --voice Puck --output welcome
```
- Best for: Friendly, conversational content
- Voice options: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
- Output: `audio/welcome_YYYYMMDD_HHMMSS.wav`

### Workflow 3: Multi-Speaker Conversation
```bash
node scripts/tts.js "TTS the following conversation:
Joe: How's it going today?
Jane: Not too bad, how about you?
Joe: I'm working on a new project.
Jane: Sounds exciting, tell me more!" --speakers "Joe:Kore,Jane:Puck" --output conversation
```
- Best for: Dialogues, interviews, role-playing content
- Format: Marked conversation with speaker names
- Script automatically routes text to appropriate voices
- Output: `audio/conversation_YYYYMMDD_HHMMSS.wav`

### Workflow 4: Long Content with Streaming
```bash
node scripts/tts.js "This is a very long text that would benefit from streaming..." --stream --output long-form
```
- Best for: Podcasts, audiobooks, long articles
- Streaming: Processes audio in chunks for long texts
- Output: `audio/long-form_YYYYMMDD_HHMMSS.wav`

### Workflow 5: Professional Voiceover
```bash
node scripts/tts.js "Welcome to our quarterly earnings presentation. Today we'll discuss our growth metrics and future plans." --voice Charon --output voiceover
```
- Best for: Corporate content, presentations, formal announcements
- Voice: `Charon` (deep, authoritative)
- Use when: Professional, serious tone required

### Workflow 6: Custom Output Directory
```bash
node scripts/tts.js "Save to specific folder." --output-dir ./my-projects/podcasts/ --output episode1
```
- Best for: Organized project structures
- Directory created automatically if it doesn't exist
- Output: `./my-projects/podcasts/episode1_YYYYMMDD_HHMMSS.wav`

### Workflow 7: Content Creation Pipeline (Text → Audio)
```bash
# 1. Generate script (gemini-text skill)
node skills/gemini-text/scripts/generate.js "Write a 2-minute podcast intro about sustainable energy"

# 2. Generate audio (this skill)
node scripts/tts.js "[Paste generated script]" --voice Fenrir --output podcast-intro

# 3. Use in video or podcast
```
- Best for: Podcasts, audiobooks, video narration
- Combines with: gemini-text for script generation

### Workflow 8: Accessible Content
```bash
node scripts/tts.js "Welcome to our accessible website. This audio describes our main navigation options." --voice Aoede --output accessibility
```
- Best for: Web accessibility, screen reader alternatives
- Voice: `Aoede` (melodic, pleasant)
- Use when: Making content accessible to visually impaired users

### Workflow 9: Educational Content
```bash
node scripts/tts.js "Chapter 1: Introduction to Quantum Computing. Let's explore the fundamental principles..." --voice Zephyr --output chapter1
```
- Best for: Educational materials, tutorials, e-learning
- Voice: `Zephyr` (light, airy)
- Combines well with: gemini-text for content generation

### Workflow 10: Disable Timestamp
```bash
node scripts/tts.js "Fixed filename." --output my-audio --no-timestamp
```
- Best for: When you want complete control over filename
- Output: `audio/my-audio.wav` (no timestamp)
- Use when: Generating files for specific naming schemes

## Parameters Reference

### Model Selection

| Model | Quality | Speed | Best For |
|-------|---------|-------|----------|
| `gemini-2.5-flash-preview-tts` | Good | Fast | General use, high volume |
| `gemini-2.5-pro-preview-tts` | Higher | Slower | Premium content, voiceovers |

### Voice Selection

| Voice | Characteristics | Best For |
|-------|----------------|----------|
| **Kore** | Clear, professional | Announcements, general purpose (default) |
| **Puck** | Friendly, conversational | Casual content, interviews |
| **Charon** | Deep, authoritative | Corporate, serious content |
| **Fenrir** | Warm, expressive | Storytelling, narratives |
| **Aoede** | Melodic, pleasant | Educational, accessibility |
| **Zephyr** | Light, airy | Gentle content, tutorials |
| **Sulafat** | Neutral, balanced | Documentaries, factual content |

### Audio Format

| Specification | Value |
|--------------|-------|
| Format | WAV (PCM) |
| Sample rate | 24000 Hz |
| Channels | 1 (mono) |
| Bit depth | 16-bit |

### Token Limits

| Limit | Type | Description |
|-------|------|-------------|
| 8,192 | Input | Maximum input text tokens |
| 16,384 | Output | Maximum output audio tokens |

## Output Interpretation

### Audio File
- Format: WAV (compatible with most players)
- Mono channel (single audio track)
- Sample rate: 24000 Hz (broadcast quality)
- Can be converted to MP3/AAC if needed

### Multi-Speaker Files
- Single WAV file with multiple voices
- Voices separated by timing within file
- Use `--speakers` parameter to map speakers to voices

### Streaming Output
- Audio processed in chunks during generation
- Script shows "Streaming audio..." message
- Useful for very long texts or real-time applications

## Common Issues

### "google-genai not installed"
```bash
npm install @google/genai@latest dotenv@latest
```

### "Voice name not found"
- Check voice name spelling
- Use available voices: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
- Voice names are case-sensitive

### "No audio generated"
- Check text is not empty
- Verify text doesn't exceed token limit (8,192)
- Try shorter text segments
- Check API quota limits

### "Multi-speaker format error"
- Format: `SpeakerName:VoiceName,Speaker2:Voice2`
- Separate speakers with commas
- Use colon between speaker and voice
- Example: `"Joe:Kore,Jane:Puck,Host:Charon"`

### "Output file already exists"
- Script will overwrite existing files
- Change `--output` filename to avoid conflicts
- Use unique names for batch generation

### Audio quality issues
- Check input text for unusual characters
- Try different voice for better pronunciation
- Consider splitting long text into smaller segments
- Verify audio playback software compatibility

## Best Practices

### Voice Selection
- **Kore**: General purpose, clear articulation
- **Puck**: Conversational, engaging tone
- **Charon**: Professional, authoritative
- **Fenrir**: Emotional, storytelling
- **Aoede**: Soft, gentle for accessibility
- **Zephyr**: Educational, clear explanations

###

Files: 5

Size: 54.4 KB

Complexity: 68/100

Category: Image & Video

Source: https://github.com/akrindev/google-studio-skills/tree/main/skills/gemini-tts

Related in Image & Video

watch

Included

Watch a video (URL or local path). Downloads with yt-dlp, extracts auto-scaled frames with ffmpeg, pulls the transcript from captions (or Whisper API fallback), and hands the result to Claude so it can answer questions about what's in the video.

Image & Videoscriptsfeatured

physical-ai-defect-image-generation

Included

Use when the user wants to orchestrate defect image generation, run associated setup, or handle outputs on OSMO. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and AnomalyGen to create initial PCBA datasets. The Day 1 path performs inference and labeling on real images. This skill helps with first-time asset setup, creation of finetuning checkpoints, and configuring deployment. Trigger keywords: defect image generation, dig workflow, dig pipeline, defect image detection workflow, aoi pipeline, aoi anomalygen, usd2roi anomalygen, day 0 pcba, day 1 pcba, day 1 real-photo alignment, day 1 manual roi, metal surface anomaly, glass defect, anomalygen finetune, setup_pcb, setup_metal, setup_glass, setup_pretrained, dig setup, dig datasets, dig pretrained checkpoint, dig image-edit endpoint.

Image & Videoscripts

accelint-react-best-practices

Included

React performance optimization and best practices. ALWAYS use this skill when working with any React code - writing components, hooks, JSX; refactoring; optimizing re-renders, memoization, state management; reviewing for performance; fixing hydration mismatches; debugging infinite re-renders, stale closures, input focus loss, animations restarting; preventing remounting; implementing transitions, lazy initialization, effect dependencies. Even simple React tasks benefit from these patterns. Covers React 19+ (useEffectEvent, Activity, ref props). Triggers - useEffect, useState, useMemo, useCallback, memo, inline components, nested components, components inside components, re-render, performance, hydration, SSR, Next.js, useDeferredValue, combined hooks.

Image & Videoscripts

elevenlabs-agents

Included

Build conversational AI voice agents with ElevenLabs Platform using React, JavaScript, React Native, or Swift SDKs. Configure agents, tools (client/server/MCP), RAG knowledge bases, multi-voice, and Scribe real-time STT. Use when: building voice chat interfaces, implementing AI phone agents with Twilio, configuring agent workflows or tools, adding RAG knowledge bases, testing with CLI "agents as code", or troubleshooting deprecated @11labs packages, Android audio cutoff, CSP violations, dynamic variables, or WebRTC config. Keywords: ElevenLabs Agents, ElevenLabs voice agents, AI voice agents, conversational AI, @elevenlabs/react, @elevenlabs/client, @elevenlabs/react-native, @elevenlabs/elevenlabs-js, @elevenlabs/agents-cli, elevenlabs SDK, voice AI, TTS, text-to-speech, ASR, speech recognition, turn-taking model, WebRTC voice, WebSocket voice, ElevenLabs conversation, agent system prompt, agent tools, agent knowledge base, RAG voice agents, multi-voice agents, pronunciation dictionary, voice speed control, elevenlabs scribe, @11labs deprecated, Android audio cutoff, CSP violation elevenlabs, dynamic variables elevenlabs, case-sensitive tool names, webhook authentication

Image & Videoscripts

humanizer

Included

Humanize AI-generated text by detecting and removing patterns typical of LLM output. Rewrites text to sound natural, specific, and human. Uses 28 pattern detectors, 560+ AI vocabulary terms across 3 tiers, and statistical analysis (burstiness, type-token ratio, readability) for comprehensive detection. Use when asked to humanize text, de-AI writing, make content sound more natural/human, review writing for AI patterns, score text for AI detection, or improve AI-generated drafts. Covers content, language, style, communication, and filler categories.

Image & Videoscripts

generating-mermaid-diagrams

Included

Salesforce architecture diagrams using Mermaid with ASCII fallback. Use this skill when generating text-based diagrams for Salesforce architecture, OAuth flows, ERDs, integration sequences, or Agentforce structure. TRIGGER when: user says "diagram", "visualize", "ERD", or asks for sequence diagrams, flowcharts, class diagrams, or architecture visualizations in Mermaid. DO NOT TRIGGER when: user wants PNG/SVG image output (use generating-visual-diagrams), or asks about non-Salesforce systems.

Image & Videoscripts