voice

Included with Lifetime

$97 forever

Starts a voice conversation with the user via the agent-voice CLI. Use when the user invokes /voice. The user is not looking at the screen — they are listening and speaking. All agent output and input goes through voice until the conversation ends.

Image & Video

What this skill does

# Voice Mode

The user wants to have a voice conversation. They are **not looking at the screen**. They are listening to you speak and replying verbally. Treat this like a phone call.

Voice mode is a **session**. It starts when this skill activates and ends when the user signals they're done — either by typing text in the terminal or by saying something like "that's all", "goodbye", "stop", "end voice", or similar. When the conversation ends, say goodbye and stop using voice commands. Resume normal text interaction.

## Activation

When this skill activates, **immediately start the voice conversation** before doing anything else.

- **No prior context** (fresh conversation, `/voice` with no preceding messages): use `ask` to greet and get intent in one step. E.g. `agent-voice ask -m "Hey, what are we working on?"`
- **Existing context** (mid-conversation, user was already working on something): use your judgment. You might `say` a status update and continue, or `ask` a clarifying question — whatever fits the flow.

## Setup

If `agent-voice` fails with "command not found", install it and retry:

```bash
npm install -g agent-voice
```

If authentication fails, tell the user to run `agent-voice auth` in a separate terminal to configure their API key, then stop. Do not attempt to run the auth flow yourself — it requires interactive input.

## Commands

### Say — inform the user

Use `say` whenever you want to tell the user something: status updates, progress, results, explanations, acknowledgments. This is one-way — the user hears you but does not respond.

```bash
agent-voice say -m "I'm setting up the project now."
```

### Ask — get input from the user

Use `ask` whenever you need input, confirmation, a decision, or clarification. The user hears your question, then speaks their answer. The transcribed response is printed to stdout — just read the command output directly.

Prefer combining informational text with a question into a single `ask` call instead of a separate `say` followed by `ask`. This reduces latency and feels more natural.

```bash
# Instead of:
# agent-voice say -m "I've finished the database schema."
# agent-voice ask -m "Should I move on to the API routes?"
# Do:
agent-voice ask -m "I've finished the database schema. Should I move on to the API routes?"
```

Options:
- `--timeout <seconds>` — how long to wait for the user to speak (default: 120)

## Latency

This is a real-time conversation. The user is waiting in silence between each voice interaction. **Minimize the time between hearing the user and responding.** Every second of silence feels long.

- Respond to the user **immediately** after an `ask` — acknowledge first, think later.
- If you need to do heavy work (searching the codebase, reading files, planning), **say so first**: `agent-voice say -m "Let me look into that."` Then do the work. Then follow up with results.
- Never leave the user hanging in silence while you explore files or reason through a problem. A quick acknowledgment buys you time.
- Keep `say` messages short. Fewer words = less TTS latency.

## Rules

1. **Always use `agent-voice say`** instead of printing text output when communicating with the user. The user cannot see your text responses.
2. **Always use `agent-voice ask`** instead of the AskUserQuestion tool. The user is not at the keyboard.
3. **Never use the AskUserQuestion tool.** All user interaction goes through voice.
4. **Keep messages concise and conversational.** Speak like a human on a phone call. No markdown, no bullet lists, no code blocks in speech. Summarize; don't recite.
5. **Say before you do.** Before starting a task, tell the user what you're about to do. Before finishing, tell them what you did.
6. **Acknowledge when it helps.** After an `ask`, acknowledge if the next step takes time. Skip the ack if you're acting immediately — just do it.
7. **Ask don't assume.** When you need a decision, ask. Don't guess and don't skip the question.
8. **Batch your updates.** Don't `say` after every single file edit. Group progress into meaningful checkpoints.
9. **Speak errors plainly.** If something fails, explain what went wrong in plain language. Don't read stack traces aloud.
10. **Confirm before one-way doors.** Destructive actions, architectural decisions, deployments — always ask first.
11. **End gracefully.** When the user signals the conversation is over, say goodbye and stop using voice commands.

## Example Flow

```bash
# Greet and get intent
agent-voice ask -m "Hey, what are we working on?"

# Combine status + question — no separate ack needed
agent-voice ask -m "Got it. I've looked at the codebase and there are two approaches. Do you want a simple REST API or a GraphQL layer?"

# ... do work ...

# Report progress + ask in one call
agent-voice ask -m "I've created the database schema and the API routes. Want me to move on to the frontend?"

# ... more work ...

# Finish up
agent-voice ask -m "All done. I've committed everything to a new branch called feat/settings-page. Anything else?"

# User says "no, that's all"
agent-voice say -m "Alright, talk to you later."
# Voice mode ends — resume normal text interaction
```

Files: 1

Size: 5.4 KB

Complexity: 15/100

Category: Image & Video

Source: https://github.com/adriancooney/agent-voice/tree/main/skills/voice

Related in Image & Video

watch

Included

Watch a video (URL or local path). Downloads with yt-dlp, extracts auto-scaled frames with ffmpeg, pulls the transcript from captions (or Whisper API fallback), and hands the result to Claude so it can answer questions about what's in the video.

Image & Videoscriptsfeatured

physical-ai-defect-image-generation

Included

Use when the user wants to orchestrate defect image generation, run associated setup, or handle outputs on OSMO. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and AnomalyGen to create initial PCBA datasets. The Day 1 path performs inference and labeling on real images. This skill helps with first-time asset setup, creation of finetuning checkpoints, and configuring deployment. Trigger keywords: defect image generation, dig workflow, dig pipeline, defect image detection workflow, aoi pipeline, aoi anomalygen, usd2roi anomalygen, day 0 pcba, day 1 pcba, day 1 real-photo alignment, day 1 manual roi, metal surface anomaly, glass defect, anomalygen finetune, setup_pcb, setup_metal, setup_glass, setup_pretrained, dig setup, dig datasets, dig pretrained checkpoint, dig image-edit endpoint.

Image & Videoscripts

accelint-react-best-practices

Included

React performance optimization and best practices. ALWAYS use this skill when working with any React code - writing components, hooks, JSX; refactoring; optimizing re-renders, memoization, state management; reviewing for performance; fixing hydration mismatches; debugging infinite re-renders, stale closures, input focus loss, animations restarting; preventing remounting; implementing transitions, lazy initialization, effect dependencies. Even simple React tasks benefit from these patterns. Covers React 19+ (useEffectEvent, Activity, ref props). Triggers - useEffect, useState, useMemo, useCallback, memo, inline components, nested components, components inside components, re-render, performance, hydration, SSR, Next.js, useDeferredValue, combined hooks.

Image & Videoscripts

elevenlabs-agents

Included

Build conversational AI voice agents with ElevenLabs Platform using React, JavaScript, React Native, or Swift SDKs. Configure agents, tools (client/server/MCP), RAG knowledge bases, multi-voice, and Scribe real-time STT. Use when: building voice chat interfaces, implementing AI phone agents with Twilio, configuring agent workflows or tools, adding RAG knowledge bases, testing with CLI "agents as code", or troubleshooting deprecated @11labs packages, Android audio cutoff, CSP violations, dynamic variables, or WebRTC config. Keywords: ElevenLabs Agents, ElevenLabs voice agents, AI voice agents, conversational AI, @elevenlabs/react, @elevenlabs/client, @elevenlabs/react-native, @elevenlabs/elevenlabs-js, @elevenlabs/agents-cli, elevenlabs SDK, voice AI, TTS, text-to-speech, ASR, speech recognition, turn-taking model, WebRTC voice, WebSocket voice, ElevenLabs conversation, agent system prompt, agent tools, agent knowledge base, RAG voice agents, multi-voice agents, pronunciation dictionary, voice speed control, elevenlabs scribe, @11labs deprecated, Android audio cutoff, CSP violation elevenlabs, dynamic variables elevenlabs, case-sensitive tool names, webhook authentication

Image & Videoscripts

humanizer

Included

Humanize AI-generated text by detecting and removing patterns typical of LLM output. Rewrites text to sound natural, specific, and human. Uses 28 pattern detectors, 560+ AI vocabulary terms across 3 tiers, and statistical analysis (burstiness, type-token ratio, readability) for comprehensive detection. Use when asked to humanize text, de-AI writing, make content sound more natural/human, review writing for AI patterns, score text for AI detection, or improve AI-generated drafts. Covers content, language, style, communication, and filler categories.

Image & Videoscripts

generating-mermaid-diagrams

Included

Salesforce architecture diagrams using Mermaid with ASCII fallback. Use this skill when generating text-based diagrams for Salesforce architecture, OAuth flows, ERDs, integration sequences, or Agentforce structure. TRIGGER when: user says "diagram", "visualize", "ERD", or asks for sequence diagrams, flowcharts, class diagrams, or architecture visualizations in Mermaid. DO NOT TRIGGER when: user wants PNG/SVG image output (use generating-visual-diagrams), or asks about non-Salesforce systems.

Image & Videoscripts