azure-speech-to-text-rest-py
Azure Speech to Text REST API for short audio (Python). Use for simple speech recognition of audio files up to 60 seconds without the Speech SDK. Triggers: "speech to text REST", "short audio transcription", "speech recognition REST API", "STT REST", "recognize speech REST". DO NOT USE FOR: Long audio (>60 seconds), real-time streaming, batch transcription, custom speech models, speech translation. Use Speech SDK or Batch Transcription API instead.
What this skill does
# Azure Speech to Text REST API for Short Audio
Simple REST API for speech-to-text transcription of short audio files (up to 60 seconds). No SDK required - just HTTP requests.
## Prerequisites
1. **Azure subscription** - [Create one free](https://azure.microsoft.com/free/)
2. **Speech resource** - Create in [Azure Portal](https://portal.azure.com/#create/Microsoft.CognitiveServicesSpeechServices)
3. **Get credentials** - After deployment, go to resource > Keys and Endpoint
## Environment Variables
```bash
# Required
AZURE_SPEECH_KEY=<your-speech-resource-key>
AZURE_SPEECH_REGION=<region> # e.g., eastus, westus2, westeurope
# Alternative: Use endpoint directly
AZURE_SPEECH_ENDPOINT=https://<region>.stt.speech.microsoft.com
```
## Installation
```bash
pip install requests
```
## Quick Start
```python
import os
import requests
def transcribe_audio(audio_file_path: str, language: str = "en-US") -> dict:
"""Transcribe short audio file (max 60 seconds) using REST API."""
region = os.environ["AZURE_SPEECH_REGION"]
api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
headers = {
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json"
}
params = {
"language": language,
"format": "detailed" # or "simple"
}
with open(audio_file_path, "rb") as audio_file:
response = requests.post(url, headers=headers, params=params, data=audio_file)
response.raise_for_status()
return response.json()
# Usage
result = transcribe_audio("audio.wav", "en-US")
print(result["DisplayText"])
```
## Audio Requirements
| Format | Codec | Sample Rate | Notes |
|--------|-------|-------------|-------|
| WAV | PCM | 16 kHz, mono | **Recommended** |
| OGG | OPUS | 16 kHz, mono | Smaller file size |
**Limitations:**
- Maximum 60 seconds of audio
- For pronunciation assessment: maximum 30 seconds
- No partial/interim results (final only)
## Content-Type Headers
```python
# WAV PCM 16kHz
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000"
# OGG OPUS
"Content-Type": "audio/ogg; codecs=opus"
```
## Response Formats
### Simple Format (default)
```python
params = {"language": "en-US", "format": "simple"}
```
```json
{
"RecognitionStatus": "Success",
"DisplayText": "Remind me to buy 5 pencils.",
"Offset": "1236645672289",
"Duration": "1236645672289"
}
```
### Detailed Format
```python
params = {"language": "en-US", "format": "detailed"}
```
```json
{
"RecognitionStatus": "Success",
"Offset": "1236645672289",
"Duration": "1236645672289",
"NBest": [
{
"Confidence": 0.9052885,
"Display": "What's the weather like?",
"ITN": "what's the weather like",
"Lexical": "what's the weather like",
"MaskedITN": "what's the weather like"
}
]
}
```
## Chunked Transfer (Recommended)
For lower latency, stream audio in chunks:
```python
import os
import requests
def transcribe_chunked(audio_file_path: str, language: str = "en-US") -> dict:
"""Stream audio in chunks for lower latency."""
region = os.environ["AZURE_SPEECH_REGION"]
api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
headers = {
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json",
"Transfer-Encoding": "chunked",
"Expect": "100-continue"
}
params = {"language": language, "format": "detailed"}
def generate_chunks(file_path: str, chunk_size: int = 1024):
with open(file_path, "rb") as f:
while chunk := f.read(chunk_size):
yield chunk
response = requests.post(
url,
headers=headers,
params=params,
data=generate_chunks(audio_file_path)
)
response.raise_for_status()
return response.json()
```
## Authentication Options
### Option 1: Subscription Key (Simple)
```python
headers = {
"Ocp-Apim-Subscription-Key": os.environ["AZURE_SPEECH_KEY"]
}
```
### Option 2: Bearer Token
```python
import requests
import os
def get_access_token() -> str:
"""Get access token from the token endpoint."""
region = os.environ["AZURE_SPEECH_REGION"]
api_key = os.environ["AZURE_SPEECH_KEY"]
token_url = f"https://{region}.api.cognitive.microsoft.com/sts/v1.0/issueToken"
response = requests.post(
token_url,
headers={
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "application/x-www-form-urlencoded",
"Content-Length": "0"
}
)
response.raise_for_status()
return response.text
# Use token in requests (valid for 10 minutes)
token = get_access_token()
headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json"
}
```
## Query Parameters
| Parameter | Required | Values | Description |
|-----------|----------|--------|-------------|
| `language` | **Yes** | `en-US`, `de-DE`, etc. | Language of speech |
| `format` | No | `simple`, `detailed` | Result format (default: simple) |
| `profanity` | No | `masked`, `removed`, `raw` | Profanity handling (default: masked) |
## Recognition Status Values
| Status | Description |
|--------|-------------|
| `Success` | Recognition succeeded |
| `NoMatch` | Speech detected but no words matched |
| `InitialSilenceTimeout` | Only silence detected |
| `BabbleTimeout` | Only noise detected |
| `Error` | Internal service error |
## Profanity Handling
```python
# Mask profanity with asterisks (default)
params = {"language": "en-US", "profanity": "masked"}
# Remove profanity entirely
params = {"language": "en-US", "profanity": "removed"}
# Include profanity as-is
params = {"language": "en-US", "profanity": "raw"}
```
## Error Handling
```python
import requests
def transcribe_with_error_handling(audio_path: str, language: str = "en-US") -> dict | None:
"""Transcribe with proper error handling."""
region = os.environ["AZURE_SPEECH_REGION"]
api_key = os.environ["AZURE_SPEECH_KEY"]
url = f"https://{region}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1"
try:
with open(audio_path, "rb") as audio_file:
response = requests.post(
url,
headers={
"Ocp-Apim-Subscription-Key": api_key,
"Content-Type": "audio/wav; codecs=audio/pcm; samplerate=16000",
"Accept": "application/json"
},
params={"language": language, "format": "detailed"},
data=audio_file
)
if response.status_code == 200:
result = response.json()
if result.get("RecognitionStatus") == "Success":
return result
else:
print(f"Recognition failed: {result.get('RecognitionStatus')}")
return None
elif response.status_code == 400:
print(f"Bad request: Check language code or audio format")
elif response.status_code == 401:
print(f"Unauthorized: Check API key or token")
elif response.status_code == 403:
print(f"Forbidden: Missing authorization header")
else:
print(f"Error {response.status_code}: {response.text}")
return None
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return None
```
## Async Version
```python
import os
import aiohttp
import asyncio
async def transcribe_async(audio_file_path: sRelated in Image & Video
watch
IncludedWatch a video (URL or local path). Downloads with yt-dlp, extracts auto-scaled frames with ffmpeg, pulls the transcript from captions (or Whisper API fallback), and hands the result to Claude so it can answer questions about what's in the video.
physical-ai-defect-image-generation
IncludedUse when the user wants to orchestrate defect image generation, run associated setup, or handle outputs on OSMO. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and AnomalyGen to create initial PCBA datasets. The Day 1 path performs inference and labeling on real images. This skill helps with first-time asset setup, creation of finetuning checkpoints, and configuring deployment. Trigger keywords: defect image generation, dig workflow, dig pipeline, defect image detection workflow, aoi pipeline, aoi anomalygen, usd2roi anomalygen, day 0 pcba, day 1 pcba, day 1 real-photo alignment, day 1 manual roi, metal surface anomaly, glass defect, anomalygen finetune, setup_pcb, setup_metal, setup_glass, setup_pretrained, dig setup, dig datasets, dig pretrained checkpoint, dig image-edit endpoint.
accelint-react-best-practices
IncludedReact performance optimization and best practices. ALWAYS use this skill when working with any React code - writing components, hooks, JSX; refactoring; optimizing re-renders, memoization, state management; reviewing for performance; fixing hydration mismatches; debugging infinite re-renders, stale closures, input focus loss, animations restarting; preventing remounting; implementing transitions, lazy initialization, effect dependencies. Even simple React tasks benefit from these patterns. Covers React 19+ (useEffectEvent, Activity, ref props). Triggers - useEffect, useState, useMemo, useCallback, memo, inline components, nested components, components inside components, re-render, performance, hydration, SSR, Next.js, useDeferredValue, combined hooks.
elevenlabs-agents
IncludedBuild conversational AI voice agents with ElevenLabs Platform using React, JavaScript, React Native, or Swift SDKs. Configure agents, tools (client/server/MCP), RAG knowledge bases, multi-voice, and Scribe real-time STT. Use when: building voice chat interfaces, implementing AI phone agents with Twilio, configuring agent workflows or tools, adding RAG knowledge bases, testing with CLI "agents as code", or troubleshooting deprecated @11labs packages, Android audio cutoff, CSP violations, dynamic variables, or WebRTC config. Keywords: ElevenLabs Agents, ElevenLabs voice agents, AI voice agents, conversational AI, @elevenlabs/react, @elevenlabs/client, @elevenlabs/react-native, @elevenlabs/elevenlabs-js, @elevenlabs/agents-cli, elevenlabs SDK, voice AI, TTS, text-to-speech, ASR, speech recognition, turn-taking model, WebRTC voice, WebSocket voice, ElevenLabs conversation, agent system prompt, agent tools, agent knowledge base, RAG voice agents, multi-voice agents, pronunciation dictionary, voice speed control, elevenlabs scribe, @11labs deprecated, Android audio cutoff, CSP violation elevenlabs, dynamic variables elevenlabs, case-sensitive tool names, webhook authentication
humanizer
IncludedHumanize AI-generated text by detecting and removing patterns typical of LLM output. Rewrites text to sound natural, specific, and human. Uses 28 pattern detectors, 560+ AI vocabulary terms across 3 tiers, and statistical analysis (burstiness, type-token ratio, readability) for comprehensive detection. Use when asked to humanize text, de-AI writing, make content sound more natural/human, review writing for AI patterns, score text for AI detection, or improve AI-generated drafts. Covers content, language, style, communication, and filler categories.
generating-mermaid-diagrams
IncludedSalesforce architecture diagrams using Mermaid with ASCII fallback. Use this skill when generating text-based diagrams for Salesforce architecture, OAuth flows, ERDs, integration sequences, or Agentforce structure. TRIGGER when: user says "diagram", "visualize", "ERD", or asks for sequence diagrams, flowcharts, class diagrams, or architecture visualizations in Mermaid. DO NOT TRIGGER when: user wants PNG/SVG image output (use generating-visual-diagrams), or asks about non-Salesforce systems.