voice-note-to-midi
Convert voice notes, humming, and melodic audio recordings to quantized MIDI files using ML-based pitch detection and intelligent post-processing
What this skill does
# ๐ต Voice Note to MIDI
Transform your voice memos, humming, and melodic recordings into clean, quantized MIDI files ready for your DAW.
## What It Does
This skill provides a complete audio-to-MIDI conversion pipeline that:
1. **Stem Separation** - Uses HPSS (Harmonic-Percussive Source Separation) to isolate melodic content from drums, noise, and background sounds
2. **ML-Powered Pitch Detection** - Leverages Spotify's Basic Pitch model for accurate fundamental frequency extraction
3. **Key Detection** - Automatically detects the musical key of your recording using Krumhansl-Kessler key profiles
4. **Intelligent Quantization** - Snaps notes to a configurable timing grid with optional key-aware pitch correction
5. **Post-Processing** - Applies octave pruning, overlap-based harmonic removal, and legato note merging for clean output
### Pipeline Architecture
```
Audio Input (WAV/M4A/MP3)
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 1: Stem Separation (HPSS) โ
โ - Isolate harmonic content โ
โ - Remove drums/percussion โ
โ - Noise gating โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 2: Pitch Detection โ
โ - Basic Pitch ML model (Spotify) โ
โ - Polyphonic note detection โ
โ - Onset/offset estimation โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 3: Analysis โ
โ - Pitch class distribution โ
โ - Key detection โ
โ - Dominant note identification โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Step 4: Quantization & Cleanup โ
โ - Timing grid snap โ
โ - Key-aware pitch correction โ
โ - Octave pruning (harmonic removal) โ
โ - Overlap-based pruning โ
โ - Note merging (legato) โ
โ - Velocity normalization โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
MIDI Output (Standard MIDI File)
```
## Setup
### Prerequisites
- Python 3.11+ (Python 3.14+ recommended)
- FFmpeg (for audio format support)
- pip
### Installation
**Quick Install (Recommended):**
```bash
cd /path/to/voice-note-to-midi
./setup.sh
```
This automated script will:
- Check Python 3.11+ is installed
- Create the `~/melody-pipeline` directory
- Set up the virtual environment
- Install all dependencies (basic-pitch, librosa, music21, etc.)
- Download and configure the hum2midi script
- Add melody-pipeline to your PATH
**Manual Install:**
If you prefer manual setup:
```bash
mkdir -p ~/melody-pipeline
cd ~/melody-pipeline
python3 -m venv venv-bp
source venv-bp/bin/activate
pip install basic-pitch librosa soundfile mido music21
chmod +x ~/melody-pipeline/hum2midi
```
5. **Add to your PATH (optional):**
```bash
echo 'export PATH="$HOME/melody-pipeline:$PATH"' >> ~/.bashrc
source ~/.bashrc
```
### Verify Installation
```bash
cd ~/melody-pipeline
./hum2midi --help
```
## Usage
### Basic Usage
Convert a voice memo to MIDI:
```bash
./hum2midi my_humming.wav
```
This creates `my_humming.mid` with 16th-note quantization.
### Specify Output File
```bash
./hum2midi input.wav output.mid
```
### Command-Line Options
| Option | Description | Default |
|--------|-------------|---------|
| `--grid <value>` | Quantization grid: `1/4`, `1/8`, `1/16`, `1/32` | `1/16` |
| `--min-note <ms>` | Minimum note duration in milliseconds | `50` |
| `--no-quantize` | Skip quantization (output raw Basic Pitch MIDI) | disabled |
| `--key-aware` | Enable key-aware pitch correction | disabled |
| `--no-analysis` | Skip pitch analysis and key detection | disabled |
### Usage Examples
#### Quantize to eighth notes
```bash
./hum2midi melody.wav --grid 1/8
```
#### Key-aware quantization (recommended for tonal music)
```bash
./hum2midi song.wav --key-aware
```
#### Require longer minimum notes
```bash
./hum2midi humming.wav --min-note 100
```
#### Skip analysis for faster processing
```bash
./hum2midi quick.wav --no-analysis
```
#### Combine options
```bash
./hum2midi recording.wav output.mid --grid 1/8 --key-aware --min-note 80
```
### Processing MIDI Input
You can also process existing MIDI files through the quantization pipeline:
```bash
./hum2midi input.mid output.mid --grid 1/16 --key-aware
```
This skips the audio processing steps and goes directly to analysis and quantization.
## Sample Output
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
hum2midi - Melody-to-MIDI Pipeline (Basic Pitch Edition)
[Key-Aware Mode Enabled]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Input: my_humming.wav
Output: my_humming.mid
โ Step 1: Stem Separation (HPSS)
Isolating melodic content...
Loaded: 5.23s @ 44100Hz
โ Melody stem extracted โ 5.23s
โ Step 2: Audio-to-MIDI Conversion (Basic Pitch)
Running Spotify's Basic Pitch ML model on melody stem...
โ Raw MIDI generated (Basic Pitch)
โ Step 3: Pitch Analysis & Key Detection
Notes detected: 42 total, 7 unique
Note range: C3 - G4
Pitch classes: C3, E3, G3, A3, C4, D4, G4
Dominant note: G3 (23.8% of notes)
Detected key: G major
โ Step 4: Quantization & Cleanup
Octave pruning: removed 3 harmonic notes above 67 (median+12)
Overlap pruning: removed 2 harmonic notes at overlapping positions
Note merging: merged 5 staccato chunks into legato notes (gap<=60 ticks)
Grid: 240 ticks (1/16)
Notes: 38 notes
Key: G major
Key-aware: 2 notes corrected to scale
Tempo: 120 BPM
โ Quantized MIDI saved
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Done! Output: my_humming.mid
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ ANALYSIS SUMMARY
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Detected Notes: C3, E3, G3, A3, C4, D4, G4
Detected Key: G major
Quantization: Key-aware mode (notes snapped to scale)
MIDI Info: 38 notes, 7 unique pitches, 120 BPM
Pitches: C3, E3, G3, A3, C4, D4, G4
```
## Notes & Limitations
### Audio Quality Matters
- **Clear, loud melody** produces the best results
- **Background noise** can cause false note detection
- **Reverb and effects** may confuse pitch detection
- **Close-mic'd vocals** work significantly better than room recordings
### Musical Considerations
- **Monophonic sources** work best (single melody line)
- **Polyphonic audio** (chords, multiple instruments) will produce messy results
- **Vibrato and pitch bends** may be quantized to stepped pitches
- **Rapid note passages** may be missed or merged
### Technical Limitations
- **Tempo is fixed** at 120 BPM in output (time positions are preserved, but tempo may need adjustment in your DAW)
- **Note velocities** are normalized but may need manual adjustment
- **Very short notes** (<50ms) may be filtered out by default
- **Extreme pitch ranges** may cause octave detection issues
### Post-Processing Recommendations
After generating MIDI, you may want to:
1. **Import into your DAW** and adjust tempo to match your original recording
2. **Quantize further** if stricter timing is needed
3. **Adjust note velocities** for dynamics
4. **Apply swing/groove** templates if the rigid grid sounds too mechanical
5. **Edit individual notes** that were misdetected (common with fast runs)
### Supported Audio Formats
Input formats supported via FFmpeg:
- WAV, AIFF, FLAC (uncompressed, best quality)
- MP3, M4A, AAC (compressed, acceptable)
- OGG, OPUS (open source formats)
- Most other formats FFmpeg supports
## Troubleshooting
### No notes detected
- Check that input file isn't silent or corrupted
- Try increasing `--min-note` threshold
- Verify audio has clear melodic content (not just noise)
### Too many notes / messy output
- Enable octave pruning and overlap pruning (on by default)
- Use `--key-aware` to constrain to musical scale
- Check for background noise in source audio
### Wrong key detected
- Key detRelated in Image & Video
watch
IncludedWatch a video (URL or local path). Downloads with yt-dlp, extracts auto-scaled frames with ffmpeg, pulls the transcript from captions (or Whisper API fallback), and hands the result to Claude so it can answer questions about what's in the video.
physical-ai-defect-image-generation
IncludedUse when the user wants to orchestrate defect image generation, run associated setup, or handle outputs on OSMO. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and AnomalyGen to create initial PCBA datasets. The Day 1 path performs inference and labeling on real images. This skill helps with first-time asset setup, creation of finetuning checkpoints, and configuring deployment. Trigger keywords: defect image generation, dig workflow, dig pipeline, defect image detection workflow, aoi pipeline, aoi anomalygen, usd2roi anomalygen, day 0 pcba, day 1 pcba, day 1 real-photo alignment, day 1 manual roi, metal surface anomaly, glass defect, anomalygen finetune, setup_pcb, setup_metal, setup_glass, setup_pretrained, dig setup, dig datasets, dig pretrained checkpoint, dig image-edit endpoint.
accelint-react-best-practices
IncludedReact performance optimization and best practices. ALWAYS use this skill when working with any React code - writing components, hooks, JSX; refactoring; optimizing re-renders, memoization, state management; reviewing for performance; fixing hydration mismatches; debugging infinite re-renders, stale closures, input focus loss, animations restarting; preventing remounting; implementing transitions, lazy initialization, effect dependencies. Even simple React tasks benefit from these patterns. Covers React 19+ (useEffectEvent, Activity, ref props). Triggers - useEffect, useState, useMemo, useCallback, memo, inline components, nested components, components inside components, re-render, performance, hydration, SSR, Next.js, useDeferredValue, combined hooks.
elevenlabs-agents
IncludedBuild conversational AI voice agents with ElevenLabs Platform using React, JavaScript, React Native, or Swift SDKs. Configure agents, tools (client/server/MCP), RAG knowledge bases, multi-voice, and Scribe real-time STT. Use when: building voice chat interfaces, implementing AI phone agents with Twilio, configuring agent workflows or tools, adding RAG knowledge bases, testing with CLI "agents as code", or troubleshooting deprecated @11labs packages, Android audio cutoff, CSP violations, dynamic variables, or WebRTC config. Keywords: ElevenLabs Agents, ElevenLabs voice agents, AI voice agents, conversational AI, @elevenlabs/react, @elevenlabs/client, @elevenlabs/react-native, @elevenlabs/elevenlabs-js, @elevenlabs/agents-cli, elevenlabs SDK, voice AI, TTS, text-to-speech, ASR, speech recognition, turn-taking model, WebRTC voice, WebSocket voice, ElevenLabs conversation, agent system prompt, agent tools, agent knowledge base, RAG voice agents, multi-voice agents, pronunciation dictionary, voice speed control, elevenlabs scribe, @11labs deprecated, Android audio cutoff, CSP violation elevenlabs, dynamic variables elevenlabs, case-sensitive tool names, webhook authentication
humanizer
IncludedHumanize AI-generated text by detecting and removing patterns typical of LLM output. Rewrites text to sound natural, specific, and human. Uses 28 pattern detectors, 560+ AI vocabulary terms across 3 tiers, and statistical analysis (burstiness, type-token ratio, readability) for comprehensive detection. Use when asked to humanize text, de-AI writing, make content sound more natural/human, review writing for AI patterns, score text for AI detection, or improve AI-generated drafts. Covers content, language, style, communication, and filler categories.
generating-mermaid-diagrams
IncludedSalesforce architecture diagrams using Mermaid with ASCII fallback. Use this skill when generating text-based diagrams for Salesforce architecture, OAuth flows, ERDs, integration sequences, or Agentforce structure. TRIGGER when: user says "diagram", "visualize", "ERD", or asks for sequence diagrams, flowcharts, class diagrams, or architecture visualizations in Mermaid. DO NOT TRIGGER when: user wants PNG/SVG image output (use generating-visual-diagrams), or asks about non-Salesforce systems.