voice-note-to-midi

Included with Lifetime

$97 forever

Convert voice notes, humming, and melodic audio recordings to quantized MIDI files using ML-based pitch detection and intelligent post-processing

Image & Videoaudiomidimusictranscriptionmachine-learning

What this skill does


# 🎵 Voice Note to MIDI

Transform your voice memos, humming, and melodic recordings into clean, quantized MIDI files ready for your DAW.

## What It Does

This skill provides a complete audio-to-MIDI conversion pipeline that:

1. **Stem Separation** - Uses HPSS (Harmonic-Percussive Source Separation) to isolate melodic content from drums, noise, and background sounds
2. **ML-Powered Pitch Detection** - Leverages Spotify's Basic Pitch model for accurate fundamental frequency extraction
3. **Key Detection** - Automatically detects the musical key of your recording using Krumhansl-Kessler key profiles
4. **Intelligent Quantization** - Snaps notes to a configurable timing grid with optional key-aware pitch correction
5. **Post-Processing** - Applies octave pruning, overlap-based harmonic removal, and legato note merging for clean output

### Pipeline Architecture

```
Audio Input (WAV/M4A/MP3)
    ↓
┌─────────────────────────────────────┐
│ Step 1: Stem Separation (HPSS)     │
│ - Isolate harmonic content          │
│ - Remove drums/percussion           │
│ - Noise gating                      │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Step 2: Pitch Detection             │
│ - Basic Pitch ML model (Spotify)    │
│ - Polyphonic note detection         │
│ - Onset/offset estimation           │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Step 3: Analysis                    │
│ - Pitch class distribution          │
│ - Key detection                     │
│ - Dominant note identification      │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Step 4: Quantization & Cleanup      │
│ - Timing grid snap                  │
│ - Key-aware pitch correction        │
│ - Octave pruning (harmonic removal) │
│ - Overlap-based pruning             │
│ - Note merging (legato)             │
│ - Velocity normalization            │
└─────────────────────────────────────┘
    ↓
MIDI Output (Standard MIDI File)
```

## Setup

### Prerequisites

- Python 3.11+ (Python 3.14+ recommended)
- FFmpeg (for audio format support)
- pip

### Installation

**Quick Install (Recommended):**

```bash
cd /path/to/voice-note-to-midi
./setup.sh
```

This automated script will:
- Check Python 3.11+ is installed
- Create the `~/melody-pipeline` directory
- Set up the virtual environment
- Install all dependencies (basic-pitch, librosa, music21, etc.)
- Download and configure the hum2midi script
- Add melody-pipeline to your PATH

**Manual Install:**

If you prefer manual setup:

```bash
mkdir -p ~/melody-pipeline
cd ~/melody-pipeline
python3 -m venv venv-bp
source venv-bp/bin/activate
pip install basic-pitch librosa soundfile mido music21
chmod +x ~/melody-pipeline/hum2midi
```

5. **Add to your PATH (optional):**

```bash
echo 'export PATH="$HOME/melody-pipeline:$PATH"' >> ~/.bashrc
source ~/.bashrc
```

### Verify Installation

```bash
cd ~/melody-pipeline
./hum2midi --help
```

## Usage

### Basic Usage

Convert a voice memo to MIDI:

```bash
./hum2midi my_humming.wav
```

This creates `my_humming.mid` with 16th-note quantization.

### Specify Output File

```bash
./hum2midi input.wav output.mid
```

### Command-Line Options

| Option | Description | Default |
|--------|-------------|---------|
| `--grid <value>` | Quantization grid: `1/4`, `1/8`, `1/16`, `1/32` | `1/16` |
| `--min-note <ms>` | Minimum note duration in milliseconds | `50` |
| `--no-quantize` | Skip quantization (output raw Basic Pitch MIDI) | disabled |
| `--key-aware` | Enable key-aware pitch correction | disabled |
| `--no-analysis` | Skip pitch analysis and key detection | disabled |

### Usage Examples

#### Quantize to eighth notes
```bash
./hum2midi melody.wav --grid 1/8
```

#### Key-aware quantization (recommended for tonal music)
```bash
./hum2midi song.wav --key-aware
```

#### Require longer minimum notes
```bash
./hum2midi humming.wav --min-note 100
```

#### Skip analysis for faster processing
```bash
./hum2midi quick.wav --no-analysis
```

#### Combine options
```bash
./hum2midi recording.wav output.mid --grid 1/8 --key-aware --min-note 80
```

### Processing MIDI Input

You can also process existing MIDI files through the quantization pipeline:

```bash
./hum2midi input.mid output.mid --grid 1/16 --key-aware
```

This skips the audio processing steps and goes directly to analysis and quantization.

## Sample Output

```
═══════════════════════════════════════════════════════════════
  hum2midi - Melody-to-MIDI Pipeline (Basic Pitch Edition)
  [Key-Aware Mode Enabled]
═══════════════════════════════════════════════════════════════

Input:  my_humming.wav
Output: my_humming.mid

→ Step 1: Stem Separation (HPSS)
  Isolating melodic content...
  Loaded: 5.23s @ 44100Hz
  ✓ Melody stem extracted → 5.23s

→ Step 2: Audio-to-MIDI Conversion (Basic Pitch)
  Running Spotify's Basic Pitch ML model on melody stem...
  ✓ Raw MIDI generated (Basic Pitch)

→ Step 3: Pitch Analysis & Key Detection
  Notes detected: 42 total, 7 unique
  Note range: C3 - G4
  Pitch classes: C3, E3, G3, A3, C4, D4, G4
  Dominant note: G3 (23.8% of notes)
  Detected key: G major

→ Step 4: Quantization & Cleanup
  Octave pruning: removed 3 harmonic notes above 67 (median+12)
  Overlap pruning: removed 2 harmonic notes at overlapping positions
  Note merging: merged 5 staccato chunks into legato notes (gap<=60 ticks)
  Grid:   240 ticks (1/16)
  Notes:  38 notes
  Key:    G major
  Key-aware: 2 notes corrected to scale
  Tempo:  120 BPM
  ✓ Quantized MIDI saved

═══════════════════════════════════════════════════════════════
  ✓ Done! Output: my_humming.mid
═══════════════════════════════════════════════════════════════

📊 ANALYSIS SUMMARY
─────────────────────────────────────────────────────────────
  Detected Notes: C3, E3, G3, A3, C4, D4, G4
  Detected Key:   G major
  Quantization:   Key-aware mode (notes snapped to scale)

MIDI Info: 38 notes, 7 unique pitches, 120 BPM
Pitches: C3, E3, G3, A3, C4, D4, G4
```

## Notes & Limitations

### Audio Quality Matters

- **Clear, loud melody** produces the best results
- **Background noise** can cause false note detection
- **Reverb and effects** may confuse pitch detection
- **Close-mic'd vocals** work significantly better than room recordings

### Musical Considerations

- **Monophonic sources** work best (single melody line)
- **Polyphonic audio** (chords, multiple instruments) will produce messy results
- **Vibrato and pitch bends** may be quantized to stepped pitches
- **Rapid note passages** may be missed or merged

### Technical Limitations

- **Tempo is fixed** at 120 BPM in output (time positions are preserved, but tempo may need adjustment in your DAW)
- **Note velocities** are normalized but may need manual adjustment
- **Very short notes** (<50ms) may be filtered out by default
- **Extreme pitch ranges** may cause octave detection issues

### Post-Processing Recommendations

After generating MIDI, you may want to:

1. **Import into your DAW** and adjust tempo to match your original recording
2. **Quantize further** if stricter timing is needed
3. **Adjust note velocities** for dynamics
4. **Apply swing/groove** templates if the rigid grid sounds too mechanical
5. **Edit individual notes** that were misdetected (common with fast runs)

### Supported Audio Formats

Input formats supported via FFmpeg:
- WAV, AIFF, FLAC (uncompressed, best quality)
- MP3, M4A, AAC (compressed, acceptable)
- OGG, OPUS (open source formats)
- Most other formats FFmpeg supports

## Troubleshooting

### No notes detected
- Check that input file isn't silent or corrupted
- Try increasing `--min-note` threshold
- Verify audio has clear melodic content (not just noise)

### Too many notes / messy output
- Enable octave pruning and overlap pruning (on by default)
- Use `--key-aware` to constrain to musical scale
- Check for background noise in source audio

### Wrong key detected
- Key det

Files: 1

Size: 10.3 KB

Complexity: 18/100

Category: Image & Video

Source: https://github.com/thinkfleetai/thinkfleet-engine/tree/main/skills/voice-note-to-midi

Related in Image & Video

watch

Included

Watch a video (URL or local path). Downloads with yt-dlp, extracts auto-scaled frames with ffmpeg, pulls the transcript from captions (or Whisper API fallback), and hands the result to Claude so it can answer questions about what's in the video.

Image & Videoscriptsfeatured

physical-ai-defect-image-generation

Included

Use when the user wants to orchestrate defect image generation, run associated setup, or handle outputs on OSMO. The Day 0 path handles cold-start with USD-to-ROI, image-edit augmentation, and AnomalyGen to create initial PCBA datasets. The Day 1 path performs inference and labeling on real images. This skill helps with first-time asset setup, creation of finetuning checkpoints, and configuring deployment. Trigger keywords: defect image generation, dig workflow, dig pipeline, defect image detection workflow, aoi pipeline, aoi anomalygen, usd2roi anomalygen, day 0 pcba, day 1 pcba, day 1 real-photo alignment, day 1 manual roi, metal surface anomaly, glass defect, anomalygen finetune, setup_pcb, setup_metal, setup_glass, setup_pretrained, dig setup, dig datasets, dig pretrained checkpoint, dig image-edit endpoint.

Image & Videoscripts

accelint-react-best-practices

Included

React performance optimization and best practices. ALWAYS use this skill when working with any React code - writing components, hooks, JSX; refactoring; optimizing re-renders, memoization, state management; reviewing for performance; fixing hydration mismatches; debugging infinite re-renders, stale closures, input focus loss, animations restarting; preventing remounting; implementing transitions, lazy initialization, effect dependencies. Even simple React tasks benefit from these patterns. Covers React 19+ (useEffectEvent, Activity, ref props). Triggers - useEffect, useState, useMemo, useCallback, memo, inline components, nested components, components inside components, re-render, performance, hydration, SSR, Next.js, useDeferredValue, combined hooks.

Image & Videoscripts

elevenlabs-agents

Included

Build conversational AI voice agents with ElevenLabs Platform using React, JavaScript, React Native, or Swift SDKs. Configure agents, tools (client/server/MCP), RAG knowledge bases, multi-voice, and Scribe real-time STT. Use when: building voice chat interfaces, implementing AI phone agents with Twilio, configuring agent workflows or tools, adding RAG knowledge bases, testing with CLI "agents as code", or troubleshooting deprecated @11labs packages, Android audio cutoff, CSP violations, dynamic variables, or WebRTC config. Keywords: ElevenLabs Agents, ElevenLabs voice agents, AI voice agents, conversational AI, @elevenlabs/react, @elevenlabs/client, @elevenlabs/react-native, @elevenlabs/elevenlabs-js, @elevenlabs/agents-cli, elevenlabs SDK, voice AI, TTS, text-to-speech, ASR, speech recognition, turn-taking model, WebRTC voice, WebSocket voice, ElevenLabs conversation, agent system prompt, agent tools, agent knowledge base, RAG voice agents, multi-voice agents, pronunciation dictionary, voice speed control, elevenlabs scribe, @11labs deprecated, Android audio cutoff, CSP violation elevenlabs, dynamic variables elevenlabs, case-sensitive tool names, webhook authentication

Image & Videoscripts

humanizer

Included

Humanize AI-generated text by detecting and removing patterns typical of LLM output. Rewrites text to sound natural, specific, and human. Uses 28 pattern detectors, 560+ AI vocabulary terms across 3 tiers, and statistical analysis (burstiness, type-token ratio, readability) for comprehensive detection. Use when asked to humanize text, de-AI writing, make content sound more natural/human, review writing for AI patterns, score text for AI detection, or improve AI-generated drafts. Covers content, language, style, communication, and filler categories.

Image & Videoscripts

generating-mermaid-diagrams

Included

Salesforce architecture diagrams using Mermaid with ASCII fallback. Use this skill when generating text-based diagrams for Salesforce architecture, OAuth flows, ERDs, integration sequences, or Agentforce structure. TRIGGER when: user says "diagram", "visualize", "ERD", or asks for sequence diagrams, flowcharts, class diagrams, or architecture visualizations in Mermaid. DO NOT TRIGGER when: user wants PNG/SVG image output (use generating-visual-diagrams), or asks about non-Salesforce systems.

Image & Videoscripts