Archive Acquisition
Included with Lifetime
$97 forever
Patterns for acquiring content from Internet Archive and archival sources
media-curator
What this skill does
# Archive Acquisition Skill
Specialized patterns and techniques for acquiring media content from Internet Archive (archive.org) and other archival sources. Focuses on bulk downloads, quality selection, format filtering, and collection management.
## Internet Archive Overview
Internet Archive hosts massive collections of audio, video, text, and software. Key characteristics:
- **Open access**: Most content freely downloadable
- **Multiple formats**: Same item often available in FLAC, MP3, OGG, etc.
- **Collections**: Items grouped by uploader, topic, or curator
- **Metadata**: Rich JSON metadata for every item
- **API access**: Full programmatic access via REST API
## Discovery Patterns
### Search API
Find items matching specific criteria:
```bash
# Search for items by keyword and mediatype
search_archive() {
local query="$1"
local mediatype="$2" # audio, movies, texts, etc.
local rows="${3:-100}" # Results per page
curl -s "https://archive.org/advancedsearch.php" \
-d "q=$query AND mediatype:$mediatype" \
-d "fl[]=identifier,title,creator,date,format" \
-d "rows=$rows" \
-d "output=json" \
| jq '.response.docs[]'
}
# Example: Search for Grateful Dead concerts
search_archive "grateful dead" "audio" 50
```
### Response Format
```json
{
"identifier": "gd1977-05-08.sbd.miller.97065.flac16",
"title": "Grateful Dead Live at Barton Hall on 1977-05-08",
"creator": "Grateful Dead",
"date": "1977-05-08",
"format": ["Flac", "VBR MP3", "Ogg Vorbis", "Metadata"]
}
```
### Collection Browsing
List all items in a collection:
```bash
# Get items from a specific collection
get_collection_items() {
local collection="$1"
local rows="${2:-1000}"
curl -s "https://archive.org/advancedsearch.php" \
-d "q=collection:$collection" \
-d "fl[]=identifier,title,format" \
-d "rows=$rows" \
-d "output=json" \
| jq -r '.response.docs[] | .identifier'
}
# Example: Get all items from GratefulDead collection
get_collection_items "GratefulDead" 500
```
### Advanced Query Patterns
```bash
# Find items by date range
search_by_date() {
local creator="$1"
local start_date="$2" # YYYY-MM-DD
local end_date="$3"
curl -s "https://archive.org/advancedsearch.php" \
-d "q=creator:\"$creator\" AND date:[$start_date TO $end_date]" \
-d "fl[]=identifier,title,date" \
-d "rows=100" \
-d "output=json" \
| jq '.response.docs[]'
}
# Find items with specific format available
search_by_format() {
local query="$1"
local format="$2" # FLAC, MP3, etc.
curl -s "https://archive.org/advancedsearch.php" \
-d "q=$query AND format:$format" \
-d "fl[]=identifier,title,format" \
-d "rows=100" \
-d "output=json" \
| jq '.response.docs[]'
}
# Example: Find all FLAC recordings of specific artist
search_by_format "pink floyd" "Flac"
```
## Item Metadata Retrieval
### Get Item Details
```bash
# Retrieve full metadata for an item
get_item_metadata() {
local identifier="$1"
curl -s "https://archive.org/metadata/$identifier" \
| jq '.'
}
# Extract specific metadata fields
get_item_files() {
local identifier="$1"
curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format != "Metadata") | "\(.name)\t\(.format)\t\(.size)"'
}
# Example output:
# gd77-05-08d1t01.flac Flac 45123456
# gd77-05-08d1t01.mp3 VBR MP3 12345678
```
### Filter by Format
```bash
# Get only FLAC files from an item
get_flac_files() {
local identifier="$1"
curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format == "Flac") | .name'
}
# Get best quality audio files (prefer FLAC, fallback to 320K MP3)
get_best_audio_files() {
local identifier="$1"
# Try FLAC first
local flac_files=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format == "Flac") | .name')
if [[ -n "$flac_files" ]]; then
echo "$flac_files"
return 0
fi
# Fallback to highest bitrate MP3
curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format == "VBR MP3" or .format == "320Kbps MP3") | .name'
}
```
## Download Patterns
### Single Item Download
```bash
# Download all files from a single item
download_item() {
local identifier="$1"
local output_dir="$2"
local format_filter="$3" # Optional: FLAC, MP3, etc.
mkdir -p "$output_dir"
# Get file list
local files
if [[ -n "$format_filter" ]]; then
files=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r ".files[] | select(.format == \"$format_filter\") | .name")
else
files=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format != "Metadata") | .name')
fi
# Download each file
while IFS= read -r file; do
echo "Downloading: $file"
wget -q --show-progress \
"https://archive.org/download/$identifier/$file" \
-P "$output_dir"
done <<< "$files"
}
# Example: Download all FLAC files from an item
download_item "gd1977-05-08.sbd.miller.97065.flac16" "/mnt/archive/grateful-dead/1977-05-08" "Flac"
```
### Recursive Collection Download
```bash
# Download entire collection with format filtering
download_collection() {
local collection="$1"
local output_base="$2"
local format="$3" # FLAC, MP3, etc.
local max_items="${4:-100}"
# Get collection items
local items=$(get_collection_items "$collection" "$max_items")
local count=0
while IFS= read -r identifier; do
count=$((count + 1))
echo "[$count/$max_items] Downloading: $identifier"
# Create item directory
local item_dir="$output_base/$identifier"
mkdir -p "$item_dir"
# Download with format filter
download_item "$identifier" "$item_dir" "$format"
# Download metadata
curl -s "https://archive.org/metadata/$identifier" \
> "$item_dir/.curator/metadata.json"
# Rate limiting (be nice to archive.org)
sleep 2
done <<< "$items"
}
# Example: Download first 50 FLAC items from collection
download_collection "GratefulDead" "/mnt/archive/grateful-dead" "Flac" 50
```
### Parallel Collection Download
```bash
# Download collection items in parallel (with concurrency limit)
download_collection_parallel() {
local collection="$1"
local output_base="$2"
local format="$3"
local max_concurrent="${4:-3}"
# Get collection items
local items=$(get_collection_items "$collection" 1000)
# Create job queue
local queue_file="/tmp/curator-archive-queue-$$.txt"
echo "$items" > "$queue_file"
# Process queue with concurrency limit
cat "$queue_file" | xargs -P "$max_concurrent" -I {} bash -c "
identifier={}
item_dir=\"$output_base/\$identifier\"
mkdir -p \"\$item_dir\"
echo \"Downloading: \$identifier\"
# Download best quality audio
files=\$(curl -s \"https://archive.org/metadata/\$identifier\" \
| jq -r '.files[] | select(.format == \"$format\") | .name')
while IFS= read -r file; do
wget -q --show-progress \
\"https://archive.org/download/\$identifier/\$file\" \
-P \"\$item_dir\"
done <<< \"\$files\"
# Save metadata
curl -s \"https://archive.org/metadata/\$identifier\" \
> \"\$item_dir/.curator/metadata.json\"
sleep 2 # Rate limiting
"
rm "$queue_file"
}
```
### wget Recursive Download
```bash
# Use wget's recursive mode for efficient bulk download
download_with_wget_recursive() {
local identifier="$1"
local output_dir="$2"
local format_pattern="$3" # e.g., "*.flac" or "*.mp3"
mkdir -p "$output_dir"
wget --recursive \
--no-parent \
--no-directories \
--accept "$format_pattern" \
--directory-prefix="$output_dir" \
--wait=2 \
--random-wait \
"https://archive.org/download/$identifier/"
}
# Example: Download all FLAC files from item
download_with_wget_recursive "gd1977-05-08.sbd.miller.97065.flac16" \
"/mnt/archive/gd-1977-05-08" \
"*.flac"
```
Related in media-curator
YouTube Acquisition
Includedyt-dlp patterns for acquiring content from YouTube and video platforms
media-curator
Integrity Verification
IncludedSHA-256 checksum manifest generation, self-verification, and PREMIS fixity patterns
media-curator
Cover Art Embedding
IncludedPatterns for finding, processing, and embedding cover artwork into media files
media-curator
Quality Filtering
IncludedAccept/reject logic and quality scoring heuristics for media content
media-curator
Metadata Tagging
Includedopustags and ffmpeg patterns for applying metadata to audio and video files
media-curator
Transcribe Media
IncludedProduce timestamped transcript sidecars for acquired audio/video with hashes, source metadata, speaker labels when available, and explicit degraded plans when STT tooling is missing
media-curator