This repository is a transport index dataset for Retreivr.
It stores mappings from canonical MusicBrainz recording MBIDs to known-good transport identifiers.
Canonical mapping model:
recording_mbid -> transport sources
Examples of transport identifiers:
- YouTube video IDs
- SoundCloud track IDs (future)
- Other supported transport IDs (future)
MusicBrainz remains the authoritative source of metadata. This repository does not replicate MusicBrainz entity metadata.
Current dataset namespace:
youtube/recording/<prefix>/<recording_mbid>.jsonyoutube/video/<prefix>/<video_id>.json(generated reverse index)
Where:
prefixis the first two characters ofrecording_mbid- filename stem equals
recording_mbid - reverse-index
prefixis the first two characters ofvideo_id - reverse-index filename stem equals
video_id
Each record contains:
recording_mbidsources[]with transport candidate identifiers and minimal validation fieldsschema_version
See schema/schema.json for the strict record contract.
Reverse index records contain minimal lookup metadata:
video_idrecording_mbidconfidenceverified_at
Reverse index files are generated by promotion tooling and must not be edited manually.
This repository must not contain:
- scraped metadata dumps
- platform search result dumps
- thumbnails
- ranking heuristics
- MusicBrainz entity metadata copies
- media files or download URLs
Validation in .github/workflows/validate.yml enforces:
- JSON parse validity for dataset files
- JSON Schema compliance
- shard-path and filename/MBID consistency
- duplicate MBID prevention in namespace
- stats integrity via
scripts/generate_stats.py --check
The dataset accelerates transport resolution for Retreivr clients while keeping output deterministic, lightweight, and Git-native.