Skip to content

Latest commit

 

History

History
280 lines (192 loc) · 17.6 KB

File metadata and controls

280 lines (192 loc) · 17.6 KB

programmers_notes.md — music_ingest.py

Notes for a future programmer (or future Claude session) picking up this code cold. This file assumes you've read README.md and design.md — it covers the implementation gotchas, schema details, and places the code is deliberately or accidentally fragile.


File layout

Everything lives in a single file: music_ingest.py (~1000 lines). The structure, top to bottom:

imports + optional-dependency flags (REQUESTS_AVAILABLE, OLLAMA_AVAILABLE, tqdm fallback)
SCHEMA (multi-statement SQL string)
helpers
  open_db, normalise, ole_to_iso, parse_duration_mmss, compute_significance,
  parse_xml_forgiving, elem_text, extract_year, log_run, normalise_path_for_match
cmd_init
phase 1: Jellyfin
  parse_artist_nfo, parse_album_nfo, match_audio_to_track, upsert_artist, cmd_jellyfin
phase 2: LRC lyrics
  LRC_TIMESTAMP + LRC_METADATA regexes, clean_lrc_text, cmd_lyrics
phase 3: MediaMonkey
  MM_QUERY (SQL constant), cmd_mediamonkey
phase 4a: genre themes
  GENRE_THEMES (data), cmd_themes_genre
phase 4b: MusicBrainz
  cmd_themes_musicbrainz
phase 4c: LLM
  LLM_PROMPT template, build_llm_prompt, cmd_themes_llm
cmd_report
main (argparse + dispatch)

The split between "phase 4a/b/c" is conceptual only — all three write to the song_themes table with different source values ('genre', 'musicbrainz', 'llm'). They can run in any order and any subset.


Database schema — things to know

Table relationships

artists (1) ─< (many) songs (1) ─< (many) song_themes

songs.artist_id is a soft FK to artists.artist_id. It's populated during the Jellyfin phase and is nullable because MediaMonkey-only songs may not have a matching artists row. The FK is enforced via PRAGMA foreign_keys=ON but never cascades — artist deletion would fail if songs reference it.

Unique constraints and their consequences

  • artists.name is UNIQUE. This is why upsert_artist checks first and updates or inserts accordingly. Artist name collisions across albums (e.g. "The Beatles" vs "Beatles") are preserved as separate rows — the UNIQUE is on the literal name, not a normalised form.
  • songs.audio_path is UNIQUE. This is the primary dedup key across phases. If the same song is described in two different album.nfo files (e.g. a compilation and the original album), whichever is seen first wins, and the second insert hits an IntegrityError caught as a "skipped" or "errors" stat.
  • song_themes has a composite PK of (song_id, theme, source). The same theme can come from multiple sources (genre and LLM both tag a song "melancholy") and both rows exist, distinguishable by source.

Derived columns to not trust blindly

  • significance_score is only populated by the mediamonkey phase. Songs without MM data have it as 0.0 (the DEFAULT), which looks the same as "MM data said zero plays and no rating". If you need to tell them apart, check whether play_count IS NOT NULL.
  • era_start/era_end are simply release_year ± 2. They don't account for late adoption (when a song entered the listener's life versus when it was released). A future refinement could use date_added or last_played to anchor era differently. See design.md for why this matters for the reminiscence-bump hypothesis.
  • source column: 'jellyfin', 'mediamonkey', or 'both'. The transition to 'both' happens inside the mediamonkey phase's UPDATE via a CASE WHEN source='jellyfin' THEN 'both' ELSE source END clause.

Phase-by-phase implementation notes

cmd_init — trivially minimal

open_db applies the schema idempotently on every call (via CREATE TABLE IF NOT EXISTS). So cmd_init is really just a "touch the file and tell me if it worked" wrapper. All other subcommands will create the schema automatically on first use; init exists only as a clarity-of-intent command for users who expect it.

cmd_jellyfin — the most complex phase

Two passes, both over the same rglob:

  1. Pass 1 (artists): Parse every artist.nfo, insert into artists table. Deliberately runs first so upsert_artist calls in pass 2 can find existing rows to link via artist_id.
  2. Pass 2 (albums): Parse every album.nfo, iterate its inline <track> children, match each to an audio file, and insert/update a songs row. If an album's <albumartist> doesn't have a pre-existing artists row (no artist.nfo was present), upsert_artist is called with just a name to create a minimal row.

Audio file matching (match_audio_to_track) tries three strategies in order:

  1. Position-prefix match with title substring verification: stem starts with 01/02/etc. AND first 15 normalised chars of title appear in normalised stem. This is the most reliable match.
  2. Pure title substring match: first 20 normalised chars of title appear in normalised stem. This catches unnumbered filenames.
  3. Positional index into sorted audio file list. This is a last resort and will give wrong results for multi-disc albums or folders containing bonus-track files.

The regex re.match(rf"^{pat}[\s._-]", stem) in strategy 1 accepts 01 - Foo, 01_Foo, 01.Foo, 01-Foo, but deliberately not 011Foo or 010 - Track (the boundary character requirement prevents the position "01" from matching the start of "010 - Foo"). Track 1 matches 1 - Foo but not 10 - Foo because the patterns are tried in order [f"{pos:02d}", f"{pos}"] — the two-digit form goes first.

Update vs insert logic: If the audio path already exists, UPDATE preserves MediaMonkey-sourced fields (play_count, rating, etc.) and only refreshes the bibliographic data. If it doesn't exist, INSERT creates a fresh row with source='jellyfin'. Re-running the phase is therefore safe and refreshes Jellyfin-sourced data without clobbering MediaMonkey data.

cmd_lyrics — simpler than it looks

Walks for *.lrc. For each, tries:

  1. Same stem, any audio extension (07 - Wanted.lrc07 - Wanted.mp3).
  2. Normalised-stem match against all audio files in the same folder (handles case differences, spacing differences).

If it finds audio, it looks up the song by audio_path and UPDATEs lyrics and lyrics_path. If the song isn't in the database (e.g. LRC for a track not yet ingested via Jellyfin), the UPDATE affects 0 rows and increments unmatched.

LRC cleaning strips two things:

  • Timestamps: [00:10.89]
  • Metadata tags: [ar: ...], [ti: ...], [al: ...], [by: ...], [length: ...], [offset: ...], [re: ...], [ve: ...]

Empty lines after stripping are removed. The resulting text is prose — good for LLM consumption.

cmd_mediamonkey — the hardest phase to debug when it goes wrong

The MM_QUERY joins Songs, Artists (twice — once for track artist, once for album artist), Albums, and Genres. This is the documented MM4/MM5 schema but has minor variations across versions:

  • MM4 sometimes has Songs.AlbumArtist directly on the Songs table instead of needing to join via Albums.IDAlbumArtist.
  • MM5 introduced a more flexible tag model; the core tables remain but new columns may appear.
  • MM6+ is an unknown — if encountered, inspect schema with sqlite3 MM5.DB .schema or SELECT sql FROM sqlite_master WHERE type='table'.

If the query fails, the error handling is explicit: print the failure, suggest sqlite3 MM.DB .schema, exit. The script does not attempt to adapt to arbitrary schema drift.

Matching strategy builds two in-memory indices once, then does lookups in a loop:

path_idx = {normalise_path_for_match(audio_path): song_id for ...}
name_idx = {(normalise(artist), normalise(title)): song_id for ...}

Try path first (exact, case-insensitive, slash-normalised). Fall back to fuzzy (normalised artist+title). If both fail, insert as MM-only.

ole_to_iso is the single most fragile conversion in the whole script. MediaMonkey stores dates as OLE Automation dates (float days since 1899-12-30). Zero and negative values mean "never played" and must return None. Very small positive values (< 1) would convert to 30 Dec 1899 and are filtered. The helper has been seen in the wild returning plausible-looking but wrong dates when MM has corrupted timestamps; if you see songs with date_added like "1900-01-02", investigate the source rather than the conversion.

Rating normalisation: MM stores rating as 0-100 in steps of 10. int(round(rating_0_100 / 20.0)) gives 0-5 stars. MM uses -1 for "unrated"; the rating_0_100 > 0 guard treats this as None.

cmd_themes_genre — dumb but fast

GENRE_THEMES is a list of (substrings, themes) tuples. For each song, check if any substring appears in the lowercased genre string. Accumulate all matching themes (set, deduplicated) and insert. The INSERT OR IGNORE lets the phase be re-run idempotently.

The before = conn.total_changes / if conn.total_changes > before pattern is how we count actual inserts (vs "already existed" ignores) without a second SELECT. Don't simplify this away without replacing with something equivalent.

cmd_themes_musicbrainz — rate-limited and fragile

Queries one URL per artist: https://musicbrainz.org/ws/2/artist/{mbid}?inc=tags&fmt=json. Sleeps 1.1 seconds between calls (MusicBrainz etiquette is max 1 req/sec; the extra 0.1 is insurance).

Things that will break it:

  • MusicBrainz returns HTTP 503 under load; the current code increments errors and moves on. A future improvement would be exponential backoff.
  • The User-Agent header is required by MusicBrainz's terms of use. Don't remove it.
  • Some artists have hundreds of tags; the length filter (len(name) > 40) weeds out junk like long descriptive phrases that some users submit as tags.

The script applies each artist tag to every song by that artist via an inner loop. This is N×M writes (N tags, M songs per artist) which is fine for modest libraries but could get slow for artists with hundreds of songs and hundreds of tags. Not currently optimised.

cmd_themes_llm — slowest, most configurable

The filter logic is dynamic WHERE-clause construction:

where = ["NOT EXISTS (...)"]  # never re-process songs already LLM-tagged
if min_significance: where.append(...)
if year_from and year_to: where.append(...)
if require_lyrics: where.append(...)
sql = f"... WHERE {' AND '.join(where)} ORDER BY significance_score DESC"

The NOT EXISTS is the resumability hook — you can interrupt and restart and it'll skip what it's already done.

Prompt construction: LLM_PROMPT has placeholders for the bibliographic fields plus {bio_block} and {lyrics_block} that expand to the full "Artist background: ..." or "Lyrics excerpt: ..." sections or to empty string. This makes the prompt adapt to what's actually available per song without awkward empty labels.

Response parsing strips markdown code fences (Ollama models sometimes wrap JSON in ```json ... ```) and then json.loads. Malformed JSON counts as an error and the song is skipped. A future improvement could retry with a tighter prompt.

The temperature=0.3 setting is low enough to make outputs fairly deterministic for the same inputs, but not so low that different songs get similar themes.


Testing and debugging

Minimum test fixture

The simplest way to validate the script against changes is to use the examples checked into the repository (or re-create them):

test/
  music/
    White Town/
      artist.nfo                            ← from sample
      Women in Technology (...)/
        album.nfo                           ← from sample
        07 - Wanted.lrc                     ← from sample
        01 - Track01.mp3                    ← empty, just needs to exist
        ...
        07 - Wanted.mp3                     ← empty
        ...

Then:

python music_ingest.py init      --music-db /tmp/test_music.db
python music_ingest.py jellyfin  --music-db /tmp/test_music.db --nfo-root test/music
python music_ingest.py lyrics    --music-db /tmp/test_music.db --music-root test/music
python music_ingest.py themes-genre --music-db /tmp/test_music.db
python music_ingest.py report    --music-db /tmp/test_music.db

Expected: 1 artist, 27 songs, 1 lyric attached, 135 genre-themes (27 songs × 5 themes from "Derby Indie" → "indie" match).

Inspecting the database directly

import sqlite3
c = sqlite3.connect('music_index.db')

# Schema
for row in c.execute("SELECT sql FROM sqlite_master WHERE type='table'"):
    print(row[0])

# Songs with missing audio
for t, a in c.execute("""
    SELECT title, album FROM songs WHERE audio_path IS NULL
    ORDER BY album, track_position
"""):
    print(f"{a}: {t}")

# Theme distribution per source
for src, n in c.execute("""
    SELECT source, COUNT(*) FROM song_themes GROUP BY source
"""):
    print(f"{src}: {n}")

# Songs that would be candidates for the LLM phase
for t, a, y, s in c.execute("""
    SELECT title, artist, release_year, significance_score FROM songs
    WHERE significance_score >= 0.5
      AND release_year BETWEEN 1971 AND 1991
    ORDER BY significance_score DESC LIMIT 20
"""):
    print(f"{s:.2f}  {y}  {a}{t}")

Common failure modes

"IntegrityError: UNIQUE constraint failed: songs.audio_path" — Two NFOs claim the same audio file. This has happened when an artist has both an album.nfo and a "Greatest Hits" compilation that points to the same underlying MP3s. Currently counted as an error; a future fix might be to prefer the chronologically earlier album.

"no such table: Songs" from MM phase — MM.DB path is wrong, or the file is MM4 not MM5 (the table names are the same but there are some file-level differences). Check with sqlite3 MM.DB .tables.

Phase 4b (MusicBrainz) returning few themes — Check SELECT COUNT(*) FROM artists WHERE musicbrainz_artist_id IS NOT NULL. If it's low, it's because the user's Jellyfin library doesn't have MusicBrainz enrichment enabled. Fix is upstream (enable the MusicBrainz metadata provider in Jellyfin, rescan), not in this script.

Phase 4c (LLM) hanging — Ollama server unreachable or model not pulled. Check with curl http://192.168.1.20:11434/api/tags before blaming the script.


Extension points — things a future maintainer might want to add

Easy additions

  • themes-manual subcommand — accept a CSV of song_id, theme and insert with source='manual'. This would let the user hand-curate the top 50 most-significant songs after seeing what the automatic phases produced.
  • Embedded ID3 tag reading — Jellyfin NFOs cover most metadata but ID3 frames (mutagen package) would fill in gaps for albums with missing NFOs.
  • Track-level MusicBrainz lookup — the current code queries artist tags only. Adding a themes-musicbrainz-tracks phase that queries /recording/{mbid}?inc=tags for tracks that have MBIDs would add mood tags at track level. Rate-limited same as artist queries.

Harder additions

  • Schema migrations — the script currently expects a clean database or one with exactly its schema. A future version may need ALTER TABLE scripts to handle database upgrades. The music_ingest_log table could hold schema version metadata.
  • Fuzzy dedup across sources — the current dedup is strict equality. A Levenshtein-based fuzzy match might merge "The Beatles" with "Beatles" or "Pink Floyd" with "Pink Floyd Feat. Dick Parry" as a post-processing pass.
  • Audio fingerprintingchromaprint/acoustid could match songs to MusicBrainz by audio content, bypassing the ID-provenance problem entirely. Big dependency, but solves path-drift between Jellyfin and MM conclusively.

Things to resist adding

  • Web UI or live reporting. The script is a CLI data pipeline; UI belongs in the jukebox player.
  • "Smart" schema evolution that tries to preserve data across arbitrary changes. The user-data here is all re-derivable from the source files. Deleting music_index.db and re-running is always the correct remediation for schema confusion. Don't over-engineer this.
  • Parallel processing of phases. The bottleneck in Phase 4b is MusicBrainz's rate limit; in Phase 4c it's the LLM's forward pass. Parallelism across phases risks DB lock contention for modest gains. Keep it simple.

Coding conventions used

  • Type hints on public functions but not aggressively on every local variable. dict | None over Optional[dict] (Python 3.10+).
  • Early return over deep nesting. The parse_*_nfo functions return None on any unrecoverable issue rather than raising.
  • Stats dictionaries are mutated in place inside the per-file loop and then logged via log_run at the end. This is the resumability hook — if the loop crashes, the partial work is committed and the user can re-run.
  • SQLite transactions are committed after each logical unit of work (one file, one artist). This makes the script interrupt-safe at the cost of some insert performance. Don't batch commits without reconsidering resumability.
  • Error handling at phase level is deliberately coarse: try/except Exception with a counter increment. Individual file-level failures should never stop the phase; phase-level failures (missing DB, missing source directory) exit with a clear message.
  • No logging module. print to stdout for progress, print(..., file=sys.stderr) for errors. This is a short-lived CLI tool, not a long-running daemon; structured logging would be overkill.

Adjacent files

  • README.md — user-facing docs
  • design.md — architectural "why" (read this if you're wondering why a decision was made)
  • requirements.txt — pip dependencies
  • Sample input files used during development:
    • examples/artist.nfo (White Town artist with biography)
    • examples/album.nfo (Women in Technology, 27 tracks including remixes)
    • examples/07 - Wanted.lrc (unsynced lyrics)