Skip to content

[content-hash 2/5] feat: content-hash passage IDs via --id-scheme content-hash#331

Open
raoabinav wants to merge 2 commits into
StarTrail-org:mainfrom
raoabinav:feat/passage-id-scheme-builder
Open

[content-hash 2/5] feat: content-hash passage IDs via --id-scheme content-hash#331
raoabinav wants to merge 2 commits into
StarTrail-org:mainfrom
raoabinav:feat/passage-id-scheme-builder

Conversation

@raoabinav
Copy link
Copy Markdown
Contributor

@raoabinav raoabinav commented May 20, 2026

Sub-PR 2 of 5 from #329. Stacks on #330 (meta.json field).

LeannBuilder(..., passage_id_scheme="content-hash") makes add_text() key passages by sha256(text)[:16] instead of insertion index. Stable across file moves, reorderings, and re-runs. Exposed at the CLI as leann build --id-scheme content-hash.

Default unchanged ("sequential"). Existing indexes continue to work identically.

Identical-text chunks collide on the same hash. For this sub-PR the second occurrence overwrites the first in the offset map — that's the dedup behavior I'd want by default. A --preserve-duplicates escape hatch can land later if needed (open question in #329).

Content-hash passage IDs train (#329)

raoabinav and others added 2 commits May 20, 2026 11:07
Sub-PR 1 of 5 from the plan in StarTrail-org#329. Purely additive — no behavior change
for any caller, existing index loaders ignore the field.

Writes a new `passage_id_scheme: "sequential"` field into the .meta.json
produced by both build_index and build_index_from_arrays. Bumps version
to "1.1" for human-inspectable schema tracking (no code reads version today,
so the bump is safe).

Module-level constants PASSAGE_ID_SCHEME_SEQUENTIAL / _CONTENT_HASH document
the value space; the content-hash scheme itself ships in sub-PR 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sub-PR 2 of 5 from StarTrail-org#329. Builds on StarTrail-org#330 (which added the meta.json field).

New behavior:
- `LeannBuilder(..., passage_id_scheme="content-hash")` makes add_text() key
  passages by sha256(text)[:16] instead of insertion index. Stable across file
  moves, reorderings, and re-runs of the same corpus.
- `leann build --id-scheme content-hash` exposes it at the CLI.
- Default unchanged ("sequential"). Existing indexes continue to work
  identically; no migration triggered.

Identical-text chunks collide (same hash). For this sub-PR the second
occurrence overwrites the first in the offset map — that's the dedup
behavior I'd want by default. A `--preserve-duplicates` escape hatch can
land later if needed (see the open question in StarTrail-org#329).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants