Skip to content

feat(chunking): repeat table headers on continuation chunks#4298

Draft
cragwolfe wants to merge 15 commits intomainfrom
crag/codex-draft-preview-headers-copy
Draft

feat(chunking): repeat table headers on continuation chunks#4298
cragwolfe wants to merge 15 commits intomainfrom
crag/codex-draft-preview-headers-copy

Conversation

@cragwolfe
Copy link
Contributor

Behavior summary

Before

  • Oversized table chunks only preserved headers in the first chunk; continuation chunks could lose column context.
  • Table header semantics (<thead> / <th>) were not retained as explicit row-level metadata after compactification.

After

  • Added repeat_table_headers (default True) to chunking APIs and strategy plumbing:
    • chunk_elements(..., repeat_table_headers=...)
    • chunk_by_title(..., repeat_table_headers=...)
    • add_chunking_strategy(...) forwarded args/docs
  • _TableChunker now detects contiguous leading header rows and repeats them on non-initial continuation chunks.
  • Repeated header rows are prepended to both continuation chunk text and text_as_html.
  • First chunk behavior remains unchanged relative to legacy output.
  • Added a guardrail: if a repeated header row would consume more than half the chunk window, splitter falls back to legacy non-repeating behavior.

Invariants

  • No body-row drop, duplication, or reordering across emitted continuation chunks.
  • Opt-out behavior (repeat_table_headers=False) matches legacy table splitting behavior.
  • Chunk windows still respect max-size constraints, including near-boundary continuation windows.
  • Only contiguous leading header rows are repeated; later non-leading header-like rows are not promoted.

Edge cases covered

  • No headers, single leading header row, multiple leading header rows.
  • Header detection from both <thead> and <th> rows.
  • Exact-fit and near-boundary continuation sizing.
  • Cascading repetition across 3+ continuation chunks.
  • Pathologically large header rows trigger safe fallback to non-repeating behavior.
  • Strategy-path forwarding validated through partition_html(..., chunking_strategy="by_title").

Test evidence

  • uv run --no-sync pytest -q test_unstructured/chunking/test_dispatch.py (6 passed)
  • uv run --no-sync pytest -q test_unstructured/chunking/test_base.py -k "Describe_TableChunker" (26 passed)
  • uv run --no-sync pytest -q test_unstructured/chunking/test_title.py::test_add_chunking_strategy_forwards_repeat_table_headers (1 passed)
  • uv run --no-sync pytest -q test_unstructured/chunking/test_title.py -k "repeat_table_headers" (5 passed)
  • uv run --with python-docx pytest -q test_unstructured/chunking/test_basic.py -k "repeat_table_headers" (4 passed)
  • uv run --no-sync pytest -q test_unstructured/common/test_html_table.py (26 passed)

authored by codex

@cragwolfe cragwolfe marked this pull request as ready for review March 25, 2026 04:13
@cragwolfe cragwolfe enabled auto-merge March 26, 2026 04:13
@cragwolfe cragwolfe marked this pull request as draft March 26, 2026 23:09
auto-merge was automatically disabled March 26, 2026 23:09

Pull request was converted to draft

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants