Skip to content

fix: restore double-newline row boundaries in Table.text (#4235)#4299

Open
alvinttang wants to merge 1 commit intoUnstructured-IO:mainfrom
alvinttang:fix/4235-table-text-row-boundaries
Open

fix: restore double-newline row boundaries in Table.text (#4235)#4299
alvinttang wants to merge 1 commit intoUnstructured-IO:mainfrom
alvinttang:fix/4235-table-text-row-boundaries

Conversation

@alvinttang
Copy link
Copy Markdown

Summary

Fixes #4235str(Table) no longer preserves row boundaries after 0.16.0, breaking downstream parsing that relied on str(table).split("\n\n") to reconstruct header→value mappings from XLSX sheets.

Root cause: HtmlTable.text in unstructured/common/html_table.py was using " ".join(self._table.itertext()) which walks all text nodes in document order and flattens everything into a single space-separated string. This lost all row structure.

Fix: Replace the flat itertext() join with iter_rows() / iter_cell_texts() — cells within a row are joined with " ", rows are joined with "\n\n", and all-blank rows are suppressed. The metadata.text_as_html field (compact HTML) is unchanged.

# Before (broken — all cells on one line)
"1 1 0 www.example.com 2 2 0 www.example2.com ..."

# After (row boundaries restored)
"Name Score URL\n\nAlice 1 www.example.com\n\nBob 2 www.example2.com"

Changes

  • unstructured/common/html_table.pyHtmlTable.text now iterates rows and joins with "\n\n"
  • test_unstructured/common/test_html_table.py — updated existing text assertion; added regression test it_separates_rows_with_double_newlines_for_boundary_reconstruction

Test plan

  • All 25 existing test_html_table.py tests pass
  • New regression test confirms str(table).split("\n\n") returns one entry per row
  • Empty rows (all cells blank) are correctly suppressed

🤖 Generated with Claude Code

HtmlTable.text was joining all cell text with a flat space, losing row
structure. Downstream callers that relied on `str(table).split("\n\n")`
to reconstruct rows (e.g. header→value mapping for XLSX sheets) broke
silently after 0.16.0.

Fix: iterate rows via iter_rows()/iter_cell_texts(), join cells within a
row with a single space, and separate rows with double newlines. Empty
rows are suppressed. The html/text_as_html metadata fields are unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/XLSX Table string representation loses row boundaries after 0.16.0 causing flattened output from partition_xlsx

1 participant