Skip to content

fix: improve multi-column layout sorting for academic papers (#4104)#4283

Open
Gopesh111 wants to merge 2 commits intoUnstructured-IO:mainfrom
Gopesh111:fix-academic-sorting
Open

fix: improve multi-column layout sorting for academic papers (#4104)#4283
Gopesh111 wants to merge 2 commits intoUnstructured-IO:mainfrom
Gopesh111:fix-academic-sorting

Conversation

@Gopesh111
Copy link
Copy Markdown

This PR addresses the reading order issues in multi-column documents (specifically academic papers) as reported in #4104.

Key Changes:

Hybrid Sorting Logic: Introduced sort_page_elements_columns in sorting.py to bin elements into Top (Header/Title), Bottom (Footer), and Middle (Body) zones.

Column-Aware Binning: Body elements are now split into Left and Right columns based on page mid-point, preventing the 'Z-pattern' reading order.

Noise-Resistant XY-Cut: Updated xycut.py with increased min_gap (10px for X, 2px for Y) and min_value thresholds. This allows the parser to ignore scanning noise and correctly identify narrow gutters between columns in research papers.

Verification:
Tested with the NAACL 2025 findings paper. Verified that the sequence now correctly follows: Title -> Abstract -> Intro (Left Col) -> Intro (Right Col) -> Footer.

@Gopesh111
Copy link
Copy Markdown
Author

Hi @Unstructured-IO-team,

I've submitted this PR to address the multi-column sorting issue (#4104) specifically for academic layouts. I've implemented a hybrid sorting logic and tuned the XY-cut thresholds to improve robustness against scan noise.

I noticed the CI workflows are awaiting approval from a maintainer. Could you please trigger the tests so I can verify everything is green on your end?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant