refactor: extract shared text walker and add encoding chain by ajroetker · Pull Request #62 · ledongthuc/pdf

ajroetker · 2026-01-03T20:33:42Z

This PR builds on #61 and adds:

FontEncodingChain: Multi-layer font decoding that prioritizes:
1. ToUnicode CMap (authoritative per PDF spec)
2. CIDFont chain for Type0 fonts
3. BaseEncoding + Differences
4. Glyph name resolution (Adobe Glyph List patterns)
5. Fallback heuristics
walkTextContent(): Shared method to consolidate PDF content stream parsing, reducing duplication between Content() and the new ContentWithMarks()

The PDF specification states that ToUnicode CMap is the authoritative source for character-to-Unicode mapping. Previously, the library only checked ToUnicode for fonts with "Identity-H" encoding or null encoding, causing incorrect text extraction for many PDFs. This change: - Checks ToUnicode CMap first before falling back to Encoding - Falls back to pdfDocEncoding instead of nopEncoder for better compatibility with unknown encodings - Removes the now-redundant charmapEncoding() method This fixes text extraction issues where characters were being incorrectly decoded (e.g., '0' appearing as 'M') due to ToUnicode being ignored when an Encoding entry was present.

The PDF spec (section 9.6.6) requires that when an Encoding dictionary is present, the BaseEncoding (e.g., WinAnsiEncoding, MacRomanEncoding) should be applied first, then the Differences array overlays specific character code mappings on top. Previously, dictEncoder only looked at the Differences array and matched character codes one by one, which was both slow and incorrect for fonts that rely on BaseEncoding for most characters. This fix: - Builds a complete 256-entry lookup table at initialization time - Copies the BaseEncoding table first (defaulting to PDFDocEncoding) - Applies Differences array entries on top - Uses O(1) lookup instead of O(n) scanning during decoding Fixes font encoding corruption in PDFs where fonts use custom Encoding dictionaries with BaseEncoding + Differences (common in legal documents).

Builds on fix/tounicode-priority branch. Changes: - Add FontEncodingChain for multi-layer font decoding: 1. ToUnicode CMap (authoritative per PDF spec) 2. CIDFont chain for Type0 fonts 3. BaseEncoding + Differences 4. Glyph name resolution (Adobe Glyph List patterns) 5. Fallback heuristics - Add walkTextContent() to consolidate PDF content stream parsing - Refactor Content() and ContentWithMarks() to use shared walker - Add CharInfo and ContentWalkOptions types - Remove dead dictEncoder code (now handled by FontEncodingChain) - Use math package instead of custom sqrt/atan implementations - Add comprehensive tests for encoding chain functionality

Instead of replacing unmapped character codes with U+FFFD (which loses information), encode them in the Private Use Area (U+E000-U+E0FF). This allows post-processing to recover the original byte value and apply custom decodings (e.g., shifted encodings). Also raises the replacement threshold from 20% to 50% to allow more text through for post-processing.

The previous PUA preservation fix only addressed encoding_chain.go (layers 3-5). Pages with partial ToUnicode coverage (e.g., 70% valid, 30% unmapped) would pass validation and never reach the PUA-preserving layers. This fix adds PUA preservation directly to cmap.Decode() in page.go, converting the 3 locations that used noRune (U+FFFD) to use PUA instead: - Line 311: bfrange with unknown destination type - Line 315: codespace match but no bfchar/bfrange mapping - Line 323: no codespace range matches

Rewrite GetPlainText to use walkTextContent with position-based word boundary detection, similar to MuPDF's approach: - Use character width × 0.2 as gap threshold for detecting word breaks - Use font size × 0.5 as threshold for detecting line breaks - Insert spaces when gap between characters exceeds threshold This fixes PDFs where text runs together without explicit space characters in the content stream. The previous implementation just concatenated all text without considering character positions.

For CID fonts with 2-byte character codes where the high byte is 0x00, only convert the low byte to PUA. This avoids interleaved null PUA chars (\ue000) that break shift detection in post-processing. Matches PyMuPDF's handling of CID fonts.

Return an empty reader for zero-length streams before applying filters. This prevents "unexpected EOF" errors from zlib when processing empty FlateDecode streams in PDFs.

- Change module path from ledongthuc/pdf to ajroetker/pdf - Add render/ submodule for PDF page rasterization (from antflydb/antfly) - render/ depends on ajroetker/go-jpeg2000 for embedded JPEG2000 images

The encoding chain's resolveGlyphName() did not handle ligature glyph names following the Adobe Glyph List naming convention (e.g., "t_t.liga", "f_i", "f_f_l.liga"). When a font's ToUnicode CMap was incomplete and didn't map certain character codes, the Differences array was the only fallback — but ligature names like "t_t.liga" resolved to 0, causing PUA characters to leak into the extracted text. This produced garbled output like invisible chars instead of "Lettuce" for PDFs using design-tool fonts with ligature substitutions. Changes: - Add resolveGlyphNameMulti() that decomposes ligature names by stripping OpenType suffixes (.liga, .alt) and splitting on '_' - Add multiDifferences field to FontEncodingChain for multi-rune Differences entries - Update Decode() to fall back from CMap PUA results to Differences when the CMap has gaps (resolvePUAWithDifferences) - Fix isValidDecode test expectations to match the 50% threshold - Add unit tests for ligature decomposition, PUA detection, CMap-to- Differences fallback (including full Decode dispatch end-to-end) - Add integration tests against a real-world recipe PDF

Demonstrates extracting styled text from a PDF that uses ligature glyphs (t_t.liga, onehalf, onequarter) in font Differences arrays.

ajroetker and others added 13 commits January 2, 2026 22:36

fix: handle empty streams to avoid zlib EOF errors

8b3e35a

Return an empty reader for zero-length streams before applying filters. This prevents "unexpected EOF" errors from zlib when processing empty FlateDecode streams in PDFs.

Merge fix/empty-stream-reader into refactor/text-walker-extraction

a32716f

Redeclare module as ajroetker/pdf and add render submodule

99dd264

- Change module path from ledongthuc/pdf to ajroetker/pdf - Add render/ submodule for PDF page rasterization (from antflydb/antfly) - render/ depends on ajroetker/go-jpeg2000 for embedded JPEG2000 images

add example: read text with ligature glyph decoding

4f4ec0d

Demonstrates extracting styled text from a PDF that uses ligature glyphs (t_t.liga, onehalf, onequarter) in font Differences arrays.

Merge pull request #1 from markhayden/fix/ligature-decoding

d1e4a7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: extract shared text walker and add encoding chain#62

refactor: extract shared text walker and add encoding chain#62
ajroetker wants to merge 13 commits intoledongthuc:masterfrom
ajroetker:refactor/text-walker-extraction

ajroetker commented Jan 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ajroetker commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ajroetker commented Jan 3, 2026 •

edited

Loading