Skip to content

refactor: extract shared text walker and add encoding chain#62

Open
ajroetker wants to merge 13 commits intoledongthuc:masterfrom
ajroetker:refactor/text-walker-extraction
Open

refactor: extract shared text walker and add encoding chain#62
ajroetker wants to merge 13 commits intoledongthuc:masterfrom
ajroetker:refactor/text-walker-extraction

Conversation

@ajroetker
Copy link
Copy Markdown

@ajroetker ajroetker commented Jan 3, 2026

This PR builds on #61 and adds:

  • FontEncodingChain: Multi-layer font decoding that prioritizes:

    1. ToUnicode CMap (authoritative per PDF spec)
    2. CIDFont chain for Type0 fonts
    3. BaseEncoding + Differences
    4. Glyph name resolution (Adobe Glyph List patterns)
    5. Fallback heuristics
  • walkTextContent(): Shared method to consolidate PDF content stream parsing, reducing duplication between Content() and the new ContentWithMarks()

ajroetker and others added 13 commits January 2, 2026 22:36
The PDF specification states that ToUnicode CMap is the authoritative
source for character-to-Unicode mapping. Previously, the library only
checked ToUnicode for fonts with "Identity-H" encoding or null encoding,
causing incorrect text extraction for many PDFs.

This change:
- Checks ToUnicode CMap first before falling back to Encoding
- Falls back to pdfDocEncoding instead of nopEncoder for better
  compatibility with unknown encodings
- Removes the now-redundant charmapEncoding() method

This fixes text extraction issues where characters were being
incorrectly decoded (e.g., '0' appearing as 'M') due to ToUnicode
being ignored when an Encoding entry was present.
The PDF spec (section 9.6.6) requires that when an Encoding dictionary
is present, the BaseEncoding (e.g., WinAnsiEncoding, MacRomanEncoding)
should be applied first, then the Differences array overlays specific
character code mappings on top.

Previously, dictEncoder only looked at the Differences array and matched
character codes one by one, which was both slow and incorrect for fonts
that rely on BaseEncoding for most characters.

This fix:
- Builds a complete 256-entry lookup table at initialization time
- Copies the BaseEncoding table first (defaulting to PDFDocEncoding)
- Applies Differences array entries on top
- Uses O(1) lookup instead of O(n) scanning during decoding

Fixes font encoding corruption in PDFs where fonts use custom Encoding
dictionaries with BaseEncoding + Differences (common in legal documents).
Builds on fix/tounicode-priority branch.

Changes:
- Add FontEncodingChain for multi-layer font decoding:
  1. ToUnicode CMap (authoritative per PDF spec)
  2. CIDFont chain for Type0 fonts
  3. BaseEncoding + Differences
  4. Glyph name resolution (Adobe Glyph List patterns)
  5. Fallback heuristics
- Add walkTextContent() to consolidate PDF content stream parsing
- Refactor Content() and ContentWithMarks() to use shared walker
- Add CharInfo and ContentWalkOptions types
- Remove dead dictEncoder code (now handled by FontEncodingChain)
- Use math package instead of custom sqrt/atan implementations
- Add comprehensive tests for encoding chain functionality
Instead of replacing unmapped character codes with U+FFFD (which loses
information), encode them in the Private Use Area (U+E000-U+E0FF). This
allows post-processing to recover the original byte value and apply
custom decodings (e.g., shifted encodings).

Also raises the replacement threshold from 20% to 50% to allow more
text through for post-processing.
The previous PUA preservation fix only addressed encoding_chain.go
(layers 3-5). Pages with partial ToUnicode coverage (e.g., 70% valid,
30% unmapped) would pass validation and never reach the PUA-preserving
layers.

This fix adds PUA preservation directly to cmap.Decode() in page.go,
converting the 3 locations that used noRune (U+FFFD) to use PUA instead:
- Line 311: bfrange with unknown destination type
- Line 315: codespace match but no bfchar/bfrange mapping
- Line 323: no codespace range matches
Rewrite GetPlainText to use walkTextContent with position-based word
boundary detection, similar to MuPDF's approach:

- Use character width × 0.2 as gap threshold for detecting word breaks
- Use font size × 0.5 as threshold for detecting line breaks
- Insert spaces when gap between characters exceeds threshold

This fixes PDFs where text runs together without explicit space
characters in the content stream. The previous implementation just
concatenated all text without considering character positions.
For CID fonts with 2-byte character codes where the high byte is 0x00,
only convert the low byte to PUA. This avoids interleaved null PUA chars
(\ue000) that break shift detection in post-processing.

Matches PyMuPDF's handling of CID fonts.
Return an empty reader for zero-length streams before applying filters.
This prevents "unexpected EOF" errors from zlib when processing empty
FlateDecode streams in PDFs.
- Change module path from ledongthuc/pdf to ajroetker/pdf
- Add render/ submodule for PDF page rasterization (from antflydb/antfly)
- render/ depends on ajroetker/go-jpeg2000 for embedded JPEG2000 images
The encoding chain's resolveGlyphName() did not handle ligature glyph
names following the Adobe Glyph List naming convention (e.g., "t_t.liga",
"f_i", "f_f_l.liga"). When a font's ToUnicode CMap was incomplete and
didn't map certain character codes, the Differences array was the only
fallback — but ligature names like "t_t.liga" resolved to 0, causing
PUA characters to leak into the extracted text. This produced garbled
output like invisible chars instead of "Lettuce" for PDFs using
design-tool fonts with ligature substitutions.

Changes:
- Add resolveGlyphNameMulti() that decomposes ligature names by
  stripping OpenType suffixes (.liga, .alt) and splitting on '_'
- Add multiDifferences field to FontEncodingChain for multi-rune
  Differences entries
- Update Decode() to fall back from CMap PUA results to Differences
  when the CMap has gaps (resolvePUAWithDifferences)
- Fix isValidDecode test expectations to match the 50% threshold
- Add unit tests for ligature decomposition, PUA detection, CMap-to-
  Differences fallback (including full Decode dispatch end-to-end)
- Add integration tests against a real-world recipe PDF
Demonstrates extracting styled text from a PDF that uses ligature
glyphs (t_t.liga, onehalf, onequarter) in font Differences arrays.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants