test_courier_td_bug.pdf
Summary
docling-parse incorrectly calculates character widths when parsing PDFs that use Td (text displacement) commands with long text strings, resulting in word bounding boxes that are consistently 20% too small (5pt per character instead of the correct 6pt for Courier 10pt font).
Environment
- docling-parse version: Latest (tested with current main branch)
- Python version: 3.12
- Operating System: Linux
Problem Description
When parsing PDFs that use Td commands followed by text strings containing spaces, docling-parse produces bounding boxes that are systematically too narrow. The issue appears to be caused by using a hardcoded character width assumption (5pt) rather than reading actual font metrics from the PDF.
Expected Behavior
For a Courier 10pt font, each character should be 6.00 points wide (Courier is a monospace font with 600 units per character in standard PDF metrics, which equals 6pt at 10pt font size).
Actual Behavior
docling-parse calculates character widths as 5.00 points, resulting in:
- Bounding boxes that are 5/6 (≈83.3%) of their correct width
- Word positions that drift progressively left as they appear later in the text string
- A consistent 1.2× ratio between correct width and docling-parse width
Reproduction
Minimal Test PDF
I've attached test_courier_td_bug.pdf which reproduces this issue. The content stream is:
BT
/F001 10 Tf
72 720 Td
( ABC12345 COMPANY NAME INC. DOCUMENT TYPE) Tj
0 -11 Td
(ID=99999 REF1234567890123) Tj
0 -11 Td
(CUSTOMER : GENERIC STORE NAME LOCATION FROM : WESTERN) Tj
ET
With minimal font descriptor:
<<
/Type /Font
/Subtype /Type1
/BaseFont /Courier
>>
Key characteristics that trigger the bug:
- Standard Type1 Courier font with no explicit width table (relies on base font metrics)
- Uses
Td command to set initial text position (e.g., 72 720 Td)
- Text shown with
Tj command containing single long string (90+ characters) with many embedded spaces
- Each
Td command followed by Tj with text
The critical element is the long string with embedded spaces - docling-parse must walk through character-by-character to determine where each word begins and ends, using character widths to calculate positions.
Impact
This bug affects:
- PDFs using Type1 standard fonts (Courier, Times, Helvetica) without explicit width tables
- PDFs generated by tools that use
Td commands with text strings containing spaces
- Use cases requiring accurate word-level bounding boxes (text extraction, OCR correction, layout analysis)
The error compounds with string length:
- Word at character position 10: 10pt error
- Word at character position 50: 50pt error
- Can be 100+ points off for words late in long strings
Additional Test Cases
I've tested multiple fonts, all showing the same 5pt character width issue:
| Font |
Expected Width (10pt) |
Actual Width |
Ratio |
| Courier |
6.00pt |
5.00pt |
1.20× |
| Courier-Bold |
6.00pt |
5.00pt |
1.20× |
| Courier-Oblique |
6.00pt |
5.00pt |
1.20× |
| Helvetica |
Varies (non-monospace) |
5.00pt |
Wrong |
| Times-Roman |
Varies (non-monospace) |
5.00pt |
Wrong |
This suggests the 500-unit fallback is being applied universally instead of using proper base font metrics.
test_courier_td_bug.pdf
Summary
docling-parse incorrectly calculates character widths when parsing PDFs that use
Td(text displacement) commands with long text strings, resulting in word bounding boxes that are consistently 20% too small (5pt per character instead of the correct 6pt for Courier 10pt font).Environment
Problem Description
When parsing PDFs that use
Tdcommands followed by text strings containing spaces, docling-parse produces bounding boxes that are systematically too narrow. The issue appears to be caused by using a hardcoded character width assumption (5pt) rather than reading actual font metrics from the PDF.Expected Behavior
For a Courier 10pt font, each character should be 6.00 points wide (Courier is a monospace font with 600 units per character in standard PDF metrics, which equals 6pt at 10pt font size).
Actual Behavior
docling-parse calculates character widths as 5.00 points, resulting in:
Reproduction
Minimal Test PDF
I've attached
test_courier_td_bug.pdfwhich reproduces this issue. The content stream is:With minimal font descriptor:
Key characteristics that trigger the bug:
Tdcommand to set initial text position (e.g.,72 720 Td)Tjcommand containing single long string (90+ characters) with many embedded spacesTdcommand followed byTjwith textThe critical element is the long string with embedded spaces - docling-parse must walk through character-by-character to determine where each word begins and ends, using character widths to calculate positions.
Impact
This bug affects:
Tdcommands with text strings containing spacesThe error compounds with string length:
Additional Test Cases
I've tested multiple fonts, all showing the same 5pt character width issue:
This suggests the 500-unit fallback is being applied universally instead of using proper base font metrics.