Benchmark Optimization Session 2026 03 08

Benchmark Optimization Session — 2026-03-08

Summary

Iterative optimization session across both DOCX and XLSX converters on the expanded 180-case benchmark for each format. Successfully fixed XLSX page mismatches, improved text width estimation, added header/footer and paragraph border support, and refined table border rendering. Also expanded DOCX test cases from 120 → 180.

Before / After

Metric	DOCX Before	DOCX After	XLSX Before	XLSX After
Average score	97.0% (120 cases)	97.62% (180 cases)	96.67%	96.82%
Below 99%	—	136 / 180	78	77
Page mismatches	0	0	1	0
Unit tests	103 pass	103 pass	—	—

XLSX Key Improvement

Case	Issue	Before	After	Delta
classic09_long_text	Page count 7 vs 12	71.38%	98.41%	+27.0%

DOCX Benchmark Expansion

Added 60 new DOCX test cases (classic121–180) covering:

Thin/thick borders, striped/patterned tables, heatmap tables
Styled invoices, status badge tables, kitchen-sink styles
Header/footer content, paragraph borders, nested list variants

Code Changes

1. Header/Footer Support (`DocxReader.cs` + `DocxToPdfConverter.cs`)

Problem: DOCX documents with headers/footers had missing content in the PDF output, reducing text similarity scores.

Solution:

Added ReadHeaderFooter() method to DocxReader.cs that reads headerReference / footerReference from sectPr, resolves the relationship target, parses the XML, and extracts paragraph text
Extended DocxDocument record with HeaderText and FooterText properties
Renderer places header text centered at MarginTop / 2 and footer text at MarginBottom / 2 on every page, using 9pt gray font

Files changed: DocxReader.cs (+42 lines), DocxToPdfConverter.cs (+25 lines)

2. Paragraph Border Support (`DocxReader.cs` + `DocxToPdfConverter.cs`)

Problem: Paragraphs with decorative borders (top, bottom, left, right) rendered without any visible border, causing visual score penalties.

Solution:

Added DocxBorders and DocxBorderEdge record types to represent individual border edges with width (from sz attribute, in eighths of a point) and color
ReadBorderEdge() parses <w:pBdr> child elements (<w:top>, <w:bottom>, <w:left>, <w:right>)
Renderer draws lines at paragraph boundaries after text rendering, using tracked paragraphStartY and current Y position

Files changed: DocxReader.cs (+30 lines), DocxToPdfConverter.cs (+18 lines)

3. Font-Aware Text Width Estimation (`DocxToPdfConverter.cs`)

Problem: Text width was estimated using a fixed average character width (fontSize * 0.47f), causing inaccurate word wrapping and alignment — especially for tab leaders and mixed-case text.

Solution:

Replaced all avgCharWidth calculations with EstimateTextWidth() using per-character Helvetica glyph widths via GetHelveticaCharWidth()
Updated WordWrap(), ExpandTabs(), RenderRunSegments(), cell text rendering, and paragraph text rendering
Tab leader fill calculation now uses Calibri-equivalent scale factor (0.725×) with Tz operator compression for accurate dot leader rendering

Impact: Improved alignment accuracy for tab stops, centered/right-aligned text, and table cell content across all 180 DOCX cases.

4. Ascent-Aware Top-of-Page Positioning (`DocxToPdfConverter.cs`)

Problem: Text at the top of a new page started at the margin boundary, but the visual top of text is baseline + fontSize × AscentRatio, causing a downward offset relative to LibreOffice reference output.

Solution:

Introduced AscentRatio = 1.075f constant (Helvetica ascent ratio)
At every top-of-page position (paragraphs, run segments, table rows), advance Y by fontSize * AscentRatio before rendering
Added IsTopOfPage tracking to the rendering state

5. Table Border Rendering Refactor (`DocxToPdfConverter.cs`)

Problem: Cell-by-cell border drawing produced doubled lines at shared boundaries, making borders visually thicker than reference output.

Solution:

Moved border drawing from per-cell to per-row: draws horizontal lines at row top (first row only) and bottom, plus vertical lines at each column boundary
Tracks isFirstRow flag, resets on page breaks so top border is redrawn on new pages

6. Nested Table Text Flattening (`DocxReader.cs`)

Problem: Nested tables emitted one paragraph per cell, causing excessive line breaks in extracted text.

Solution:

Changed nested table handling to join each row's cell text into a single paragraph with space separators, reducing visual and text divergence.

7. XLSX Virtual Wrap Fix (`ExcelToPdfConverter.cs`)

Problem: classic09_long_text (single column, 1000+ character cells) produced 7 pages vs 12 reference. The virtual row height calculation didn't account for cell content margins.

Solution:

Changed wrap width calculation from ExcelSheet.CharUnitsToPoints(8.43f) to (defaultColPts - 11f) * CalibriFittingScale
The 11pt deduction accounts for cell content margins (left + right padding); CalibriFittingScale corrects for Helvetica vs Calibri width differences
Uses sheet.DefaultColumnWidth when available instead of hardcoded 8.43

Impact: classic09 page count: 7 → 12 (matches reference). Score: 71.38% → 98.41%.

8. XLSX Deferred Overflow Emission (`ExcelToPdfConverter.cs`)

Problem: Virtual overflow pages were emitted immediately after each row, but LibreOffice renders all row content first, then appends empty overflow pages at the end of the sheet.

Solution:

Accumulated accumulatedOverflowHeight across all rows instead of emitting pages per-row
After all rows are processed, emit empty overflow pages in a single pass at the end

9. XLSX Pie Chart Legend Text (`ExcelToPdfConverter.cs`)

Problem: Pie chart legends rendered color swatches but no category name text, causing text extraction differences.

Solution:

Added page.AddText(legendName, legendTextX, legendY, labelFontSize) after each legend swatch rectangle to display category names as extractable text.

10. Default Spacing After Change (`DocxToPdfConverter.cs`)

Problem: Default spacing after paragraphs used fontSize * 0.35f which varied with font size, producing inconsistent gaps compared to Word's fixed 8pt default.

Solution:

Changed default spacing after from fontSize * 0.35f to a fixed 8f points (matching Word's default 160 twips ≈ 8pt).

Dead Ends & Failed Experiments

Several approaches were attempted but reverted due to regressions:

Experiment	Result	Reason
Bullet as text "•" (U+2022)	REVERTED	Visual regression outweighed text extraction improvement (avg 97.62% → 97.61%)
Bullet as rectangle + tiny 1pt text overlay	REVERTED	Visual artifacts from dual rendering (avg dropped to 97.55%)
FontMetricsFactor 1.17 → 1.15	REVERTED	Caused page mismatch regressions
cellPaddingV 1 → 2	REVERTED	Tables overflow to new pages
Column padding changes (XLSX)	NOT ATTEMPTED	Any value change breaks text extraction

Root Causes of Remaining Below-99% Cases

DOCX visual floor (~97%): Helvetica vs Calibri font mismatch creates ~2–3% visual difference; unfixable without embedding Calibri
DOCX shading height: FontMetricsFactor × lineSpacing makes shading blocks ~29% taller than reference; tied to line height calculation
XLSX text extraction: Unicode/CJK/emoji characters limited by WinAnsiEncoding
XLSX chart rendering: 29/31 chart cases below 99%; structural differences from LibreOffice's vector-graphic chart exports

Score Distribution

DOCX (180 cases)

Bucket	Count
≥ 0.99 (Excellent)	44
0.95 – 0.98	98
0.90 – 0.94	33
0.80 – 0.89	3
< 0.80	2

XLSX (180 cases)

Bucket	Count
≥ 0.99 (Excellent)	103
0.95 – 0.98	43
0.90 – 0.94	22
0.80 – 0.89	9
< 0.80	3

Files Changed

File	Lines Changed	Summary
`src/MiniPdf/DocxReader.cs`	+103	Header/footer reader, paragraph borders, nested table flattening, border edge records
`src/MiniPdf/DocxToPdfConverter.cs`	+299/−62	Font-aware widths, ascent positioning, header/footer rendering, paragraph borders, table border refactor, tab leader improvements
`src/MiniPdf/ExcelToPdfConverter.cs`	+62/−31	Virtual wrap fix, deferred overflow, pie chart legend text

Benchmark Commands Reference

# Clear build cache (required before each benchmark cycle)
Remove-Item -Recurse -Force "$env:TEMP\dotnet\runfile" -ErrorAction SilentlyContinue

# XLSX: Convert → Compare → Stats
cd tests\MiniPdf.Scripts; dotnet run convert_xlsx_to_pdf.cs
cd tests\MiniPdf.Benchmark; python compare_pdfs.py --report-dir reports
python _xlsx_stats.py

# DOCX: Convert → Compare → Stats
cd tests\MiniPdf.Scripts; dotnet run convert_docx_to_pdf.cs
cd tests\MiniPdf.Benchmark; python compare_pdfs.py --minipdf-dir ../MiniPdf.Scripts/pdf_output_docx --reference-dir reference_pdfs_docx --report-dir reports_docx
python _final_docx_stats.py

# Unit tests
dotnet test tests/MiniPdf.Tests --no-restore --verbosity minimal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Optimization Session 2026 03 08

Benchmark Optimization Session — 2026-03-08

Summary

Before / After

XLSX Key Improvement

DOCX Benchmark Expansion

Code Changes

1. Header/Footer Support (`DocxReader.cs` + `DocxToPdfConverter.cs`)

2. Paragraph Border Support (`DocxReader.cs` + `DocxToPdfConverter.cs`)

3. Font-Aware Text Width Estimation (`DocxToPdfConverter.cs`)

4. Ascent-Aware Top-of-Page Positioning (`DocxToPdfConverter.cs`)

5. Table Border Rendering Refactor (`DocxToPdfConverter.cs`)

6. Nested Table Text Flattening (`DocxReader.cs`)

7. XLSX Virtual Wrap Fix (`ExcelToPdfConverter.cs`)

8. XLSX Deferred Overflow Emission (`ExcelToPdfConverter.cs`)

9. XLSX Pie Chart Legend Text (`ExcelToPdfConverter.cs`)

10. Default Spacing After Change (`DocxToPdfConverter.cs`)

Dead Ends & Failed Experiments

Root Causes of Remaining Below-99% Cases

Score Distribution

DOCX (180 cases)

XLSX (180 cases)

Files Changed

Benchmark Commands Reference

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Benchmark Optimization Session 2026 03 08

Benchmark Optimization Session — 2026-03-08

Summary

Before / After

XLSX Key Improvement

DOCX Benchmark Expansion

Code Changes

1. Header/Footer Support (DocxReader.cs + DocxToPdfConverter.cs)

2. Paragraph Border Support (DocxReader.cs + DocxToPdfConverter.cs)

3. Font-Aware Text Width Estimation (DocxToPdfConverter.cs)

4. Ascent-Aware Top-of-Page Positioning (DocxToPdfConverter.cs)

5. Table Border Rendering Refactor (DocxToPdfConverter.cs)

6. Nested Table Text Flattening (DocxReader.cs)

7. XLSX Virtual Wrap Fix (ExcelToPdfConverter.cs)

8. XLSX Deferred Overflow Emission (ExcelToPdfConverter.cs)

9. XLSX Pie Chart Legend Text (ExcelToPdfConverter.cs)

10. Default Spacing After Change (DocxToPdfConverter.cs)

Dead Ends & Failed Experiments

Root Causes of Remaining Below-99% Cases

Score Distribution

DOCX (180 cases)

XLSX (180 cases)

Files Changed

Benchmark Commands Reference

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

1. Header/Footer Support (`DocxReader.cs` + `DocxToPdfConverter.cs`)

2. Paragraph Border Support (`DocxReader.cs` + `DocxToPdfConverter.cs`)

3. Font-Aware Text Width Estimation (`DocxToPdfConverter.cs`)

4. Ascent-Aware Top-of-Page Positioning (`DocxToPdfConverter.cs`)

5. Table Border Rendering Refactor (`DocxToPdfConverter.cs`)

6. Nested Table Text Flattening (`DocxReader.cs`)

7. XLSX Virtual Wrap Fix (`ExcelToPdfConverter.cs`)

8. XLSX Deferred Overflow Emission (`ExcelToPdfConverter.cs`)

9. XLSX Pie Chart Legend Text (`ExcelToPdfConverter.cs`)

10. Default Spacing After Change (`DocxToPdfConverter.cs`)