Skip to content

Benchmark Optimization Session 2026 03 08

Wei Lin edited this page Mar 8, 2026 · 1 revision

Benchmark Optimization Session — 2026-03-08

Summary

Iterative optimization session across both DOCX and XLSX converters on the expanded 180-case benchmark for each format. Successfully fixed XLSX page mismatches, improved text width estimation, added header/footer and paragraph border support, and refined table border rendering. Also expanded DOCX test cases from 120 → 180.


Before / After

Metric DOCX Before DOCX After XLSX Before XLSX After
Average score 97.0% (120 cases) 97.62% (180 cases) 96.67% 96.82%
Below 99% 136 / 180 78 77
Page mismatches 0 0 1 0
Unit tests 103 pass 103 pass

XLSX Key Improvement

Case Issue Before After Delta
classic09_long_text Page count 7 vs 12 71.38% 98.41% +27.0%

DOCX Benchmark Expansion

Added 60 new DOCX test cases (classic121–180) covering:

  • Thin/thick borders, striped/patterned tables, heatmap tables
  • Styled invoices, status badge tables, kitchen-sink styles
  • Header/footer content, paragraph borders, nested list variants

Code Changes

1. Header/Footer Support (DocxReader.cs + DocxToPdfConverter.cs)

Problem: DOCX documents with headers/footers had missing content in the PDF output, reducing text similarity scores.

Solution:

  • Added ReadHeaderFooter() method to DocxReader.cs that reads headerReference / footerReference from sectPr, resolves the relationship target, parses the XML, and extracts paragraph text
  • Extended DocxDocument record with HeaderText and FooterText properties
  • Renderer places header text centered at MarginTop / 2 and footer text at MarginBottom / 2 on every page, using 9pt gray font

Files changed: DocxReader.cs (+42 lines), DocxToPdfConverter.cs (+25 lines)

2. Paragraph Border Support (DocxReader.cs + DocxToPdfConverter.cs)

Problem: Paragraphs with decorative borders (top, bottom, left, right) rendered without any visible border, causing visual score penalties.

Solution:

  • Added DocxBorders and DocxBorderEdge record types to represent individual border edges with width (from sz attribute, in eighths of a point) and color
  • ReadBorderEdge() parses <w:pBdr> child elements (<w:top>, <w:bottom>, <w:left>, <w:right>)
  • Renderer draws lines at paragraph boundaries after text rendering, using tracked paragraphStartY and current Y position

Files changed: DocxReader.cs (+30 lines), DocxToPdfConverter.cs (+18 lines)

3. Font-Aware Text Width Estimation (DocxToPdfConverter.cs)

Problem: Text width was estimated using a fixed average character width (fontSize * 0.47f), causing inaccurate word wrapping and alignment — especially for tab leaders and mixed-case text.

Solution:

  • Replaced all avgCharWidth calculations with EstimateTextWidth() using per-character Helvetica glyph widths via GetHelveticaCharWidth()
  • Updated WordWrap(), ExpandTabs(), RenderRunSegments(), cell text rendering, and paragraph text rendering
  • Tab leader fill calculation now uses Calibri-equivalent scale factor (0.725×) with Tz operator compression for accurate dot leader rendering

Impact: Improved alignment accuracy for tab stops, centered/right-aligned text, and table cell content across all 180 DOCX cases.

4. Ascent-Aware Top-of-Page Positioning (DocxToPdfConverter.cs)

Problem: Text at the top of a new page started at the margin boundary, but the visual top of text is baseline + fontSize × AscentRatio, causing a downward offset relative to LibreOffice reference output.

Solution:

  • Introduced AscentRatio = 1.075f constant (Helvetica ascent ratio)
  • At every top-of-page position (paragraphs, run segments, table rows), advance Y by fontSize * AscentRatio before rendering
  • Added IsTopOfPage tracking to the rendering state

5. Table Border Rendering Refactor (DocxToPdfConverter.cs)

Problem: Cell-by-cell border drawing produced doubled lines at shared boundaries, making borders visually thicker than reference output.

Solution:

  • Moved border drawing from per-cell to per-row: draws horizontal lines at row top (first row only) and bottom, plus vertical lines at each column boundary
  • Tracks isFirstRow flag, resets on page breaks so top border is redrawn on new pages

6. Nested Table Text Flattening (DocxReader.cs)

Problem: Nested tables emitted one paragraph per cell, causing excessive line breaks in extracted text.

Solution:

  • Changed nested table handling to join each row's cell text into a single paragraph with space separators, reducing visual and text divergence.

7. XLSX Virtual Wrap Fix (ExcelToPdfConverter.cs)

Problem: classic09_long_text (single column, 1000+ character cells) produced 7 pages vs 12 reference. The virtual row height calculation didn't account for cell content margins.

Solution:

  • Changed wrap width calculation from ExcelSheet.CharUnitsToPoints(8.43f) to (defaultColPts - 11f) * CalibriFittingScale
  • The 11pt deduction accounts for cell content margins (left + right padding); CalibriFittingScale corrects for Helvetica vs Calibri width differences
  • Uses sheet.DefaultColumnWidth when available instead of hardcoded 8.43

Impact: classic09 page count: 7 → 12 (matches reference). Score: 71.38% → 98.41%.

8. XLSX Deferred Overflow Emission (ExcelToPdfConverter.cs)

Problem: Virtual overflow pages were emitted immediately after each row, but LibreOffice renders all row content first, then appends empty overflow pages at the end of the sheet.

Solution:

  • Accumulated accumulatedOverflowHeight across all rows instead of emitting pages per-row
  • After all rows are processed, emit empty overflow pages in a single pass at the end

9. XLSX Pie Chart Legend Text (ExcelToPdfConverter.cs)

Problem: Pie chart legends rendered color swatches but no category name text, causing text extraction differences.

Solution:

  • Added page.AddText(legendName, legendTextX, legendY, labelFontSize) after each legend swatch rectangle to display category names as extractable text.

10. Default Spacing After Change (DocxToPdfConverter.cs)

Problem: Default spacing after paragraphs used fontSize * 0.35f which varied with font size, producing inconsistent gaps compared to Word's fixed 8pt default.

Solution:

  • Changed default spacing after from fontSize * 0.35f to a fixed 8f points (matching Word's default 160 twips ≈ 8pt).

Dead Ends & Failed Experiments

Several approaches were attempted but reverted due to regressions:

Experiment Result Reason
Bullet as text "•" (U+2022) REVERTED Visual regression outweighed text extraction improvement (avg 97.62% → 97.61%)
Bullet as rectangle + tiny 1pt text overlay REVERTED Visual artifacts from dual rendering (avg dropped to 97.55%)
FontMetricsFactor 1.17 → 1.15 REVERTED Caused page mismatch regressions
cellPaddingV 1 → 2 REVERTED Tables overflow to new pages
Column padding changes (XLSX) NOT ATTEMPTED Any value change breaks text extraction

Root Causes of Remaining Below-99% Cases

  • DOCX visual floor (~97%): Helvetica vs Calibri font mismatch creates ~2–3% visual difference; unfixable without embedding Calibri
  • DOCX shading height: FontMetricsFactor × lineSpacing makes shading blocks ~29% taller than reference; tied to line height calculation
  • XLSX text extraction: Unicode/CJK/emoji characters limited by WinAnsiEncoding
  • XLSX chart rendering: 29/31 chart cases below 99%; structural differences from LibreOffice's vector-graphic chart exports

Score Distribution

DOCX (180 cases)

Bucket Count
≥ 0.99 (Excellent) 44
0.95 – 0.98 98
0.90 – 0.94 33
0.80 – 0.89 3
< 0.80 2

XLSX (180 cases)

Bucket Count
≥ 0.99 (Excellent) 103
0.95 – 0.98 43
0.90 – 0.94 22
0.80 – 0.89 9
< 0.80 3

Files Changed

File Lines Changed Summary
src/MiniPdf/DocxReader.cs +103 Header/footer reader, paragraph borders, nested table flattening, border edge records
src/MiniPdf/DocxToPdfConverter.cs +299/−62 Font-aware widths, ascent positioning, header/footer rendering, paragraph borders, table border refactor, tab leader improvements
src/MiniPdf/ExcelToPdfConverter.cs +62/−31 Virtual wrap fix, deferred overflow, pie chart legend text

Benchmark Commands Reference

# Clear build cache (required before each benchmark cycle)
Remove-Item -Recurse -Force "$env:TEMP\dotnet\runfile" -ErrorAction SilentlyContinue

# XLSX: Convert → Compare → Stats
cd tests\MiniPdf.Scripts; dotnet run convert_xlsx_to_pdf.cs
cd tests\MiniPdf.Benchmark; python compare_pdfs.py --report-dir reports
python _xlsx_stats.py

# DOCX: Convert → Compare → Stats
cd tests\MiniPdf.Scripts; dotnet run convert_docx_to_pdf.cs
cd tests\MiniPdf.Benchmark; python compare_pdfs.py --minipdf-dir ../MiniPdf.Scripts/pdf_output_docx --reference-dir reference_pdfs_docx --report-dir reports_docx
python _final_docx_stats.py

# Unit tests
dotnet test tests/MiniPdf.Tests --no-restore --verbosity minimal

Clone this wiki locally