-
Notifications
You must be signed in to change notification settings - Fork 11
Benchmark Optimization Session 2026 03 08
Iterative optimization session across both DOCX and XLSX converters on the expanded 180-case benchmark for each format. Successfully fixed XLSX page mismatches, improved text width estimation, added header/footer and paragraph border support, and refined table border rendering. Also expanded DOCX test cases from 120 → 180.
| Metric | DOCX Before | DOCX After | XLSX Before | XLSX After |
|---|---|---|---|---|
| Average score | 97.0% (120 cases) | 97.62% (180 cases) | 96.67% | 96.82% |
| Below 99% | — | 136 / 180 | 78 | 77 |
| Page mismatches | 0 | 0 | 1 | 0 |
| Unit tests | 103 pass | 103 pass | — | — |
| Case | Issue | Before | After | Delta |
|---|---|---|---|---|
| classic09_long_text | Page count 7 vs 12 | 71.38% | 98.41% | +27.0% |
Added 60 new DOCX test cases (classic121–180) covering:
- Thin/thick borders, striped/patterned tables, heatmap tables
- Styled invoices, status badge tables, kitchen-sink styles
- Header/footer content, paragraph borders, nested list variants
Problem: DOCX documents with headers/footers had missing content in the PDF output, reducing text similarity scores.
Solution:
- Added
ReadHeaderFooter()method toDocxReader.csthat readsheaderReference/footerReferencefromsectPr, resolves the relationship target, parses the XML, and extracts paragraph text - Extended
DocxDocumentrecord withHeaderTextandFooterTextproperties - Renderer places header text centered at
MarginTop / 2and footer text atMarginBottom / 2on every page, using 9pt gray font
Files changed: DocxReader.cs (+42 lines), DocxToPdfConverter.cs (+25 lines)
Problem: Paragraphs with decorative borders (top, bottom, left, right) rendered without any visible border, causing visual score penalties.
Solution:
- Added
DocxBordersandDocxBorderEdgerecord types to represent individual border edges with width (fromszattribute, in eighths of a point) and color -
ReadBorderEdge()parses<w:pBdr>child elements (<w:top>,<w:bottom>,<w:left>,<w:right>) - Renderer draws lines at paragraph boundaries after text rendering, using tracked
paragraphStartYand current Y position
Files changed: DocxReader.cs (+30 lines), DocxToPdfConverter.cs (+18 lines)
Problem: Text width was estimated using a fixed average character width (fontSize * 0.47f), causing inaccurate word wrapping and alignment — especially for tab leaders and mixed-case text.
Solution:
- Replaced all
avgCharWidthcalculations withEstimateTextWidth()using per-character Helvetica glyph widths viaGetHelveticaCharWidth() - Updated
WordWrap(),ExpandTabs(),RenderRunSegments(), cell text rendering, and paragraph text rendering - Tab leader fill calculation now uses Calibri-equivalent scale factor (0.725×) with
Tzoperator compression for accurate dot leader rendering
Impact: Improved alignment accuracy for tab stops, centered/right-aligned text, and table cell content across all 180 DOCX cases.
Problem: Text at the top of a new page started at the margin boundary, but the visual top of text is baseline + fontSize × AscentRatio, causing a downward offset relative to LibreOffice reference output.
Solution:
- Introduced
AscentRatio = 1.075fconstant (Helvetica ascent ratio) - At every top-of-page position (paragraphs, run segments, table rows), advance Y by
fontSize * AscentRatiobefore rendering - Added
IsTopOfPagetracking to the rendering state
Problem: Cell-by-cell border drawing produced doubled lines at shared boundaries, making borders visually thicker than reference output.
Solution:
- Moved border drawing from per-cell to per-row: draws horizontal lines at row top (first row only) and bottom, plus vertical lines at each column boundary
- Tracks
isFirstRowflag, resets on page breaks so top border is redrawn on new pages
Problem: Nested tables emitted one paragraph per cell, causing excessive line breaks in extracted text.
Solution:
- Changed nested table handling to join each row's cell text into a single paragraph with space separators, reducing visual and text divergence.
Problem: classic09_long_text (single column, 1000+ character cells) produced 7 pages vs 12 reference. The virtual row height calculation didn't account for cell content margins.
Solution:
- Changed wrap width calculation from
ExcelSheet.CharUnitsToPoints(8.43f)to(defaultColPts - 11f) * CalibriFittingScale - The 11pt deduction accounts for cell content margins (left + right padding); CalibriFittingScale corrects for Helvetica vs Calibri width differences
- Uses
sheet.DefaultColumnWidthwhen available instead of hardcoded 8.43
Impact: classic09 page count: 7 → 12 (matches reference). Score: 71.38% → 98.41%.
Problem: Virtual overflow pages were emitted immediately after each row, but LibreOffice renders all row content first, then appends empty overflow pages at the end of the sheet.
Solution:
- Accumulated
accumulatedOverflowHeightacross all rows instead of emitting pages per-row - After all rows are processed, emit empty overflow pages in a single pass at the end
Problem: Pie chart legends rendered color swatches but no category name text, causing text extraction differences.
Solution:
- Added
page.AddText(legendName, legendTextX, legendY, labelFontSize)after each legend swatch rectangle to display category names as extractable text.
Problem: Default spacing after paragraphs used fontSize * 0.35f which varied with font size, producing inconsistent gaps compared to Word's fixed 8pt default.
Solution:
- Changed default spacing after from
fontSize * 0.35fto a fixed8fpoints (matching Word's default 160 twips ≈ 8pt).
Several approaches were attempted but reverted due to regressions:
| Experiment | Result | Reason |
|---|---|---|
| Bullet as text "•" (U+2022) | REVERTED | Visual regression outweighed text extraction improvement (avg 97.62% → 97.61%) |
| Bullet as rectangle + tiny 1pt text overlay | REVERTED | Visual artifacts from dual rendering (avg dropped to 97.55%) |
| FontMetricsFactor 1.17 → 1.15 | REVERTED | Caused page mismatch regressions |
| cellPaddingV 1 → 2 | REVERTED | Tables overflow to new pages |
| Column padding changes (XLSX) | NOT ATTEMPTED | Any value change breaks text extraction |
- DOCX visual floor (~97%): Helvetica vs Calibri font mismatch creates ~2–3% visual difference; unfixable without embedding Calibri
-
DOCX shading height:
FontMetricsFactor × lineSpacingmakes shading blocks ~29% taller than reference; tied to line height calculation - XLSX text extraction: Unicode/CJK/emoji characters limited by WinAnsiEncoding
- XLSX chart rendering: 29/31 chart cases below 99%; structural differences from LibreOffice's vector-graphic chart exports
| Bucket | Count |
|---|---|
| ≥ 0.99 (Excellent) | 44 |
| 0.95 – 0.98 | 98 |
| 0.90 – 0.94 | 33 |
| 0.80 – 0.89 | 3 |
| < 0.80 | 2 |
| Bucket | Count |
|---|---|
| ≥ 0.99 (Excellent) | 103 |
| 0.95 – 0.98 | 43 |
| 0.90 – 0.94 | 22 |
| 0.80 – 0.89 | 9 |
| < 0.80 | 3 |
| File | Lines Changed | Summary |
|---|---|---|
src/MiniPdf/DocxReader.cs |
+103 | Header/footer reader, paragraph borders, nested table flattening, border edge records |
src/MiniPdf/DocxToPdfConverter.cs |
+299/−62 | Font-aware widths, ascent positioning, header/footer rendering, paragraph borders, table border refactor, tab leader improvements |
src/MiniPdf/ExcelToPdfConverter.cs |
+62/−31 | Virtual wrap fix, deferred overflow, pie chart legend text |
# Clear build cache (required before each benchmark cycle)
Remove-Item -Recurse -Force "$env:TEMP\dotnet\runfile" -ErrorAction SilentlyContinue
# XLSX: Convert → Compare → Stats
cd tests\MiniPdf.Scripts; dotnet run convert_xlsx_to_pdf.cs
cd tests\MiniPdf.Benchmark; python compare_pdfs.py --report-dir reports
python _xlsx_stats.py
# DOCX: Convert → Compare → Stats
cd tests\MiniPdf.Scripts; dotnet run convert_docx_to_pdf.cs
cd tests\MiniPdf.Benchmark; python compare_pdfs.py --minipdf-dir ../MiniPdf.Scripts/pdf_output_docx --reference-dir reference_pdfs_docx --report-dir reports_docx
python _final_docx_stats.py
# Unit tests
dotnet test tests/MiniPdf.Tests --no-restore --verbosity minimal