Skip to content

DOCX to PDF Implementation Session 2026 03 05

Wei Lin edited this page Mar 5, 2026 · 1 revision

DOCX-to-PDF Implementation Session — 2026-03-05

Summary

Implemented full DOCX-to-PDF conversion for MiniPdf, following the existing XLSX-to-PDF architecture. The implementation parses Open XML (.docx) documents and renders them as paginated PDF with paragraphs, headings, tables, images, and lists. Achieved 98.7% average benchmark score across 30 test cases with all cases rated Excellent (≥ 90%).


Results

Metric Value
Average score 98.7%
Cases ≥ 90% (Excellent) 30 / 30
Lowest score 96.7% (classic08)
Highest score 99.9% (classic04, classic17)
Unit tests 9 / 9 passing
Scoring formula 0.4 × Text + 0.4 × Visual + 0.2 × PageCount

Score Distribution (30 cases)

Bucket Count
≥ 0.99 9
0.98 – 0.989 10
0.97 – 0.979 5
0.96 – 0.969 6

Architecture

Core Files

File Purpose
src/MiniPdf/DocxReader.cs Parses DOCX (ZIP / Open XML) into DocxDocument model
src/MiniPdf/DocxToPdfConverter.cs Renders DocxDocumentPdfDocument with page layout
src/MiniPdf/MiniPdf.cs Added ConvertToPdf overload detecting .docx extension

Data Model

DocxDocument
├── Paragraphs: List<DocxParagraph>
│   ├── Text, FontSize, Bold, Italic
│   ├── StyleId (Heading1–4, Normal, ListBullet, ListNumber)
│   ├── LineSpacing, SpacingBefore, SpacingAfter
│   ├── IsBulletList, IsNumberedList, ListText
│   ├── Images: List<DocxImage>
│   └── PageBreakBefore
├── Tables: List<DocxTable>
│   └── Rows → Cells → Paragraphs (with images)
├── Margins (Top, Bottom, Left, Right)
└── Elements: List<object> (interleaved paragraphs + tables)

Features Supported

  • Paragraphs — Normal text with font size, bold, italic formatting
  • Headings — H1 (20pt), H2 (16pt), H3 (13pt), H4 (11pt) with bold styling
  • Tables — Borders, cell padding, proportional column widths, images inside cells
  • Inline images — JPEG/PNG, scaled to fit page width or cell width
  • Bullet lists — Detected via <w:numPr> or ListBullet style name; rendered with prefix
  • Numbered lists — Detected via <w:numPr> or ListNumber style name; sequential counter per group
  • Page breaks — Explicit <w:br w:type="page"> and paragraph-level pageBreakBefore
  • Line spacing — Single, 1.15×, 1.5×, double (OOXML w:line / lineRule parsing)
  • Document margins — Read from <w:pgMar> in section properties
  • SpacingBefore / SpacingAfter — Per-paragraph spacing from styles and direct formatting

Key Technical Decisions

1. Line Height Formula

lineHeight = fontSize × FontMetricsFactor × lineSpacingMultiplier
  • FontMetricsFactor = 1.18 — Represents ascent + descent ratio for Calibri → Helvetica mapping
  • Derived from reference measurement: LibreOffice renders 11pt text at ~14.9pt line height → 14.9 / (11 × 1.15) = 1.178 ≈ 1.18
  • Default line spacing multiplier: 1.15 (from OOXML docDefaults w:line="276", i.e. 276/240)

2. OOXML w:line Parsing

The w:line attribute has different semantics depending on w:lineRule:

lineRule Interpretation Formula
auto (default) Multiplier value / 240
exact Fixed points value / 20
atLeast Minimum points value / 20

Root cause of classic13 regression: Initially parsed all w:line values as twips (÷20), but auto mode uses 240ths. This made line spacing 12× too small, causing a 4-page document to render as 3 pages.

3. Table Compact Spacing

Tables use a different line height formula without the font metrics factor:

// Paragraph (normal)
var lineHeight = fontSize * FontMetricsFactor * lineSpacingMul;

// Table cell (compact)
var lineHeight = effectiveFontSize * options.LineSpacing;

Cell padding = 1pt (Word uses tight cell margins). This prevents table rows from being too tall compared to the reference.

4. Style-Based List Detection

Two detection paths for lists:

  1. <w:numPr> element — Standard OOXML numbering reference (checked first)
  2. Style name fallback — If styleId starts with ListBullet or ListNumber, treat as list even without <w:numPr>

Style-based numbered lists use a sequential counter in the Read() loop, resetting when a non-numbered paragraph is encountered.

5. SpacingBefore Suppression at Top of Page

SpacingBefore is suppressed for the first paragraph on each page, matching Word behavior (not LibreOffice, which always applies it). Changing this caused classic30 to overflow from 3 to 4 pages.


Iteration History

Phase 1 — Initial Implementation (score: 0.953)

Basic DOCX parsing and rendering: paragraphs, headings, tables, images, basic lists. 9 unit tests created.

Phase 2 — Line Height Fix (score: 0.963 → 0.977)

  • Fixed w:line parsing: auto = ÷240, exact/atLeast = ÷20
  • Added FontMetricsFactor = 1.18f
  • classic13 fixed: 3/4 → 4/4 pages

Phase 3 — Table & List Improvements (score: 0.977 → 0.985)

  • Tables use compact spacing (no FontMetricsFactor)
  • Cell padding reduced from 2pt to 1pt
  • Style-based list detection (ListBullet/ListNumber)
  • Sequential counter for style-based numbered lists
  • classic30 fixed: 4/3 → 3/3 pages

Phase 4 — Table Cell Images (score: 0.985 → 0.987)

  • Added image rendering inside table cell paragraphs
  • Height calculation accounts for images in CalculateRowHeight()
  • classic29 improved: 0.896 → 0.987

Phase 5 — Final Tuning (score: 0.987)

  • Tested cellPadding=0.5pt → slightly worse, reverted to 1pt
  • Tested removing spacingBefore suppression → classic30 regression, reverted
  • Final score stabilized at 0.9868

Approaches Tried but Reverted

Approach Result Reason
cellPadding = 0.5pt Score 0.9866 (−0.0002) Marginal degradation across multiple cases
Remove spacingBefore suppression classic30 regression Page overflow (3 → 4 pages)
FontMetricsFactor = 1.2 classic30 overflow Too aggressive — 4/3 pages for classic30

Remaining Limitations

Category Examples Root Cause
Bullet encoding classic08 (96.7%) Helvetica uses U+2022 (•), Word uses Wingdings U+F0B7 — different glyph rendering
Table shading classic11 (96.9%) No cell background color support yet
Word wrap differences classic27 (96.8%) Helvetica wider than Calibri causes different line breaks
Font metrics mismatch classic19 (97.4%), classic20 (97.9%) Helvetica character widths differ from Calibri

All remaining gaps are primarily due to the fundamental font difference (Helvetica vs Calibri). True typographic equivalence would require embedding Calibri metrics or a matching font.


All Benchmark Scores

Test Case Score Category
classic01 0.997 Basic paragraph
classic02 0.985 Multiple paragraphs
classic03 0.995 Heading styles
classic04 0.999 Bold / italic
classic05 0.993 Bullet list
classic06 0.999 Numbered list
classic07 0.991 Mixed content
classic08 0.967 Nested bullets
classic09 0.996 Indented paragraphs
classic10 0.990 Table basic
classic11 0.969 Table with shading
classic12 0.992 Multi-page
classic13 0.968 Long document (4 pages)
classic14 0.983 Header levels
classic15 0.989 Paragraph spacing
classic16 0.986 Line spacing variants
classic17 0.999 Simple table
classic18 0.981 Table with merged cells
classic19 0.974 Complex formatting
classic20 0.979 Table with many rows
classic21 0.972 Mixed lists
classic22 0.996 Images
classic23 0.998 Single image
classic24 0.981 Table + images
classic25 0.996 Formatted paragraphs
classic26 0.992 Multi-section
classic27 0.968 Long paragraphs
classic28 0.994 List continuation
classic29 0.987 Table with cell images
classic30 0.986 Comprehensive report (3 pages)

Build & Test Commands

# Build
cd d:\git\MiniPdf
dotnet build src/MiniPdf/MiniPdf.csproj

# Convert all DOCX test files to PDF
dotnet run --project tests/DocxConverter -- <input_dir> <output_dir>

# Run unit tests
dotnet test tests/MiniPdf.Tests --filter "DocxToPdfConverterTests"

# Run benchmark comparison
cd tests/MiniPdf.Benchmark
python compare_pdfs.py --minipdf-dir "..\MiniPdf.Scripts\pdf_output_docx" --reference-dir "reference_pdfs_docx" --report-dir "reports_docx"

# Update README with DOCX benchmark images
python scripts/update_readme_docx_images.py

Usage

MiniPdf.ConvertToPdf("report.docx", "output.pdf");

Clone this wiki locally