-
Notifications
You must be signed in to change notification settings - Fork 14
DOCX to PDF Implementation Session 2026 03 05
Implemented full DOCX-to-PDF conversion for MiniPdf, following the existing XLSX-to-PDF architecture. The implementation parses Open XML (.docx) documents and renders them as paginated PDF with paragraphs, headings, tables, images, and lists. Achieved 98.7% average benchmark score across 30 test cases with all cases rated Excellent (≥ 90%).
| Metric | Value |
|---|---|
| Average score | 98.7% |
| Cases ≥ 90% (Excellent) | 30 / 30 |
| Lowest score | 96.7% (classic08) |
| Highest score | 99.9% (classic04, classic17) |
| Unit tests | 9 / 9 passing |
| Scoring formula | 0.4 × Text + 0.4 × Visual + 0.2 × PageCount |
| Bucket | Count |
|---|---|
| ≥ 0.99 | 9 |
| 0.98 – 0.989 | 10 |
| 0.97 – 0.979 | 5 |
| 0.96 – 0.969 | 6 |
| File | Purpose |
|---|---|
src/MiniPdf/DocxReader.cs |
Parses DOCX (ZIP / Open XML) into DocxDocument model |
src/MiniPdf/DocxToPdfConverter.cs |
Renders DocxDocument → PdfDocument with page layout |
src/MiniPdf/MiniPdf.cs |
Added ConvertToPdf overload detecting .docx extension |
DocxDocument
├── Paragraphs: List<DocxParagraph>
│ ├── Text, FontSize, Bold, Italic
│ ├── StyleId (Heading1–4, Normal, ListBullet, ListNumber)
│ ├── LineSpacing, SpacingBefore, SpacingAfter
│ ├── IsBulletList, IsNumberedList, ListText
│ ├── Images: List<DocxImage>
│ └── PageBreakBefore
├── Tables: List<DocxTable>
│ └── Rows → Cells → Paragraphs (with images)
├── Margins (Top, Bottom, Left, Right)
└── Elements: List<object> (interleaved paragraphs + tables)
- Paragraphs — Normal text with font size, bold, italic formatting
- Headings — H1 (20pt), H2 (16pt), H3 (13pt), H4 (11pt) with bold styling
- Tables — Borders, cell padding, proportional column widths, images inside cells
- Inline images — JPEG/PNG, scaled to fit page width or cell width
-
Bullet lists — Detected via
<w:numPr>orListBulletstyle name; rendered with•prefix -
Numbered lists — Detected via
<w:numPr>orListNumberstyle name; sequential counter per group -
Page breaks — Explicit
<w:br w:type="page">and paragraph-levelpageBreakBefore -
Line spacing — Single, 1.15×, 1.5×, double (OOXML
w:line/lineRuleparsing) -
Document margins — Read from
<w:pgMar>in section properties - SpacingBefore / SpacingAfter — Per-paragraph spacing from styles and direct formatting
lineHeight = fontSize × FontMetricsFactor × lineSpacingMultiplier
- FontMetricsFactor = 1.18 — Represents ascent + descent ratio for Calibri → Helvetica mapping
- Derived from reference measurement: LibreOffice renders 11pt text at ~14.9pt line height → 14.9 / (11 × 1.15) = 1.178 ≈ 1.18
- Default line spacing multiplier: 1.15 (from OOXML docDefaults
w:line="276", i.e. 276/240)
The w:line attribute has different semantics depending on w:lineRule:
lineRule |
Interpretation | Formula |
|---|---|---|
auto (default) |
Multiplier | value / 240 |
exact |
Fixed points | value / 20 |
atLeast |
Minimum points | value / 20 |
Root cause of classic13 regression: Initially parsed all w:line values as twips (÷20), but auto mode uses 240ths. This made line spacing 12× too small, causing a 4-page document to render as 3 pages.
Tables use a different line height formula without the font metrics factor:
// Paragraph (normal)
var lineHeight = fontSize * FontMetricsFactor * lineSpacingMul;
// Table cell (compact)
var lineHeight = effectiveFontSize * options.LineSpacing;Cell padding = 1pt (Word uses tight cell margins). This prevents table rows from being too tall compared to the reference.
Two detection paths for lists:
-
<w:numPr>element — Standard OOXML numbering reference (checked first) -
Style name fallback — If
styleIdstarts withListBulletorListNumber, treat as list even without<w:numPr>
Style-based numbered lists use a sequential counter in the Read() loop, resetting when a non-numbered paragraph is encountered.
SpacingBefore is suppressed for the first paragraph on each page, matching Word behavior (not LibreOffice, which always applies it). Changing this caused classic30 to overflow from 3 to 4 pages.
Basic DOCX parsing and rendering: paragraphs, headings, tables, images, basic lists. 9 unit tests created.
- Fixed
w:lineparsing:auto= ÷240,exact/atLeast= ÷20 - Added
FontMetricsFactor = 1.18f - classic13 fixed: 3/4 → 4/4 pages
- Tables use compact spacing (no FontMetricsFactor)
- Cell padding reduced from 2pt to 1pt
- Style-based list detection (
ListBullet/ListNumber) - Sequential counter for style-based numbered lists
- classic30 fixed: 4/3 → 3/3 pages
- Added image rendering inside table cell paragraphs
- Height calculation accounts for images in
CalculateRowHeight() - classic29 improved: 0.896 → 0.987
- Tested cellPadding=0.5pt → slightly worse, reverted to 1pt
- Tested removing spacingBefore suppression → classic30 regression, reverted
- Final score stabilized at 0.9868
| Approach | Result | Reason |
|---|---|---|
| cellPadding = 0.5pt | Score 0.9866 (−0.0002) | Marginal degradation across multiple cases |
| Remove spacingBefore suppression | classic30 regression | Page overflow (3 → 4 pages) |
| FontMetricsFactor = 1.2 | classic30 overflow | Too aggressive — 4/3 pages for classic30 |
| Category | Examples | Root Cause |
|---|---|---|
| Bullet encoding | classic08 (96.7%) | Helvetica uses U+2022 (•), Word uses Wingdings U+F0B7 — different glyph rendering |
| Table shading | classic11 (96.9%) | No cell background color support yet |
| Word wrap differences | classic27 (96.8%) | Helvetica wider than Calibri causes different line breaks |
| Font metrics mismatch | classic19 (97.4%), classic20 (97.9%) | Helvetica character widths differ from Calibri |
All remaining gaps are primarily due to the fundamental font difference (Helvetica vs Calibri). True typographic equivalence would require embedding Calibri metrics or a matching font.
| Test Case | Score | Category |
|---|---|---|
| classic01 | 0.997 | Basic paragraph |
| classic02 | 0.985 | Multiple paragraphs |
| classic03 | 0.995 | Heading styles |
| classic04 | 0.999 | Bold / italic |
| classic05 | 0.993 | Bullet list |
| classic06 | 0.999 | Numbered list |
| classic07 | 0.991 | Mixed content |
| classic08 | 0.967 | Nested bullets |
| classic09 | 0.996 | Indented paragraphs |
| classic10 | 0.990 | Table basic |
| classic11 | 0.969 | Table with shading |
| classic12 | 0.992 | Multi-page |
| classic13 | 0.968 | Long document (4 pages) |
| classic14 | 0.983 | Header levels |
| classic15 | 0.989 | Paragraph spacing |
| classic16 | 0.986 | Line spacing variants |
| classic17 | 0.999 | Simple table |
| classic18 | 0.981 | Table with merged cells |
| classic19 | 0.974 | Complex formatting |
| classic20 | 0.979 | Table with many rows |
| classic21 | 0.972 | Mixed lists |
| classic22 | 0.996 | Images |
| classic23 | 0.998 | Single image |
| classic24 | 0.981 | Table + images |
| classic25 | 0.996 | Formatted paragraphs |
| classic26 | 0.992 | Multi-section |
| classic27 | 0.968 | Long paragraphs |
| classic28 | 0.994 | List continuation |
| classic29 | 0.987 | Table with cell images |
| classic30 | 0.986 | Comprehensive report (3 pages) |
# Build
cd d:\git\MiniPdf
dotnet build src/MiniPdf/MiniPdf.csproj
# Convert all DOCX test files to PDF
dotnet run --project tests/DocxConverter -- <input_dir> <output_dir>
# Run unit tests
dotnet test tests/MiniPdf.Tests --filter "DocxToPdfConverterTests"
# Run benchmark comparison
cd tests/MiniPdf.Benchmark
python compare_pdfs.py --minipdf-dir "..\MiniPdf.Scripts\pdf_output_docx" --reference-dir "reference_pdfs_docx" --report-dir "reports_docx"
# Update README with DOCX benchmark images
python scripts/update_readme_docx_images.pyMiniPdf.ConvertToPdf("report.docx", "output.pdf");