Description
DoclingParseDocumentBackend.page_count() can return -1, causing downstream failures.
This happens when docling-parse's C++ backend fails to parse a PDF's page tree (QPDF exception) but pypdfium2 succeeds.
The method logs "Inconsistent number of pages: N!=-1" but returns the -1 value from self.dp_doc.number_of_pages() to the caller.
Root cause
The docling-parse C++ layer (pdf_decoder<DOCUMENT>) initializes number_of_pages to -1:
// docling-parse: src/parse/pdf_decoders/document.h
pdf_decoder<DOCUMENT>::pdf_decoder():
// ...
number_of_pages(-1),
// ...
It's only set to the real value if qpdf_pages.getKey("/Count").getIntValue() succeeds inside process_document_from_file / process_document_from_bytesio.
If QPDF throws, the exception is caught and logged, but number_of_pages remains -1:
catch(const std::exception& exc)
{
LOG_S(ERROR) << "filename: " << filename
<< " can not be processed by qpdf: " << exc.what();
return false;
}
The Python wrapper (docling_parse.pdf_parser.PdfDocument.number_of_pages()) propagates this -1 without validation. Then in docling:
# docling/backend/docling_parse_backend.py
def page_count(self) -> int:
len_1 = len(self._pdoc) # pypdfium2 — returns correct count
len_2 = self.dp_doc.number_of_pages() # docling-parse — returns -1
if len_1 != len_2:
_log.error(f"Inconsistent number of pages: {len_1}!={len_2}")
return len_2 # returns -1 to caller
Impact
In our production docling-serve deployment (15 RQ worker pods), we see ~505 of these errors in 12 hours, spread evenly across workers. The errors are document-specific (certain PDFs always trigger it) and not load-dependent.
Example log lines:
ERROR:docling.backend.docling_parse_backend:Inconsistent number of pages: 1!=-1
ERROR:docling.backend.docling_parse_backend:Inconsistent number of pages: 5!=-1
ERROR:docling.backend.docling_parse_backend:Inconsistent number of pages: 4!=-1
ERROR:docling.backend.docling_parse_backend:Inconsistent number of pages: 2!=-1
A -1 page count causes is_valid() to return False (since -1 > 0 is false), which likely causes the document to be rejected entirely — even though pypdfium2 can parse it fine.
Suggested fix
In page_count(), handle the disagreement defensively:
def page_count(self) -> int:
len_1 = len(self._pdoc)
len_2 = self.dp_doc.number_of_pages()
if len_2 < 0:
_log.warning(
f"docling-parse returned invalid page count ({len_2}) "
f"for document {self.document_hash}, "
f"falling back to pypdfium2 count ({len_1})"
)
return len_1
if len_1 != len_2:
_log.error(f"Inconsistent number of pages: {len_1}!={len_2}")
return len_2
Additionally, docling-parse should either raise on -1 or validate number_of_pages >= 0 after document loading (separate issue for docling-project/docling-parse).
Environment
- docling-serve v1.14.0 (RQ mode, 15 GPU worker pods)
- docling 2.x (latest)
- docling-parse (latest)
- Observed on multiple distinct PDFs
Description
DoclingParseDocumentBackend.page_count()can return-1, causing downstream failures.This happens when
docling-parse's C++ backend fails to parse a PDF's page tree (QPDF exception) butpypdfium2succeeds.The method logs
"Inconsistent number of pages: N!=-1"but returns the-1value fromself.dp_doc.number_of_pages()to the caller.Root cause
The
docling-parseC++ layer (pdf_decoder<DOCUMENT>) initializesnumber_of_pagesto-1:It's only set to the real value if
qpdf_pages.getKey("/Count").getIntValue()succeeds insideprocess_document_from_file/process_document_from_bytesio.If QPDF throws, the exception is caught and logged, but
number_of_pagesremains-1:The Python wrapper (
docling_parse.pdf_parser.PdfDocument.number_of_pages()) propagates this-1without validation. Then indocling:Impact
In our production docling-serve deployment (15 RQ worker pods), we see ~505 of these errors in 12 hours, spread evenly across workers. The errors are document-specific (certain PDFs always trigger it) and not load-dependent.
Example log lines:
A
-1page count causesis_valid()to returnFalse(since-1 > 0is false), which likely causes the document to be rejected entirely — even though pypdfium2 can parse it fine.Suggested fix
In
page_count(), handle the disagreement defensively:Additionally,
docling-parseshould either raise on-1or validatenumber_of_pages >= 0after document loading (separate issue fordocling-project/docling-parse).Environment