Skip to content

DoclingParseDocumentBackend.page_count() returns -1 when docling-parse fails to parse PDF page tree #3031

@prein

Description

@prein

Description

DoclingParseDocumentBackend.page_count() can return -1, causing downstream failures.
This happens when docling-parse's C++ backend fails to parse a PDF's page tree (QPDF exception) but pypdfium2 succeeds.

The method logs "Inconsistent number of pages: N!=-1" but returns the -1 value from self.dp_doc.number_of_pages() to the caller.

Root cause

The docling-parse C++ layer (pdf_decoder<DOCUMENT>) initializes number_of_pages to -1:

// docling-parse: src/parse/pdf_decoders/document.h
pdf_decoder<DOCUMENT>::pdf_decoder():
    // ...
    number_of_pages(-1),
    // ...

It's only set to the real value if qpdf_pages.getKey("/Count").getIntValue() succeeds inside process_document_from_file / process_document_from_bytesio.
If QPDF throws, the exception is caught and logged, but number_of_pages remains -1:

catch(const std::exception& exc)
{
    LOG_S(ERROR) << "filename: " << filename
                 << " can not be processed by qpdf: " << exc.what();
    return false;
}

The Python wrapper (docling_parse.pdf_parser.PdfDocument.number_of_pages()) propagates this -1 without validation. Then in docling:

# docling/backend/docling_parse_backend.py
def page_count(self) -> int:
    len_1 = len(self._pdoc)       # pypdfium2 — returns correct count
    len_2 = self.dp_doc.number_of_pages()  # docling-parse — returns -1

    if len_1 != len_2:
        _log.error(f"Inconsistent number of pages: {len_1}!={len_2}")

    return len_2  # returns -1 to caller

Impact

In our production docling-serve deployment (15 RQ worker pods), we see ~505 of these errors in 12 hours, spread evenly across workers. The errors are document-specific (certain PDFs always trigger it) and not load-dependent.

Example log lines:

ERROR:docling.backend.docling_parse_backend:Inconsistent number of pages: 1!=-1
ERROR:docling.backend.docling_parse_backend:Inconsistent number of pages: 5!=-1
ERROR:docling.backend.docling_parse_backend:Inconsistent number of pages: 4!=-1
ERROR:docling.backend.docling_parse_backend:Inconsistent number of pages: 2!=-1

A -1 page count causes is_valid() to return False (since -1 > 0 is false), which likely causes the document to be rejected entirely — even though pypdfium2 can parse it fine.

Suggested fix

In page_count(), handle the disagreement defensively:

def page_count(self) -> int:
    len_1 = len(self._pdoc)
    len_2 = self.dp_doc.number_of_pages()

    if len_2 < 0:
        _log.warning(
            f"docling-parse returned invalid page count ({len_2}) "
            f"for document {self.document_hash}, "
            f"falling back to pypdfium2 count ({len_1})"
        )
        return len_1

    if len_1 != len_2:
        _log.error(f"Inconsistent number of pages: {len_1}!={len_2}")

    return len_2

Additionally, docling-parse should either raise on -1 or validate number_of_pages >= 0 after document loading (separate issue for docling-project/docling-parse).

Environment

  • docling-serve v1.14.0 (RQ mode, 15 GPU worker pods)
  • docling 2.x (latest)
  • docling-parse (latest)
  • Observed on multiple distinct PDFs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions