feat(markdown): add footnote serialization support by ShrillHarrier · Pull Request #569 · docling-project/docling-core

ShrillHarrier · 2026-03-26T20:12:41Z

This PR is related to the Improved Footnote Serialization in MarkdownDocSerializer.

It is a Feature Request submitted by simonschoe in docling-project.

The features added include serializing a footnote in the form [^{Identifier}]: {Description}. This is done for Table and Picture items, as footnotes are linked to those.

In general, footnotes in .md files should look like:

[^5]: https://github.com/tesseract-ocr/tesseract
[^6]: https://github.com/VikParuchuri/surya
[^7]: https://github.com/lukas-blecher/LaTeX-OCR

Resolves docling-project/docling#3128

Tests Added:

test_table_with_footnotes_markdown()
test_picture_with_footnotes_markdown()
test_table_export_to_markdown_with_footnotes()

github-actions · 2026-03-26T20:12:53Z

✅ DCO Check Passed

Thanks @ShrillHarrier, all your commits are properly signed off. 🎉

mergify · 2026-03-26T20:13:16Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for:

#approved-reviews-by >= 2

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dosubot · 2026-03-26T20:16:46Z

Related Documentation

1 document(s) may need updating based on files changed in this PR:

Docling

What are the differences between `vlm_pipeline_model_local` and `picture_description_local` in Docling, and how do image descriptions, OCR, and table extraction work together? Also, how do the `include_annotations` and `mark_annotations` properties affect exported output?

View Suggested Changes

@@ -67,6 +67,9 @@
 - `compact_tables` (bool): Whether to use compact table format without column padding (default: `False`, Markdown only)
 - `traverse_pictures` (bool): Whether to traverse into picture items and serialize their text children (default: `False`)
 
+**Footnote Serialization in Markdown:**
+When tables or pictures have associated footnotes in the document, these footnotes are automatically serialized in the markdown output using standard markdown footnote syntax: `[^{Identifier}]: {Description}`. The identifier is extracted from the first part of the footnote text, and the remaining text becomes the footnote description. This formatting ensures that footnotes attached to Table and Picture items appear correctly in the exported markdown.
+
 **Handling OCR Text in Scanned/Image-Based PDFs:**
 When processing scanned or image-based PDFs with `force_full_page_ocr=True`, the layout model classifies full-page scans as `PictureItem` nodes. OCR text items are added as children of that picture node in the document tree.

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

^{How did I do? Any feedback?}

I, Matthew Panizza <shrillharrier1@gmail.com>, hereby add my Signed-off-by to this commit: 73a9a40 I, Matthew Panizza <shrillharrier1@gmail.com>, hereby add my Signed-off-by to this commit: 929c11f Signed-off-by: Matthew Panizza <shrillharrier1@gmail.com>

ShrillHarrier · 2026-03-30T19:44:00Z

Signing off with personal email.

Signed-off-by: Matthew Panizza <username@users.noreply.github.com>

codecov · 2026-04-11T07:08:36Z

Codecov Report

❌ Patch coverage is 95.45455% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling_core/transforms/serializer/markdown.py	95.45%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

ceberam

Thanks @ShrillHarrier for suggesting this PR. Please, see my comments.
In general:

Adding new tests with a programmatic example is fine but it is much more illustrative to show the impact of the new feature in a serialization with a ground truth data file. This allows us to check how the output markdown file gets rendered in applications like Github or VSC. Please, check how this is done in other test modules.
To keep the repository consistent, I would suggest that you add the tests in the module test/test_serialization.py, together with the other tests of the markdown serialization, instead of creating a separate module.
There are some DoclingDocument files (.json) that do not serialize as expected. Please, check some of them and ensure that the markdown serialization generates the right footnotes hook and text. For instance, test/data/doc/2408.09869v3_enriched.json has a footnote. If you regenerate the ground truth files (by running the tests with env variable DOCLING_GEN_TEST_DATA=1 ), I would expect that the markdown serialization gets updated with the new footnote serialization.

ceberam · 2026-04-13T12:27:36Z

+        params: MarkdownParams = self.params.merge_with_patch(patch=kwargs)
+        results: list[SerializationResult] = []
+        if DocItemLabel.FOOTNOTE in params.labels:
+            results = []


This line is redundant

Suggested change

results = []

ceberam · 2026-04-13T13:06:54Z

+            results = []
+            for footnote in item.footnotes:
+                if isinstance(ftn := footnote.resolve(self.doc), TextItem):
+                    parts = ftn.text.split(" ", 1)


The footnote parsing logic assumes a specific format (the identifier and the footnote text). This format is not clearly represented or documented. It would be good that the format is explicit, validated, and clearly documented. We should keep in mind that for the markdown footnote to work, identifiers can be numbers or words, but they can’t contain spaces or tabs..
In addition, I don't think that we keep the footnote references with the correct formatting (a caret and an identifier inside brackets, e.g., [^1]).

If I try to serialize a DoclingDocument from the test dataset, you'll see that the footnote 1 see huggingface.co/ds4sd/docling-models/ (doc item with reference #/texts/29) is not properly serialized as in the markdown specification.

from docling_core.types.doc import DoclingDocument from docling_core.types.doc.base import ImageRefMode with open("test/data/doc/2408.09869v3_enriched.json") as handler: content = handler.read() doc = DoclingDocument.model_validate_json(content) doc.export_to_markdown(image_mode=ImageRefMode.PLACEHOLDER)

ceberam · 2026-04-13T13:08:09Z

+    CodeItem,
+    FieldHeadingItem,
+    FieldValueItem,
+    FormulaItem,
+    ListItem,
+    SectionHeaderItem,
+    TextItem,
+    TitleItem,


Many of these imports are never used. Please, remove unused imports.

ShrillHarrier changed the title ~~Improved Footnote Serialization in MarkdownDocSerializer #3128~~ feat(markdown-serializer): add footnote serialization support for markdown Mar 26, 2026

ShrillHarrier changed the title ~~feat(markdown-serializer): add footnote serialization support for markdown~~ feat(markdown): add footnote serialization support Mar 26, 2026

ShrillHarrier force-pushed the dev/md-footnote-serializer branch from b849797 to 27e8b28 Compare March 30, 2026 19:41

ShrillHarrier mentioned this pull request Mar 30, 2026

Improved Footnote Serialization in MarkdownDocSerializer docling-project/docling#3128

Open

update tests

f3f0876

Signed-off-by: Matthew Panizza <username@users.noreply.github.com>

ShrillHarrier force-pushed the dev/md-footnote-serializer branch from 483d902 to f3f0876 Compare April 5, 2026 19:36

ceberam self-requested a review April 11, 2026 07:08

ceberam requested changes Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(markdown): add footnote serialization support#569

feat(markdown): add footnote serialization support#569
ShrillHarrier wants to merge 2 commits intodocling-project:mainfrom
ShrillHarrier:dev/md-footnote-serializer

ShrillHarrier commented Mar 26, 2026 •

edited by ceberam

Loading

Uh oh!

github-actions bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

mergify bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

dosubot bot commented Mar 26, 2026

Uh oh!

ShrillHarrier commented Mar 30, 2026

Uh oh!

codecov bot commented Apr 11, 2026

Uh oh!

ceberam left a comment

Uh oh!

ceberam Apr 13, 2026

Uh oh!

ceberam Apr 13, 2026

Uh oh!

ceberam Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShrillHarrier commented Mar 26, 2026 • edited by ceberam Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests Added:

Uh oh!

github-actions bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

dosubot bot commented Mar 26, 2026

What are the differences between vlm_pipeline_model_local and picture_description_local in Docling, and how do image descriptions, OCR, and table extraction work together? Also, how do the include_annotations and mark_annotations properties affect exported output?

Uh oh!

ShrillHarrier commented Mar 30, 2026

Uh oh!

codecov bot commented Apr 11, 2026

Codecov Report

Uh oh!

ceberam left a comment

Choose a reason for hiding this comment

Uh oh!

ceberam Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ceberam Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ceberam Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShrillHarrier commented Mar 26, 2026 •

edited by ceberam

Loading

github-actions bot commented Mar 26, 2026 •

edited

Loading

mergify bot commented Mar 26, 2026 •

edited

Loading

What are the differences between `vlm_pipeline_model_local` and `picture_description_local` in Docling, and how do image descriptions, OCR, and table extraction work together? Also, how do the `include_annotations` and `mark_annotations` properties affect exported output?