feat(markdown): add footnote serialization support#569
feat(markdown): add footnote serialization support#569ShrillHarrier wants to merge 2 commits intodocling-project:mainfrom
Conversation
|
✅ DCO Check Passed Thanks @ShrillHarrier, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesWaiting for:
This rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: Docling What are the differences between
|
b849797 to
27e8b28
Compare
|
Signing off with personal email. |
Signed-off-by: Matthew Panizza <username@users.noreply.github.com>
483d902 to
f3f0876
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
ceberam
left a comment
There was a problem hiding this comment.
Thanks @ShrillHarrier for suggesting this PR. Please, see my comments.
In general:
- Adding new tests with a programmatic example is fine but it is much more illustrative to show the impact of the new feature in a serialization with a ground truth data file. This allows us to check how the output markdown file gets rendered in applications like Github or VSC. Please, check how this is done in other test modules.
- To keep the repository consistent, I would suggest that you add the tests in the module
test/test_serialization.py, together with the other tests of the markdown serialization, instead of creating a separate module. - There are some DoclingDocument files (
.json) that do not serialize as expected. Please, check some of them and ensure that the markdown serialization generates the right footnotes hook and text. For instance,test/data/doc/2408.09869v3_enriched.jsonhas a footnote. If you regenerate the ground truth files (by running the tests with env variableDOCLING_GEN_TEST_DATA=1), I would expect that the markdown serialization gets updated with the new footnote serialization.
| params: MarkdownParams = self.params.merge_with_patch(patch=kwargs) | ||
| results: list[SerializationResult] = [] | ||
| if DocItemLabel.FOOTNOTE in params.labels: | ||
| results = [] |
There was a problem hiding this comment.
This line is redundant
| results = [] |
| results = [] | ||
| for footnote in item.footnotes: | ||
| if isinstance(ftn := footnote.resolve(self.doc), TextItem): | ||
| parts = ftn.text.split(" ", 1) |
There was a problem hiding this comment.
The footnote parsing logic assumes a specific format (the identifier and the footnote text). This format is not clearly represented or documented. It would be good that the format is explicit, validated, and clearly documented. We should keep in mind that for the markdown footnote to work, identifiers can be numbers or words, but they can’t contain spaces or tabs..
In addition, I don't think that we keep the footnote references with the correct formatting (a caret and an identifier inside brackets, e.g., [^1]).
If I try to serialize a DoclingDocument from the test dataset, you'll see that the footnote 1 see huggingface.co/ds4sd/docling-models/ (doc item with reference #/texts/29) is not properly serialized as in the markdown specification.
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.base import ImageRefMode
with open("test/data/doc/2408.09869v3_enriched.json") as handler:
content = handler.read()
doc = DoclingDocument.model_validate_json(content)
doc.export_to_markdown(image_mode=ImageRefMode.PLACEHOLDER)| CodeItem, | ||
| FieldHeadingItem, | ||
| FieldValueItem, | ||
| FormulaItem, | ||
| ListItem, | ||
| SectionHeaderItem, | ||
| TextItem, | ||
| TitleItem, |
There was a problem hiding this comment.
Many of these imports are never used. Please, remove unused imports.
This PR is related to the Improved Footnote Serialization in MarkdownDocSerializer.
It is a Feature Request submitted by simonschoe in docling-project.
The features added include serializing a footnote in the form
[^{Identifier}]: {Description}. This is done for Table and Picture items, as footnotes are linked to those.In general, footnotes in
.mdfiles should look like:Resolves docling-project/docling#3128
Tests Added: