feat(markdown): add picture index to image placeholder#555
feat(markdown): add picture index to image placeholder#555nuri-yoo wants to merge 5 commits intodocling-project:mainfrom
Conversation
|
✅ DCO Check Passed Thanks @nuri-yoo, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesWaiting for:
This rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: Docling What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?View Suggested Changes@@ -7,7 +7,7 @@
- `do_ocr` (default True): Use OCR
- `force_ocr`: Replace existing text with OCR-generated text
- `ocr_engine`, `ocr_lang`: OCR engine and language options
- - `image_export_mode`: `placeholder`, `embedded`, `referenced`
+ - `image_export_mode`: `placeholder`, `embedded`, `referenced`. When using `placeholder` mode with Markdown export, the default placeholder format is `"<!-- image_{index} -->"`, which renders as sequential placeholders like `<!-- image_0 -->`, `<!-- image_1 -->`, etc. The index corresponds to the picture reference in the JSON export (e.g., `item.self_ref` like `"#/pictures/6"` → `6`). This is backward compatible—custom placeholders without the `{index}` token are unaffected.
- `do_table_structure`, `table_mode`, `table_cell_matching`: Table extraction options (see Table Structure Models section below for details on TableFormer V1 and V2)
- `do_code_enrichment`, `do_formula_enrichment`: Code/formula recognition
- `vlm_pipeline_preset`, `vlm_pipeline_custom_config`, `picture_description_preset`, `picture_description_custom_config`, `code_formula_preset`, `code_formula_custom_config`: New model inference engine and preset options for VLM, picture description, and code/formula extractionNote: You must be authenticated to accept/decline updates. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Use `{index}` token in `image_placeholder` to include the picture index
from `item.self_ref`. Default placeholder changes from `<!-- image -->`
to `<!-- image_{index} -->`, producing `<!-- image_0 -->`, etc.
Backward compatible: custom placeholders without `{index}` are unaffected.
Related: docling-project/docling#3078
I, nryoo <nryoo@nryooui-MacBookPro.local>, hereby add my Signed-off-by to this commit: 5de57a5 Signed-off-by: nryoo <nryoo@nryooui-MacBookPro.local>
Signed-off-by: nryoo <nryoo@nryooui-MacBookPro.local>
Signed-off-by: nryoo <nryoo@nryooui-MacBookPro.local>
4d418c3 to
8ee7731
Compare
I, nryoo <nryoo@nryooui-MacBookPro.local>, hereby add my Signed-off-by to this commit: c157073 Signed-off-by: nryoo <nryoo@nryooui-MacBookPro.local>
|
@nuri-yoo I've seen you've been adding new commits lately. Just please let us know when it's ready for review. |
|
Ready for review. All CI checks are passing now. |
|
@ceberam Gentle ping on this PR. Appreciate a review when you get a chance. |
There was a problem hiding this comment.
Thanks a lot @nuri-yoo for your contribution. However, I don’t think indexed image placeholders should become the library default. Markdown export here is a lossy text representation of DoclingDocument. Embedding self_ref-derived picture identity into the default output makes it application-specific and sets a precedent for exposing internal node indices for other item types as well. The DoclingDocument should be the reference object for document hierarchy.
The same workflow can be achieved with a custom serializer extension (e.g., custom MarkdownPictureSerializer / BasePictureSerializer), as shown in Creating a custom serializer.
For instance:
from pathlib import Path
from docling_core.transforms.serializer.base import SerializationResult
from docling_core.transforms.serializer.common import create_ser_result
from docling_core.transforms.serializer.markdown import (
MarkdownDocSerializer,
MarkdownParams,
MarkdownPictureSerializer,
)
from docling_core.types.doc.base import ImageRefMode
from docling_core.types.doc.document import DoclingDocument, PictureItem
class IndexedMarkdownPictureSerializer(MarkdownPictureSerializer):
"""Custom picture serializer that supports {index} in the placeholder."""
def _serialize_image_part(
self,
item: PictureItem,
doc: DoclingDocument,
image_mode: ImageRefMode,
image_placeholder: str,
**kwargs,
) -> SerializationResult:
pic_idx = item.self_ref.rsplit("/", 1)[-1]
resolved_placeholder = image_placeholder.replace("{index}", pic_idx)
# Reuse the parent implementation for non-placeholder modes if desired.
if image_mode != ImageRefMode.PLACEHOLDER:
return super()._serialize_image_part(
item=item,
doc=doc,
image_mode=image_mode,
image_placeholder=resolved_placeholder,
**kwargs,
)
return create_ser_result(text=resolved_placeholder, span_source=item)
# Example usage with an existing document
src = Path("test/data/doc/2408.09869v3_enriched.json")
doc: DoclingDocument = DoclingDocument.load_from_json(src)
serializer = MarkdownDocSerializer(
doc=doc,
picture_serializer=IndexedMarkdownPictureSerializer(),
params=MarkdownParams(
image_mode=ImageRefMode.PLACEHOLDER,
image_placeholder="<!-- image_{index} -->",
),
)
markdown = serializer.serialize().text
print(markdown)My suggestion would be:
- close this PR
- show this use case in Docling documentation (section Serialization) by extending the notebook serialization.ipynb
|
Thanks for the thorough review and the alternative approach. I'll close this and open a docs PR extending |
Sounds good! You can link the new PR to the same issue docling-project/docling#3078 when it's ready. |
Show how to subclass MarkdownPictureSerializer to resolve {index}
tokens in image placeholders using self_ref, as an alternative to
modifying the library default.
Ref: docling-project/docling-core#555
…ok (#3293) * docs: add indexed picture placeholder example to serialization notebook Show how to subclass MarkdownPictureSerializer to resolve {index} tokens in image placeholders using self_ref, as an alternative to modifying the library default. Ref: docling-project/docling-core#555 * DCO Remediation Commit for nuri-yoo <nuri-yoo@users.noreply.github.com> I, nuri-yoo <nuri-yoo@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 82cb733 Signed-off-by: nuri-yoo <nuri-yoo@users.noreply.github.com> --------- Signed-off-by: nuri-yoo <nuri-yoo@users.noreply.github.com> Co-authored-by: nuri-yoo <nuri-yoo@users.noreply.github.com>
Summary
Add sequential picture indexing to the markdown image placeholder by introducing a
{index}format token inimage_placeholder."<!-- image -->"→"<!-- image_{index} -->"→ renders as<!-- image_0 -->,<!-- image_1 -->, ...item.self_ref(e.g."#/pictures/6"→6), matching JSON export references{index}are unaffected (.replace()is a no-op)Changes
MarkdownParams.image_placeholderdefault updatedMarkdownPictureSerializer._serialize_image_part(): resolve{index}token before emitting placeholderTesting
image_placeholder="<!-- image -->"(no{index}token → no change)Resolves docling-project/docling#3078