html backend 업그레이드 (docling v2.92, html 관련 코드 체리피킹)#193
html backend 업그레이드 (docling v2.92, html 관련 코드 체리피킹)#193HeechanKim-Genon merged 3 commits intodevelopfrom
Conversation
|
Warning Rate limit exceeded
To continue reviewing without waiting, purchase usage credits in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThe PR introduces a backend options configuration hierarchy for document processors, expands HTML regression testing with sample files and test baselines, hardens chunk metadata computation against missing provenance data, and bumps docling-core dependencies to version 2.62. ChangesBackend Options Infrastructure
HTML Regression Testing
Robust Chunk Metadata
Sequence Diagram(s)sequenceDiagram
autonumber
participant HTML as HTML File
participant Processor as Document Processor
participant Backend as HTML Backend
participant Options as Backend Options
participant Output as Vector Output
HTML->>Processor: Feed HTML sample
Processor->>Options: Request backend options (kind: "html")
Options->>Backend: Provide HTMLBackendOptions<br/>(rendering, image fetch config)
Backend->>Backend: Parse & render HTML<br/>with configured options
Backend->>Processor: Return DoclingDocument
Processor->>Processor: Extract vectors &<br/>compute metadata
alt Provenance exists
Processor->>Output: Set page from provenance
else No provenance
Processor->>Output: Default page to 0
end
Processor->>Output: Emit vectors with<br/>robust metadata
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
genon/preprocessor/facade/intelligent_processor.py (2)
1290-1321:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpload flow can still fail on
PictureItem.image is None.Line 1319 calls
get_media_files(), but that helper readsitem.image.uriwithout the null check added inset_media_files(). This reintroduces the same crash path whenupload_filesis enabled.Suggested fix
def get_media_files(self, doc_items: list): temp_list = [] for item in doc_items: - if isinstance(item, PictureItem): + if isinstance(item, PictureItem) and item.image: path = str(item.image.uri) name = path.rsplit("/", 1)[-1] temp_list.append({'path': path, 'name': name}) return temp_list🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@genon/preprocessor/facade/intelligent_processor.py` around lines 1290 - 1321, The upload flow can still crash because get_media_files() accesses item.image.uri without the null check used in set_media_files(); update get_media_files() to mirror set_media_files() by skipping items where item is a PictureItem with image is None (or where item.image.uri is missing) and only include media entries with a valid uri, so the loop in intelligent_processor.py that calls get_media_files(chunk.meta.doc_items) before creating upload tasks will not pass None URIs to upload_files; ensure the function returns an empty list when no valid media found and preserves existing behavior for other item types.
944-960:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
set_chunk_bboxes()still crashes whenitem.provis missing.Line 947 iterates
item.provdirectly. If anyDocItemhasprov=None, this still raises at runtime despite the new provenance-hardening changes.Suggested fix
def set_chunk_bboxes(self, doc_items: list, document: DoclingDocument) -> "GenOSVectorMetaBuilder": chunk_bboxes = [] for item in doc_items: - for prov in item.prov: + if not getattr(item, "prov", None): + continue + for prov in item.prov: label = item.self_ref type_ = item.label size = document.pages.get(prov.page_no).size page_no = prov.page_no bbox = prov.bbox🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@genon/preprocessor/facade/intelligent_processor.py` around lines 944 - 960, The set_chunk_bboxes method currently iterates item.prov directly and will crash if item.prov is None; update set_chunk_bboxes (in GenOSVectorMetaBuilder) to guard against missing provenance by treating item.prov as an empty iterable (e.g., skip the item when item.prov is falsy or iterate over (item.prov or [])). Ensure you reference item.self_ref, item.label and prov.page_no/prov.bbox only when prov exists, so chunk_bboxes remains correct and the subsequent e_page and self.chunk_bboxes assignments are still valid when no provenance is present.
🧹 Nitpick comments (2)
genon/preprocessor/tests/regression/test_html_regression.py (1)
28-47: ⚡ Quick winExtract shared vector-processing logic into a helper to eliminate duplication.
The vector-collection loop (lines 28–47 and 88–107) is copy-pasted verbatim between
run_html_testandcreate_html_baseline. A future change to the serialization path (e.g., then_charfallback orchunk_bboxesparsing) will need to be applied in two places.♻️ Proposed refactor: extract `_collect_results`
+def _collect_results(vectors) -> dict: + """Process a list of vectors into a result dict.""" + result = { + "num_vectors": len(vectors), + "vectors": [], + "label_distribution": {}, + "total_characters": 0, + } + label_counts = Counter() + for vector in vectors: + if hasattr(vector, "model_dump"): + vector_data = vector.model_dump() + else: + vector_data = vector if isinstance(vector, dict) else vars(vector) + + result["vectors"].append(vector_data) + result["total_characters"] += vector_data.get("n_char", len(vector_data.get("text", ""))) + + if "chunk_bboxes" in vector_data: + try: + bboxes = json.loads(vector_data["chunk_bboxes"]) + for bbox in bboxes: + if "type" in bbox: + label_counts[bbox["type"]] += 1 + except (json.JSONDecodeError, TypeError): + pass + + result["label_distribution"] = dict(label_counts) + return result + async def run_html_test(html_path, baseline_path, basic_processor): dp = basic_processor() if not baseline_path.exists(): pytest.fail(...) vectors = await dp(None, str(html_path)) - current_result = { - "num_vectors": len(vectors), - "vectors": [], - "label_distribution": {}, - "total_characters": 0, - } - label_counts = Counter() - for vector in vectors: - ... - current_result["label_distribution"] = dict(label_counts) + current_result = _collect_results(vectors) ... async def create_html_baseline(html_path, baseline_path, basic_processor): dp = basic_processor() vectors = await dp(None, str(html_path)) - result = { - "num_vectors": len(vectors), - "vectors": [], - "label_distribution": {}, - "total_characters": 0, - } - label_counts = Counter() - for vector in vectors: - ... - result["label_distribution"] = dict(label_counts) + result = _collect_results(vectors) ...Also applies to: 88-107
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@genon/preprocessor/tests/regression/test_html_regression.py` around lines 28 - 47, Extract the duplicated vector-processing loop into a single helper function (e.g., _collect_results) and update run_html_test and create_html_baseline to call it; the helper should accept the vectors iterable and the current_result dict, iterate vectors using the same logic currently in both places (use model_dump() when present else dict/vars(), append vector_data to current_result["vectors"], add n_char if present else len(text) to current_result["total_characters"], parse "chunk_bboxes" via json.loads and count bbox["type"] into a Counter while ignoring JSONDecodeError/TypeError), and finally set current_result["label_distribution"] from the Counter—this centralizes the n_char fallback, chunk_bboxes parsing and label counting so both call sites simply call _collect_results(vectors, current_result).genon/preprocessor/facade/gitbook_doc/attachment_processor.md (1)
1165-1165: ⚡ Quick winConsider adding a note distinguishing simple vs. enhanced HTML handling.
Since this PR introduces an enhanced HTML backend with advanced features (Playwright rendering, SSRF mitigation, enhanced table/list handling), readers might be confused about why
attachment_processor.pyuses the basicUnstructuredFileLoaderfor HTML instead of the new backend.Consider adding a brief note (perhaps in section 11 or as a footnote) clarifying that:
- The attachment processor intentionally uses simple text extraction for HTML to maintain speed
- The enhanced HTML backend (with layout analysis, Playwright, etc.) is used in
intelligent_processor.pyfor scenarios requiring higher-quality structure preservationThis would help readers understand the architectural distinction and when to use each processor.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@genon/preprocessor/facade/gitbook_doc/attachment_processor.md` at line 1165, Add a short clarifying note in the docs near the HTML table row (or as a footnote/section 11) explaining that attachment_processor.py intentionally uses the basic UnstructuredFileLoader for fast, simple text extraction, while the enhanced HTML backend (referenced in intelligent_processor.py) provides Playwright rendering, SSRF mitigation, layout analysis and improved table/list handling for use-cases needing higher-quality structure preservation; mention when to prefer each processor and link to intelligent_processor.py/back-end docs for details.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@genon/preprocessor/sample_files/html_sample.html`:
- Line 16: Swap the incorrect descriptions for sub and sup in the sample so they
match the actual markup: change the explanatory text so that H<sub>2</sub>O is
labeled as a subscript example and E=mc<sup>2</sup> is labeled as a superscript
example; locate the snippet containing the strings "H<sub>2</sub>O" and
"E=mc<sup>2</sup>" and correct the wording accordingly.
In `@genon/preprocessor/tests/regression/README.md`:
- Around line 111-113: The README has inconsistent threshold wording: Line 63
says "85% 이상" but the example uses assert similarity > 0.85; update the example
to match the document by making the assertion inclusive (use assert similarity
>= 0.85) or else change the phrasing to "보다 큼" so both match; modify the usage
around similarity and difflib.SequenceMatcher in the example so the comparison
operator and the written threshold are consistent.
In `@genon/preprocessor/tests/regression/test_html_regression.py`:
- Line 24: The test currently collects label_distribution from chunk_bboxes but
never checks it; update run_html_test and/or create_html_baseline to either
assert equality of the computed label_distribution against the baseline or
remove the field entirely: if asserting, after the existing character-count
check in run_html_test compare the computed label_distribution to the baseline's
label_distribution and fail the test on mismatch; if removing, delete the
Counter construction and the label_distribution entries from
create_html_baseline, run_html_test, and the persisted baseline/result schema so
the field is not produced or read anywhere (search for symbols chunk_bboxes,
label_distribution, run_html_test, create_html_baseline to locate all usages).
- Line 9: The module-level computation of HTML_FILES can be empty and causes
pytest to collect zero tests; update
genon/preprocessor/tests/regression/test_html_regression.py so that after
computing HTML_FILES from SAMPLE_DIR you assert or explicitly fail when no files
are found (e.g., assert HTML_FILES, raise RuntimeError or call pytest.fail with
a clear message) so `@pytest.mark.parametrize` receives a non-empty list and the
test suite fails fast; alternatively compute the list inside the test or a
fixture (referencing HTML_FILES, SAMPLE_DIR, and the parametrize usage) and add
a guard that fails when no *.html files are present.
- Around line 49-55: The test currently accesses baseline["num_vectors"],
baseline["total_characters"], and baseline["vectors"] directly which raises a
raw KeyError for malformed baselines; after loading baseline in
test_html_regression.py, validate that baseline is a dict and contains the
required keys ("num_vectors", "total_characters", "vectors") and if any are
missing raise a clear assertion/pytest.fail with a descriptive message including
the html_path.name and which key(s) are missing; then continue to compare
current_result["num_vectors"], current_result["total_characters"], and
current_result["vectors"] against baseline values as before.
---
Outside diff comments:
In `@genon/preprocessor/facade/intelligent_processor.py`:
- Around line 1290-1321: The upload flow can still crash because
get_media_files() accesses item.image.uri without the null check used in
set_media_files(); update get_media_files() to mirror set_media_files() by
skipping items where item is a PictureItem with image is None (or where
item.image.uri is missing) and only include media entries with a valid uri, so
the loop in intelligent_processor.py that calls
get_media_files(chunk.meta.doc_items) before creating upload tasks will not pass
None URIs to upload_files; ensure the function returns an empty list when no
valid media found and preserves existing behavior for other item types.
- Around line 944-960: The set_chunk_bboxes method currently iterates item.prov
directly and will crash if item.prov is None; update set_chunk_bboxes (in
GenOSVectorMetaBuilder) to guard against missing provenance by treating
item.prov as an empty iterable (e.g., skip the item when item.prov is falsy or
iterate over (item.prov or [])). Ensure you reference item.self_ref, item.label
and prov.page_no/prov.bbox only when prov exists, so chunk_bboxes remains
correct and the subsequent e_page and self.chunk_bboxes assignments are still
valid when no provenance is present.
---
Nitpick comments:
In `@genon/preprocessor/facade/gitbook_doc/attachment_processor.md`:
- Line 1165: Add a short clarifying note in the docs near the HTML table row (or
as a footnote/section 11) explaining that attachment_processor.py intentionally
uses the basic UnstructuredFileLoader for fast, simple text extraction, while
the enhanced HTML backend (referenced in intelligent_processor.py) provides
Playwright rendering, SSRF mitigation, layout analysis and improved table/list
handling for use-cases needing higher-quality structure preservation; mention
when to prefer each processor and link to intelligent_processor.py/back-end docs
for details.
In `@genon/preprocessor/tests/regression/test_html_regression.py`:
- Around line 28-47: Extract the duplicated vector-processing loop into a single
helper function (e.g., _collect_results) and update run_html_test and
create_html_baseline to call it; the helper should accept the vectors iterable
and the current_result dict, iterate vectors using the same logic currently in
both places (use model_dump() when present else dict/vars(), append vector_data
to current_result["vectors"], add n_char if present else len(text) to
current_result["total_characters"], parse "chunk_bboxes" via json.loads and
count bbox["type"] into a Counter while ignoring JSONDecodeError/TypeError), and
finally set current_result["label_distribution"] from the Counter—this
centralizes the n_char fallback, chunk_bboxes parsing and label counting so both
call sites simply call _collect_results(vectors, current_result).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e3afa361-9a8e-40bd-8168-4dda97d0c5dd
⛔ Files ignored due to path filters (1)
genon/preprocessor/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (15)
docling/backend/abstract_backend.pydocling/backend/html_backend.pydocling/datamodel/backend_options.pygenon/preprocessor/facade/gitbook_doc/attachment_processor.mdgenon/preprocessor/facade/intelligent_processor.pygenon/preprocessor/pyproject.tomlgenon/preprocessor/sample_files/html_lists.htmlgenon/preprocessor/sample_files/html_sample.htmlgenon/preprocessor/sample_files/html_tables.htmlgenon/preprocessor/tests/regression/README.mdgenon/preprocessor/tests/regression/baselines/html_html_lists.jsongenon/preprocessor/tests/regression/baselines/html_html_sample.jsongenon/preprocessor/tests/regression/baselines/html_html_tables.jsongenon/preprocessor/tests/regression/test_html_regression.pypyproject.toml
| similarity = difflib.SequenceMatcher(None, current_text, baseline_text).ratio() | ||
| assert similarity > 0.85 | ||
| ``` |
There was a problem hiding this comment.
유사도 기준 표현을 문서 전체에서 통일해 주세요.
Line 63은 “85% 이상”인데, Line 112 예시는 assert similarity > 0.85라서 경계값(0.85) 처리 설명이 어긋납니다. >= 또는 문구 중 하나로 맞춰 주세요.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@genon/preprocessor/tests/regression/README.md` around lines 111 - 113, The
README has inconsistent threshold wording: Line 63 says "85% 이상" but the example
uses assert similarity > 0.85; update the example to match the document by
making the assertion inclusive (use assert similarity >= 0.85) or else change
the phrasing to "보다 큼" so both match; modify the usage around similarity and
difflib.SequenceMatcher in the example so the comparison operator and the
written threshold are consistent.
|
|
||
| # sample_files에서 모든 HTML 파일 자동 검색 | ||
| SAMPLE_DIR = Path(__file__).resolve().parents[2] / "sample_files" | ||
| HTML_FILES = sorted([f for f in SAMPLE_DIR.glob("*.html") if f.is_file()]) |
There was a problem hiding this comment.
HTML_FILES is computed at module import time — an empty sample_files/ directory silently skips all tests.
If SAMPLE_DIR is missing or contains no *.html files (e.g., in a fresh CI clone where sample files are gitignored), @pytest.mark.parametrize receives an empty list and pytest silently collects zero tests instead of failing. This can mask a broken environment.
🛡️ Proposed fix: assert at least one HTML file is found
HTML_FILES = sorted([f for f in SAMPLE_DIR.glob("*.html") if f.is_file()])
+
+if not HTML_FILES:
+ raise FileNotFoundError(
+ f"No HTML sample files found in {SAMPLE_DIR}. "
+ "Ensure sample_files/*.html are committed to the repository."
+ )Alternatively, add a dedicated test:
+def test_sample_files_exist():
+ assert HTML_FILES, f"No HTML files found in {SAMPLE_DIR}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| HTML_FILES = sorted([f for f in SAMPLE_DIR.glob("*.html") if f.is_file()]) | |
| HTML_FILES = sorted([f for f in SAMPLE_DIR.glob("*.html") if f.is_file()]) | |
| if not HTML_FILES: | |
| raise FileNotFoundError( | |
| f"No HTML sample files found in {SAMPLE_DIR}. " | |
| "Ensure sample_files/*.html are committed to the repository." | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@genon/preprocessor/tests/regression/test_html_regression.py` at line 9, The
module-level computation of HTML_FILES can be empty and causes pytest to collect
zero tests; update genon/preprocessor/tests/regression/test_html_regression.py
so that after computing HTML_FILES from SAMPLE_DIR you assert or explicitly fail
when no files are found (e.g., assert HTML_FILES, raise RuntimeError or call
pytest.fail with a clear message) so `@pytest.mark.parametrize` receives a
non-empty list and the test suite fails fast; alternatively compute the list
inside the test or a fixture (referencing HTML_FILES, SAMPLE_DIR, and the
parametrize usage) and add a guard that fails when no *.html files are present.
| current_result = { | ||
| "num_vectors": len(vectors), | ||
| "vectors": [], | ||
| "label_distribution": {}, |
There was a problem hiding this comment.
label_distribution is collected but never asserted — either add an assertion or remove it.
Both run_html_test and create_html_baseline build label_distribution from chunk_bboxes, it is persisted in the baseline JSON, but the regression test never compares it. This means silent regressions in label distributions will go undetected, and if comparison is intentionally deferred, the dead data still costs parse overhead on every run.
🛡️ Proposed fix: assert label_distribution or drop it
If you want to assert it, add after the existing character-count check in run_html_test:
+ assert current_result["label_distribution"] == baseline.get("label_distribution", {}), (
+ f"[{html_path.name}] Label distribution mismatch: "
+ f"{current_result['label_distribution']} != {baseline.get('label_distribution', {})}"
+ )Or, if label distribution tracking is intentionally deferred, remove the Counter and label_distribution fields entirely from both functions and the result schema to avoid confusion.
Also applies to: 47-47
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@genon/preprocessor/tests/regression/test_html_regression.py` at line 24, The
test currently collects label_distribution from chunk_bboxes but never checks
it; update run_html_test and/or create_html_baseline to either assert equality
of the computed label_distribution against the baseline or remove the field
entirely: if asserting, after the existing character-count check in
run_html_test compare the computed label_distribution to the baseline's
label_distribution and fail the test on mismatch; if removing, delete the
Counter construction and the label_distribution entries from
create_html_baseline, run_html_test, and the persisted baseline/result schema so
the field is not produced or read anywhere (search for symbols chunk_bboxes,
label_distribution, run_html_test, create_html_baseline to locate all usages).
| with open(baseline_path, "r", encoding="utf-8") as f: | ||
| baseline = json.load(f) | ||
|
|
||
| assert current_result["num_vectors"] == baseline["num_vectors"], ( | ||
| f"[{html_path.name}] Vector count mismatch: " | ||
| f"{current_result['num_vectors']} != {baseline['num_vectors']}" | ||
| ) |
There was a problem hiding this comment.
Unguarded key access on baseline dict will produce confusing KeyError on malformed baselines.
Lines 52, 58, and 65 access baseline["num_vectors"], baseline["total_characters"], and baseline["vectors"] directly. If a baseline file is corrupted, was generated with an older schema, or is unexpectedly empty, the test crashes with a KeyError rather than a clear diagnostic message.
🛡️ Proposed fix: validate baseline schema after loading
with open(baseline_path, "r", encoding="utf-8") as f:
baseline = json.load(f)
+ required_keys = {"num_vectors", "total_characters", "vectors"}
+ missing = required_keys - baseline.keys()
+ if missing:
+ pytest.fail(f"[{html_path.name}] Baseline is missing keys: {missing}. Re-run with -m 'update_baseline'.")
+
assert current_result["num_vectors"] == baseline["num_vectors"], ...📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with open(baseline_path, "r", encoding="utf-8") as f: | |
| baseline = json.load(f) | |
| assert current_result["num_vectors"] == baseline["num_vectors"], ( | |
| f"[{html_path.name}] Vector count mismatch: " | |
| f"{current_result['num_vectors']} != {baseline['num_vectors']}" | |
| ) | |
| with open(baseline_path, "r", encoding="utf-8") as f: | |
| baseline = json.load(f) | |
| required_keys = {"num_vectors", "total_characters", "vectors"} | |
| missing = required_keys - baseline.keys() | |
| if missing: | |
| pytest.fail(f"[{html_path.name}] Baseline is missing keys: {missing}. Re-run with -m 'update_baseline'.") | |
| assert current_result["num_vectors"] == baseline["num_vectors"], ( | |
| f"[{html_path.name}] Vector count mismatch: " | |
| f"{current_result['num_vectors']} != {baseline['num_vectors']}" | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@genon/preprocessor/tests/regression/test_html_regression.py` around lines 49
- 55, The test currently accesses baseline["num_vectors"],
baseline["total_characters"], and baseline["vectors"] directly which raises a
raw KeyError for malformed baselines; after loading baseline in
test_html_regression.py, validate that baseline is a dict and contains the
required keys ("num_vectors", "total_characters", "vectors") and if any are
missing raise a clear assertion/pytest.fail with a descriptive message including
the html_path.name and which key(s) are missing; then continue to compare
current_result["num_vectors"], current_result["total_characters"], and
current_result["vectors"] against baseline values as before.
There was a problem hiding this comment.
Code Review
This pull request introduces a structured options system for document backends using Pydantic models and adds comprehensive support for HTML processing, including new regression tests and sample files. It also upgrades docling-core to version 2.62 and includes several stability fixes in the intelligent_processor to handle missing image data and provenance information. Feedback focuses on fixing a Pydantic inheritance issue where kind literals conflict, strengthening validation for HTML backend options, and refactoring duplicate logic in the new regression tests. Additionally, it is recommended to use a mock request object in tests to avoid potential attribute errors.
| class PdfBackendOptions(BaseBackendOptions): | ||
| """Backend options for pdf document backends.""" | ||
|
|
||
| kind: Literal["pdf"] = Field("pdf", exclude=True, repr=False) | ||
| password: Optional[SecretStr] = None | ||
|
|
||
|
|
||
| class MetsGbsBackendOptions(PdfBackendOptions): | ||
| """Options specific to the METS-GBS document backend.""" | ||
|
|
||
| kind: Annotated[Literal["mets-gbs"], Field(exclude=True, repr=False)] = "mets-gbs" # type: ignore[assignment] |
There was a problem hiding this comment.
MetsGbsBackendOptions가 PdfBackendOptions를 상속받고 있는데, 부모 클래스에서 kind 필드가 Literal["pdf"]로 고정되어 있어 Pydantic 유효성 검사 시 오류가 발생합니다. Pydantic v2에서는 상속받은 필드의 Literal 타입을 호환되지 않는 값으로 재정의할 수 없기 때문입니다.
공통 필드(예: password)를 별도의 기저 클래스(예: PdfBaseOptions)로 분리하고, PdfBackendOptions와 MetsGbsBackendOptions가 이를 각각 상속받도록 수정하여 kind 리터럴 충돌을 해결해야 합니다.
class PdfBaseOptions(BaseBackendOptions):
"""Base options for PDF-based backends."""
password: Optional[SecretStr] = None
class PdfBackendOptions(PdfBaseOptions):
"""Backend options for pdf document backends."""
kind: Literal["pdf"] = Field("pdf", exclude=True, repr=False)
class MetsGbsBackendOptions(PdfBaseOptions):
"""Options specific to the METS-GBS document backend."""
kind: Annotated[Literal["mets-gbs"], Field(exclude=True, repr=False)] = "mets-gbs"| render_page_width: int = Field( | ||
| 794, description="Render page width in CSS pixels (A4 @ 96 DPI)." | ||
| ) | ||
| render_page_height: int = Field( | ||
| 1123, description="Render page height in CSS pixels (A4 @ 96 DPI)." | ||
| ) | ||
| render_page_orientation: Literal["portrait", "landscape"] = Field( | ||
| "portrait", description="Render page orientation." | ||
| ) | ||
| render_print_media: bool = Field( | ||
| True, description="Use print media emulation when rendering." | ||
| ) | ||
| render_wait_until: Literal["load", "domcontentloaded", "networkidle"] = Field( | ||
| "networkidle", | ||
| description="Playwright wait condition before extracting the DOM.", | ||
| ) | ||
| render_wait_ms: int = Field( | ||
| 0, description="Extra delay in milliseconds after load." | ||
| ) | ||
| render_device_scale: float = Field( | ||
| 1.0, description="Device scale factor for rendering." | ||
| ) | ||
| page_padding: int = Field( | ||
| 0, | ||
| description=( | ||
| "Padding in CSS pixels applied to the HTML body before rendering." | ||
| ), | ||
| ) | ||
| render_full_page: bool = Field( | ||
| False, | ||
| description=("Capture a single full-height page image instead of paginating."), | ||
| ) | ||
| render_dpi: int = Field( | ||
| 96, description="DPI used for page images created from rendering." | ||
| ) |
| async def run_html_test(html_path, baseline_path, basic_processor): | ||
| """HTML 파일에 대한 regression test 실행""" | ||
| dp = basic_processor() | ||
|
|
||
| if not baseline_path.exists(): | ||
| pytest.fail(f"Baseline not found: {baseline_path}. Run with -m 'update_baseline' to create.") | ||
|
|
||
| vectors = await dp(None, str(html_path)) | ||
|
|
||
| current_result = { | ||
| "num_vectors": len(vectors), | ||
| "vectors": [], | ||
| "label_distribution": {}, | ||
| "total_characters": 0, | ||
| } | ||
|
|
||
| label_counts = Counter() | ||
| for vector in vectors: | ||
| if hasattr(vector, "model_dump"): | ||
| vector_data = vector.model_dump() | ||
| else: | ||
| vector_data = vector if isinstance(vector, dict) else vars(vector) | ||
|
|
||
| current_result["vectors"].append(vector_data) | ||
| current_result["total_characters"] += vector_data.get("n_char", len(vector_data.get("text", ""))) | ||
|
|
||
| if "chunk_bboxes" in vector_data: | ||
| try: | ||
| bboxes = json.loads(vector_data["chunk_bboxes"]) | ||
| for bbox in bboxes: | ||
| if "type" in bbox: | ||
| label_counts[bbox["type"]] += 1 | ||
| except (json.JSONDecodeError, TypeError): | ||
| pass | ||
|
|
||
| current_result["label_distribution"] = dict(label_counts) | ||
|
|
||
| with open(baseline_path, "r", encoding="utf-8") as f: | ||
| baseline = json.load(f) | ||
|
|
||
| assert current_result["num_vectors"] == baseline["num_vectors"], ( | ||
| f"[{html_path.name}] Vector count mismatch: " | ||
| f"{current_result['num_vectors']} != {baseline['num_vectors']}" | ||
| ) | ||
|
|
||
| char_diff = abs(current_result["total_characters"] - baseline["total_characters"]) | ||
| char_ratio = char_diff / max(baseline["total_characters"], 1) | ||
| assert char_ratio < 0.05, ( | ||
| f"[{html_path.name}] Character count difference too large: " | ||
| f"{char_diff} chars ({char_ratio:.1%} change)" | ||
| ) | ||
|
|
||
| for i, (current_vector, baseline_vector) in enumerate( | ||
| zip(current_result["vectors"], baseline["vectors"]) | ||
| ): | ||
| current_text = current_vector.get("text", "") | ||
| baseline_text = baseline_vector.get("text", "") | ||
| similarity = difflib.SequenceMatcher(None, current_text, baseline_text).ratio() | ||
| assert similarity > 0.85, ( | ||
| f"[{html_path.name}] Vector {i} text similarity too low: {similarity:.2%}" | ||
| ) | ||
|
|
||
|
|
||
| async def create_html_baseline(html_path, baseline_path, basic_processor): | ||
| """HTML 파일에 대한 baseline 생성""" | ||
| dp = basic_processor() | ||
|
|
||
| vectors = await dp(None, str(html_path)) | ||
|
|
||
| result = { | ||
| "num_vectors": len(vectors), | ||
| "vectors": [], | ||
| "label_distribution": {}, | ||
| "total_characters": 0, | ||
| } | ||
|
|
||
| label_counts = Counter() | ||
| for vector in vectors: | ||
| if hasattr(vector, "model_dump"): | ||
| vector_data = vector.model_dump() | ||
| else: | ||
| vector_data = vector if isinstance(vector, dict) else vars(vector) | ||
|
|
||
| result["vectors"].append(vector_data) | ||
| result["total_characters"] += vector_data.get("n_char", len(vector_data.get("text", ""))) | ||
|
|
||
| if "chunk_bboxes" in vector_data: | ||
| try: | ||
| bboxes = json.loads(vector_data["chunk_bboxes"]) | ||
| for bbox in bboxes: | ||
| if "type" in bbox: | ||
| label_counts[bbox["type"]] += 1 | ||
| except (json.JSONDecodeError, TypeError): | ||
| pass | ||
|
|
||
| result["label_distribution"] = dict(label_counts) | ||
|
|
||
| baseline_path.parent.mkdir(parents=True, exist_ok=True) | ||
| with open(baseline_path, "w", encoding="utf-8") as f: | ||
| json.dump(result, f, indent=2, ensure_ascii=False) | ||
|
|
||
| print(f"✓ Updated baseline: {baseline_path}") |
| if not baseline_path.exists(): | ||
| pytest.fail(f"Baseline not found: {baseline_path}. Run with -m 'update_baseline' to create.") | ||
|
|
||
| vectors = await dp(None, str(html_path)) |
There was a problem hiding this comment.
PR #191: HTML 파싱 백엔드 업그레이드
Summary
docling v2.92 기준으로 HTML 파싱 백엔드(
html_backend.py)를 대폭 업그레이드하고, 백엔드 옵션 시스템(backend_options.py)을 신규 도입하였습니다. 아울러intelligent_processor.py의None접근 버그 4건을 수정하고, HTML 파일에 대한 regression test를 추가했습니다.Changes
HTML Backend 업그레이드 (
docling/backend/html_backend.py)docling v2.92에서 HTML 관련 코드를 체리피킹하여 파싱 품질을 대폭 향상했습니다.
주요 추가/개선 사항:
<b>,<i>,<s>,<u>,<sub>,<sup>,<code>등 텍스트 포맷팅 정보를Formatting객체로 유지<ol>/<ul>중첩 및start속성 지원, 순서 있는 리스트 번호 정확히 추적colspan/rowspan완전 지원,<caption>/<thead>/<tfoot>파싱<figure>+<figcaption>구조 인식,<img>단독 태그 처리, Base64 및 원격 이미지 fetch 지원HTMLBackendOptions.render_page=True설정 시 헤드리스 브라우저로 페이지를 렌더링하여 정확한 바운딩 박스 및 페이지 이미지 캡처_validate_url_safety): SSRF 방지를 위한 private IP / localhost 차단_BLOCK_TAGS/_PARA_BREAKERS구조화: 블록 요소와 인라인 요소 분류를 명확한 상수 집합으로 재정의백엔드 옵션 시스템 도입 (
docling/datamodel/backend_options.py) — 신규 파일모든 백엔드에서 공통으로 사용할 수 있는 Pydantic 기반 옵션 클래스 계층을 신설했습니다.
HTMLBackendOptions주요 필드:render_pageFalserender_page_width/heightrender_wait_untilnetworkidlefetch_imagesFalsesource_uriNoneabstract_backend.py수정AbstractDocumentBackend/DeclarativeDocumentBackend생성자에options: Optional[BackendOptions] = None파라미터를 추가하여 옵션 시스템과 연결했습니다.intelligent_processor.py안전성 패치HTML을 포함한 페이지 정보가 없는 문서(
prov미포함 청크, 이미지 없는PictureItem)를 처리할 때 발생하던AttributeError/TypeError4건 수정:set_chunk_bboxes()e_page기본값None→0set_media_files()PictureItem.imageNone 체크 추가split_documents()chunk.meta.doc_items[0].prov비어있는 경우 건너뜀compose_vectors()prov비어있을 때chunk_page기본값0처리의존성 업데이트
docling-core:2.58.0→2.62.0(GraphData, RichTableCell 등 신규 타입 지원)HTML Regression Test 추가
tests/regression/test_html_regression.pysample_files/html_sample.htmlsample_files/html_tables.htmlsample_files/html_lists.html기존 PDF/DOCX 테스트와 달리 HTML 테스트는 assertion을 처음부터 활성화했습니다 (벡터 수 정확 일치, 문자 수 ±5%, 유사도 ≥85%).
Test Plan
pytest tests/regression/test_html_regression.py -m update_baseline -v— baseline 생성pytest tests/regression/test_html_regression.py -m regression -v— HTML regression 통과 확인pytest tests/regression/ -m regression -v— 기존 HWP/HWPX regression 회귀 없음 확인Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores
docling-coredependency to version 2.62.