Skip to content

html backend 업그레이드 (docling v2.92, html 관련 코드 체리피킹)#193

Merged
HeechanKim-Genon merged 3 commits intodevelopfrom
feature/191-html-parsing-backend
May 7, 2026
Merged

html backend 업그레이드 (docling v2.92, html 관련 코드 체리피킹)#193
HeechanKim-Genon merged 3 commits intodevelopfrom
feature/191-html-parsing-backend

Conversation

@inoray
Copy link
Copy Markdown
Collaborator

@inoray inoray commented May 7, 2026

PR #191: HTML 파싱 백엔드 업그레이드

Summary

docling v2.92 기준으로 HTML 파싱 백엔드(html_backend.py)를 대폭 업그레이드하고, 백엔드 옵션 시스템(backend_options.py)을 신규 도입하였습니다. 아울러 intelligent_processor.pyNone 접근 버그 4건을 수정하고, HTML 파일에 대한 regression test를 추가했습니다.


Changes

HTML Backend 업그레이드 (docling/backend/html_backend.py)

docling v2.92에서 HTML 관련 코드를 체리피킹하여 파싱 품질을 대폭 향상했습니다.

주요 추가/개선 사항:

  • 인라인 서식 보존: <b>, <i>, <s>, <u>, <sub>, <sup>, <code> 등 텍스트 포맷팅 정보를 Formatting 객체로 유지
  • 리스트 처리 강화: <ol> / <ul> 중첩 및 start 속성 지원, 순서 있는 리스트 번호 정확히 추적
  • 테이블 처리 강화: colspan / rowspan 완전 지원, <caption> / <thead> / <tfoot> 파싱
  • 이미지/피겨 처리: <figure> + <figcaption> 구조 인식, <img> 단독 태그 처리, Base64 및 원격 이미지 fetch 지원
  • Playwright 렌더링 지원: HTMLBackendOptions.render_page=True 설정 시 헤드리스 브라우저로 페이지를 렌더링하여 정확한 바운딩 박스 및 페이지 이미지 캡처
  • URL 보안 검증 추가 (_validate_url_safety): SSRF 방지를 위한 private IP / localhost 차단
  • _BLOCK_TAGS / _PARA_BREAKERS 구조화: 블록 요소와 인라인 요소 분류를 명확한 상수 집합으로 재정의

백엔드 옵션 시스템 도입 (docling/datamodel/backend_options.py) — 신규 파일

모든 백엔드에서 공통으로 사용할 수 있는 Pydantic 기반 옵션 클래스 계층을 신설했습니다.

BaseBackendOptions
├── DeclarativeBackendOptions
├── HTMLBackendOptions      ← render_page, fetch_images, source_uri 등
├── MarkdownBackendOptions
├── PdfBackendOptions
├── MsExcelBackendOptions
└── ...

HTMLBackendOptions 주요 필드:

옵션 기본값 설명
render_page False Playwright 헤드리스 렌더링 활성화
render_page_width/height 794 / 1123 px A4 @ 96 DPI 기준
render_wait_until networkidle 렌더링 대기 조건
fetch_images False 원격/로컬 이미지 접근 허용
source_uri None 상대 경로 해석용 원본 URI

abstract_backend.py 수정

AbstractDocumentBackend / DeclarativeDocumentBackend 생성자에 options: Optional[BackendOptions] = None 파라미터를 추가하여 옵션 시스템과 연결했습니다.

intelligent_processor.py 안전성 패치

HTML을 포함한 페이지 정보가 없는 문서(prov 미포함 청크, 이미지 없는 PictureItem)를 처리할 때 발생하던 AttributeError / TypeError 4건 수정:

위치 수정 내용
set_chunk_bboxes() e_page 기본값 None0
set_media_files() PictureItem.image None 체크 추가
split_documents() chunk.meta.doc_items[0].prov 비어있는 경우 건너뜀
compose_vectors() prov 비어있을 때 chunk_page 기본값 0 처리

의존성 업데이트

docling-core: 2.58.02.62.0 (GraphData, RichTableCell 등 신규 타입 지원)

HTML Regression Test 추가

파일 설명
tests/regression/test_html_regression.py HTML 파일 자동 탐색 및 baseline 비교 테스트
sample_files/html_sample.html 헤딩 계층, 텍스트 포맷팅, 코드블록, 인용문
sample_files/html_tables.html colspan / rowspan 포함 복합 테이블
sample_files/html_lists.html 중첩 리스트, start 속성, dl 정의 목록

기존 PDF/DOCX 테스트와 달리 HTML 테스트는 assertion을 처음부터 활성화했습니다 (벡터 수 정확 일치, 문자 수 ±5%, 유사도 ≥85%).


Test Plan

  • pytest tests/regression/test_html_regression.py -m update_baseline -v — baseline 생성
  • pytest tests/regression/test_html_regression.py -m regression -v — HTML regression 통과 확인
  • pytest tests/regression/ -m regression -v — 기존 HWP/HWPX regression 회귀 없음 확인
  • HTML 파일 직접 docling 변환 후 헤딩·테이블·리스트·이미지 파싱 결과 육안 검토

Summary by CodeRabbit

  • New Features

    • Added configurable options system for document backends to enable backend-specific settings.
    • Introduced HTML regression testing with automatic baseline validation.
  • Bug Fixes

    • Improved handling of missing metadata in chunk processing for better robustness.
  • Tests

    • Added HTML sample files and regression test suite with baseline fixtures.
  • Chores

    • Updated docling-core dependency to version 2.62.

@inoray inoray linked an issue May 7, 2026 that may be closed by this pull request
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Warning

Rate limit exceeded

@inoray has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 27 minutes and 38 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e672c603-ffa8-48f1-8e9b-2f4c716433ce

📥 Commits

Reviewing files that changed from the base of the PR and between 93557b6 and c622f8b.

📒 Files selected for processing (5)
  • genon/preprocessor/facade/attachment_processor.py
  • genon/preprocessor/facade/convert_processor.py
  • genon/preprocessor/facade/gitbook_doc/convert_processor.md
  • genon/preprocessor/facade/gitbook_doc/intelligent_processor.md
  • genon/preprocessor/sample_files/html_sample.html
📝 Walkthrough

Walkthrough

The PR introduces a backend options configuration hierarchy for document processors, expands HTML regression testing with sample files and test baselines, hardens chunk metadata computation against missing provenance data, and bumps docling-core dependencies to version 2.62.

Changes

Backend Options Infrastructure

Layer / File(s) Summary
Options Data Model
docling/datamodel/backend_options.py
New Pydantic options hierarchy: BaseBackendOptions with shared fetch toggles, eight backend-specific option classes (DeclarativeBackendOptions, HTMLBackendOptions, MarkdownBackendOptions, PdfBackendOptions, MetsGbsBackendOptions, MsExcelBackendOptions, LatexBackendOptions, XBRLBackendOptions) each with kind discriminator and backend-specific config, and BackendOptions discriminated union type alias.
Backend Integration
docling/backend/abstract_backend.py
AbstractDocumentBackend.__init__ now accepts optional BaseBackendOptions parameter and stores self.options (defaults to BaseBackendOptions()); return type of supported_formats changed from typing.Set to set; DeclarativeDocumentBackend.__init__ added with BackendOptions parameter defaulting to DeclarativeBackendOptions().
Dependency Updates
pyproject.toml, genon/preprocessor/pyproject.toml
docling-core dependency bumped from 2.58 to 2.62 in both root and preprocessor projects.

HTML Regression Testing

Layer / File(s) Summary
Test Fixtures
genon/preprocessor/sample_files/html_sample.html, html_lists.html, html_tables.html
Three new HTML sample documents covering headings, formatting, lists (unordered, ordered, definition), nested structures, tables (with colspan/rowspan), links, code blocks, blockquotes, and special characters.
Test Suite
genon/preprocessor/tests/regression/test_html_regression.py
Auto-discovers HTML samples and validates processing output: asserts vector count match, enforces ±5% character count tolerance, verifies per-vector text similarity > 0.85 using difflib.SequenceMatcher; includes update mode to regenerate baselines.
Test Baselines
genon/preprocessor/tests/regression/baselines/html_html_sample.json, html_html_lists.json, html_html_tables.json
Three JSON baseline fixtures containing expected vector outputs, character/word/line counts, page indices, and aggregated totals for regression assertion.
Documentation
genon/preprocessor/tests/regression/README.md, genon/preprocessor/facade/gitbook_doc/attachment_processor.md
README expanded with "HTML 테스트 상세" section detailing HTML sample coverage, enabled assertions, and baseline workflow; attachment processor doc updated to include .html routing via UnstructuredFileLoader.

Robust Chunk Metadata

Layer / File(s) Summary
Defensive Provenance Handling
genon/preprocessor/facade/intelligent_processor.py
GenOSVectorMetaBuilder.set_chunk_bboxes() defaults e_page to 0 when no bounding boxes exist; DocumentProcessor.split_documents() increments page counts only when chunk has provenance; DocumentProcessor.compose_vectors() safely derives chunk_page by checking for provenance and defaulting to 0 when absent.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant HTML as HTML File
    participant Processor as Document Processor
    participant Backend as HTML Backend
    participant Options as Backend Options
    participant Output as Vector Output

    HTML->>Processor: Feed HTML sample
    Processor->>Options: Request backend options (kind: "html")
    Options->>Backend: Provide HTMLBackendOptions<br/>(rendering, image fetch config)
    Backend->>Backend: Parse & render HTML<br/>with configured options
    Backend->>Processor: Return DoclingDocument
    Processor->>Processor: Extract vectors &<br/>compute metadata
    alt Provenance exists
        Processor->>Output: Set page from provenance
    else No provenance
        Processor->>Output: Default page to 0
    end
    Processor->>Output: Emit vectors with<br/>robust metadata
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hops through HTML fields,
With options and defensive shields,
Each sample tested, baselines blessed,
The regression suite has passed the test—
Robustness blooms where gaps once grew. 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly references the main change: HTML backend upgrade with docling v2.92 cherry-picking, which matches the PR's primary objective of upgrading HTML parsing capabilities.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/191-html-parsing-backend

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
genon/preprocessor/facade/intelligent_processor.py (2)

1290-1321: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Upload flow can still fail on PictureItem.image is None.

Line 1319 calls get_media_files(), but that helper reads item.image.uri without the null check added in set_media_files(). This reintroduces the same crash path when upload_files is enabled.

Suggested fix
 def get_media_files(self, doc_items: list):
     temp_list = []
     for item in doc_items:
-        if isinstance(item, PictureItem):
+        if isinstance(item, PictureItem) and item.image:
             path = str(item.image.uri)
             name = path.rsplit("/", 1)[-1]
             temp_list.append({'path': path, 'name': name})
     return temp_list
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/facade/intelligent_processor.py` around lines 1290 - 1321,
The upload flow can still crash because get_media_files() accesses
item.image.uri without the null check used in set_media_files(); update
get_media_files() to mirror set_media_files() by skipping items where item is a
PictureItem with image is None (or where item.image.uri is missing) and only
include media entries with a valid uri, so the loop in intelligent_processor.py
that calls get_media_files(chunk.meta.doc_items) before creating upload tasks
will not pass None URIs to upload_files; ensure the function returns an empty
list when no valid media found and preserves existing behavior for other item
types.

944-960: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

set_chunk_bboxes() still crashes when item.prov is missing.

Line 947 iterates item.prov directly. If any DocItem has prov=None, this still raises at runtime despite the new provenance-hardening changes.

Suggested fix
 def set_chunk_bboxes(self, doc_items: list, document: DoclingDocument) -> "GenOSVectorMetaBuilder":
     chunk_bboxes = []
     for item in doc_items:
-        for prov in item.prov:
+        if not getattr(item, "prov", None):
+            continue
+        for prov in item.prov:
             label = item.self_ref
             type_ = item.label
             size = document.pages.get(prov.page_no).size
             page_no = prov.page_no
             bbox = prov.bbox
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/facade/intelligent_processor.py` around lines 944 - 960,
The set_chunk_bboxes method currently iterates item.prov directly and will crash
if item.prov is None; update set_chunk_bboxes (in GenOSVectorMetaBuilder) to
guard against missing provenance by treating item.prov as an empty iterable
(e.g., skip the item when item.prov is falsy or iterate over (item.prov or [])).
Ensure you reference item.self_ref, item.label and prov.page_no/prov.bbox only
when prov exists, so chunk_bboxes remains correct and the subsequent e_page and
self.chunk_bboxes assignments are still valid when no provenance is present.
🧹 Nitpick comments (2)
genon/preprocessor/tests/regression/test_html_regression.py (1)

28-47: ⚡ Quick win

Extract shared vector-processing logic into a helper to eliminate duplication.

The vector-collection loop (lines 28–47 and 88–107) is copy-pasted verbatim between run_html_test and create_html_baseline. A future change to the serialization path (e.g., the n_char fallback or chunk_bboxes parsing) will need to be applied in two places.

♻️ Proposed refactor: extract `_collect_results`
+def _collect_results(vectors) -> dict:
+    """Process a list of vectors into a result dict."""
+    result = {
+        "num_vectors": len(vectors),
+        "vectors": [],
+        "label_distribution": {},
+        "total_characters": 0,
+    }
+    label_counts = Counter()
+    for vector in vectors:
+        if hasattr(vector, "model_dump"):
+            vector_data = vector.model_dump()
+        else:
+            vector_data = vector if isinstance(vector, dict) else vars(vector)
+
+        result["vectors"].append(vector_data)
+        result["total_characters"] += vector_data.get("n_char", len(vector_data.get("text", "")))
+
+        if "chunk_bboxes" in vector_data:
+            try:
+                bboxes = json.loads(vector_data["chunk_bboxes"])
+                for bbox in bboxes:
+                    if "type" in bbox:
+                        label_counts[bbox["type"]] += 1
+            except (json.JSONDecodeError, TypeError):
+                pass
+
+    result["label_distribution"] = dict(label_counts)
+    return result
+
 async def run_html_test(html_path, baseline_path, basic_processor):
     dp = basic_processor()
     if not baseline_path.exists():
         pytest.fail(...)
     vectors = await dp(None, str(html_path))
-    current_result = {
-        "num_vectors": len(vectors),
-        "vectors": [],
-        "label_distribution": {},
-        "total_characters": 0,
-    }
-    label_counts = Counter()
-    for vector in vectors:
-        ...
-    current_result["label_distribution"] = dict(label_counts)
+    current_result = _collect_results(vectors)
     ...

 async def create_html_baseline(html_path, baseline_path, basic_processor):
     dp = basic_processor()
     vectors = await dp(None, str(html_path))
-    result = {
-        "num_vectors": len(vectors),
-        "vectors": [],
-        "label_distribution": {},
-        "total_characters": 0,
-    }
-    label_counts = Counter()
-    for vector in vectors:
-        ...
-    result["label_distribution"] = dict(label_counts)
+    result = _collect_results(vectors)
     ...

Also applies to: 88-107

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/tests/regression/test_html_regression.py` around lines 28
- 47, Extract the duplicated vector-processing loop into a single helper
function (e.g., _collect_results) and update run_html_test and
create_html_baseline to call it; the helper should accept the vectors iterable
and the current_result dict, iterate vectors using the same logic currently in
both places (use model_dump() when present else dict/vars(), append vector_data
to current_result["vectors"], add n_char if present else len(text) to
current_result["total_characters"], parse "chunk_bboxes" via json.loads and
count bbox["type"] into a Counter while ignoring JSONDecodeError/TypeError), and
finally set current_result["label_distribution"] from the Counter—this
centralizes the n_char fallback, chunk_bboxes parsing and label counting so both
call sites simply call _collect_results(vectors, current_result).
genon/preprocessor/facade/gitbook_doc/attachment_processor.md (1)

1165-1165: ⚡ Quick win

Consider adding a note distinguishing simple vs. enhanced HTML handling.

Since this PR introduces an enhanced HTML backend with advanced features (Playwright rendering, SSRF mitigation, enhanced table/list handling), readers might be confused about why attachment_processor.py uses the basic UnstructuredFileLoader for HTML instead of the new backend.

Consider adding a brief note (perhaps in section 11 or as a footnote) clarifying that:

  • The attachment processor intentionally uses simple text extraction for HTML to maintain speed
  • The enhanced HTML backend (with layout analysis, Playwright, etc.) is used in intelligent_processor.py for scenarios requiring higher-quality structure preservation

This would help readers understand the architectural distinction and when to use each processor.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/facade/gitbook_doc/attachment_processor.md` at line 1165,
Add a short clarifying note in the docs near the HTML table row (or as a
footnote/section 11) explaining that attachment_processor.py intentionally uses
the basic UnstructuredFileLoader for fast, simple text extraction, while the
enhanced HTML backend (referenced in intelligent_processor.py) provides
Playwright rendering, SSRF mitigation, layout analysis and improved table/list
handling for use-cases needing higher-quality structure preservation; mention
when to prefer each processor and link to intelligent_processor.py/back-end docs
for details.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@genon/preprocessor/sample_files/html_sample.html`:
- Line 16: Swap the incorrect descriptions for sub and sup in the sample so they
match the actual markup: change the explanatory text so that H<sub>2</sub>O is
labeled as a subscript example and E=mc<sup>2</sup> is labeled as a superscript
example; locate the snippet containing the strings "H<sub>2</sub>O" and
"E=mc<sup>2</sup>" and correct the wording accordingly.

In `@genon/preprocessor/tests/regression/README.md`:
- Around line 111-113: The README has inconsistent threshold wording: Line 63
says "85% 이상" but the example uses assert similarity > 0.85; update the example
to match the document by making the assertion inclusive (use assert similarity
>= 0.85) or else change the phrasing to "보다 큼" so both match; modify the usage
around similarity and difflib.SequenceMatcher in the example so the comparison
operator and the written threshold are consistent.

In `@genon/preprocessor/tests/regression/test_html_regression.py`:
- Line 24: The test currently collects label_distribution from chunk_bboxes but
never checks it; update run_html_test and/or create_html_baseline to either
assert equality of the computed label_distribution against the baseline or
remove the field entirely: if asserting, after the existing character-count
check in run_html_test compare the computed label_distribution to the baseline's
label_distribution and fail the test on mismatch; if removing, delete the
Counter construction and the label_distribution entries from
create_html_baseline, run_html_test, and the persisted baseline/result schema so
the field is not produced or read anywhere (search for symbols chunk_bboxes,
label_distribution, run_html_test, create_html_baseline to locate all usages).
- Line 9: The module-level computation of HTML_FILES can be empty and causes
pytest to collect zero tests; update
genon/preprocessor/tests/regression/test_html_regression.py so that after
computing HTML_FILES from SAMPLE_DIR you assert or explicitly fail when no files
are found (e.g., assert HTML_FILES, raise RuntimeError or call pytest.fail with
a clear message) so `@pytest.mark.parametrize` receives a non-empty list and the
test suite fails fast; alternatively compute the list inside the test or a
fixture (referencing HTML_FILES, SAMPLE_DIR, and the parametrize usage) and add
a guard that fails when no *.html files are present.
- Around line 49-55: The test currently accesses baseline["num_vectors"],
baseline["total_characters"], and baseline["vectors"] directly which raises a
raw KeyError for malformed baselines; after loading baseline in
test_html_regression.py, validate that baseline is a dict and contains the
required keys ("num_vectors", "total_characters", "vectors") and if any are
missing raise a clear assertion/pytest.fail with a descriptive message including
the html_path.name and which key(s) are missing; then continue to compare
current_result["num_vectors"], current_result["total_characters"], and
current_result["vectors"] against baseline values as before.

---

Outside diff comments:
In `@genon/preprocessor/facade/intelligent_processor.py`:
- Around line 1290-1321: The upload flow can still crash because
get_media_files() accesses item.image.uri without the null check used in
set_media_files(); update get_media_files() to mirror set_media_files() by
skipping items where item is a PictureItem with image is None (or where
item.image.uri is missing) and only include media entries with a valid uri, so
the loop in intelligent_processor.py that calls
get_media_files(chunk.meta.doc_items) before creating upload tasks will not pass
None URIs to upload_files; ensure the function returns an empty list when no
valid media found and preserves existing behavior for other item types.
- Around line 944-960: The set_chunk_bboxes method currently iterates item.prov
directly and will crash if item.prov is None; update set_chunk_bboxes (in
GenOSVectorMetaBuilder) to guard against missing provenance by treating
item.prov as an empty iterable (e.g., skip the item when item.prov is falsy or
iterate over (item.prov or [])). Ensure you reference item.self_ref, item.label
and prov.page_no/prov.bbox only when prov exists, so chunk_bboxes remains
correct and the subsequent e_page and self.chunk_bboxes assignments are still
valid when no provenance is present.

---

Nitpick comments:
In `@genon/preprocessor/facade/gitbook_doc/attachment_processor.md`:
- Line 1165: Add a short clarifying note in the docs near the HTML table row (or
as a footnote/section 11) explaining that attachment_processor.py intentionally
uses the basic UnstructuredFileLoader for fast, simple text extraction, while
the enhanced HTML backend (referenced in intelligent_processor.py) provides
Playwright rendering, SSRF mitigation, layout analysis and improved table/list
handling for use-cases needing higher-quality structure preservation; mention
when to prefer each processor and link to intelligent_processor.py/back-end docs
for details.

In `@genon/preprocessor/tests/regression/test_html_regression.py`:
- Around line 28-47: Extract the duplicated vector-processing loop into a single
helper function (e.g., _collect_results) and update run_html_test and
create_html_baseline to call it; the helper should accept the vectors iterable
and the current_result dict, iterate vectors using the same logic currently in
both places (use model_dump() when present else dict/vars(), append vector_data
to current_result["vectors"], add n_char if present else len(text) to
current_result["total_characters"], parse "chunk_bboxes" via json.loads and
count bbox["type"] into a Counter while ignoring JSONDecodeError/TypeError), and
finally set current_result["label_distribution"] from the Counter—this
centralizes the n_char fallback, chunk_bboxes parsing and label counting so both
call sites simply call _collect_results(vectors, current_result).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e3afa361-9a8e-40bd-8168-4dda97d0c5dd

📥 Commits

Reviewing files that changed from the base of the PR and between 847e1c4 and 93557b6.

⛔ Files ignored due to path filters (1)
  • genon/preprocessor/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • docling/backend/abstract_backend.py
  • docling/backend/html_backend.py
  • docling/datamodel/backend_options.py
  • genon/preprocessor/facade/gitbook_doc/attachment_processor.md
  • genon/preprocessor/facade/intelligent_processor.py
  • genon/preprocessor/pyproject.toml
  • genon/preprocessor/sample_files/html_lists.html
  • genon/preprocessor/sample_files/html_sample.html
  • genon/preprocessor/sample_files/html_tables.html
  • genon/preprocessor/tests/regression/README.md
  • genon/preprocessor/tests/regression/baselines/html_html_lists.json
  • genon/preprocessor/tests/regression/baselines/html_html_sample.json
  • genon/preprocessor/tests/regression/baselines/html_html_tables.json
  • genon/preprocessor/tests/regression/test_html_regression.py
  • pyproject.toml

Comment thread genon/preprocessor/sample_files/html_sample.html Outdated
Comment on lines +111 to +113
similarity = difflib.SequenceMatcher(None, current_text, baseline_text).ratio()
assert similarity > 0.85
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

유사도 기준 표현을 문서 전체에서 통일해 주세요.

Line 63은 “85% 이상”인데, Line 112 예시는 assert similarity > 0.85라서 경계값(0.85) 처리 설명이 어긋납니다. >= 또는 문구 중 하나로 맞춰 주세요.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/tests/regression/README.md` around lines 111 - 113, The
README has inconsistent threshold wording: Line 63 says "85% 이상" but the example
uses assert similarity > 0.85; update the example to match the document by
making the assertion inclusive (use assert similarity >= 0.85) or else change
the phrasing to "보다 큼" so both match; modify the usage around similarity and
difflib.SequenceMatcher in the example so the comparison operator and the
written threshold are consistent.


# sample_files에서 모든 HTML 파일 자동 검색
SAMPLE_DIR = Path(__file__).resolve().parents[2] / "sample_files"
HTML_FILES = sorted([f for f in SAMPLE_DIR.glob("*.html") if f.is_file()])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

HTML_FILES is computed at module import time — an empty sample_files/ directory silently skips all tests.

If SAMPLE_DIR is missing or contains no *.html files (e.g., in a fresh CI clone where sample files are gitignored), @pytest.mark.parametrize receives an empty list and pytest silently collects zero tests instead of failing. This can mask a broken environment.

🛡️ Proposed fix: assert at least one HTML file is found
 HTML_FILES = sorted([f for f in SAMPLE_DIR.glob("*.html") if f.is_file()])
+
+if not HTML_FILES:
+    raise FileNotFoundError(
+        f"No HTML sample files found in {SAMPLE_DIR}. "
+        "Ensure sample_files/*.html are committed to the repository."
+    )

Alternatively, add a dedicated test:

+def test_sample_files_exist():
+    assert HTML_FILES, f"No HTML files found in {SAMPLE_DIR}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
HTML_FILES = sorted([f for f in SAMPLE_DIR.glob("*.html") if f.is_file()])
HTML_FILES = sorted([f for f in SAMPLE_DIR.glob("*.html") if f.is_file()])
if not HTML_FILES:
raise FileNotFoundError(
f"No HTML sample files found in {SAMPLE_DIR}. "
"Ensure sample_files/*.html are committed to the repository."
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/tests/regression/test_html_regression.py` at line 9, The
module-level computation of HTML_FILES can be empty and causes pytest to collect
zero tests; update genon/preprocessor/tests/regression/test_html_regression.py
so that after computing HTML_FILES from SAMPLE_DIR you assert or explicitly fail
when no files are found (e.g., assert HTML_FILES, raise RuntimeError or call
pytest.fail with a clear message) so `@pytest.mark.parametrize` receives a
non-empty list and the test suite fails fast; alternatively compute the list
inside the test or a fixture (referencing HTML_FILES, SAMPLE_DIR, and the
parametrize usage) and add a guard that fails when no *.html files are present.

current_result = {
"num_vectors": len(vectors),
"vectors": [],
"label_distribution": {},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

label_distribution is collected but never asserted — either add an assertion or remove it.

Both run_html_test and create_html_baseline build label_distribution from chunk_bboxes, it is persisted in the baseline JSON, but the regression test never compares it. This means silent regressions in label distributions will go undetected, and if comparison is intentionally deferred, the dead data still costs parse overhead on every run.

🛡️ Proposed fix: assert label_distribution or drop it

If you want to assert it, add after the existing character-count check in run_html_test:

+    assert current_result["label_distribution"] == baseline.get("label_distribution", {}), (
+        f"[{html_path.name}] Label distribution mismatch: "
+        f"{current_result['label_distribution']} != {baseline.get('label_distribution', {})}"
+    )

Or, if label distribution tracking is intentionally deferred, remove the Counter and label_distribution fields entirely from both functions and the result schema to avoid confusion.

Also applies to: 47-47

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/tests/regression/test_html_regression.py` at line 24, The
test currently collects label_distribution from chunk_bboxes but never checks
it; update run_html_test and/or create_html_baseline to either assert equality
of the computed label_distribution against the baseline or remove the field
entirely: if asserting, after the existing character-count check in
run_html_test compare the computed label_distribution to the baseline's
label_distribution and fail the test on mismatch; if removing, delete the
Counter construction and the label_distribution entries from
create_html_baseline, run_html_test, and the persisted baseline/result schema so
the field is not produced or read anywhere (search for symbols chunk_bboxes,
label_distribution, run_html_test, create_html_baseline to locate all usages).

Comment on lines +49 to +55
with open(baseline_path, "r", encoding="utf-8") as f:
baseline = json.load(f)

assert current_result["num_vectors"] == baseline["num_vectors"], (
f"[{html_path.name}] Vector count mismatch: "
f"{current_result['num_vectors']} != {baseline['num_vectors']}"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Unguarded key access on baseline dict will produce confusing KeyError on malformed baselines.

Lines 52, 58, and 65 access baseline["num_vectors"], baseline["total_characters"], and baseline["vectors"] directly. If a baseline file is corrupted, was generated with an older schema, or is unexpectedly empty, the test crashes with a KeyError rather than a clear diagnostic message.

🛡️ Proposed fix: validate baseline schema after loading
     with open(baseline_path, "r", encoding="utf-8") as f:
         baseline = json.load(f)

+    required_keys = {"num_vectors", "total_characters", "vectors"}
+    missing = required_keys - baseline.keys()
+    if missing:
+        pytest.fail(f"[{html_path.name}] Baseline is missing keys: {missing}. Re-run with -m 'update_baseline'.")
+
     assert current_result["num_vectors"] == baseline["num_vectors"], ...
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with open(baseline_path, "r", encoding="utf-8") as f:
baseline = json.load(f)
assert current_result["num_vectors"] == baseline["num_vectors"], (
f"[{html_path.name}] Vector count mismatch: "
f"{current_result['num_vectors']} != {baseline['num_vectors']}"
)
with open(baseline_path, "r", encoding="utf-8") as f:
baseline = json.load(f)
required_keys = {"num_vectors", "total_characters", "vectors"}
missing = required_keys - baseline.keys()
if missing:
pytest.fail(f"[{html_path.name}] Baseline is missing keys: {missing}. Re-run with -m 'update_baseline'.")
assert current_result["num_vectors"] == baseline["num_vectors"], (
f"[{html_path.name}] Vector count mismatch: "
f"{current_result['num_vectors']} != {baseline['num_vectors']}"
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/tests/regression/test_html_regression.py` around lines 49
- 55, The test currently accesses baseline["num_vectors"],
baseline["total_characters"], and baseline["vectors"] directly which raises a
raw KeyError for malformed baselines; after loading baseline in
test_html_regression.py, validate that baseline is a dict and contains the
required keys ("num_vectors", "total_characters", "vectors") and if any are
missing raise a clear assertion/pytest.fail with a descriptive message including
the html_path.name and which key(s) are missing; then continue to compare
current_result["num_vectors"], current_result["total_characters"], and
current_result["vectors"] against baseline values as before.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a structured options system for document backends using Pydantic models and adds comprehensive support for HTML processing, including new regression tests and sample files. It also upgrades docling-core to version 2.62 and includes several stability fixes in the intelligent_processor to handle missing image data and provenance information. Feedback focuses on fixing a Pydantic inheritance issue where kind literals conflict, strengthening validation for HTML backend options, and refactoring duplicate logic in the new regression tests. Additionally, it is recommended to use a mock request object in tests to avoid potential attribute errors.

Comment on lines +123 to +133
class PdfBackendOptions(BaseBackendOptions):
"""Backend options for pdf document backends."""

kind: Literal["pdf"] = Field("pdf", exclude=True, repr=False)
password: Optional[SecretStr] = None


class MetsGbsBackendOptions(PdfBackendOptions):
"""Options specific to the METS-GBS document backend."""

kind: Annotated[Literal["mets-gbs"], Field(exclude=True, repr=False)] = "mets-gbs" # type: ignore[assignment]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

MetsGbsBackendOptionsPdfBackendOptions를 상속받고 있는데, 부모 클래스에서 kind 필드가 Literal["pdf"]로 고정되어 있어 Pydantic 유효성 검사 시 오류가 발생합니다. Pydantic v2에서는 상속받은 필드의 Literal 타입을 호환되지 않는 값으로 재정의할 수 없기 때문입니다.

공통 필드(예: password)를 별도의 기저 클래스(예: PdfBaseOptions)로 분리하고, PdfBackendOptionsMetsGbsBackendOptions가 이를 각각 상속받도록 수정하여 kind 리터럴 충돌을 해결해야 합니다.

class PdfBaseOptions(BaseBackendOptions):
    """Base options for PDF-based backends."""
    password: Optional[SecretStr] = None


class PdfBackendOptions(PdfBaseOptions):
    """Backend options for pdf document backends."""

    kind: Literal["pdf"] = Field("pdf", exclude=True, repr=False)


class MetsGbsBackendOptions(PdfBaseOptions):
    """Options specific to the METS-GBS document backend."""

    kind: Annotated[Literal["mets-gbs"], Field(exclude=True, repr=False)] = "mets-gbs"

Comment on lines +38 to +72
render_page_width: int = Field(
794, description="Render page width in CSS pixels (A4 @ 96 DPI)."
)
render_page_height: int = Field(
1123, description="Render page height in CSS pixels (A4 @ 96 DPI)."
)
render_page_orientation: Literal["portrait", "landscape"] = Field(
"portrait", description="Render page orientation."
)
render_print_media: bool = Field(
True, description="Use print media emulation when rendering."
)
render_wait_until: Literal["load", "domcontentloaded", "networkidle"] = Field(
"networkidle",
description="Playwright wait condition before extracting the DOM.",
)
render_wait_ms: int = Field(
0, description="Extra delay in milliseconds after load."
)
render_device_scale: float = Field(
1.0, description="Device scale factor for rendering."
)
page_padding: int = Field(
0,
description=(
"Padding in CSS pixels applied to the HTML body before rendering."
),
)
render_full_page: bool = Field(
False,
description=("Capture a single full-height page image instead of paginating."),
)
render_dpi: int = Field(
96, description="DPI used for page images created from rendering."
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

HTMLBackendOptions의 일부 필드들에 대해 유효성 검사를 강화할 수 있습니다.

  • render_page_width, render_page_height, render_dpi: 항상 양수여야 하므로 PositiveInt 사용을 권장합니다.
  • render_wait_ms, page_padding: 0 이상의 값이어야 하므로 Annotated[int, Field(ge=0)] 등을 사용하여 범위를 제한하는 것이 안전합니다.

Comment on lines +12 to +113
async def run_html_test(html_path, baseline_path, basic_processor):
"""HTML 파일에 대한 regression test 실행"""
dp = basic_processor()

if not baseline_path.exists():
pytest.fail(f"Baseline not found: {baseline_path}. Run with -m 'update_baseline' to create.")

vectors = await dp(None, str(html_path))

current_result = {
"num_vectors": len(vectors),
"vectors": [],
"label_distribution": {},
"total_characters": 0,
}

label_counts = Counter()
for vector in vectors:
if hasattr(vector, "model_dump"):
vector_data = vector.model_dump()
else:
vector_data = vector if isinstance(vector, dict) else vars(vector)

current_result["vectors"].append(vector_data)
current_result["total_characters"] += vector_data.get("n_char", len(vector_data.get("text", "")))

if "chunk_bboxes" in vector_data:
try:
bboxes = json.loads(vector_data["chunk_bboxes"])
for bbox in bboxes:
if "type" in bbox:
label_counts[bbox["type"]] += 1
except (json.JSONDecodeError, TypeError):
pass

current_result["label_distribution"] = dict(label_counts)

with open(baseline_path, "r", encoding="utf-8") as f:
baseline = json.load(f)

assert current_result["num_vectors"] == baseline["num_vectors"], (
f"[{html_path.name}] Vector count mismatch: "
f"{current_result['num_vectors']} != {baseline['num_vectors']}"
)

char_diff = abs(current_result["total_characters"] - baseline["total_characters"])
char_ratio = char_diff / max(baseline["total_characters"], 1)
assert char_ratio < 0.05, (
f"[{html_path.name}] Character count difference too large: "
f"{char_diff} chars ({char_ratio:.1%} change)"
)

for i, (current_vector, baseline_vector) in enumerate(
zip(current_result["vectors"], baseline["vectors"])
):
current_text = current_vector.get("text", "")
baseline_text = baseline_vector.get("text", "")
similarity = difflib.SequenceMatcher(None, current_text, baseline_text).ratio()
assert similarity > 0.85, (
f"[{html_path.name}] Vector {i} text similarity too low: {similarity:.2%}"
)


async def create_html_baseline(html_path, baseline_path, basic_processor):
"""HTML 파일에 대한 baseline 생성"""
dp = basic_processor()

vectors = await dp(None, str(html_path))

result = {
"num_vectors": len(vectors),
"vectors": [],
"label_distribution": {},
"total_characters": 0,
}

label_counts = Counter()
for vector in vectors:
if hasattr(vector, "model_dump"):
vector_data = vector.model_dump()
else:
vector_data = vector if isinstance(vector, dict) else vars(vector)

result["vectors"].append(vector_data)
result["total_characters"] += vector_data.get("n_char", len(vector_data.get("text", "")))

if "chunk_bboxes" in vector_data:
try:
bboxes = json.loads(vector_data["chunk_bboxes"])
for bbox in bboxes:
if "type" in bbox:
label_counts[bbox["type"]] += 1
except (json.JSONDecodeError, TypeError):
pass

result["label_distribution"] = dict(label_counts)

baseline_path.parent.mkdir(parents=True, exist_ok=True)
with open(baseline_path, "w", encoding="utf-8") as f:
json.dump(result, f, indent=2, ensure_ascii=False)

print(f"✓ Updated baseline: {baseline_path}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

run_html_test 함수와 create_html_baseline 함수 사이에 결과 데이터를 추출하고 label_distribution을 계산하는 로직이 거의 동일하게 중복되어 있습니다. 코드의 가독성과 유지보수성을 위해 이를 별도의 헬퍼 함수(예: _extract_result_data)로 분리하여 공통으로 사용하는 것을 권장합니다.

if not baseline_path.exists():
pytest.fail(f"Baseline not found: {baseline_path}. Run with -m 'update_baseline' to create.")

vectors = await dp(None, str(html_path))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

DocumentProcessor.__call__ 메서드는 첫 번째 인자로 fastapi.Request 객체를 기대합니다. 테스트에서 None을 전달하면, intelligent_processor.py 내부에서 upload_files가 활성화되어 있을 경우 request.is_disconnected() 호출 등에서 AttributeError가 발생할 수 있습니다. 테스트 환경에서도 안전하게 동작하도록 Mock Request 객체를 생성하여 전달하는 것이 좋습니다.

@inoray inoray requested a review from HeechanKim-Genon May 7, 2026 05:19
@HeechanKim-Genon HeechanKim-Genon merged commit 5b28049 into develop May 7, 2026
3 checks passed
@inoray inoray deleted the feature/191-html-parsing-backend branch May 7, 2026 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTML Parsing Backend 강화

2 participants