Skip to content

[PDF변환] 사이냅소프트 PDF 컨버터 SDK 도입 #187#197

Merged
inoray merged 3 commits intodevelopfrom
187-pdf-사이냅소프트-pdf-컨버터-이식하기
May 8, 2026

Hidden character warning

The head ref may contain hidden characters: "187-pdf-\uc0ac\uc774\ub0c5\uc18c\ud504\ud2b8-pdf-\ucee8\ubc84\ud130-\uc774\uc2dd\ud558\uae30"
Merged

[PDF변환] 사이냅소프트 PDF 컨버터 SDK 도입 #187#197
inoray merged 3 commits intodevelopfrom
187-pdf-사이냅소프트-pdf-컨버터-이식하기

Conversation

@HeechanKim-Genon
Copy link
Copy Markdown

@HeechanKim-Genon HeechanKim-Genon commented May 7, 2026

Summary

  • Closes [PDF} 사이냅소프트 PDF 컨버터 이식하기 #187
  • HWP/DOCX/PPT/이미지 등 비-PDF 입력을 PDF 로 변환하는 엔진을 LibreOffice 단독에서 PDF SDK(default) + LibreOffice fallback 으로 확장
  • kwargs.get('use_pdf_sdk', ...) 패턴 (HwpProcessor 의 use_hwp_sdk 와 동일 컨셉) 으로 사용자가 엔진 선택 가능, 기본값 True
  • 도커 빌드 시 HF private dataset(HeechanKim-Genon/pdf_sdk) 에서 SDK 자동 다운로드. HWP SDK 와 동일 토큰(HF_TOKEN) 사용
  • intelligent_processor 는 비-PDF 입력에 대해 PDF SDK 로 자동 변환 후 docling pipeline 진입 (auto_convert_to_pdf, default True)

변경 내용 (커밋 분리)

1. [PDF변환] 사이냅소프트 PDF 컨버터 SDK 도입 #187 (메인)

  • attachment_processor.py / convert_processor.py / intelligent_processor.pyconvert_to_pdf 시그니처 확장: (file_path, use_pdf_sdk=True) → 내부 분기로 SDK / LibreOffice 호출
  • intelligent_processor.py 진입부에 _is_pdf 매직 헤더 검사 + auto_convert_to_pdf 분기 추가
  • SDK 경로 결정: PDF_SDK_HOME 환경변수 → fallback <repo_root>/pdf_sdk
  • fontconfig conf 파일 런타임 동적 패치 (현재 SDK 경로 기준 <dir>/<cachedir> 치환)
  • Dockerfile: pdf_sdk_artifacts stage 신설 (HF dataset 다운로드 + chmod + conf 보정), runtime stage 에 PDF_SDK_HOME=/app/pdf_sdk ENV 추가
  • tests/smoke/test_pdf_convert_smoke.py 추가 — attachment_processor × {ppt, pptx} × {use_pdf_sdk=True, False} parametrize, vectors 정상 반환 + schema 검증
  • genon/README.md + gitbook_doc/*.md 문서 갱신
  • .gitignorepdf_sdk/ 추가

2. [FIX] MinioConfig 모듈 임포트 시 환경변수 강제 검증 제거 (별도 fix)

본 PR 의 PDF SDK 작업과는 무관한 별개 버그. 187 브랜치 빌드/배포 검증 중 발견되어 같이 처리.

  • [BE] 첨부형 전처리기 배포 시 리소스 파일 자동 다운로드 (GenOS #9078) #175 PR (692a8637) 에서 settings.py 모듈 레벨에 minio_config = MinioConfig() 가 추가되면서, MinIO 를 사용하지 않는 환경에서도 import 시점에 MINIO_* 환경변수 누락만으로 워커 boot 가 실패하는 문제 발생
  • main.py:49if settings.PREPROCESSOR_ID: 분기에서만 MinIO 가 실제로 호출되므로, 검증도 그 흐름 안에서 일어나야 함
  • common/settings.py: 모듈 레벨 인스턴스 제거 (클래스 정의는 유지)
  • util/minio_resource.py: download_resource_files 함수 안에서 MinioConfig() 인스턴스화로 이동

3. [FIX] facade 간 import 제거 — Genos 단일 파일 환경 호환성 회복

  • 코드 중복 제거 시도로 convert_processor / intelligent_processorattachment_processor.convert_to_pdf 를 import 하게 통합했었으나, Genos 웹 UI 가 facade 코드를 단일 파일(preprocessor.py)로 다루기 때문에 GitHub 원본 그대로 복붙 시 import 가 깨지는 회귀 발생
  • 세 facade 가 독립적으로 동작하도록 convert_to_pdf + 헬퍼 4종을 자체 정의로 복원 (코드 중복 부활하지만 단일 파일 제약이 우선)

알려진 운영 환경 요구사항 (인프라 셋업 필요)

본 PR 의 코드 자체는 빌드/임포트 통과하나, PREPROCESSOR_ID 가 truthy 인 deployment 에서 정상 부팅하려면 다음 운영 셋업이 필요함 (#175 의 의존성, 본 PR 무관):

  • MINIO_ENDPOINT / MINIO_ACCESS_KEY / MINIO_SECRET_KEY 환경변수 주입 — 기존 cluster 의 minio Secret 활용 가능
  • MinIO 서버에 preprocessor bucket 생성

후속 이슈로 분리 권장

  • download_resource_files 를 fail-safe 하게 (bucket / 리소스 부재 시 warning + 진행 — 현재는 hard-fail)
  • 운영 deployment 의 web UI 재생성이 직접 patch 한 spec 을 날리는 워크플로우 점검

Test plan

  • 3개 facade 파일 syntax check (ast.parse) 통과
  • 도커 빌드 성공 (mncregistry:30500/mnc/doc-parser-preprocessor:187-branch-test)
  • PDF SDK 단독 호출 (도커 컨테이너 내) 한글 hwp/docx 변환 정상 동작 확인
  • HF private dataset 업로드 + 도커 빌드 시 자동 다운로드 (저작권 우려 폰트 자동 제외)
  • 운영 배포 후 attachment/intelligent processor 통합 흐름 e2e (위 인프라 셋업 후 가능)
  • LibreOffice fallback (use_pdf_sdk=False) 회귀 검증

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added PDF SDK-based document conversion engine as an alternative to LibreOffice
    • Enabled automatic PDF conversion for non-PDF inputs in intelligent processor
  • Bug Fixes

    • Fixed MinIO configuration causing worker startup failures when environment variables are absent
  • Documentation

    • Expanded README with setup instructions for PDF SDK and HWP SDK
    • Updated processor documentation to reflect dual conversion engine support
  • Tests

    • Added smoke tests validating PDF conversion pipeline with both SDK and LibreOffice backends
  • Chores

    • Updated .gitignore to exclude pdf_sdk directory
    • Enhanced Docker configuration for PDF SDK artifact preparation

HeechanKim-Genon and others added 3 commits May 7, 2026 16:19
- attachment_processor / convert_processor / intelligent_processor 의
  PDF 변환 로직을 LibreOffice 단일 → PDF SDK(default) + LibreOffice fallback 으로 전환
- convert_to_pdf(file_path, use_pdf_sdk=True) 시그니처: kwargs 로 엔진 선택
- intelligent_processor: kwargs 'auto_convert_to_pdf' (default True) 추가.
  비-PDF 입력에 대해 자동 PDF 변환 여부 제어 (False=변경 전 동작)
- convert_processor 의 자체 convert_to_pdf 제거 → attachment_processor 에서 import 재사용
- PDF SDK 경로: PDF_SDK_HOME 환경변수 → fallback <repo_root>/pdf_sdk
- fontconfig conf 런타임 동적 패치 (현재 SDK 경로 기준 <dir>/<cachedir> 치환)

[빌드/배포]
- Dockerfile: 'pdf_sdk_artifacts' stage 신설 (HF private dataset
  HeechanKim-Genon/pdf_sdk 다운로드 + chmod + fontconfig 보정)
- runtime stage: PDF_SDK_HOME=/app/pdf_sdk ENV + COPY 추가
- HWP SDK 와 동일한 HF_TOKEN 으로 두 SDK 모두 다운로드

[테스트]
- tests/smoke/test_pdf_convert_smoke.py 추가
  attachment_processor × {ppt, pptx} × {use_pdf_sdk=True, False} parametrize
  → vectors 정상 반환 + schema 검증 (LibreOffice fallback 회귀 방지)

[문서]
- genon/README.md: HF 다운로드 가이드 (도커 빌드 + 로컬 실행),
  SDK 바이너리 단독 실행 가이드 (Linux 전용 명시)
- gitbook_doc/intro.md, attachment_processor.md, convert_processor.md,
  intelligent_processor.md: PDF 변환 엔진 분기 / use_pdf_sdk·auto_convert_to_pdf
  config / SDK 경로 결정 로직 반영
- .gitignore: pdf_sdk/ 추가

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#175 PR(692a863)에서 추가된 settings.py 의 모듈 레벨 minio_config = MinioConfig()
가 PROD/DEV 환경 무관하게 import 시점에 MINIO_ENDPOINT/MINIO_ACCESS_KEY/MINIO_SECRET_KEY
세 환경변수의 존재를 강제 검증하는 문제 수정.

main.py:49 의 로직상 MinIO 다운로드는 settings.PREPROCESSOR_ID 가 있을 때만 호출되므로,
PREPROCESSOR_ID 가 없는 환경(MinIO 를 쓰지 않는 환경)에서도 워커 boot 가 실패하는 회귀가
발생함.

수정:
- common/settings.py: 모듈 레벨 minio_config 인스턴스 제거 (MinioConfig 클래스 정의는 유지)
- util/minio_resource.py: download_resource_files 함수 안에서 MinioConfig() 인스턴스화로 이동
  → MinIO 사용 시점에만 환경변수 검증, 사용 안 하는 환경은 영향 없음

(이 수정은 187 브랜치의 PDF SDK 도입과 무관한 별개 fix 이지만, 187 브랜치 빌드/배포 시
관측되어 같은 브랜치에 합쳐 처리)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
이전 커밋(79edf45)에서 코드 중복 제거 목적으로 convert_processor 와
intelligent_processor 가 attachment_processor 의 convert_to_pdf 를
import 하도록 통합했으나, Genos 웹 UI 는 facade 코드를 단일 파일
(preprocessor.py)로 다루기 때문에 사용자가 GitHub 의 원본을 그대로
복붙하면 attachment_processor 모듈을 찾을 수 없어 import 가 깨짐.

각 facade 가 독립적으로 동작하도록 convert_to_pdf / _convert_to_pdf_sdk /
_convert_to_pdf_libreoffice / _patch_fontconfig 를 자체 정의로 복원.
세 파일 모두 동일한 구현(코드 중복) 을 보유하지만, 운영 환경의 단일
파일 제약을 만족하는 게 우선.

수정:
- convert_processor.py: import 한 줄(convert_to_pdf) 제거 + 4개 함수
  자체 정의 + shutil/subprocess/tempfile/unicodedata import 복원
- intelligent_processor.py: 동일
- gitbook_doc/convert_processor.md, intelligent_processor.md:
  "재사용/import" 표기를 "자체 정의" 로 정정 + 단일 파일 제약 사유 명시

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@HeechanKim-Genon HeechanKim-Genon linked an issue May 7, 2026 that may be closed by this pull request
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR integrates a PDF SDK as an alternative to LibreOffice for file-to-PDF conversion. It adds dual-backend routing with environment configuration, updates all preprocessor pipelines to support engine selection via a use_pdf_sdk parameter, and enables intelligent_processor to auto-convert non-PDF inputs.

Changes

PDF SDK Integration & Dual-Backend Conversion

Layer / File(s) Summary
Environment Configuration
.gitignore, genon/preprocessor/src/common/settings.py
pdf_sdk/ added to ignore list; MinioConfig instantiation deferred from import-time to call-time to prevent boot failures when environment variables are absent.
SDK Setup & Documentation
genon/README.md
New sections document HF_TOKEN usage for downloading PDF SDK, local testing prerequisites (SDK placement, path fallback behavior), and Linux-only standalone binary execution for PDF conversion.
Docker Build Infrastructure
genon/preprocessor/docker/Dockerfile
New pdf_sdk_artifacts multi-stage build target downloads private Hugging Face dataset, patches font configuration paths, and sets PDF_SDK_HOME in runtime environment.
Utility Layer Updates
genon/preprocessor/src/util/minio_resource.py
MinioConfig instantiated at download_resource_files() call-time instead of module import.
Dual-Backend Conversion Core
genon/preprocessor/facade/convert_processor.py, genon/preprocessor/facade/attachment_processor.py
convert_to_pdf(file_path, use_pdf_sdk=True) routes to SDK path (fontconfig patching, LD_LIBRARY_PATH setup, pdfConverter binary) or LibreOffice path (headless soffice, non-ASCII filename handling); both return None on failure.
Intelligent Processor Auto-Conversion
genon/preprocessor/facade/intelligent_processor.py
New convert_to_pdf() and _is_pdf() helpers added; __call__() auto-converts non-PDF inputs to PDF when auto_convert_to_pdf=True (default) before Docling processing, raising GenosServiceException on conversion failure.
Integration & Parameter Wiring
genon/preprocessor/facade/convert_processor.py, genon/preprocessor/facade/attachment_processor.py
DocumentProcessor.get_loader(file_path, use_pdf_sdk=True) and get_loader_langchain(file_path, use_pdf_sdk=True) updated to accept and forward backend selection; main request handlers now pass use_pdf_sdk from kwargs into conversion calls.
Documentation
genon/preprocessor/facade/gitbook_doc/*, genon/README.md
All processor documentation updated to reflect dual-backend API: convert_to_pdf() signatures, SDK/LibreOffice routing logic, environment variable selection, and intelligent_processor's auto-conversion capability.
Tests & Validation
genon/preprocessor/tests/smoke/test_pdf_convert_smoke.py
New parameterized smoke test validates PPT/PPTX-to-PDF conversion across both use_pdf_sdk=True and False, asserting vector output contains text field.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • genonai/doc_parser#186: Parallel Docker multi-stage build improvements for font artifacts and PDF SDK setup.
  • genonai/doc_parser#193: Related intelligent_processor changes for provenance and bbox handling alongside PDF conversion.
  • genonai/doc_parser#167: Prior SDK integration pattern (hwp_sdk) that this PR extends with pdf_sdk alongside existing infrastructure.

Suggested labels

enhancement

Suggested reviewers

  • inoray
  • yspaik

Poem

🐰 A fuzzy pal skips through the code,
Two paths now flow where one once rode,
SDK swift or LibreOffice slow,
Both roads to PDF now brightly glow!
Non-PDFs transform with grace,
Before they reach the Docling place. 🎉

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check ❓ Inconclusive The PR includes two unrelated changes: MinioConfig instantiation deferral and independent convert_to_pdf definitions for single-file compatibility, which fall outside issue #187's scope. Clarify whether MinioConfig deferral and convert_to_pdf duplication are intentional supplementary fixes or should be separated into distinct PRs for cleaner scoping.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[PDF변환] 사이냅소프트 PDF 컨버터 SDK 도입 #187' clearly summarizes the main change: integrating the SynapSoft PDF Converter SDK for PDF conversion, which aligns with the PR's primary objective.
Linked Issues check ✅ Passed The PR successfully addresses issue #187 requirements by implementing selectable PDF SDK engine alongside LibreOffice, allowing engine selection via use_pdf_sdk parameter, and supporting SDK provisioning in Docker builds.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 187-pdf-사이냅소프트-pdf-컨버터-이식하기

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new PDF SDK as the primary engine for document conversion, complementing the existing LibreOffice fallback across the attachment, convert, and intelligent processors. Key changes include Dockerfile updates for SDK acquisition, comprehensive documentation updates, and a refactor of the conversion logic to support engine selection via the use_pdf_sdk parameter. Additionally, the PR delays MinioConfig instantiation to prevent worker startup failures in environments where MinIO is not utilized. Review feedback identifies a potential issue in the font configuration patching logic, suggesting the removal of the count=1 constraint in re.sub calls to ensure all relevant XML tags are correctly updated across all processor implementations.

dst = os.path.join(tmp_dir, "fonts.conf")
with open(src, "r", encoding="utf-8") as f:
conf = f.read()
conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using count=1 in re.sub might not patch all <dir> tags if there are multiple. Consider removing count=1 to ensure all occurrences are replaced, consistent with the Dockerfile sed command.

Suggested change
conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1)
conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf)

with open(src, "r", encoding="utf-8") as f:
conf = f.read()
conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1)
conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf, count=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using count=1 in re.sub might not patch all <cachedir> tags if there are multiple. Consider removing count=1 to ensure all occurrences are replaced, consistent with the Dockerfile sed command.

Suggested change
conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf, count=1)
conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf)

dst = os.path.join(tmp_dir, "fonts.conf")
with open(src, "r", encoding="utf-8") as f:
conf = f.read()
conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using count=1 in re.sub might not patch all <dir> tags if there are multiple. Consider removing count=1 to ensure all occurrences are replaced, consistent with the Dockerfile sed command.

Suggested change
conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1)
conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf)

with open(src, "r", encoding="utf-8") as f:
conf = f.read()
conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1)
conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf, count=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using count=1 in re.sub might not patch all <cachedir> tags if there are multiple. Consider removing count=1 to ensure all occurrences are replaced, consistent with the Dockerfile sed command.

Suggested change
conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf, count=1)
conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf)

dst = os.path.join(tmp_dir, "fonts.conf")
with open(src, "r", encoding="utf-8") as f:
conf = f.read()
conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using count=1 in re.sub might not patch all <dir> tags if there are multiple. Consider removing count=1 to ensure all occurrences are replaced, consistent with the Dockerfile sed command.

Suggested change
conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1)
conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf)

with open(src, "r", encoding="utf-8") as f:
conf = f.read()
conf = re.sub(r"<dir>[^<]*</dir>", f"<dir>{fonts_dir}</dir>", conf, count=1)
conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf, count=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using count=1 in re.sub might not patch all <cachedir> tags if there are multiple. Consider removing count=1 to ensure all occurrences are replaced, consistent with the Dockerfile sed command.

Suggested change
conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf, count=1)
conf = re.sub(r"<cachedir>[^<]*</cachedir>", f"<cachedir>{font_cache}</cachedir>", conf)

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
genon/preprocessor/facade/attachment_processor.py (2)

1314-1325: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the converted PDF path instead of discarding it.

For .doc, .ppt/.pptx, and image inputs, convert_to_pdf() is called and then ignored; the loader still parses the original source file. That means use_pdf_sdk does not change the extraction path, and broken converters can stay hidden behind a successful Unstructured*Loader parse.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/facade/attachment_processor.py` around lines 1314 - 1325,
The code calls convert_to_pdf(file_path, use_pdf_sdk=use_pdf_sdk) but ignores
its returned PDF path, continuing to pass the original file_path into
UnstructuredWordDocumentLoader, UnstructuredPowerPointLoader, and
UnstructuredImageLoader; change those branches to capture the returned path
(e.g., pdf_path = convert_to_pdf(...)) and pass pdf_path to the respective
loaders (UnstructuredWordDocumentLoader, UnstructuredPowerPointLoader,
UnstructuredImageLoader) so the loaders operate on the converted PDF rather than
the original source when use_pdf_sdk is used.

194-217: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The non-ASCII LibreOffice fallback checks the wrong PDF name.

If the ASCII temp copy succeeds, LibreOffice writes <ascii_name>.pdf into out_dir, not the original <in_path.stem>.pdf. The current existence check therefore misses successful conversions for non-ASCII filenames and returns None anyway.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/facade/attachment_processor.py` around lines 194 - 217,
The LibreOffice fallback uses the original in_path.stem when checking for the
produced PDF, which fails for non-ASCII temp copies; update the check in
convert_to_pdf (attachment_processor.py) to derive pdf_path from the actual
candidate filename (use cand.stem or the ascii_copy name) — e.g., build
expected_pdf = Path(out_dir) / f"{cand.stem}.pdf" for each candidate, test its
existence and return its string, and ensure tmp_dir is cleaned up on success or
failure.
genon/preprocessor/facade/convert_processor.py (1)

228-251: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The LibreOffice temp-copy fallback misses successful conversions for non-ASCII filenames.

soffice will emit a PDF named after cand.stem, but the success check only looks for in_path.with_suffix(".pdf"). When the ASCII temp copy is the one that succeeds, this path still returns None.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/facade/convert_processor.py` around lines 228 - 251, The
LibreOffice fallback currently checks only in_path.with_suffix(".pdf"), missing
cases where the ASCII temp copy (ascii_copy) produced the PDF; update the loop
that iterates candidates in convert_processor.py to derive the expected output
PDF per candidate (e.g., pdf_path = out_dir / f"{cand.stem}.pdf") and check that
path after subprocess.run, returning its string when found; also ensure tmp_dir
is removed when the temp copy produced the PDF (and still cleaned up on all
exits). Use the existing symbols cand, pdf_path, tmp_dir, convert_arg, out_dir,
and in_path to implement this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@genon/preprocessor/facade/attachment_processor.py`:
- Around line 141-153: The subprocess calls that run the PDF conversion (the
subprocess.run call creating proc with cmd, checking proc.returncode and
pdf_path) must be bounded by a timeout and handle subprocess.TimeoutExpired
similarly to other failures: add a timeout argument to subprocess.run (for both
the pdfConverter branch shown around the cmd/proc/pdf_path logic and the soffice
branch referenced around lines 205-217), wrap the call in try/except
subprocess.TimeoutExpired, and in the except block log the timeout via
_log.warning (include proc-like context and the command) and return/raise the
same failure response path used for non-zero returncodes so the request isn't
blocked indefinitely. Ensure you reference the same variables (cmd, env, proc,
pdf_path) and reuse the existing failure handling where returncode != 0 is
currently handled.

In `@genon/preprocessor/facade/convert_processor.py`:
- Around line 1313-1318: The PPT and DOCX/PPTX branches in get_loader_langchain
(and the similar branch at lines ~1945-1946) call convert_to_pdf(file_path,
use_pdf_sdk=use_pdf_sdk) for side effects but continue using the original file,
so the use_pdf_sdk choice is never applied and conversion failures are ignored;
change these branches to capture the returned PDF path from convert_to_pdf and
pass that path into the LangChain loader (e.g., replace
UnstructuredPowerPointLoader(file_path) with
UnstructuredPowerPointLoader(pdf_path)) and do likewise for the DOCX/PPTX loader
paths, propagate the use_pdf_sdk argument into those calls, and if
convert_to_pdf returns None or raises, fail fast by raising an exception or
returning an error instead of proceeding with the original file.
- Around line 175-187: Add a bounded timeout to the external subprocess calls:
pass a sensible timeout (e.g. timeout=30) to subprocess.run when building and
running cmd in convert_processor.py, wrap the call in a try/except that catches
subprocess.TimeoutExpired, logs a clear warning including the command and that
it timed out (similar to the existing _log.warning usage for proc.stderr), and
treat the timeout as a conversion failure (return None or the current failure
path) instead of letting the request hang; apply the same change to the other
subprocess.run block around lines 240-251 so both conversions are protected by
timeouts and timeout handling.

In `@genon/preprocessor/facade/intelligent_processor.py`:
- Around line 63-75: The converter subprocess calls currently use subprocess.run
without timeouts, so add a timeout to both runs (the PDF SDK call shown around
the cmd variable and the other soffice call around lines 128-139) by passing a
sensible timeout value (e.g., convert_timeout) and wrap each call in try/except
subprocess.TimeoutExpired; on timeout log a warning (include the context like
"[convert_to_pdf:sdk]" or the soffice context) and treat it as a conversion
failure (return the same failure value/path handling used for non-zero return
codes). Ensure you reference and update the subprocess.run invocations in the
convert_to_pdf-related code paths and the soffice conversion path to include
timeout and TimeoutExpired handling.
- Around line 116-143: The check for successful LibreOffice conversion uses a
fixed pdf_path based on in_path, so when converting the ASCII temp copy
(ascii_copy) the produced file (ascii_name.pdf) isn't recognized; update the
loop in convert_to_pdf (function in intelligent_processor.py) to compute the
expected PDF path for each candidate (e.g., derive pdf_path from the current
cand's name/stem and out_dir) and check that path.exists() after subprocess.run;
keep the same tmp_dir cleanup behavior (remove tmp_dir on success or after the
loop).

In `@genon/preprocessor/tests/smoke/test_pdf_convert_smoke.py`:
- Around line 34-43: The current test_pdf_convert_smoke is insufficient because
basic_processor() still proceeds to parse the original PPT/PPTX instead of
verifying the PDF output; update the test to explicitly exercise convert_to_pdf
(or call a processor that consumes the returned PDF path) and assert on the
converted artifact itself: invoke the convert_to_pdf path used by
basic_processor() (or call convert_to_pdf directly), verify the returned PDF
path exists and is a non-empty file (and/or has a .pdf extension), and then pass
that PDF path into the downstream processor to ensure the PDF backend actually
produced a usable PDF rather than only validating PowerPoint parsing.

---

Outside diff comments:
In `@genon/preprocessor/facade/attachment_processor.py`:
- Around line 1314-1325: The code calls convert_to_pdf(file_path,
use_pdf_sdk=use_pdf_sdk) but ignores its returned PDF path, continuing to pass
the original file_path into UnstructuredWordDocumentLoader,
UnstructuredPowerPointLoader, and UnstructuredImageLoader; change those branches
to capture the returned path (e.g., pdf_path = convert_to_pdf(...)) and pass
pdf_path to the respective loaders (UnstructuredWordDocumentLoader,
UnstructuredPowerPointLoader, UnstructuredImageLoader) so the loaders operate on
the converted PDF rather than the original source when use_pdf_sdk is used.
- Around line 194-217: The LibreOffice fallback uses the original in_path.stem
when checking for the produced PDF, which fails for non-ASCII temp copies;
update the check in convert_to_pdf (attachment_processor.py) to derive pdf_path
from the actual candidate filename (use cand.stem or the ascii_copy name) —
e.g., build expected_pdf = Path(out_dir) / f"{cand.stem}.pdf" for each
candidate, test its existence and return its string, and ensure tmp_dir is
cleaned up on success or failure.

In `@genon/preprocessor/facade/convert_processor.py`:
- Around line 228-251: The LibreOffice fallback currently checks only
in_path.with_suffix(".pdf"), missing cases where the ASCII temp copy
(ascii_copy) produced the PDF; update the loop that iterates candidates in
convert_processor.py to derive the expected output PDF per candidate (e.g.,
pdf_path = out_dir / f"{cand.stem}.pdf") and check that path after
subprocess.run, returning its string when found; also ensure tmp_dir is removed
when the temp copy produced the PDF (and still cleaned up on all exits). Use the
existing symbols cand, pdf_path, tmp_dir, convert_arg, out_dir, and in_path to
implement this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 931c1aec-6945-42a9-a970-af948770595d

📥 Commits

Reviewing files that changed from the base of the PR and between 5b28049 and 5a57c80.

📒 Files selected for processing (13)
  • .gitignore
  • genon/README.md
  • genon/preprocessor/docker/Dockerfile
  • genon/preprocessor/facade/attachment_processor.py
  • genon/preprocessor/facade/convert_processor.py
  • genon/preprocessor/facade/gitbook_doc/attachment_processor.md
  • genon/preprocessor/facade/gitbook_doc/convert_processor.md
  • genon/preprocessor/facade/gitbook_doc/intelligent_processor.md
  • genon/preprocessor/facade/gitbook_doc/intro.md
  • genon/preprocessor/facade/intelligent_processor.py
  • genon/preprocessor/src/common/settings.py
  • genon/preprocessor/src/util/minio_resource.py
  • genon/preprocessor/tests/smoke/test_pdf_convert_smoke.py

Comment on lines +141 to +153
cmd = [
binary,
"-i", str(in_path),
"-o", str(out_dir),
"-t", tmp,
"-f", fonts_dir,
"-e", "-1",
"-p", "1",
]
proc = subprocess.run(cmd, env=env, capture_output=True, text=True)
if proc.returncode == 0 and pdf_path.exists():
return str(pdf_path)
_log.warning(f"[convert_to_pdf:sdk] stderr: {proc.stderr.strip()}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bound the SDK and LibreOffice subprocesses with timeouts.

These conversions execute inline during request handling. A hung pdfConverter or soffice process will block the request indefinitely. Please add a timeout and handle subprocess.TimeoutExpired the same way you handle other conversion failures.

Also applies to: 205-217

🧰 Tools
🪛 Ruff (0.15.12)

[error] 150-150: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/facade/attachment_processor.py` around lines 141 - 153,
The subprocess calls that run the PDF conversion (the subprocess.run call
creating proc with cmd, checking proc.returncode and pdf_path) must be bounded
by a timeout and handle subprocess.TimeoutExpired similarly to other failures:
add a timeout argument to subprocess.run (for both the pdfConverter branch shown
around the cmd/proc/pdf_path logic and the soffice branch referenced around
lines 205-217), wrap the call in try/except subprocess.TimeoutExpired, and in
the except block log the timeout via _log.warning (include proc-like context and
the command) and return/raise the same failure response path used for non-zero
returncodes so the request isn't blocked indefinitely. Ensure you reference the
same variables (cmd, env, proc, pdf_path) and reuse the existing failure
handling where returncode != 0 is currently handled.

Comment on lines +175 to +187
cmd = [
binary,
"-i", str(in_path),
"-o", str(out_dir),
"-t", tmp,
"-f", fonts_dir,
"-e", "-1",
"-p", "1",
]
proc = subprocess.run(cmd, env=env, capture_output=True, text=True)
if proc.returncode == 0 and pdf_path.exists():
return str(pdf_path)
_log.warning(f"[convert_to_pdf:sdk] stderr: {proc.stderr.strip()}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add timeouts to the conversion subprocesses.

These external binaries run on the request path. If either converter hangs, the request hangs with it. A bounded timeout here would turn that into a normal conversion failure instead of a stuck worker.

Also applies to: 240-251

🧰 Tools
🪛 Ruff (0.15.12)

[error] 184-184: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/facade/convert_processor.py` around lines 175 - 187, Add a
bounded timeout to the external subprocess calls: pass a sensible timeout (e.g.
timeout=30) to subprocess.run when building and running cmd in
convert_processor.py, wrap the call in a try/except that catches
subprocess.TimeoutExpired, logs a clear warning including the command and that
it timed out (similar to the existing _log.warning usage for proc.stderr), and
treat the timeout as a conversion failure (return None or the current failure
path) instead of letting the request hang; apply the same change to the other
subprocess.run block around lines 240-251 so both conversions are protected by
timeouts and timeout handling.

Comment on lines +1313 to 1318
def get_loader_langchain(self, file_path: str, use_pdf_sdk: bool = True):
"""PPT 파일용 langchain 로더"""
ext = os.path.splitext(file_path)[-1].lower()
if ext == '.ppt':
convert_to_pdf(file_path)
convert_to_pdf(file_path, use_pdf_sdk=use_pdf_sdk)
return UnstructuredPowerPointLoader(file_path)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The new use_pdf_sdk flag never feeds the parser in these paths.

In both the PPT LangChain branch and the DOCX/PPTX branch, convert_to_pdf() is called only for side effects and the returned PDF path is ignored. Parsing continues from the original file, so SDK vs. LibreOffice cannot change extraction results, and conversion failures degrade silently instead of failing fast.

Also applies to: 1945-1946

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/facade/convert_processor.py` around lines 1313 - 1318, The
PPT and DOCX/PPTX branches in get_loader_langchain (and the similar branch at
lines ~1945-1946) call convert_to_pdf(file_path, use_pdf_sdk=use_pdf_sdk) for
side effects but continue using the original file, so the use_pdf_sdk choice is
never applied and conversion failures are ignored; change these branches to
capture the returned PDF path from convert_to_pdf and pass that path into the
LangChain loader (e.g., replace UnstructuredPowerPointLoader(file_path) with
UnstructuredPowerPointLoader(pdf_path)) and do likewise for the DOCX/PPTX loader
paths, propagate the use_pdf_sdk argument into those calls, and if
convert_to_pdf returns None or raises, fail fast by raising an exception or
returning an error instead of proceeding with the original file.

Comment on lines +63 to +75
cmd = [
binary,
"-i", str(in_path),
"-o", str(out_dir),
"-t", tmp,
"-f", fonts_dir,
"-e", "-1",
"-p", "1",
]
proc = subprocess.run(cmd, env=env, capture_output=True, text=True)
if proc.returncode == 0 and pdf_path.exists():
return str(pdf_path)
_log.warning(f"[convert_to_pdf:sdk] stderr: {proc.stderr.strip()}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add timeouts to both converter subprocesses.

Both binaries run directly on the request path. If pdfConverter or soffice hangs on a bad document, this coroutine blocks indefinitely and can pin a worker. Bound both calls with a timeout and treat TimeoutExpired as a conversion failure.

Also applies to: 128-139

🧰 Tools
🪛 Ruff (0.15.12)

[error] 72-72: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/facade/intelligent_processor.py` around lines 63 - 75, The
converter subprocess calls currently use subprocess.run without timeouts, so add
a timeout to both runs (the PDF SDK call shown around the cmd variable and the
other soffice call around lines 128-139) by passing a sensible timeout value
(e.g., convert_timeout) and wrap each call in try/except
subprocess.TimeoutExpired; on timeout log a warning (include the context like
"[convert_to_pdf:sdk]" or the soffice context) and treat it as a conversion
failure (return the same failure value/path handling used for non-zero return
codes). Ensure you reference and update the subprocess.run invocations in the
convert_to_pdf-related code paths and the soffice conversion path to include
timeout and TimeoutExpired handling.

Comment on lines +116 to +143
try:
in_path.name.encode('ascii')
candidates = [in_path]
tmp_dir = None
except UnicodeEncodeError:
tmp_dir = Path(tempfile.mkdtemp())
ascii_name = unicodedata.normalize('NFKD', in_path.stem).encode('ascii', 'ignore').decode('ascii') or "file"
ascii_copy = tmp_dir / f"{ascii_name}{in_path.suffix}"
shutil.copy2(in_path, ascii_copy)
candidates = [ascii_copy, in_path]

for cand in candidates:
cmd = [
"soffice", "--headless",
"--convert-to", convert_arg,
"--outdir", str(out_dir),
str(cand)
]
proc = subprocess.run(cmd, env=env, capture_output=True, text=True)
if proc.returncode == 0 and pdf_path.exists():
if tmp_dir:
shutil.rmtree(tmp_dir, ignore_errors=True)
return str(pdf_path)
_log.warning(f"[convert_to_pdf:libreoffice] stderr: {proc.stderr.strip()}")

if tmp_dir:
shutil.rmtree(tmp_dir, ignore_errors=True)
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The LibreOffice non-ASCII fallback never recognizes its own output.

When cand is the ASCII temp copy, LibreOffice writes <ascii_name>.pdf into out_dir, but success is checked only against in_path.with_suffix(".pdf"). For Korean filenames, a successful temp-copy conversion is therefore reported as a failure.

Suggested fix
         for cand in candidates:
+            cand_pdf_path = out_dir / f"{cand.stem}.pdf"
             cmd = [
                 "soffice", "--headless",
                 "--convert-to", convert_arg,
                 "--outdir", str(out_dir),
                 str(cand)
             ]
             proc = subprocess.run(cmd, env=env, capture_output=True, text=True)
-            if proc.returncode == 0 and pdf_path.exists():
+            if proc.returncode == 0 and cand_pdf_path.exists():
+                if cand_pdf_path != pdf_path:
+                    cand_pdf_path.replace(pdf_path)
                 if tmp_dir:
                     shutil.rmtree(tmp_dir, ignore_errors=True)
                 return str(pdf_path)
🧰 Tools
🪛 Ruff (0.15.12)

[error] 134-134: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/facade/intelligent_processor.py` around lines 116 - 143,
The check for successful LibreOffice conversion uses a fixed pdf_path based on
in_path, so when converting the ASCII temp copy (ascii_copy) the produced file
(ascii_name.pdf) isn't recognized; update the loop in convert_to_pdf (function
in intelligent_processor.py) to compute the expected PDF path for each candidate
(e.g., derive pdf_path from the current cand's name/stem and out_dir) and check
that path.exists() after subprocess.run; keep the same tmp_dir cleanup behavior
(remove tmp_dir on success or after the loop).

Comment on lines +34 to +43
async def test_pdf_convert_smoke(basic_processor, sample: Path, use_pdf_sdk: bool):
"""입력 ext × {PDF SDK, LibreOffice} 모두 vectors 반환 확인."""
dp = basic_processor()
vectors = await dp(None, str(sample), use_pdf_sdk=use_pdf_sdk)

assert isinstance(vectors, list) and len(vectors) >= 1
v = vectors[0]
if hasattr(v, "model_dump"):
v = v.model_dump()
assert "text" in v and isinstance(v["text"], str)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

This smoke test can pass even when both PDF backends are broken.

basic_processor() currently goes through a path that still parses the original .ppt/.pptx after calling convert_to_pdf(), so these assertions only prove that PowerPoint loading still works. They do not prove that either backend actually produced a PDF. Please assert on the converted artifact or exercise a processor that consumes the returned PDF path directly.

🧰 Tools
🪛 Ruff (0.15.12)

[warning] 35-35: Docstring contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF002)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/tests/smoke/test_pdf_convert_smoke.py` around lines 34 -
43, The current test_pdf_convert_smoke is insufficient because basic_processor()
still proceeds to parse the original PPT/PPTX instead of verifying the PDF
output; update the test to explicitly exercise convert_to_pdf (or call a
processor that consumes the returned PDF path) and assert on the converted
artifact itself: invoke the convert_to_pdf path used by basic_processor() (or
call convert_to_pdf directly), verify the returned PDF path exists and is a
non-empty file (and/or has a .pdf extension), and then pass that PDF path into
the downstream processor to ensure the PDF backend actually produced a usable
PDF rather than only validating PowerPoint parsing.

@HeechanKim-Genon HeechanKim-Genon requested a review from inoray May 7, 2026 09:51
@inoray inoray merged commit 79c6c7a into develop May 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[PDF} 사이냅소프트 PDF 컨버터 이식하기

2 participants