Skip to content

[Hotfix] HWP→PDF 변환 실패 수정 (사이냅 SDK 출력 파일명 미스매치)#198

Merged
inoray merged 1 commit intodevelopfrom
hotfix/sdk-pdf-output-filename
May 8, 2026
Merged

[Hotfix] HWP→PDF 변환 실패 수정 (사이냅 SDK 출력 파일명 미스매치)#198
inoray merged 1 commit intodevelopfrom
hotfix/sdk-pdf-output-filename

Conversation

@HeechanKim-Genon
Copy link
Copy Markdown

@HeechanKim-Genon HeechanKim-Genon commented May 8, 2026

요약

2.0.0.3 배포 직후 발생한 HWP 파일 처리 100% 실패 핫픽스.

원인

_convert_to_pdf_sdk 가 사이냅 PDF SDK 출력 파일명을 잘못 추측:

입력 실제 SDK 출력 코드 기대값
Document-104918.hwp Document-104918.hwp.pdf Document-104918.pdf

코드: pdf_path = in_path.with_suffix('.pdf').hwp.pdf교체 가정.
실제 SDK: 입력 풀네임에 .pdf덧붙임.

pdf_path.exists() 항상 False → 모든 HWP 가 GenosServiceException(1, "PDF 변환 실패") 로 강제 실패.
운영 컨테이너에서 pdfConverter 직접 실행으로 SDK 자체는 정상 (returncode=0, PDF 5MB 생성) 확인 — 순전히 호출 측 버그.

수정

_convert_to_pdf_sdk 를 세 facade 에서 동일하게 수정:

  1. 임시 디렉토리에 출력 (/tmp/pdfsdk_out_*/) — 출력 파일명 추측에 의존하지 않고, NFS 쓰기 권한 의존도 제거
  2. glob 으로 PDF 회수 — 어떤 이름으로 떨어지든 잡을 수 있게
  3. 의도한 위치 (<basename>.pdf) 로 shutil.copy2 — 다운스트림 호환 유지
  4. 진단 로그 대폭 보강 — preflight 체크, cmd, returncode, stdout, stderr, produced_files 모두 기록 (이전엔 stderr 일부만)
  5. 600초 타임아웃 추가
  6. intelligent_processor.__call__ 에 SDK 실패 시 LibreOffice 폴백 추가 (방어선)

검증

운영 컨테이너에 웹 UI 핫리로드로 적용 후 Document-104920.hwp 재처리 → 성공:

[convert_to_pdf:sdk] returncode=0 produced_files=['Document-104920.hwp.pdf']
[convert_to_pdf:sdk] success → /nfs-root/docs/Document-104920.pdf
[intelligent] Converted PDF: /nfs-root/docs/Document-104920.pdf
...
Success: "/nfs-root/docs/Document-104920.hwp" (86.84 seconds)

변경 파일

  • genon/preprocessor/facade/intelligent_processor.py
  • genon/preprocessor/facade/attachment_processor.py
  • genon/preprocessor/facade/convert_processor.py

Test plan

  • 머지 후 develop 기준으로 운영 핫픽스 코드와 동기 확인
  • 실패했던 HWP (Document-104918, 104920) 재처리 성공 확인
  • 회귀: PDF 직접 입력 / DOC / DOCX / PPT / PPTX / 이미지 1건씩 재처리 성공 확인
  • 로그에 returncode, produced_files 정상 기록 확인

긴급도

P0 — 2.0.0.3 머지된 상태에서 운영의 모든 HWP 파일 처리 막힘. 운영은 웹 UI 핫리로드로 임시 적용 중 (컨테이너 재시작 시 휘발). 이 PR 머지로 GitHub source-of-truth 와 운영 코드 동기화.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Enhanced PDF conversion reliability with automatic fallback mechanism—if the primary conversion method fails, the system now automatically retries using an alternative approach
    • Improved file handling and cleanup for PDF conversion processes

사이냅 PDF SDK는 입력 파일명에 .pdf를 덧붙여 출력하지만
(Document.hwp → Document.hwp.pdf) Python 코드는 with_suffix('.pdf')로
.hwp를 교체한 경로(Document.pdf)를 기대 → 모든 HWP 파일이
'PDF 변환 실패'로 잘못 분류되는 회귀.

수정 범위 — facade 3개 동일 적용:
- intelligent_processor.py
- attachment_processor.py
- convert_processor.py

변경:
- SDK 출력을 임시 디렉토리로 받고 glob으로 PDF 회수 후 의도한 위치로 copy
- 진단 로그 보강 (preflight, returncode, stdout, stderr, produced_files)
- 600초 타임아웃 추가
- intelligent_processor: SDK 실패 시 LibreOffice 폴백 추가

운영 검증:
컨테이너에 핫리로드 후 Document-104920.hwp 재처리 86.84초 성공
(returncode=0 produced_files=['Document-104920.hwp.pdf']
 → /nfs-root/docs/Document-104920.pdf)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Three processor modules (attachment_processor, convert_processor, intelligent_processor) are updated to isolate PDF-SDK conversion outputs to temporary directories, enhance logging for debugging, implement copy-to-target fallback logic, and add LibreOffice retry when SDK conversion fails.

Changes

PDF-SDK Conversion with Temporary Output and Fallback

Layer / File(s) Summary
State and Variables
genon/preprocessor/facade/attachment_processor.py, genon/preprocessor/facade/convert_processor.py, genon/preprocessor/facade/intelligent_processor.py
Introduce optional sdk_out_dir temporary directory variable and keep_out_dir control flag across all three processors to manage output isolation and cleanup.
Preflight Setup and Logging
genon/preprocessor/facade/attachment_processor.py, genon/preprocessor/facade/convert_processor.py, genon/preprocessor/facade/intelligent_processor.py
Create temporary output directories with unique prefixes and add detailed preflight logging capturing SDK binary executability, font/module resource paths, input file existence/size, and chosen temp directory location.
SDK Execution and Logging
genon/preprocessor/facade/attachment_processor.py, genon/preprocessor/facade/convert_processor.py, genon/preprocessor/facade/intelligent_processor.py
Update subprocess invocation to write outputs to sdk_out_dir, capture stdout/stderr with timeout handling, enumerate and log produced files, and log executed command with return code.
Output Handling and Cleanup
genon/preprocessor/facade/attachment_processor.py, genon/preprocessor/facade/convert_processor.py, genon/preprocessor/facade/intelligent_processor.py
Discover first produced PDF from temp directory, attempt copy to canonical target path (same base filename with .pdf extension), return temp PDF path if copy fails while preserving temp directory, and conditionally remove temp directory in finally block.
Fallback Retry Flow
genon/preprocessor/facade/intelligent_processor.py
Non-PDF auto-conversion now retries with LibreOffice when initial SDK-based conversion fails or produces no usable output file.
Minor Changes
genon/preprocessor/facade/attachment_processor.py, genon/preprocessor/facade/convert_processor.py
Anchor return statements and string formatting with no observable logic changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • genonai/doc_parser#197: Modifies the same PDF-SDK conversion paths and directly relates to the SDK integration improvements in this PR.

Suggested reviewers

  • inoray

Poem

🐰 A rabbit hops through temp directories bright,
Moving PDFs with fallback in sight,
LibreOffice stands ready to save the day,
When SDKs stumble along the way!
Cleanup and logs make the journey clear,
Conversions now safer, my dear! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly identifies the main change: fixing HWP-to-PDF conversion failures caused by SDK output filename mismatches. It directly addresses the root cause and the specific issue resolved.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hotfix/sdk-pdf-output-filename

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@HeechanKim-Genon HeechanKim-Genon requested a review from inoray May 8, 2026 04:56
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
genon/preprocessor/facade/intelligent_processor.py (1)

1760-1770: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

LGTM — SDK→LibreOffice fallback is correctly gated.

The retry only triggers when the original attempt used the SDK (use_sdk is True), which is the right call: if the caller explicitly passed use_pdf_sdk=False, silently re-trying with another engine would violate that intent. The not converted or not os.path.exists(converted) check also catches the "SDK returned a path that vanished" failure mode in addition to outright None returns.

Minor nit on line 1765 only: the warning string has no interpolation, so the f prefix is unnecessary (Ruff F541).

🧹 Suggested cleanup
-                _log.warning(f"[intelligent] SDK conversion failed → fallback to LibreOffice")
+                _log.warning("[intelligent] SDK conversion failed → fallback to LibreOffice")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/facade/intelligent_processor.py` around lines 1760 - 1770,
The warning message in the intelligent preprocessing flow uses an unnecessary
f-string: in the block inside intelligent_processor where use_sdk is determined
and convert_to_pdf is called (symbols: convert_to_pdf, use_sdk, _log.warning),
remove the f-prefix from the literal "_log.warning(f\"[intelligent] SDK
conversion failed → fallback to LibreOffice\")" (or alternatively add
interpolation fields if you intend to format), so change it to a plain string
`_log.warning("[intelligent] SDK conversion failed → fallback to LibreOffice")`.
🧹 Nitpick comments (1)
genon/preprocessor/facade/attachment_processor.py (1)

113-203: 💤 Low value

LGTM — same SDK fix as convert_processor.py; behavior is consistent across facades.

Same correctness assessment applies: temp-dir output + copy-to-target cleanly fixes the Document.hwp.pdf vs Document.pdf mismatch, and cleanup/logging are sound.

One maintainability note: this _convert_to_pdf_sdk is now byte-for-byte identical in three facades. The reason (Genos web UI loading each facade as a single file) is documented only in intelligent_processor.py:17-19. Consider duplicating that short rationale comment here and in convert_processor.py so future readers don't try to extract a shared helper without realizing it would break the runtime loader.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@genon/preprocessor/facade/attachment_processor.py` around lines 113 - 203,
The function _convert_to_pdf_sdk is byte-for-byte duplicated across facades; add
the same short rationale comment that exists in intelligent_processor.py (lines
17-19) to the top of this function in attachment_processor.py and also add that
same comment to convert_processor.py so future maintainers know the duplication
is intentional due to the Genos web UI loading each facade as a single file;
place the comment immediately above the _convert_to_pdf_sdk definition to make
the reason obvious when someone considers extracting a shared helper.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@genon/preprocessor/facade/intelligent_processor.py`:
- Around line 1760-1770: The warning message in the intelligent preprocessing
flow uses an unnecessary f-string: in the block inside intelligent_processor
where use_sdk is determined and convert_to_pdf is called (symbols:
convert_to_pdf, use_sdk, _log.warning), remove the f-prefix from the literal
"_log.warning(f\"[intelligent] SDK conversion failed → fallback to
LibreOffice\")" (or alternatively add interpolation fields if you intend to
format), so change it to a plain string `_log.warning("[intelligent] SDK
conversion failed → fallback to LibreOffice")`.

---

Nitpick comments:
In `@genon/preprocessor/facade/attachment_processor.py`:
- Around line 113-203: The function _convert_to_pdf_sdk is byte-for-byte
duplicated across facades; add the same short rationale comment that exists in
intelligent_processor.py (lines 17-19) to the top of this function in
attachment_processor.py and also add that same comment to convert_processor.py
so future maintainers know the duplication is intentional due to the Genos web
UI loading each facade as a single file; place the comment immediately above the
_convert_to_pdf_sdk definition to make the reason obvious when someone considers
extracting a shared helper.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b59a5a61-93d5-4f1b-8400-11abc61572cd

📥 Commits

Reviewing files that changed from the base of the PR and between 79c6c7a and 4d44b66.

📒 Files selected for processing (3)
  • genon/preprocessor/facade/attachment_processor.py
  • genon/preprocessor/facade/convert_processor.py
  • genon/preprocessor/facade/intelligent_processor.py

@inoray inoray merged commit a768591 into develop May 8, 2026
4 checks passed
@yspaik yspaik added the bug Something isn't working label May 8, 2026
@yspaik yspaik added this to the Parser v2.1.0 05/07 Release milestone May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants