Site/komipo 20251211#166
Conversation
…_parser into site/komipo-20251211 Conflicts: build-script/doc-parser-build.config genon/preprocessor/facade/attachment_processor.py genon/preprocessor/facade/intelligent_processor.py genon/preprocessor/facade/intelligent_processor_ocr.py
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
📝 WalkthroughWalkthroughThis PR introduces comprehensive document preprocessing infrastructure by adding multiple specialized processor modules (OCR, JSON, OneAgent, attachment handling) with guardrail safety checks, updates build configurations and Docker image versions to Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
genon/preprocessor/scripts/register_image.sh (2)
61-73:⚠️ Potential issue | 🔴 CriticalCritical:
HAS_LOCAL_IMAGEis undefined, breaking push logic.The local image check (lines 61-68) is commented out, but line 73 still references
HAS_LOCAL_IMAGE. Since this variable is never set, the condition[[ "${HAS_LOCAL_IMAGE}" != "yes" ]]will always evaluate to true (unset variables expand to empty string), causing the script to always skip to the registry check or user prompt path, never proceeding with a normal push.Either uncomment the local image check or remove the
HAS_LOCAL_IMAGEcondition entirely.🐛 Option 1: Uncomment the local image check
-# # ── 로컬 이미지 확인 ──────────────────────────────────────── -# step "로컬 Docker 이미지 확인" -# if docker images | awk '{print $1":"$2}' | grep -qx "${FULL_IMAGE_NAME}"; then -# ok "로컬 이미지 존재" -# else -# fail "로컬에 ${FULL_IMAGE_NAME} 없음. 먼저 build/push 하세요." -# exit 1 -# fi +# ── 로컬 이미지 확인 ──────────────────────────────────────── +step "로컬 Docker 이미지 확인" +HAS_LOCAL_IMAGE="no" +if docker images | awk '{print $1":"$2}' | grep -qx "${FULL_IMAGE_NAME}"; then + ok "로컬 이미지 존재" + HAS_LOCAL_IMAGE="yes" +else + echo "⚠️ 로컬에 ${FULL_IMAGE_NAME} 없음" +fi🐛 Option 2: Remove the HAS_LOCAL_IMAGE condition
# ── docker push (포그라운드 / 재시도) ─────────────────────── step "docker push" SKIP_PUSH="no" -if [[ "${HAS_LOCAL_IMAGE}" != "yes" ]]; then +# Check if image exists in registry before attempting push +if ! docker image inspect "${FULL_IMAGE_NAME}" >/dev/null 2>&1; then if [[ -n "${REGISTRY_API_URL:-}" ]]; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/scripts/register_image.sh` around lines 61 - 73, The script references HAS_LOCAL_IMAGE in the "docker push" block but the local image check that sets it (which inspects FULL_IMAGE_NAME) is commented out, so HAS_LOCAL_IMAGE is never defined; restore the local-image check or explicitly set HAS_LOCAL_IMAGE before the push logic (e.g., run the docker images | awk ... | grep -qx "${FULL_IMAGE_NAME}" test and set HAS_LOCAL_IMAGE="yes"/"no") so the condition in the step "docker push" behaves correctly, or alternatively remove the HAS_LOCAL_IMAGE conditional entirely and rely on the registry/user prompt flow.
154-171:⚠️ Potential issue | 🔴 CriticalCritical: SQL INSERT is never executed.
SQL_INSERTis assigned butmysql_queryis never called to execute it. The subsequentSELECT(lines 165-169) queries for a row that was never inserted, returning empty. The script will incorrectly report "DB 등록 완료" with an emptyIMAGE_ID.🐛 Proposed fix: Execute the INSERT before querying
SQL_INSERT=" INSERT INTO llmops.system_docker_image_tb (name, tag, description, type, status, is_active, reg_date, mod_date, reg_user_id, mod_user_id) VALUES ('${IMAGE_NAME}', '${IMAGE_TAG}', '${DESCRIPTION}', '${TYPE_LIST_JSON}', 'COMPLETED', 1, NOW(), NOW(), 1, 1); INSERT INTO llmops.resource_meta_tb (resource_id, resource_type, resource_group_id, is_active, reg_date, mod_date, reg_user_id, mod_user_id) VALUES (LAST_INSERT_ID(), 'DOCKER_IMAGE', 1, 1, NOW(), NOW(), 1, 1); - " 2>/dev/null + " + + if ! mysql_query "${SQL_INSERT}"; then + fail "DB INSERT 실패" + exit 1 + fi IMAGE_ID=$( kubectl exec -it "${MARIADB_POD}" -n "${K8S_NAMESPACE}" -- \🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/scripts/register_image.sh` around lines 154 - 171, The SQL_INSERT variable is never executed, so the SELECT that populates IMAGE_ID finds nothing; run the SQL_INSERT against the MariaDB before performing the SELECT (e.g. execute SQL_INSERT via the same kubectl exec mysql invocation or pipe it into mysql) so the INSERT into system_docker_image_tb and resource_meta_tb actually occur; keep using the same kubectl exec command pattern (referencing SQL_INSERT, kubectl exec -it "${MARIADB_POD}" -n "${K8S_NAMESPACE}" -- mysql -u "${MYSQL_USER}" -p"${MYSQL_PASS}" llmops -se) and then run the SELECT to set IMAGE_ID, and preserve/handle any errors from the mysql invocation.
🟠 Major comments (20)
genon/preprocessor/scripts/register.config-4-4 (1)
4-4:⚠️ Potential issue | 🟠 MajorVersion mismatch with build configuration.
The registration config specifies
1.3.3-komipo, butbuild-script/doc-parser-build.configbuilds version1.3.7-komipo. This inconsistency will cause registration to reference a non-existent or outdated image version. Align these versions to prevent deployment failures.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/scripts/register.config` at line 4, The IMAGE_TAG value in register.config ("IMAGE_TAG=\"1.3.3-komipo\"") is out of sync with the build-script/doc-parser-build.config which builds "1.3.7-komipo"; update the IMAGE_TAG constant in genon/preprocessor/scripts/register.config to "1.3.7-komipo" (or alternatively update the build config so both use the same canonical version) so registration references the exact image version produced by doc-parser-build.config.genon/preprocessor/module/utils/util.py-143-146 (1)
143-146:⚠️ Potential issue | 🟠 MajorDon't return a PDF path when no PDF was created.
If
WeasyPrintis unavailable, this function still returnspdf_patheven though nothing was written. Callers will treat that as a successful conversion and fail later on missing-file access.Proposed fix
html_content = markdown(md_content) - if HTML: - HTML(string=html_content).write_pdf(pdf_path) - return pdf_path + if HTML is None: + return None + HTML(string=html_content).write_pdf(pdf_path) + return pdf_path🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/module/utils/util.py` around lines 143 - 146, The function currently returns pdf_path even when WeasyPrint/HTML isn't available; update the logic around HTML(string=html_content).write_pdf(pdf_path) so you only return pdf_path after confirming a PDF was actually created: if HTML is falsy or write_pdf wasn't called/failed, raise a clear exception (e.g., RuntimeError) or return None, and after calling HTML(...).write_pdf(pdf_path) verify the file exists (os.path.exists(pdf_path)) before returning. Refer to the HTML symbol and the pdf_path variable in util.py to locate and change the post-conversion return behavior.genon/preprocessor/module/base_processor.py-125-126 (1)
125-126:⚠️ Potential issue | 🟠 MajorHonor the configured chunker instead of hardcoding
"simple".The constructor ignores
config["chunker"], so this new config surface is already ineffective and any future non-simpleprocessor will silently use the wrong chunker.Proposed fix
- self.chunker = CHUNKERS["simple"]() + chunker_name = self.config.get("chunker", "simple") + self.chunker = CHUNKERS[chunker_name]() self.genos_meta_builder = GenOSVectorMetaBuilder()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/module/base_processor.py` around lines 125 - 126, The constructor currently hardcodes self.chunker = CHUNKERS["simple"](), ignoring the provided config["chunker"]; change it to read the chunker name from the config (e.g., chunker_name = config.get("chunker", "simple")), validate that chunker_name exists in CHUNKERS, and then instantiate with self.chunker = CHUNKERS[chunker_name](); if the name is invalid, either fall back to "simple" and log a warning or raise a clear error so the configured chunker option in the BaseProcessor constructor is honored.genon/preprocessor/module/utils/util.py-56-73 (1)
56-73:⚠️ Potential issue | 🟠 MajorHandle the PDF generated from the ASCII temp name.
When
candis the ASCII copy, LibreOffice writes<ascii_name>.pdfintoout_dir, but this code only checks forpdf_pathbased on the original filename. Non-ASCII inputs can therefore convert successfully and still returnNone.Proposed fix
for cand in candidates: cmd = [ "soffice", @@ ] proc = subprocess.run(cmd, env=env, capture_output=True, text=True) - if proc.returncode == 0 and pdf_path.exists(): + generated_pdf = out_dir / f"{cand.stem}.pdf" + if proc.returncode == 0 and generated_pdf.exists(): + if generated_pdf != pdf_path: + generated_pdf.replace(pdf_path) # 성공 if tmp_dir: shutil.rmtree(tmp_dir, ignore_errors=True) return str(pdf_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/module/utils/util.py` around lines 56 - 73, The conversion loop currently only checks pdf_path (based on the original filename) so when LibreOffice writes an ASCII-named PDF for the candidate `cand` the code misses it; after running subprocess.run(cmd, ...) inside the for cand in candidates loop, check for the actual generated PDF in out_dir using the candidate's output name (e.g. generated = out_dir / f"{cand.stem}.pdf") and/or glob for any PDF that matches the candidate stem, and if that file exists and proc.returncode == 0 then remove tmp_dir (if present) and return its string path; keep the existing stderr logging on failure and still cleanup tmp_dir on success.genon/preprocessor/module/utils/util.py-94-97 (1)
94-97:⚠️ Potential issue | 🟠 MajorStop rewriting the entire path string here.
str.replace()will also rewrite matching text in parent directories and basename substrings, not just the suffix. That can produce invalid output paths for files under directories like.../md.archive/....Proposed fix
def _get_pdf_path(file_path: str, CONVERTIBLE_EXTENSIONS: list) -> str: @@ - pdf_path = file_path - for ext in CONVERTIBLE_EXTENSIONS: - pdf_path = pdf_path.replace(ext, ".pdf") - return pdf_path + path = Path(file_path) + if path.suffix.lower() in CONVERTIBLE_EXTENSIONS: + return str(path.with_suffix(".pdf")) + return str(path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/module/utils/util.py` around lines 94 - 97, The current loop in util.py uses pdf_path = pdf_path.replace(ext, ".pdf") which replaces every occurrence of ext in the whole path; instead detect and replace only the file extension/suffix. Update the logic that computes pdf_path (using file_path, CONVERTIBLE_EXTENSIONS and pdf_path) to check the path's suffix via os.path.splitext or pathlib.Path.suffix and only swap the final extension to ".pdf" (or use Path.with_suffix(".pdf")) when the file's extension matches a member of CONVERTIBLE_EXTENSIONS, leaving directory names untouched.genon/preprocessor/module/utils/metadata.py-295-298 (1)
295-298:⚠️ Potential issue | 🟠 MajorGuard
appendixstring parsing against non-JSON input.This branch assumes every non-empty string is a JSON array. A plain value like
"appendix-1.pdf"will raiseJSONDecodeErrorand abort vector composition.Proposed fix
appendix_list = [] if isinstance(appendix_info, str): - appendix_list = ( - [item.strip() for item in json.loads(appendix_info) if item.strip()] if appendix_info else [] - ) + if appendix_info: + try: + parsed = json.loads(appendix_info) + except json.JSONDecodeError: + parsed = [appendix_info] + appendix_list = [item.strip() for item in parsed if isinstance(item, str) and item.strip()] elif isinstance(appendix_info, list): appendix_list = appendix_info🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/module/utils/metadata.py` around lines 295 - 298, The code assumes appendix_info (string) is JSON and calls json.loads directly; wrap the parsing for appendix_info in a try/except to guard against non-JSON strings: attempt json.loads(appendix_info) and if it raises JSONDecodeError (or ValueError), fall back to treating the trimmed appendix_info as a single entry (appendix_list = [appendix_info.strip()] if appendix_info.strip() else []); preserve the existing empty-string behavior and keep the variable name appendix_list and the branch for isinstance(appendix_info, str).genon/preprocessor/module/test_processor.py-6-6 (1)
6-6:⚠️ Potential issue | 🟠 MajorUse package-relative imports and add missing
__init__.pyfiles.The absolute import
from base_processor import BaseProcessorworks only when this directory is manually added tosys.path(as shown intest.pyline 10). Importing this module through the package path will raiseModuleNotFoundError. Convert to relative imports and ensure the package is properly structured with__init__.pyfiles ingenon/preprocessor/andgenon/preprocessor/module/.Proposed fix
-from base_processor import BaseProcessor +from .base_processor import BaseProcessorAlso add
__init__.py(can be empty) to:
genon/preprocessor/__init__.pygenon/preprocessor/module/__init__.py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/module/test_processor.py` at line 6, Change the absolute import in test_processor.py to a package-relative import (e.g. in test_processor.py use "from ..base_processor import BaseProcessor") so the module can be imported via the package; also add empty __init__.py files to the genon.preprocessor package and the genon.preprocessor.module subpackage (create __init__.py in genon/preprocessor/ and genon/preprocessor/module/) to make the package importable.genon/preprocessor/facade/json_processor.py-89-105 (1)
89-105:⚠️ Potential issue | 🟠 MajorKeep the JSON vector schema aligned with the shared metadata contract.
This builder emits
bboxes/url, while the shared implementation ingenon/preprocessor/module/utils/metadata.pyand the other preprocessors emitchunk_bboxes,media_files,title,created_date, andappendix. Returning a different shape from one preprocessor is a downstream compatibility break waiting to happen.Also applies to: 172-175
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/facade/json_processor.py` around lines 89 - 105, The JSON processor currently defines attributes reg_date, bboxes, url and populates self.data with keys reg_date, bboxes, url which diverges from the shared metadata contract; update the attribute names and the self.data keys to match the shared schema used in genon/preprocessor/module/utils/metadata.py and other preprocessors: replace reg_date with created_date, bboxes with chunk_bboxes, url with media_files, and add missing keys title and appendix in self.data (and initialize corresponding attributes if needed, e.g., self.created_date, self.chunk_bboxes, self.media_files, self.title, self.appendix) so the emitted JSON shape matches the shared contract (also apply the same renames/keys for the other occurrence around the later block referenced).genon/preprocessor/module/utils/chunkers.py-130-143 (1)
130-143:⚠️ Potential issue | 🟠 MajorPreserve document order for tables missing from
iterate_items().Prepending every missing table at index
0reorders the document whenever an omitted table belongs to a later page. That corrupts the chunk text and the header/page associations carried downstream.genon/preprocessor/module/intelligent_processor.py-1188-1193 (1)
1188-1193:⚠️ Potential issue | 🟠 Major
enrichment()exits beforeenrich_document()ever runs.The first
return documentmakes the actual enrichment call unreachable, so the TOC/metadata enrichment configured inself.enrichment_optionsis currently disabled.🐛 Proposed fix
def enrichment(self, document: DoclingDocument, **kwargs: dict) -> DoclingDocument: - return document - - # 새로운 enriched result 받기 document = enrich_document(document, self.enrichment_options, **kwargs) return document🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/module/intelligent_processor.py` around lines 1188 - 1193, The enrichment method currently returns immediately, making the enrich_document call unreachable; remove the premature "return document" so that enrichment(self, document: DoclingDocument, **kwargs) calls enrich_document(document, self.enrichment_options, **kwargs) and returns that enriched DoclingDocument; ensure the function signature and final return still return a DoclingDocument and that self.enrichment_options is passed through to enrich_document.genon/preprocessor/facade/attachment_processor_guardrail.py-1469-1476 (1)
1469-1476:⚠️ Potential issue | 🟠 MajorFilter request kwargs before constructing the text splitter.
__call__()forwards all request kwargs, butRecursiveCharacterTextSplitter(**kwargs)only accepts a small subset. Flags likesave_images,include_wmf, or future API parameters will raiseTypeErrorhere and break attachment preprocessing.🐛 Proposed fix
- text_splitter = RecursiveCharacterTextSplitter(**kwargs) + splitter_kwargs = { + k: v for k, v in kwargs.items() + if k in ["chunk_size", "chunk_overlap", "separators", "length_function"] + } + text_splitter = RecursiveCharacterTextSplitter(**splitter_kwargs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/facade/attachment_processor_guardrail.py` around lines 1469 - 1476, The split_documents method passes the entire request kwargs into RecursiveCharacterTextSplitter causing TypeError for unknown params; before instantiating RecursiveCharacterTextSplitter in split_documents, filter kwargs down to the supported keys (e.g., allowed = {"chunk_size","chunk_overlap","separators",...} or derive via RecursiveCharacterTextSplitter signature) then set defaults like chunk_size=20000 on that filtered dict and call RecursiveCharacterTextSplitter(**filtered_kwargs) so only accepted params are forwarded.genon/preprocessor/facade/json_processor.py-311-329 (1)
311-329:⚠️ Potential issue | 🟠 MajorUse real chunk counts and indices when composing vectors.
split_documents()can return multiple chunks, but every vector here is stamped withn_chunk_of_doc=1,i_chunk_on_doc=1, and page info(1, 1, 1). Multi-chunk JSON inputs will therefore carry incorrect ordering/count metadata.🐛 Proposed fix
global_metadata = dict( - n_chunk_of_doc=int(1), - n_page=int(1), + n_chunk_of_doc=len(chunks), + n_page=1, reg_date=datetime.now().isoformat(timespec='seconds') + 'Z', **first_chunk['metadata'], ) @@ - for chunk in chunks: + for chunk_idx, chunk in enumerate(chunks): vector = (GenOSVectorMetaBuilder() .set_text(chunk["text"]) - .set_page_info(1, 1, 1) - .set_chunk_index(1) + .set_page_info(1, chunk_idx, len(chunks)) + .set_chunk_index(chunk_idx) .set_global_metadata(**global_metadata) .set_bboxes(None) ).build()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/facade/json_processor.py` around lines 311 - 329, The code currently hardcodes chunk/page metadata (n_chunk_of_doc=1, page info (1,1,1), chunk index 1) when building vectors; update json_processor.py so global_metadata uses the real chunk count (e.g., n_chunk_of_doc = len(chunks) or the actual total for that document), increment chunk_index_on_page for each chunk and pass the real chunk index into GenOSVectorMetaBuilder.set_chunk_index, and compute and pass correct page info into set_page_info (use current_page, total_pages, and per-page index variables rather than fixed 1s) before calling set_global_metadata and build so each vector carries correct ordering/count metadata.genon/preprocessor/module/utils/chunkers.py-614-658 (1)
614-658:⚠️ Potential issue | 🟠 MajorList mutation here can skip later oversized sections.
range(len(sections_with_text))is computed before youpop()andinsert()intosections_with_text. If an early section splits into multiple pieces, some original tail sections are never revisited, so they can remain abovemax_tokens.🐛 Proposed fix
- if self.max_tokens > 0: - for i in range(len(sections_with_text)): + if self.max_tokens > 0: + i = 0 + while i < len(sections_with_text): text, items, h_infos, h_short = sections_with_text[i] token_count = self._count_tokens(text) if token_count < self.max_tokens: - continue + i += 1 + continue @@ - sections_with_text.pop(i) - for new_section in reversed(new_sections): - sections_with_text.insert(i, new_section) + sections_with_text[i:i + 1] = new_sections + i += len(new_sections)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/module/utils/chunkers.py` around lines 614 - 658, The for-loop over range(len(sections_with_text)) mutates sections_with_text (pop/insert) and skips later entries; change it to a while-loop that re-evaluates length at each iteration: initialize i = 0 and loop while i < len(sections_with_text), load (text, items, h_infos, h_short) = sections_with_text[i], compute token_count via self._count_tokens(text); if token_count < self.max_tokens then i += 1; otherwise perform the existing grouping (adjust_captions, adjust_pictures_in_tables), compute item_token_counts, call split_items_evenly_by_tokens, build new_sections with self._generate_section_text_with_heading, replace the original entry by popping and inserting new_sections at index i, and then advance i by len(new_sections) to continue processing the remaining sections without skipping any.genon/preprocessor/facade/intelligent_processor_ocr.py-1341-1341 (1)
1341-1341:⚠️ Potential issue | 🟠 MajorCancellation guard exists but is not executed in the hot path.
assert_cancelledis defined, but the call in Line 1341 is commented out. Long OCR/chunking work will continue even after client disconnect.Also applies to: 1382-1384
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/facade/intelligent_processor_ocr.py` at line 1341, The cancellation guard assert_cancelled is defined but its awaited call was commented out; re-enable and await assert_cancelled(request) at the hot-path spots where long OCR/chunking work runs (the commented call near the current assert_cancelled(request) and the nearby region around the other commented calls) so work stops promptly on client disconnect; ensure these awaits occur inside the async processing/loop functions handling OCR/chunking (where request is in scope) and that the functions remain async so the await is effective.genon/preprocessor/facade/intelligent_processor_ocr.py-848-849 (1)
848-849:⚠️ Potential issue | 🟠 Major
page_chunk_countsshould be request-scoped, not processor-scoped.
self.page_chunk_countsis mutable instance state and is incremented duringsplit_documentsbut never reset per request. This can leak counts across requests and produce wrongn_chunk_of_pagemetadata (and race under concurrent use).🧭 Suggested fix
async def __call__(self, request: Request, file_path: str, **kwargs: dict): + self.page_chunk_counts.clear() document: DoclingDocument = self.load_documents(file_path, **kwargs)Also applies to: 987-997, 1103-1106
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/facade/intelligent_processor_ocr.py` around lines 848 - 849, self.page_chunk_counts is currently stored on the processor instance and accumulates across requests; make it request-scoped by removing self.page_chunk_counts and creating a local page_chunk_counts (e.g., a defaultdict(int)) inside split_documents (or the top-level request handler that calls split_documents), pass that local map into any helper methods that compute or set n_chunk_of_page metadata, and ensure you use the local variable when incrementing counts so it is reset on each request (this also avoids races under concurrent requests).genon/preprocessor/facade/intelligent_processor_ocr.py-713-733 (1)
713-733:⚠️ Potential issue | 🟠 MajorLocal metadata model duplicates shared code and already drifted (
appendixmissing).This file redefines
GenOSVectorMeta/ builder instead of reusinggenon/preprocessor/module/utils/metadata.py(whereappendixexists at Lines 22-48). This creates schema inconsistency across preprocessors and risks metadata loss.Also applies to: 735-833
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/facade/intelligent_processor_ocr.py` around lines 713 - 733, The file redefines GenOSVectorMeta locally causing schema drift (missing appendix) instead of reusing the shared model; replace the local GenOSVectorMeta definition with an import of the canonical model from genon.preprocessor.module.utils.metadata (or subclass it) so the shared fields including appendix are present, update any builder/consumer code in intelligent_processor_ocr.py that referenced the local class to use the imported/shared GenOSVectorMeta (or a thin subclass) to ensure identical schema and avoid duplicated definitions.genon/preprocessor/facade/intelligent_processor_ocr.py-1276-1278 (1)
1276-1278:⚠️ Potential issue | 🟠 MajorDon’t suppress OCR failures with
print(...); pass.Lines 1276-1278 swallow all OCR errors, which makes partial failures invisible and hard to diagnose. At minimum, log structured context and raise a domain exception (or return an explicit failure signal).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/facade/intelligent_processor_ocr.py` around lines 1276 - 1278, Replace the bare "except Exception as e: print(...); pass" in the OCR handling block in genon/preprocessor/facade/intelligent_processor_ocr.py with structured logging and an explicit failure signal: use the module logger (e.g., logging.getLogger(__name__)) to log an error including contextual fields (document id, page number, image metadata) and the exception details, then either raise a domain-specific exception (e.g., OCRProcessingError) with the original exception chained (raise OCRProcessingError("OCR failed for document X", ...) from e) or return a clear failure result/object instead of swallowing the error; update or add the OCRProcessingError class if needed and ensure callers handle the new exception/return value.genon/preprocessor/facade/intelligent_processor_ocr.py-1311-1311 (1)
1311-1311:⚠️ Potential issue | 🟠 Major
has_text_itemstable condition is inverted.Line 1311 marks table content as present only when
len(item.data.table_cells) == 0. For normal tables (cells > 0), this evaluates false and can incorrectly trigger the dummy"."fallback path.✅ Suggested fix
- if (isinstance(item, (TextItem, ListItem, CodeItem, SectionHeaderItem)) and item.text and item.text.strip()) or (isinstance(item, TableItem) and item.data and len(item.data.table_cells) == 0): + if (isinstance(item, (TextItem, ListItem, CodeItem, SectionHeaderItem)) and item.text and item.text.strip()) or (isinstance(item, TableItem) and item.data and len(item.data.table_cells) > 0):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/facade/intelligent_processor_ocr.py` at line 1311, The condition that detects tables is inverted: in the compound if that checks item types (TextItem, ListItem, CodeItem, SectionHeaderItem, TableItem) change the TableItem branch to test for table_cells present (len(item.data.table_cells) > 0) instead of == 0 so real tables count as having text; update the expression that references item.data.table_cells accordingly to restore correct has_text_items behavior and avoid the dummy "." fallback.genon/preprocessor/facade/intelligent_processor_ocr.py-483-488 (1)
483-488:⚠️ Potential issue | 🟠 MajorOversized table splitting is effectively disabled.
In Lines 483-488, the code detects token overflow but then forces
split_tables = [table_only_text], so no split actually happens. This can still emit chunks abovemax_tokensand break downstream embedding/model limits.✂️ Suggested fix
- # split_tables = self._split_table_text(table_only_text, 4096) - split_tables = [table_only_text] + split_tables = self._split_table_text(table_only_text, self.max_tokens)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/facade/intelligent_processor_ocr.py` around lines 483 - 488, When table_tokens > self.max_tokens replace the forced no-op split with a real split: call the existing helper _split_table_text(table_only_text, chunk_token_limit) instead of assigning split_tables = [table_only_text]; compute chunk_token_limit from self.max_tokens (or self.max_tokens minus any per-chunk overhead) and assign its result to split_tables, and fall back to [table_only_text] only if the splitter returns an empty list. Update the block that currently references table_only_text, _split_table_text and split_tables so oversized tables are actually chunked to respect self.max_tokens.genon/preprocessor/facade/intelligent_processor_ocr.py-1174-1178 (1)
1174-1178:⚠️ Potential issue | 🟠 MajorAvoid blocking HTTP calls (
requests.post) in the async request path.
__call__is async, but OCR uses synchronousrequests.postper cell. This can block the event loop and degrade latency/throughput under concurrent traffic.#!/bin/bash # Verify blocking network calls in async flow. set -euo pipefail rg -n -C3 'async def __call__|def ocr_all_table_cells|requests\.post\(' genon/preprocessor/facade/intelligent_processor_ocr.pyAlso applies to: 1282-1296
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/facade/intelligent_processor_ocr.py` around lines 1174 - 1178, The code uses synchronous requests.post inside post_ocr_bytes (used from async __call__ / ocr_all_table_cells), which blocks the event loop; replace the sync call with an async HTTP client (e.g., aiohttp.ClientSession or httpx.AsyncClient): create an async function post_ocr_bytes_async(img_bytes, timeout=60) that performs the POST with the same HEADERS/payload and timeout, returns the parsed dict, and update callers (ocr_all_table_cells and __call__) to await post_ocr_bytes_async; ensure proper session lifecycle (reuse a session if many calls) and propagate/handle HTTP errors the same way as the current r.ok logic.
🟡 Minor comments (2)
genon/preprocessor/module/test.py-55-59 (1)
55-59:⚠️ Potential issue | 🟡 MinorRemove the committed debugger stop.
breakpoint()makes this runner hang in non-interactive environments right after writingresult.json.Proposed fix
end = time.time() print(f"Processing time: {end - begin:.2f} seconds") - - -breakpoint()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/module/test.py` around lines 55 - 59, The committed debugger call breakpoint() at the end of genon/preprocessor/module/test.py causes the runner to hang; remove the breakpoint() call (or replace it with a non-blocking log statement) after the timing print (the end/ begin timing block) so the test runner exits normally; ensure no other interactive debugger calls remain in the module.genon/preprocessor/facade/attachment_processor.py-130-131 (1)
130-131:⚠️ Potential issue | 🟡 MinorKeep the metadata in sync when you rewrite
r.text.After this replacement,
n_char,n_word, andn_linestill describe the original chunk. That leaves the vector payload internally inconsistent.Proposed fix
if answer.startswith("[UNSAFE]"): - r.text = "부적절한 텍스트가 포함되어 있으므로 해당 청크를 제거합니다." + replacement = "부적절한 텍스트가 포함되어 있으므로 해당 청크를 제거합니다." + r.text = replacement + if hasattr(r, "n_char"): + r.n_char = len(replacement) + if hasattr(r, "n_word"): + r.n_word = len(replacement.split()) + if hasattr(r, "n_line"): + r.n_line = len(replacement.splitlines())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/facade/attachment_processor.py` around lines 130 - 131, The replacement of r.text when answer.startswith("[UNSAFE]") leaves the metadata (n_char, n_word, n_line) inconsistent; update those metadata fields to reflect the new text after setting r.text (e.g., compute new character count, word count, and line count from the replacement string) and store them back into r.metadata (or the same metadata container used for n_char, n_word, n_line) so the vector payload remains consistent with r.text.
🧹 Nitpick comments (3)
genon/preprocessor/scripts/register_image.sh (2)
163-163: Syntax anomaly:2>/dev/nullafter string assignment.The
2>/dev/nullredirect after the closing quote ofSQL_INSERThas no effect since this is a variable assignment, not a command execution. This appears to be leftover from previous code.🧹 Remove the ineffective redirect
VALUES - (LAST_INSERT_ID(), 'DOCKER_IMAGE', 1, 1, NOW(), NOW(), 1, 1); - " 2>/dev/null + (LAST_INSERT_ID(), 'DOCKER_IMAGE', 1, 1, NOW(), NOW(), 1, 1); + "🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/scripts/register_image.sh` at line 163, Remove the stray redirection following the SQL_INSERT assignment: locate the SQL_INSERT variable assignment in register_image.sh and delete the trailing `2>/dev/null` that appears after the closing quote (it's ineffective on assignments); ensure only the intended string is assigned to SQL_INSERT and that no redirection tokens remain adjacent to the quoted value.
166-166: Avoid-itflags in non-interactivekubectl exec.Using
-it(interactive TTY) in a script context can cause issues with output parsing and may fail in CI/CD environments. Use-ionly or omit both flags when capturing output.♻️ Proposed fix
IMAGE_ID=$( - kubectl exec -it "${MARIADB_POD}" -n "${K8S_NAMESPACE}" -- \ + kubectl exec -i "${MARIADB_POD}" -n "${K8S_NAMESPACE}" -- \ mysql -u "${MYSQL_USER}" -p"${MYSQL_PASS}" llmops -se \🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@genon/preprocessor/scripts/register_image.sh` at line 166, In the kubectl exec invocation that uses the variables MARIADB_POD and K8S_NAMESPACE, remove the interactive TTY flag (-t) so the script doesn't use -it in non-interactive contexts; replace "-it" with either "-i" (if stdin must be kept open) or omit both flags to capture output reliably in CI/CD. Locate the kubectl exec line in register_image.sh and update the flags accordingly while keeping the pod/namespace variables and remaining command arguments unchanged.genon/README.md (1)
81-84: Add language specifier to fenced code block.The code block is missing a language identifier, which affects syntax highlighting and linting compliance.
📝 Proposed fix
-``` +```shell docker save mncregistry:30500/doc-parser-ocr:1.3.3-komipo | gzip > doc-parser-ocr.tar.gz gunzip -c doc-parser-ocr.tar.gz | docker load</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@genon/README.mdaround lines 81 - 84, The fenced code block containing the
two commands starting with "docker save
mncregistry:30500/doc-parser-ocr:1.3.3-komipo | gzip > doc-parser-ocr.tar.gz"
and "gunzip -c doc-parser-ocr.tar.gz | docker load" in README.md needs a
language specifier; change the opening triple backticks to include a language
(e.g., useshell) so the block becomesshell ... ``` to enable correct
syntax highlighting and satisfy linting rules.</details> </blockquote></details> </blockquote></details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: defaults **Review profile**: CHILL **Plan**: Pro **Run ID**: `d14a1a4c-0e58-4938-92e4-4aabcc22ca7a` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between bdcd8184ca01c43395d775e6d0828702d0e6e7e6 and b5877fe12fa2066a9a7fec8441575ad1e0b4ffbb. </details> <details> <summary>📒 Files selected for processing (19)</summary> * `build-script/doc-parser-build.config` * `build-script/paddle-ocr-build.config` * `genon/README.md` * `genon/preprocessor/facade/attachment_processor.py` * `genon/preprocessor/facade/attachment_processor_guardrail.py` * `genon/preprocessor/facade/intelligent_processor_ocr.py` * `genon/preprocessor/facade/json_processor.py` * `genon/preprocessor/facade/legacy/적재용(외부)_ocr.py` * `genon/preprocessor/facade/oneagent_processor.py` * `genon/preprocessor/module/base_processor.py` * `genon/preprocessor/module/intelligent_processor.py` * `genon/preprocessor/module/test.py` * `genon/preprocessor/module/test_processor.py` * `genon/preprocessor/module/utils/chunkers.py` * `genon/preprocessor/module/utils/genos_util.py` * `genon/preprocessor/module/utils/metadata.py` * `genon/preprocessor/module/utils/util.py` * `genon/preprocessor/scripts/register.config` * `genon/preprocessor/scripts/register_image.sh` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| GUARDRAIL_WORKFLOW_ID = 694 | ||
| GUARDRAIL_BEARER_TOKEN = '23c3898fe3264fd597961af23a68fe7c' |
There was a problem hiding this comment.
Remove and rotate the committed guardrail token.
This bearer token is now part of the repository history, so anyone with source access can call the workflow service. Load it from environment/secret storage instead and rotate the exposed token before merge.
🧰 Tools
🪛 Betterleaks (1.1.2)
[high] 101-101: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🪛 Ruff (0.15.10)
[error] 101-101: Possible hardcoded password assigned to: "GUARDRAIL_BEARER_TOKEN"
(S105)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@genon/preprocessor/facade/attachment_processor_guardrail.py` around lines 100
- 101, The hardcoded secret GUARDRAIL_BEARER_TOKEN (and any other sensitive
value like GUARDRAIL_WORKFLOW_ID if it's sensitive) must be removed from source;
change code to read the bearer token from secure configuration/environment
(e.g., env var name GUARDRAIL_BEARER_TOKEN) and fail fast with a clear error if
it is missing, and ensure callers still use the GUARDRAIL_WORKFLOW_ID constant
only if non-sensitive; rotate the exposed token before merging so the leaked
value is invalidated and update documentation/CI to inject the new secret into
environments instead of committing it.
| def guardrail(func): | ||
| @wraps(func) | ||
| async def wrapper(*args, **kwargs): | ||
| result = await func(*args, **kwargs) | ||
| for r in result: | ||
| url = f"{GENOS_URL}/workflow/{GUARDRAIL_WORKFLOW_ID}" | ||
| headers = dict(Authorization=f"Bearer {GUARDRAIL_BEARER_TOKEN}") | ||
|
|
||
| if hasattr(r, "text"): | ||
| body = {'question': r.text} | ||
|
|
||
| res = requests.post(f'{url}/run/v2', json=body, headers=headers) | ||
|
|
||
| answer = res.json()['data']['text'] | ||
|
|
||
| if answer.startswith("[UNSAFE]"): | ||
| r.text = "부적절한 텍스트가 포함되어 있으므로 해당 청크를 제거합니다." | ||
|
|
||
|
|
||
| return result |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd -type f -name "*attachment_processor_guardrail*" | head -20Repository: genonai/doc_parser
Length of output: 232
🏁 Script executed:
cat -n genon/preprocessor/facade/attachment_processor_guardrail.pyRepository: genonai/doc_parser
Length of output: 50374
🏁 Script executed:
rg -n "guardrail" genon/preprocessor/facade/attachment_processor_guardrail.py | head -20Repository: genonai/doc_parser
Length of output: 107
🏁 Script executed:
rg -A 5 "@guardrail" --max-count 10Repository: genonai/doc_parser
Length of output: 1894
🏁 Script executed:
rg -B 5 -A 15 "requests.post" genon/preprocessor/facade/attachment_processor_guardrail.py | head -50Repository: genonai/doc_parser
Length of output: 1584
🏁 Script executed:
rg "timeout" genon/preprocessor/facade/attachment_processor_guardrail.pyRepository: genonai/doc_parser
Length of output: 172
Replace synchronous HTTP request with async implementation or use executor to prevent event loop blocking.
The guardrail decorator wraps async functions but makes synchronous requests.post() calls serially inside a for loop without timeout or status validation. This blocks the entire event loop while waiting for the workflow service, stalling all concurrent requests. Use an async HTTP client like aiohttp or httpx, or wrap the blocking call with asyncio.to_thread(). Additionally, add a timeout (e.g., timeout=30) and validate HTTP status before parsing JSON to handle service failures gracefully.
🧰 Tools
🪛 Ruff (0.15.10)
[error] 120-120: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@genon/preprocessor/facade/attachment_processor_guardrail.py` around lines 109
- 128, The guardrail decorator's wrapper currently makes blocking synchronous
requests.post calls inside an async loop (using GENOS_URL,
GUARDRAIL_WORKFLOW_ID, GUARDRAIL_BEARER_TOKEN and reading/modifying r.text),
which blocks the event loop; change the implementation to perform the HTTP call
asynchronously (use an async client like aiohttp or httpx async) or offload to a
thread via asyncio.to_thread, include a timeout (e.g., 30s) and check HTTP
status before calling res.json(), and handle network/errors by skipping or
marking the chunk instead of raising so wrapper keeps processing other items.
| async def __call__(self, request: Request, file_path: str, **kwargs: dict): | ||
| document: DoclingDocument = self.load_documents(file_path, **kwargs) | ||
| artifacts_dir, reference_path = self.get_paths(file_path) | ||
| document = document._with_pictures_refs(image_dir=artifacts_dir, reference_path=reference_path) | ||
|
|
||
| chunks: list[DocChunk] = self.split_documents(document, **kwargs) | ||
|
|
||
| vectors = [] | ||
| if len(chunks) >= 1: | ||
| vectors: list[dict] = await self.compose_vectors(document, chunks, file_path, request, **kwargs) | ||
| else: | ||
| raise GenosServiceException(1, f"chunk length is 0") | ||
|
|
||
| text = "" | ||
| for vector in vectors: | ||
| if len(text) + len(vector.text) > 8192: | ||
| break | ||
| text += vector.text | ||
|
|
||
| return [vectors[0]] |
There was a problem hiding this comment.
Don't discard all but the first HWPX vector.
After building vectors, this branch always returns [vectors[0]]. Any additional chunks/pages from an HWPX document are silently lost.
🧰 Tools
🪛 Ruff (0.15.10)
[error] 1227-1227: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@genon/preprocessor/facade/attachment_processor_guardrail.py` around lines
1216 - 1235, The __call__ method builds a list named vectors via compose_vectors
but then always returns only [vectors[0]], dropping remaining HWPX pages; change
the return logic to preserve all composed vectors (or the subset actually
included under the 8192-char accumulation) instead of returning only the first:
locate __call__, the vectors variable and the for loop that accumulates text and
either return the full vectors list or return the slice of vectors consumed by
the text-length limit so all chunk/page vectors are preserved (use
compose_vectors, vectors, and the accumulation loop to determine the correct
subset).
| try: | ||
| conv_result: ConversionResult = self.ocr_converter.convert(file_path, raises_on_error=True) | ||
| except Exception as e: | ||
| print("@@@@", e) | ||
| # conv_result: ConversionResult = self.ocr_second_converter.convert(file_path, raises_on_error=True) | ||
|
|
||
| return conv_result.document |
There was a problem hiding this comment.
Handle OCR conversion failures before returning conv_result.document.
If self.ocr_converter.convert() raises, the except block only logs and execution still falls through to return conv_result.document. That turns the original OCR error into UnboundLocalError.
🧰 Tools
🪛 Ruff (0.15.10)
[warning] 1442-1442: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@genon/preprocessor/facade/attachment_processor_guardrail.py` around lines
1440 - 1446, The try/except around self.ocr_converter.convert() can leave
conv_result undefined on error and cause UnboundLocalError; update the except
block to handle failures before returning conv_result.document by either
re-raising the caught exception or attempting the fallback converter (call
self.ocr_second_converter.convert(file_path, raises_on_error=True) and assign to
conv_result) and only returning conv_result.document after conv_result is
guaranteed set; ensure any logged error includes the exception and that a proper
exception is raised or a sensible fallback return is returned if both converters
fail.
| def guardrail(func): | ||
| @wraps(func) | ||
| async def wrapper(*args, **kwargs): | ||
| result = await func(*args, **kwargs) | ||
|
|
||
| for r in result: | ||
| url = f"{GENOS_URL}/api/gateway/workflow/{GUARDRAIL_WORKFLOW_ID}" | ||
| headers = dict(Authorization=f"Bearer {GUARDRAIL_BEARER_TOKEN}") | ||
|
|
||
| if hasattr(r, "text"): | ||
| body = {"question": r.text} | ||
|
|
||
| res = requests.post(f"{url}/run/v2", json=body, headers=headers) | ||
|
|
||
| answer = res.json()["data"]["text"] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and get its structure
file_path="genon/preprocessor/facade/attachment_processor.py"
if [ -f "$file_path" ]; then
echo "=== File exists ==="
wc -l "$file_path"
echo ""
echo "=== Lines 100-140 (context around 114-128) ==="
sed -n '100,140p' "$file_path"
else
echo "File not found. Searching for similar files..."
find . -name "*attachment_processor*" -type f 2>/dev/null
fiRepository: genonai/doc_parser
Length of output: 1215
🏁 Script executed:
# Check imports at the top of the file
head -50 "genon/preprocessor/facade/attachment_processor.py"Repository: genonai/doc_parser
Length of output: 1444
🏁 Script executed:
# Search for guardrail decorator usage and related async patterns
rg "def guardrail|asyncio\.|to_thread" "genon/preprocessor/facade/attachment_processor.py" -A 3 -B 1Repository: genonai/doc_parser
Length of output: 963
🏁 Script executed:
# Check if requests is imported and how it's used elsewhere in the file
rg "import requests|requests\." "genon/preprocessor/facade/attachment_processor.py" -B 1 -A 1Repository: genonai/doc_parser
Length of output: 361
Move the synchronous requests.post() call out of the async event loop.
The decorator makes a blocking network call for every chunk with no timeout or error handling. If the guardrail service is slow or fails, it will stall the entire async preprocessing pipeline.
Use asyncio.to_thread() to run the POST in a thread pool, add a timeout, and safely handle the response:
Suggested implementation
def guardrail(func):
`@wraps`(func)
async def wrapper(*args, **kwargs):
result = await func(*args, **kwargs)
+ if not GENOS_URL or not GUARDRAIL_BEARER_TOKEN:
+ return result
for r in result:
url = f"{GENOS_URL}/api/gateway/workflow/{GUARDRAIL_WORKFLOW_ID}"
headers = dict(Authorization=f"Bearer {GUARDRAIL_BEARER_TOKEN}")
if hasattr(r, "text"):
body = {"question": r.text}
- res = requests.post(f"{url}/run/v2", json=body, headers=headers)
-
- answer = res.json()["data"]["text"]
+ res = await asyncio.to_thread(
+ requests.post,
+ f"{url}/run/v2",
+ json=body,
+ headers=headers,
+ timeout=5,
+ )
+ res.raise_for_status()
+ payload = res.json()
+ answer = ((payload.get("data") or {}).get("text") or "")🧰 Tools
🪛 Ruff (0.15.10)
[error] 126-126: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@genon/preprocessor/facade/attachment_processor.py` around lines 114 - 128,
The guardrail decorator's wrapper currently performs synchronous requests.post
calls inside the async loop; change it to run the network call in a thread via
asyncio.to_thread (or run_in_executor) and wrap that call with asyncio.wait_for
to enforce a timeout, plus add try/except to catch RequestException, timeouts,
and JSON/key errors; specifically update the loop in wrapper (where it builds
url/headers/body and calls requests.post then res.json()) to call a helper that
performs requests.post in a thread, await it with a timeout, check
res.status_code before accessing res.json(), and handle/log exceptions and
missing "data"/"text" keys so a failed guardrail call does not block or crash
the async preprocessing pipeline.
| # @@@@ 성민: OneAgent 연동용 | ||
| if "uploads" in kwargs.keys(): | ||
| import base64 | ||
| uploads = kwargs.get("uploads", None)[0] | ||
|
|
||
| # @@@@ 전처리기 파일 저장 경로 | ||
| folder = "/nfs-root/tmp/uploads" | ||
|
|
||
| decoded = base64.b64decode(uploads['data'].split(",", 1)[1]) | ||
| file_path = os.path.join(folder, uploads['name']) | ||
|
|
||
| with open(file_path, 'wb') as f: | ||
| f.write(decoded) | ||
|
|
There was a problem hiding this comment.
Sanitize uploaded filenames before writing them.
uploads["name"] is joined directly into /nfs-root/tmp/uploads. A filename containing ../ can escape that directory and overwrite arbitrary files on the worker.
🐛 Proposed fix
if "uploads" in kwargs.keys():
import base64
uploads = kwargs.get("uploads", None)[0]
@@
# @@@@ 전처리기 파일 저장 경로
folder = "/nfs-root/tmp/uploads"
+ os.makedirs(folder, exist_ok=True)
decoded = base64.b64decode(uploads['data'].split(",", 1)[1])
- file_path = os.path.join(folder, uploads['name'])
+ safe_name = f"{uuid.uuid4()}_{os.path.basename(uploads['name'])}"
+ file_path = os.path.join(folder, safe_name)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@genon/preprocessor/facade/oneagent_processor.py` around lines 1566 - 1579,
The code writes an uploaded file using uploads['name'] directly into folder and
is vulnerable to path traversal; update the block handling uploads (variables:
uploads, folder, file_path) to sanitize and validate the filename before
joining: obtain a safe base name (e.g., use os.path.basename or a
secure_filename routine), reject or normalize names that contain path separators
or resolve outside the target folder, and/or replace the name with a generated
safe token (UUID) plus the original extension; then construct file_path with
os.path.join(folder, safe_name) and ensure folder exists and file_path resides
under folder (e.g., by comparing resolved absolute paths) before writing the
decoded data.
| # # Mistral-Small-3.1-24B-Instruct-2503, 운영망 | ||
| # toc_api_base_url="https://genos.genon.ai:3443/api/gateway/rep/serving/502/v1/chat/completions", | ||
| # metadata_api_base_url="https://genos.genon.ai:3443/api/gateway/rep/serving/502/v1/chat/completions", | ||
| # toc_api_key="022653a3743849e299f19f19d323490b", | ||
| # metadata_api_key="022653a3743849e299f19f19d323490b", | ||
| # # Mistral-Small-3.1-24B-Instruct-2503, 한국은행 클러스터 | ||
| # # toc_api_base_url="http://llmops-gateway-api-service:8080/serving/13/31/v1/chat/completions", | ||
| # # metadata_api_base_url="http://llmops-gateway-api-service:8080/serving/13/31/v1/chat/completions", | ||
| # # toc_api_key="9e32423947fd4a5da07a28962fe88487", | ||
| # # metadata_api_key="9e32423947fd4a5da07a28962fe88487", |
There was a problem hiding this comment.
Remove and rotate the API keys in comments.
Commented credentials are still committed secrets and remain readable in repository history and logs. These values need to be removed from source and rotated before merge.
🧰 Tools
🪛 Betterleaks (1.1.2)
[high] 1034-1034: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
[high] 1035-1035: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
[high] 1039-1039: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
[high] 1040-1040: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@genon/preprocessor/module/intelligent_processor.py` around lines 1031 - 1040,
Remove the hard-coded and commented API credentials and keys (the commented
toc_api_key and metadata_api_key and any commented
toc_api_base_url/metadata_api_base_url entries) from the source in
intelligent_processor.py, replacing them with non-secret placeholders or
references to environment/config variables used by your app (e.g., use
TOC_API_KEY/METADATA_API_KEY env names) and update any documentation comments
accordingly; delete the historical secret values from comments, ensure no other
commented credentials remain in the file (search for toc_api_key,
metadata_api_key, toc_api_base_url, metadata_api_base_url), and confirm that the
real keys are rotated/invalidated before merging.
| def get_media_files(self, doc_items: list): | ||
| temp_list = [] | ||
| for item in doc_items: | ||
| if isinstance(item, PictureItem): | ||
| name = path.rsplit("/", 1)[-1] | ||
| temp_list.append({"path": path, "name": name}) | ||
| return temp_list |
There was a problem hiding this comment.
get_media_files() will crash on the first picture.
path is never defined before path.rsplit(...). When upload_files is enabled and a chunk contains an image, this raises NameError from compose_vectors().
🐛 Proposed fix
def get_media_files(self, doc_items: list):
temp_list = []
for item in doc_items:
if isinstance(item, PictureItem):
+ path = str(item.image.uri)
name = path.rsplit("/", 1)[-1]
temp_list.append({"path": path, "name": name})
return temp_list🧰 Tools
🪛 Ruff (0.15.10)
[error] 1286-1286: Undefined name path
(F821)
[error] 1287-1287: Undefined name path
(F821)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@genon/preprocessor/module/intelligent_processor.py` around lines 1282 - 1288,
get_media_files currently uses an undefined variable path which causes a
NameError; update the function (get_media_files) to read the path from the
PictureItem instance (e.g., item.path or item.src depending on the PictureItem
API), skip items with no path, and compute the filename with os.path.basename
(or path.rsplit("/",1)[-1]) before appending {"path": path, "name": name};
ensure the code references the correct attribute on PictureItem and adds a
defensive check for missing/empty paths so compose_vectors doesn't crash when
images are present.
| if upload_files: | ||
| file_list = self.get_media_files(chunk.meta.doc_items) | ||
| upload_tasks.append(asyncio.create_task(upload_files(file_list, request=request))) |
There was a problem hiding this comment.
This upload path calls a method that does not exist.
GenOSVectorMetaBuilder defines set_media_files(...), but not get_media_files(...). As soon as upload_files is available, this branch raises AttributeError and breaks the whole request.
Proposed fix
+ def get_media_files(self, doc_items: list) -> list[dict]:
+ files = []
+ for item in doc_items:
+ if isinstance(item, PictureItem) and item.image is not None:
+ path = str(item.image.uri)
+ name = path.rsplit("/", 1)[-1]
+ files.append({"path": path, "name": name})
+ return files
+
@@
chunk_index_on_page += 1
if upload_files:
file_list = self.get_media_files(chunk.meta.doc_items)
upload_tasks.append(asyncio.create_task(upload_files(file_list, request=request)))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@genon/preprocessor/module/utils/metadata.py` around lines 345 - 347, The code
calls a non-existent method get_media_files which raises AttributeError; either
implement get_media_files(doc_items) on GenOSVectorMetaBuilder to return the
list of media file paths extracted from chunk.meta.doc_items, or change this
branch to use the existing set_media_files API (e.g., call
self.set_media_files(chunk.meta.doc_items) and then read the list from the
attribute that set_media_files populates) so that upload_files(upload_files,
request=request) receives a proper list; update the upload_tasks creation
(asyncio.create_task(upload_files(file_list, request=request))) to use the valid
file_list source and ensure file_list is an actual list of file paths/objects
acceptable to upload_files.
| def convert_md_to_pdf(md_path): | ||
| """Markdown 파일을 PDF로 변환""" | ||
| install_packages(["chardet"]) | ||
| import chardet |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find the util.py file
fd -t f "util.py" | head -20Repository: genonai/doc_parser
Length of output: 147
🏁 Script executed:
# Read the relevant file and lines
if [ -f "genon/preprocessor/module/utils/util.py" ]; then
cat -n "genon/preprocessor/module/utils/util.py" | sed -n '110,125p'
fiRepository: genonai/doc_parser
Length of output: 601
🏁 Script executed:
# Check for install_packages definition or import in the file
rg "install_packages" genon/preprocessor/module/utils/util.py -B 2 -A 2Repository: genonai/doc_parser
Length of output: 178
🏁 Script executed:
# Also check for any imports at the top of the file
head -50 "genon/preprocessor/module/utils/util.py" | cat -nRepository: genonai/doc_parser
Length of output: 1782
🏁 Script executed:
# Search entire file for install_packages
rg "install_packages" genon/preprocessor/module/utils/util.pyRepository: genonai/doc_parser
Length of output: 95
🏁 Script executed:
# Check if install_packages might be defined in genos_util.py or imported
rg "install_packages" genon/preprocessor/module/utils/Repository: genonai/doc_parser
Length of output: 135
🏁 Script executed:
# Check broader search for install_packages in the codebase
rg "def install_packages|import.*install_packages" --type py | head -20Repository: genonai/doc_parser
Length of output: 393
🏁 Script executed:
# Check which install_packages definitions exist and their context
rg "def install_packages" -A 5 genon/preprocessor/facade/ | head -40Repository: genonai/doc_parser
Length of output: 2568
🏁 Script executed:
# Check entire convert_md_to_pdf function for context
sed -n '115,160p' genon/preprocessor/module/utils/util.py | cat -nRepository: genonai/doc_parser
Length of output: 1345
🏁 Script executed:
# Check if util.py imports anything from facade modules
rg "from.*facade|import.*facade" genon/preprocessor/module/utils/util.pyRepository: genonai/doc_parser
Length of output: 44
convert_md_to_pdf() crashes immediately with NameError due to undefined install_packages.
Line 117 calls install_packages(["chardet"]), but this function is neither defined nor imported in this module. The function will fail before any conversion work occurs.
The install_packages() function exists in other facade modules (e.g., genon/preprocessor/facade/attachment_processor.py), but is not available in this utility module. Either import it from the appropriate facade module or remove the call if dynamic package installation is not needed.
🧰 Tools
🪛 Ruff (0.15.10)
[error] 117-117: Undefined name install_packages
(F821)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@genon/preprocessor/module/utils/util.py` around lines 115 - 118, The
convert_md_to_pdf function calls install_packages(["chardet"]) but
install_packages is not defined or imported, causing a NameError; fix by either
importing install_packages from its provider (e.g., the facade module where it's
defined) and using that import in convert_md_to_pdf, or remove the dynamic
install call if not needed and rely on a normal dependency declaration; update
the top of genon/preprocessor/module/utils/util.py to import the
install_packages symbol (or delete the call inside convert_md_to_pdf) and ensure
any side-effects (like error handling) are preserved.
< 추가 내용>
Checklist:
Summary by CodeRabbit
New Features
Bug Fixes
Chores
1.3.7-komipoand1.3.3-komipo