Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 2 additions & 11 deletions build-script/doc-parser-build.config
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,10 @@ DOCKER_REGISTRY=mncregistry:30500
IMAGE_NAME=doc-parser-preprocessor

# 버전 (git tag, 브랜치 이름, 날짜 등으로 교체 가능)
IMAGE_VERSION=1.3.6.2
IMAGE_VERSION=1.3.7-komipo

# 실제 Dockerfile 위치 (루트 기준)
DOCKERFILE_PATH=genon/preprocessor/docker/Dockerfile

# 빌드 후 push 할지 여부
PUSH_IMAGE=true

# USER, GROUP
APP_UID=3000
APP_GID=3000
APP_UNAME=genos
APP_GNAME=genos

# NLTK packages (comma-separated). Use "all" to download everything.
APP_NLTK_PACKAGES=punkt,stopwords,averaged_perceptron_tagger,averaged_perceptron_tagger_eng,wordnet,omw-1.4
PUSH_IMAGE=false
2 changes: 1 addition & 1 deletion build-script/paddle-ocr-build.config
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ DOCKERFILE=genon/serving/paddle/docker/Dockerfile

# 이미지 이름/태그
IMAGE_NAME=doc-parser-ocr
IMAGE_TAG=0.0.0
IMAGE_TAG=1.3.7-komipo

# 푸시할 레지스트리 (없으면 빈값)
REGISTRY=mncregistry:30500
Expand Down
10 changes: 8 additions & 2 deletions genon/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
6. 사이트 배포 시
```shell
1. 이미지 저장
docker save mncregistry:30500/mnc/doc-parser-preprocessor:latest | gzip > doc-parser-preprocessor.tar.gz
docker save mncregistry:30500/mnc/doc-parser-preprocessor:1.3.3-komipo | gzip > doc-parser-preprocessor.tar.gz
2. 사이트에서 이미지 복원
gunzip -c doc-parser-preprocessor.tar.gz | docker load
3. register_image.sh 파일 실행
Expand All @@ -75,4 +75,10 @@ gunzip -c doc-parser-preprocessor.tar.gz | docker load
```shell
kubectl apply -f doc-parser-ocr-deployment.yaml
```
5. 노드 포트로 배포시는 [doc-parser-ocr-deployment-node-port.yaml](serving/paddle/k8s-manifest/doc-parser-ocr-deployment-node-port.yaml)
5. 노드 포트로 배포시는 [doc-parser-ocr-deployment-node-port.yaml](serving/paddle/k8s-manifest/doc-parser-ocr-deployment-node-port.yaml)

사이트에서
```
docker save mncregistry:30500/doc-parser-ocr:1.3.3-komipo | gzip > doc-parser-ocr.tar.gz
gunzip -c doc-parser-ocr.tar.gz | docker load
```
37 changes: 37 additions & 0 deletions genon/preprocessor/facade/attachment_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,41 @@
# pdf 변환 대상 확장자
CONVERTIBLE_EXTENSIONS = ['.hwp', '.txt', '.json', '.md', '.ppt', '.pptx', '.docx']

## 보안컨설팅 조치로 인한 가드레일 추가

import requests
import re
import json

GUARDRAIL_WORKFLOW_ID = 694
GUARDRAIL_BEARER_TOKEN = ""
GENOS_URL = ""

from functools import wraps

def guardrail(func):
@wraps(func)
async def wrapper(*args, **kwargs):
result = await func(*args, **kwargs)

for r in result:
url = f"{GENOS_URL}/api/gateway/workflow/{GUARDRAIL_WORKFLOW_ID}"
headers = dict(Authorization=f"Bearer {GUARDRAIL_BEARER_TOKEN}")

if hasattr(r, "text"):
body = {"question": r.text}

res = requests.post(f"{url}/run/v2", json=body, headers=headers)

answer = res.json()["data"]["text"]
Comment on lines +114 to +128
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and get its structure
file_path="genon/preprocessor/facade/attachment_processor.py"
if [ -f "$file_path" ]; then
    echo "=== File exists ==="
    wc -l "$file_path"
    echo ""
    echo "=== Lines 100-140 (context around 114-128) ==="
    sed -n '100,140p' "$file_path"
else
    echo "File not found. Searching for similar files..."
    find . -name "*attachment_processor*" -type f 2>/dev/null
fi

Repository: genonai/doc_parser

Length of output: 1215


🏁 Script executed:

# Check imports at the top of the file
head -50 "genon/preprocessor/facade/attachment_processor.py"

Repository: genonai/doc_parser

Length of output: 1444


🏁 Script executed:

# Search for guardrail decorator usage and related async patterns
rg "def guardrail|asyncio\.|to_thread" "genon/preprocessor/facade/attachment_processor.py" -A 3 -B 1

Repository: genonai/doc_parser

Length of output: 963


🏁 Script executed:

# Check if requests is imported and how it's used elsewhere in the file
rg "import requests|requests\." "genon/preprocessor/facade/attachment_processor.py" -B 1 -A 1

Repository: genonai/doc_parser

Length of output: 361


Move the synchronous requests.post() call out of the async event loop.

The decorator makes a blocking network call for every chunk with no timeout or error handling. If the guardrail service is slow or fails, it will stall the entire async preprocessing pipeline.

Use asyncio.to_thread() to run the POST in a thread pool, add a timeout, and safely handle the response:

Suggested implementation
def guardrail(func):
    `@wraps`(func)
    async def wrapper(*args, **kwargs):
        result = await func(*args, **kwargs)
+        if not GENOS_URL or not GUARDRAIL_BEARER_TOKEN:
+            return result
 
         for r in result:
             url = f"{GENOS_URL}/api/gateway/workflow/{GUARDRAIL_WORKFLOW_ID}"
             headers = dict(Authorization=f"Bearer {GUARDRAIL_BEARER_TOKEN}")
 
             if hasattr(r, "text"):
                 body = {"question": r.text}
 
-                res = requests.post(f"{url}/run/v2", json=body, headers=headers)
-
-                answer = res.json()["data"]["text"]
+                res = await asyncio.to_thread(
+                    requests.post,
+                    f"{url}/run/v2",
+                    json=body,
+                    headers=headers,
+                    timeout=5,
+                )
+                res.raise_for_status()
+                payload = res.json()
+                answer = ((payload.get("data") or {}).get("text") or "")
🧰 Tools
🪛 Ruff (0.15.10)

[error] 126-126: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@genon/preprocessor/facade/attachment_processor.py` around lines 114 - 128,
The guardrail decorator's wrapper currently performs synchronous requests.post
calls inside the async loop; change it to run the network call in a thread via
asyncio.to_thread (or run_in_executor) and wrap that call with asyncio.wait_for
to enforce a timeout, plus add try/except to catch RequestException, timeouts,
and JSON/key errors; specifically update the loop in wrapper (where it builds
url/headers/body and calls requests.post then res.json()) to call a helper that
performs requests.post in a thread, await it with a timeout, check
res.status_code before accessing res.json(), and handle/log exceptions and
missing "data"/"text" keys so a failed guardrail call does not block or crash
the async preprocessing pipeline.


if answer.startswith("[UNSAFE]"):
r.text = "부적절한 텍스트가 포함되어 있으므로 해당 청크를 제거합니다."

return result

return wrapper


def convert_to_pdf(file_path: str) -> str | None:
"""
Expand Down Expand Up @@ -179,6 +214,7 @@ def _get_pdf_path(file_path: str) -> str:
return pdf_path



def install_packages(packages):
for package in packages:
try:
Expand Down Expand Up @@ -1432,6 +1468,7 @@ def get_level_name(level_num: int) -> str:
# root logger level 적용
logging.getLogger().setLevel(level)

@guardrail
async def __call__(self, request: Request, file_path: str, **kwargs: dict):
self.setup_logging(kwargs.get('log_level', 4))

Expand Down
Loading
Loading