Skip to content

reorg: split parser layer to 2-layer: accessor and parser, so that we can reuse more code#1428

Open
MaojiaSheng wants to merge 17 commits intovolcengine:mainfrom
MaojiaSheng:larkdoc
Open

reorg: split parser layer to 2-layer: accessor and parser, so that we can reuse more code#1428
MaojiaSheng wants to merge 17 commits intovolcengine:mainfrom
MaojiaSheng:larkdoc

Conversation

@MaojiaSheng
Copy link
Copy Markdown
Collaborator

This commit completes the two-layer (accessor + parser) architecture implementation:

Two-layer architecture now fully functional:

  • Phase 1: AccessorRegistry → DataAccessor → LocalResource
  • Phase 2: ParserRegistry → DataParser → ParseResult

Additional:

  1. ZipParser refactoring:

    • Rewrite ZipParser to extract zip archives and delegate to DirectoryParser
    • Remove _process_zip_file() special branch from UnifiedResourceProcessor
    • Support nested zip files via DirectoryParser recursion
    • ZipParser now follows the same pattern as other parsers
  2. Upload filename preservation:

    • temp_upload now saves .ov_upload.meta file with original_filename
    • resolve_uploaded_temp_file_id returns (path, original_filename) tuple
    • add_resource passes original_filename as source_name to parsers
    • PDFParser and MarkdownParser both support source_name/resource_name
    • Cleanup also removes .ov_upload.meta files

All sources (local files, URLs, Git repos, zip files, uploaded files) now
go through the same unified pipeline with proper filename/directory preservation.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ MaojiaSheng
❌ openviking


openviking seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 85
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Blocking I/O in Async Functions

Zip file extraction using zipfile (blocking I/O) is performed directly in async functions, which can starve the event loop. This affects _github_zip_download, _gitlab_zip_download, and _extract_zip methods.

    # Safe extraction with Zip Slip validation
    target = Path(extract_dir).resolve()
    with zipfile.ZipFile(zip_path, "r") as zf:
        for info in zf.infolist():
            mode = info.external_attr >> 16
            if info.is_dir() or stat.S_ISDIR(mode):
                continue
            if stat.S_ISLNK(mode):
                logger.warning(
                    f"[GitAccessor] Skipping symlink entry in GitHub ZIP: {info.filename}"
                )
                continue
            raw = info.filename.replace("\\", "/")
            raw_parts = [p for p in raw.split("/") if p]
            if ".." in raw_parts:
                raise ValueError(f"Zip Slip detected in GitHub archive: {info.filename!r}")
            if PurePosixPath(raw).is_absolute():
                raise ValueError(f"Zip Slip detected in GitHub archive: {info.filename!r}")
            extracted = Path(zf.extract(info, extract_dir)).resolve()
            if not extracted.is_relative_to(target):
                extracted.unlink(missing_ok=True)
                raise ValueError(f"Zip Slip detected in GitHub archive: {info.filename!r}")

    # Remove downloaded archive to free disk space.
    try:
        os.unlink(zip_path)
    except OSError:
        pass

    # GitHub ZIPs have a single top-level directory
    top_level = [d for d in Path(extract_dir).iterdir() if d.is_dir()]
    content_dir = top_level[0] if len(top_level) == 1 else Path(extract_dir)

    logger.info(f"[GitAccessor] GitHub ZIP extracted to {content_dir} ({repo_name})")
    return content_dir, repo_name

async def _gitlab_zip_download(
    self,
    repo_url: str,
    branch: Optional[str],
    target_dir: str,
) -> Tuple[Path, str]:
    """Download a GitLab repo as a ZIP archive and extract it."""
    repo_name = self._get_repo_name(repo_url)

    # Build archive URL from owner/repo path components.
    # GitLab archive URL format: https://gitlab.com/{owner}/{repo}/-/archive/{ref}/{repo}-{ref}.zip
    parsed = urlparse(repo_url)
    path_parts = [p for p in parsed.path.split("/") if p]
    owner = path_parts[0]
    repo_raw = path_parts[1]
    # Strip .git suffix for the archive URL
    repo_slug = repo_raw[:-4] if repo_raw.endswith(".git") else repo_raw

    ref = branch or "HEAD"
    # GitLab uses the format: /{owner}/{repo}/-/archive/{ref}/{repo}-{ref}.zip
    zip_url = f"{parsed.scheme}://{parsed.netloc}/{owner}/{repo_slug}/-/archive/{ref}/{repo_slug}-{ref}.zip"

    logger.info(f"[GitAccessor] Downloading GitLab ZIP: {zip_url}")

    zip_path = os.path.join(target_dir, "_archive.zip")
    extract_dir = os.path.join(target_dir, "_extracted")
    os.makedirs(extract_dir, exist_ok=True)

    # Download (blocking HTTP; run in thread pool)
    def _download() -> None:
        headers = {"User-Agent": "OpenViking"}

        req = urllib.request.Request(zip_url, headers=headers)
        with urllib.request.urlopen(req, timeout=1800) as resp, open(zip_path, "wb") as f:
            shutil.copyfileobj(resp, f)

    try:
        await asyncio.to_thread(_download)
    except Exception as exc:
        raise RuntimeError(f"Failed to download GitLab ZIP {zip_url}: {exc}")

    # Safe extraction with Zip Slip validation
    target = Path(extract_dir).resolve()
    with zipfile.ZipFile(zip_path, "r") as zf:
        for info in zf.infolist():
            mode = info.external_attr >> 16
            if info.is_dir() or stat.S_ISDIR(mode):
                continue
            if stat.S_ISLNK(mode):
                logger.warning(
                    f"[GitAccessor] Skipping symlink entry in GitLab ZIP: {info.filename}"
                )
                continue
            raw = info.filename.replace("\\", "/")
            raw_parts = [p for p in raw.split("/") if p]
            if ".." in raw_parts:
                raise ValueError(f"Zip Slip detected in GitLab archive: {info.filename!r}")
            if PurePosixPath(raw).is_absolute():
                raise ValueError(f"Zip Slip detected in GitLab archive: {info.filename!r}")
            extracted = Path(zf.extract(info, extract_dir)).resolve()
            if not extracted.is_relative_to(target):
                extracted.unlink(missing_ok=True)
                raise ValueError(f"Zip Slip detected in GitLab archive: {info.filename!r}")

    # Remove downloaded archive to free disk space.
    try:
        os.unlink(zip_path)
    except OSError:
        pass

    # GitLab ZIPs have a single top-level directory: {repo}-{ref}/
    top_level = [d for d in Path(extract_dir).iterdir() if d.is_dir()]
    content_dir = top_level[0] if len(top_level) == 1 else Path(extract_dir)

    logger.info(f"[GitAccessor] GitLab ZIP extracted to {content_dir} ({repo_name})")
    return content_dir, repo_name

async def _extract_zip(self, zip_path: str, target_dir: str) -> str:
    """Extract a local zip file into target_dir."""
    if zip_path.startswith(("http://", "https://")):
        raise NotImplementedError("Zip URL download not yet implemented in GitAccessor")

    path = Path(zip_path)
    name = path.stem

    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        target = Path(target_dir).resolve()
        for info in zip_ref.infolist():
            mode = info.external_attr >> 16
            # Skip directory entries
            if info.is_dir() or stat.S_ISDIR(mode):
                continue
            # Skip symlink entries
            if stat.S_ISLNK(mode):
                logger.warning(f"[GitAccessor] Skipping symlink entry in zip: {info.filename}")
                continue
            # Reject entries with suspicious raw path components
            raw = info.filename.replace("\\", "/")
            raw_parts = [p for p in raw.split("/") if p]
            if ".." in raw_parts:
                raise ValueError(f"Zip Slip detected: entry {info.filename!r} contains '..'")
            if PurePosixPath(raw).is_absolute() or (len(raw) >= 2 and raw[1] == ":"):
                raise ValueError(
                    f"Zip Slip detected: entry {info.filename!r} is an absolute path"
                )
            # Normalize and verify
            arcname = info.filename.replace("/", os.sep)
            if os.path.altsep:
                arcname = arcname.replace(os.path.altsep, os.sep)
            arcname = os.path.splitdrive(arcname)[1]
            arcname = os.sep.join(p for p in arcname.split(os.sep) if p not in ("", ".", ".."))
            if not arcname:
                continue
            member_path = (Path(target_dir) / arcname).resolve()
            if not member_path.is_relative_to(target):
                raise ValueError(
                    f"Zip Slip detected: entry {info.filename!r} escapes target directory"
                )
            # Extract and verify
            extracted = Path(zip_ref.extract(info, target_dir)).resolve()
            if not extracted.is_relative_to(target):
                # Best-effort cleanup
                try:
                    extracted.unlink(missing_ok=True)
                except OSError as cleanup_err:
                    logger.warning(
                        f"[GitAccessor] Failed to clean up escaped file {extracted}: {cleanup_err}"
                    )
                raise ValueError(
                    f"Zip Slip detected: entry {info.filename!r} escapes target directory"
                )

    return name
Blocking I/O in Async Functions

Zip file extraction and temp directory creation (blocking I/O) are performed directly in the async parse method, which can starve the event loop.

async def parse(self, source: Union[str, Path], instruction: str = "", **kwargs) -> ParseResult:
    """Parse a ZIP file by extracting it and delegating to DirectoryParser.

    Args:
        source: Path to the .zip file
        instruction: Processing instruction (forwarded to DirectoryParser)
        **kwargs: Extra options forwarded to DirectoryParser

    Returns:
        ParseResult from DirectoryParser, with temp_dir_path preserved
        for TreeBuilder to use later.
    """

    path = Path(source)
    if not path.exists():
        raise FileNotFoundError(f"ZIP file not found: {path}")

    if not zipfile.is_zipfile(path):
        raise ValueError(f"Not a valid ZIP file: {path}")

    # Extract ZIP to temporary directory
    temp_dir = Path(tempfile.mkdtemp(prefix="ov_zip_"))
    try:
        with zipfile.ZipFile(path, "r") as zipf:
            safe_extract_zip(zipf, temp_dir)

        # Check if extracted content has a single root directory
        extracted_entries = [p for p in temp_dir.iterdir() if p.name not in {".", ".."}]

        # Prepare kwargs for DirectoryParser
        dir_kwargs = dict(kwargs)
        dir_kwargs["instruction"] = instruction

        # Use DirectoryParser to process the extracted content
        from openviking.parse.parsers.directory import DirectoryParser

        parser = DirectoryParser()

        if len(extracted_entries) == 1 and extracted_entries[0].is_dir():
            # Single root directory - parse that directly
            dir_kwargs.pop("source_name", None)
            result = await parser.parse(str(extracted_entries[0]), **dir_kwargs)
        else:
            # Multiple entries at root - parse the temp dir itself
            # Set source_name from zip filename if not provided
            if "source_name" not in dir_kwargs or not dir_kwargs["source_name"]:
                dir_kwargs["source_name"] = path.stem
            result = await parser.parse(str(temp_dir), **dir_kwargs)

        # Ensure the temporary directory is preserved for TreeBuilder
        if not result.temp_dir_path:
            result.temp_dir_path = str(temp_dir)
        else:
            # If DirectoryParser created its own temp, clean up our extraction dir
            try:
                shutil.rmtree(temp_dir, ignore_errors=True)
            except Exception:
                pass

        result.source_format = "zip"
        result.parser_name = "ZipParser"
        return result

    except Exception:
        # Clean up on error
        try:
            shutil.rmtree(temp_dir, ignore_errors=True)
        except Exception:
            pass
        raise

async def parse_content(
    self, content: str, source_path: Optional[str] = None, instruction: str = "", **kwargs
) -> ParseResult:
    """Parse content - not applicable for ZIP files (needs a file path)."""
    raise NotImplementedError(
        "ZipParser does not support parse_content. Please provide a file path to parse()."
    )

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Ensure proper file context managers

The synchronous file operations (open(), write(), json.dump()) are missing context
managers for proper resource cleanup in the improved version. When using
asyncio.to_thread(), ensure files are properly closed by wrapping in context
managers inside the thread function.

openviking/server/routers/resources.py [162-173]

-with open(temp_file_path, "wb") as f:
-    f.write(await file.read())
+# Write file content in thread pool with proper context manager
+file_content = await file.read()
+def _write_file():
+    with open(temp_file_path, "wb") as f:
+        f.write(file_content)
+await asyncio.to_thread(_write_file)
 
-# Save metadata with original filename
+# Save metadata with original filename with proper context manager
 if file.filename:
     meta_path = temp_dir / f"{temp_filename}.ov_upload.meta"
     meta = {
         "original_filename": file.filename,
         "upload_time": time.time(),
     }
-    with open(meta_path, "w") as f:
-        json.dump(meta, f)
+    def _write_meta():
+        with open(meta_path, "w") as f:
+            json.dump(meta, f)
+    await asyncio.to_thread(_write_meta)
Suggestion importance[1-10]: 6

__

Why: This suggestion correctly fixes the file resource management issue by using proper context managers inside thread-pooled functions, addressing the bug in the previous suggestion while retaining the async I/O offload benefit.

Low

@MaojiaSheng MaojiaSheng changed the title reort: split parser layer to 2-layer: accessor and parser, so that we can reuse more code reorg: split parser layer to 2-layer: accessor and parser, so that we can reuse more code Apr 14, 2026
Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing the accessor/parser split forward. The overall direction makes sense, but I found three regressions on the public add_resource() URL-ingestion path: content-type-based routing for extensionless downloads was removed, GitHub/GitLab blob URLs no longer normalize to raw-file URLs, and remote filename preservation now falls back to temp filenames. Requesting changes so we do not ship those behavior regressions.

openviking added 5 commits April 14, 2026 14:37
- Fix config lookup: github_domains/gitlab_domains are in CodeConfig now
- Add debug logging instead of swallowing exceptions
- Add comment about config location

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- HTTPAccessor now stores original_filename in LocalResource.meta
  - Priority: Content-Disposition filename > URL path filename
- UnifiedResourceProcessor now prefers original_filename from meta
  - Uses original_filename for resource_name and source_name
  - Falls back to temp path stem only if no original filename available

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这轮复查我重点核了这次 accessor/parser 重构是否改变了原有入口行为。分层方向本身是对的,之前 review 提到的几个 URL 下载回归点也基本补回来了,但当前还存在 4 处 blocking 的行为/兼容性问题:不存在的本地路径会被当成原始内容继续解析,resolve_uploaded_temp_file_id() 的返回值变更打坏了 pack/import 路径,本地 .zip 会被 GitAccessor 提前接管导致单根目录 ZIP 的命名行为变化,以及 HTMLParser/URLTypeDetector 的兼容层声明与实际 API 不一致。先 request changes,建议把这些回归补齐后再合。

Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

复查后,上一轮 4 个 blocking 点里 3 个已修复:本地缺失路径重新报 FileNotFoundError,pack/import 的 tuple 传参已修正,本地 ZIP 也重新走回 ZipParser。当前还剩 2 个阻塞问题:1) URLTypeDetector 吞掉 PermissionDeniedError,SSRF 校验语义回退;2) HTMLParser 的 backward compatibility 仍不完整,旧的 _fetch_htmlparse(url) 行为都已 broken。

if url_type != URLType.UNKNOWN:
return url_type, meta

except Exception as e:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking)
这里把 request_validator 抛出的 PermissionDeniedError 也一起吞掉了,导致原来应该直接拒绝的 SSRF 目标被回退成默认 WEBPAGE。我本地跑 tests/server/test_api_local_input_security.py::test_url_detector_request_validator_blocks_loopback_head,loopback URL 不再抛 PermissionDeniedError,而是静默返回 (URLType.WEBPAGE, meta)。这会把安全校验从 hard fail 降级成继续走后续抓取路径,属于行为回归。这里至少要像 main 上旧实现那样对 PermissionDeniedError 单独 raise

parse_time=time.time() - start_time,
warnings=[f"Failed to parse code repository: {e}"],
)
return await self._parse_local_file(path, start_time, **kwargs)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking)
兼容层这里还没补完整。当前 HTMLParser(timeout=...) 虽然不再 TypeError,但旧 public API 仍然是 broken:HTMLParser().parse("http://example.com") 不再抓取 URL,而是把它当本地路径返回 File not found: http:/example.com;同时 HTMLParser()._fetch_html(...) 也已经不存在,tests/server/test_api_local_input_security.py::test_html_parser_request_validator_blocks_loopback_fetch 现在仍是 AttributeError。另外 re-export 出来的 URLType 也少了旧的 CODE_REPOSITORY 成员。如果这里想保持 backward compatibility,需要把旧 URL parse / _fetch_html 路径补回来;否则这就应该明确当成 breaking change,并同步清理现有调用和测试。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants