reorg: split parser layer to 2-layer: accessor and parser, so that we can reuse more code by MaojiaSheng · Pull Request #1428 · volcengine/OpenViking

MaojiaSheng · 2026-04-14T04:12:28Z

This commit completes the two-layer (accessor + parser) architecture implementation:

Two-layer architecture now fully functional:

Phase 1: AccessorRegistry → DataAccessor → LocalResource
Phase 2: ParserRegistry → DataParser → ParseResult

Additional:

ZipParser refactoring:
- Rewrite ZipParser to extract zip archives and delegate to DirectoryParser
- Remove _process_zip_file() special branch from UnifiedResourceProcessor
- Support nested zip files via DirectoryParser recursion
- ZipParser now follows the same pattern as other parsers
Upload filename preservation:
- temp_upload now saves .ov_upload.meta file with original_filename
- resolve_uploaded_temp_file_id returns (path, original_filename) tuple
- add_resource passes original_filename as source_name to parsers
- PDFParser and MarkdownParser both support source_name/resource_name
- Cleanup also removes .ov_upload.meta files

All sources (local files, URLs, Git repos, zip files, uploaded files) now
go through the same unified pipeline with proper filename/directory preservation.

… can reuse more code

CLAassistant · 2026-04-14T04:12:35Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ MaojiaSheng
❌ openviking

openviking seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

github-actions · 2026-04-14T04:13:59Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 85
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Blocking I/O in Async Functions Zip file extraction using `zipfile` (blocking I/O) is performed directly in async functions, which can starve the event loop. This affects `_github_zip_download`, `_gitlab_zip_download`, and `_extract_zip` methods. # Safe extraction with Zip Slip validation target = Path(extract_dir).resolve() with zipfile.ZipFile(zip_path, "r") as zf: for info in zf.infolist(): mode = info.external_attr >> 16 if info.is_dir() or stat.S_ISDIR(mode): continue if stat.S_ISLNK(mode): logger.warning( f"[GitAccessor] Skipping symlink entry in GitHub ZIP: {info.filename}" ) continue raw = info.filename.replace("\\", "/") raw_parts = [p for p in raw.split("/") if p] if ".." in raw_parts: raise ValueError(f"Zip Slip detected in GitHub archive: {info.filename!r}") if PurePosixPath(raw).is_absolute(): raise ValueError(f"Zip Slip detected in GitHub archive: {info.filename!r}") extracted = Path(zf.extract(info, extract_dir)).resolve() if not extracted.is_relative_to(target): extracted.unlink(missing_ok=True) raise ValueError(f"Zip Slip detected in GitHub archive: {info.filename!r}") # Remove downloaded archive to free disk space. try: os.unlink(zip_path) except OSError: pass # GitHub ZIPs have a single top-level directory top_level = [d for d in Path(extract_dir).iterdir() if d.is_dir()] content_dir = top_level[0] if len(top_level) == 1 else Path(extract_dir) logger.info(f"[GitAccessor] GitHub ZIP extracted to {content_dir} ({repo_name})") return content_dir, repo_name async def _gitlab_zip_download( self, repo_url: str, branch: Optional[str], target_dir: str, ) -> Tuple[Path, str]: """Download a GitLab repo as a ZIP archive and extract it.""" repo_name = self._get_repo_name(repo_url) # Build archive URL from owner/repo path components. # GitLab archive URL format: https://gitlab.com/{owner}/{repo}/-/archive/{ref}/{repo}-{ref}.zip parsed = urlparse(repo_url) path_parts = [p for p in parsed.path.split("/") if p] owner = path_parts[0] repo_raw = path_parts[1] # Strip .git suffix for the archive URL repo_slug = repo_raw[:-4] if repo_raw.endswith(".git") else repo_raw ref = branch or "HEAD" # GitLab uses the format: /{owner}/{repo}/-/archive/{ref}/{repo}-{ref}.zip zip_url = f"{parsed.scheme}://{parsed.netloc}/{owner}/{repo_slug}/-/archive/{ref}/{repo_slug}-{ref}.zip" logger.info(f"[GitAccessor] Downloading GitLab ZIP: {zip_url}") zip_path = os.path.join(target_dir, "_archive.zip") extract_dir = os.path.join(target_dir, "_extracted") os.makedirs(extract_dir, exist_ok=True) # Download (blocking HTTP; run in thread pool) def _download() -> None: headers = {"User-Agent": "OpenViking"} req = urllib.request.Request(zip_url, headers=headers) with urllib.request.urlopen(req, timeout=1800) as resp, open(zip_path, "wb") as f: shutil.copyfileobj(resp, f) try: await asyncio.to_thread(_download) except Exception as exc: raise RuntimeError(f"Failed to download GitLab ZIP {zip_url}: {exc}") # Safe extraction with Zip Slip validation target = Path(extract_dir).resolve() with zipfile.ZipFile(zip_path, "r") as zf: for info in zf.infolist(): mode = info.external_attr >> 16 if info.is_dir() or stat.S_ISDIR(mode): continue if stat.S_ISLNK(mode): logger.warning( f"[GitAccessor] Skipping symlink entry in GitLab ZIP: {info.filename}" ) continue raw = info.filename.replace("\\", "/") raw_parts = [p for p in raw.split("/") if p] if ".." in raw_parts: raise ValueError(f"Zip Slip detected in GitLab archive: {info.filename!r}") if PurePosixPath(raw).is_absolute(): raise ValueError(f"Zip Slip detected in GitLab archive: {info.filename!r}") extracted = Path(zf.extract(info, extract_dir)).resolve() if not extracted.is_relative_to(target): extracted.unlink(missing_ok=True) raise ValueError(f"Zip Slip detected in GitLab archive: {info.filename!r}") # Remove downloaded archive to free disk space. try: os.unlink(zip_path) except OSError: pass # GitLab ZIPs have a single top-level directory: {repo}-{ref}/ top_level = [d for d in Path(extract_dir).iterdir() if d.is_dir()] content_dir = top_level[0] if len(top_level) == 1 else Path(extract_dir) logger.info(f"[GitAccessor] GitLab ZIP extracted to {content_dir} ({repo_name})") return content_dir, repo_name async def _extract_zip(self, zip_path: str, target_dir: str) -> str: """Extract a local zip file into target_dir.""" if zip_path.startswith(("http://", "https://")): raise NotImplementedError("Zip URL download not yet implemented in GitAccessor") path = Path(zip_path) name = path.stem with zipfile.ZipFile(zip_path, "r") as zip_ref: target = Path(target_dir).resolve() for info in zip_ref.infolist(): mode = info.external_attr >> 16 # Skip directory entries if info.is_dir() or stat.S_ISDIR(mode): continue # Skip symlink entries if stat.S_ISLNK(mode): logger.warning(f"[GitAccessor] Skipping symlink entry in zip: {info.filename}") continue # Reject entries with suspicious raw path components raw = info.filename.replace("\\", "/") raw_parts = [p for p in raw.split("/") if p] if ".." in raw_parts: raise ValueError(f"Zip Slip detected: entry {info.filename!r} contains '..'") if PurePosixPath(raw).is_absolute() or (len(raw) >= 2 and raw[1] == ":"): raise ValueError( f"Zip Slip detected: entry {info.filename!r} is an absolute path" ) # Normalize and verify arcname = info.filename.replace("/", os.sep) if os.path.altsep: arcname = arcname.replace(os.path.altsep, os.sep) arcname = os.path.splitdrive(arcname)[1] arcname = os.sep.join(p for p in arcname.split(os.sep) if p not in ("", ".", "..")) if not arcname: continue member_path = (Path(target_dir) / arcname).resolve() if not member_path.is_relative_to(target): raise ValueError( f"Zip Slip detected: entry {info.filename!r} escapes target directory" ) # Extract and verify extracted = Path(zip_ref.extract(info, target_dir)).resolve() if not extracted.is_relative_to(target): # Best-effort cleanup try: extracted.unlink(missing_ok=True) except OSError as cleanup_err: logger.warning( f"[GitAccessor] Failed to clean up escaped file {extracted}: {cleanup_err}" ) raise ValueError( f"Zip Slip detected: entry {info.filename!r} escapes target directory" ) return name Blocking I/O in Async Functions Zip file extraction and temp directory creation (blocking I/O) are performed directly in the async `parse` method, which can starve the event loop. async def parse(self, source: Union[str, Path], instruction: str = "", kwargs) -> ParseResult: """Parse a ZIP file by extracting it and delegating to DirectoryParser. Args: source: Path to the .zip file instruction: Processing instruction (forwarded to DirectoryParser) kwargs: Extra options forwarded to DirectoryParser Returns: ParseResult from DirectoryParser, with temp_dir_path preserved for TreeBuilder to use later. """ path = Path(source) if not path.exists(): raise FileNotFoundError(f"ZIP file not found: {path}") if not zipfile.is_zipfile(path): raise ValueError(f"Not a valid ZIP file: {path}") # Extract ZIP to temporary directory temp_dir = Path(tempfile.mkdtemp(prefix="ov_zip_")) try: with zipfile.ZipFile(path, "r") as zipf: safe_extract_zip(zipf, temp_dir) # Check if extracted content has a single root directory extracted_entries = [p for p in temp_dir.iterdir() if p.name not in {".", ".."}] # Prepare kwargs for DirectoryParser dir_kwargs = dict(kwargs) dir_kwargs["instruction"] = instruction # Use DirectoryParser to process the extracted content from openviking.parse.parsers.directory import DirectoryParser parser = DirectoryParser() if len(extracted_entries) == 1 and extracted_entries[0].is_dir(): # Single root directory - parse that directly dir_kwargs.pop("source_name", None) result = await parser.parse(str(extracted_entries[0]), dir_kwargs) else: # Multiple entries at root - parse the temp dir itself # Set source_name from zip filename if not provided if "source_name" not in dir_kwargs or not dir_kwargs["source_name"]: dir_kwargs["source_name"] = path.stem result = await parser.parse(str(temp_dir), dir_kwargs) # Ensure the temporary directory is preserved for TreeBuilder if not result.temp_dir_path: result.temp_dir_path = str(temp_dir) else: # If DirectoryParser created its own temp, clean up our extraction dir try: shutil.rmtree(temp_dir, ignore_errors=True) except Exception: pass result.source_format = "zip" result.parser_name = "ZipParser" return result except Exception: # Clean up on error try: shutil.rmtree(temp_dir, ignore_errors=True) except Exception: pass raise async def parse_content( self, content: str, source_path: Optional[str] = None, instruction: str = "", **kwargs ) -> ParseResult: """Parse content - not applicable for ZIP files (needs a file path).""" raise NotImplementedError( "ZipParser does not support parse_content. Please provide a file path to parse()." )

github-actions · 2026-04-14T04:15:46Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category Suggestion Impact

General

Ensure proper file context managers

The synchronous file operations (open(), write(), json.dump()) are missing context
managers for proper resource cleanup in the improved version. When using
asyncio.to_thread(), ensure files are properly closed by wrapping in context
managers inside the thread function.

openviking/server/routers/resources.py [162-173]

-with open(temp_file_path, "wb") as f:
-    f.write(await file.read())
+# Write file content in thread pool with proper context manager
+file_content = await file.read()
+def _write_file():
+    with open(temp_file_path, "wb") as f:
+        f.write(file_content)
+await asyncio.to_thread(_write_file)
 
-# Save metadata with original filename
+# Save metadata with original filename with proper context manager
 if file.filename:
     meta_path = temp_dir / f"{temp_filename}.ov_upload.meta"
     meta = {
         "original_filename": file.filename,
         "upload_time": time.time(),
     }
-    with open(meta_path, "w") as f:
-        json.dump(meta, f)
+    def _write_meta():
+        with open(meta_path, "w") as f:
+            json.dump(meta, f)
+    await asyncio.to_thread(_write_meta)

Suggestion importance[1-10]: 6

__

Why: This suggestion correctly fixes the file resource management issue by using proper context managers inside thread-pooled functions, addressing the bug in the previous suggestion while retaining the async I/O offload benefit.

Low

qin-ctx

Thanks for pushing the accessor/parser split forward. The overall direction makes sense, but I found three regressions on the public add_resource() URL-ingestion path: content-type-based routing for extensionless downloads was removed, GitHub/GitLab blob URLs no longer normalize to raw-file URLs, and remote filename preservation now falls back to temp filenames. Requesting changes so we do not ship those behavior regressions.

openviking/parse/accessors/http_accessor.py

openviking/utils/media_processor.py

- Fix config lookup: github_domains/gitlab_domains are in CodeConfig now - Add debug logging instead of swallowing exceptions - Add comment about config location 🤖 Generated with [Claude Code](https://claude.com/claude-code)

- HTTPAccessor now stores original_filename in LocalResource.meta - Priority: Content-Disposition filename > URL path filename - UnifiedResourceProcessor now prefers original_filename from meta - Uses original_filename for resource_name and source_name - Falls back to temp path stem only if no original filename available 🤖 Generated with [Claude Code](https://claude.com/claude-code)

qin-ctx

这轮复查我重点核了这次 accessor/parser 重构是否改变了原有入口行为。分层方向本身是对的，之前 review 提到的几个 URL 下载回归点也基本补回来了，但当前还存在 4 处 blocking 的行为/兼容性问题：不存在的本地路径会被当成原始内容继续解析，resolve_uploaded_temp_file_id() 的返回值变更打坏了 pack/import 路径，本地 .zip 会被 GitAccessor 提前接管导致单根目录 ZIP 的命名行为变化，以及 HTMLParser/URLTypeDetector 的兼容层声明与实际 API 不一致。先 request changes，建议把这些回归补齐后再合。

openviking/parse/accessors/local_accessor.py

openviking/server/local_input_guard.py

openviking/parse/accessors/git_accessor.py

openviking/parse/parsers/html.py

… ragfs exceptions

qin-ctx

复查后，上一轮 4 个 blocking 点里 3 个已修复：本地缺失路径重新报 FileNotFoundError，pack/import 的 tuple 传参已修正，本地 ZIP 也重新走回 ZipParser。当前还剩 2 个阻塞问题：1) URLTypeDetector 吞掉 PermissionDeniedError，SSRF 校验语义回退；2) HTMLParser 的 backward compatibility 仍不完整，旧的 _fetch_html 和 parse(url) 行为都已 broken。

qin-ctx · 2026-04-14T12:48:48Z

openviking/parse/accessors/http_accessor.py

+                    if url_type != URLType.UNKNOWN:
+                        return url_type, meta
+
+        except Exception as e:


[Bug] (blocking)
这里把 request_validator 抛出的 PermissionDeniedError 也一起吞掉了，导致原来应该直接拒绝的 SSRF 目标被回退成默认 WEBPAGE。我本地跑 tests/server/test_api_local_input_security.py::test_url_detector_request_validator_blocks_loopback_head，loopback URL 不再抛 PermissionDeniedError，而是静默返回 (URLType.WEBPAGE, meta)。这会把安全校验从 hard fail 降级成继续走后续抓取路径，属于行为回归。这里至少要像 main 上旧实现那样对 PermissionDeniedError 单独 raise。

qin-ctx · 2026-04-14T12:48:48Z

openviking/parse/parsers/html.py

-                parse_time=time.time() - start_time,
-                warnings=[f"Failed to parse code repository: {e}"],
-            )
+        return await self._parse_local_file(path, start_time, **kwargs)


[Bug] (blocking)
兼容层这里还没补完整。当前 HTMLParser(timeout=...) 虽然不再 TypeError，但旧 public API 仍然是 broken：HTMLParser().parse("http://example.com") 不再抓取 URL，而是把它当本地路径返回 File not found: http:/example.com；同时 HTMLParser()._fetch_html(...) 也已经不存在，tests/server/test_api_local_input_security.py::test_html_parser_request_validator_blocks_loopback_fetch 现在仍是 AttributeError。另外 re-export 出来的 URLType 也少了旧的 CODE_REPOSITORY 成员。如果这里想保持 backward compatibility，需要把旧 URL parse / _fetch_html 路径补回来；否则这就应该明确当成 breaking change，并同步清理现有调用和测试。

openviking and others added 6 commits April 14, 2026 10:29

reort: split parser layer to 2-layer: accessor and parser, so that we…

6afc68b

… can reuse more code

Merge branch 'volcengine:main' into larkdoc

c24e4b4

reort: split parser layer to 2-layer: accessor and parser, so that we…

6a2d189

… can reuse more code

reort: split parser layer to 2-layer: accessor and parser, so that we…

daaead1

… can reuse more code

reort: split parser layer to 2-layer: accessor and parser, so that we…

42f7160

… can reuse more code

Merge branch 'volcengine:main' into larkdoc

59bace4

github-project-automation bot added this to OpenViking project Apr 14, 2026

github-project-automation bot moved this to Backlog in OpenViking project Apr 14, 2026

github-actions bot added the Review effort 4/5 label Apr 14, 2026

MaojiaSheng changed the title ~~reort: split parser layer to 2-layer: accessor and parser, so that we can reuse more code~~ reorg: split parser layer to 2-layer: accessor and parser, so that we can reuse more code Apr 14, 2026

openviking added 2 commits April 14, 2026 13:49

fix: zip optimization

1df49f1

fix: zip optimization for gitlab

bf5b323

qin-ctx requested changes Apr 14, 2026

View reviewed changes

openviking/parse/accessors/http_accessor.py Outdated Show resolved Hide resolved

openviking/parse/accessors/http_accessor.py Outdated Show resolved Hide resolved

openviking/utils/media_processor.py Outdated Show resolved Hide resolved

openviking added 5 commits April 14, 2026 14:37

fix: zip docs

5243828

fix: download http files

a3d2a13

fix: download pdf files, and fit IANA standards

abf19d5

qin-ctx requested changes Apr 14, 2026

View reviewed changes

openviking/parse/accessors/local_accessor.py Show resolved Hide resolved

openviking/server/local_input_guard.py Show resolved Hide resolved

openviking/parse/accessors/git_accessor.py Outdated Show resolved Hide resolved

openviking/parse/parsers/html.py Outdated Show resolved Hide resolved

openviking and others added 4 commits April 14, 2026 16:38

fix: review comments about accessors

8705331

fix: review comments about accessors

0c7d353

fix: multiple improvements for filename handling, network access, and…

59f3f05

… ragfs exceptions

Merge branch 'main' into larkdoc

df86ea1

qin-ctx requested changes Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reorg: split parser layer to 2-layer: accessor and parser, so that we can reuse more code#1428

reorg: split parser layer to 2-layer: accessor and parser, so that we can reuse more code#1428
MaojiaSheng wants to merge 17 commits intovolcengine:mainfrom
MaojiaSheng:larkdoc

MaojiaSheng commented Apr 14, 2026

Uh oh!

CLAassistant commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

qin-ctx left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qin-ctx left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qin-ctx left a comment

Uh oh!

qin-ctx Apr 14, 2026

Uh oh!

qin-ctx Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MaojiaSheng commented Apr 14, 2026

Uh oh!

CLAassistant commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Apr 14, 2026

PR Code Suggestions ✨

Uh oh!

qin-ctx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qin-ctx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qin-ctx left a comment

Choose a reason for hiding this comment

Uh oh!

qin-ctx Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

qin-ctx Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants