reorg: split parser layer to 2-layer: accessor and parser, so that we can reuse more code#1428
reorg: split parser layer to 2-layer: accessor and parser, so that we can reuse more code#1428MaojiaSheng wants to merge 17 commits intovolcengine:mainfrom
Conversation
… can reuse more code
… can reuse more code
… can reuse more code
… can reuse more code
|
openviking seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
qin-ctx
left a comment
There was a problem hiding this comment.
Thanks for pushing the accessor/parser split forward. The overall direction makes sense, but I found three regressions on the public add_resource() URL-ingestion path: content-type-based routing for extensionless downloads was removed, GitHub/GitLab blob URLs no longer normalize to raw-file URLs, and remote filename preservation now falls back to temp filenames. Requesting changes so we do not ship those behavior regressions.
- Fix config lookup: github_domains/gitlab_domains are in CodeConfig now - Add debug logging instead of swallowing exceptions - Add comment about config location 🤖 Generated with [Claude Code](https://claude.com/claude-code)
- HTTPAccessor now stores original_filename in LocalResource.meta - Priority: Content-Disposition filename > URL path filename - UnifiedResourceProcessor now prefers original_filename from meta - Uses original_filename for resource_name and source_name - Falls back to temp path stem only if no original filename available 🤖 Generated with [Claude Code](https://claude.com/claude-code)
qin-ctx
left a comment
There was a problem hiding this comment.
这轮复查我重点核了这次 accessor/parser 重构是否改变了原有入口行为。分层方向本身是对的,之前 review 提到的几个 URL 下载回归点也基本补回来了,但当前还存在 4 处 blocking 的行为/兼容性问题:不存在的本地路径会被当成原始内容继续解析,resolve_uploaded_temp_file_id() 的返回值变更打坏了 pack/import 路径,本地 .zip 会被 GitAccessor 提前接管导致单根目录 ZIP 的命名行为变化,以及 HTMLParser/URLTypeDetector 的兼容层声明与实际 API 不一致。先 request changes,建议把这些回归补齐后再合。
qin-ctx
left a comment
There was a problem hiding this comment.
复查后,上一轮 4 个 blocking 点里 3 个已修复:本地缺失路径重新报 FileNotFoundError,pack/import 的 tuple 传参已修正,本地 ZIP 也重新走回 ZipParser。当前还剩 2 个阻塞问题:1) URLTypeDetector 吞掉 PermissionDeniedError,SSRF 校验语义回退;2) HTMLParser 的 backward compatibility 仍不完整,旧的 _fetch_html 和 parse(url) 行为都已 broken。
| if url_type != URLType.UNKNOWN: | ||
| return url_type, meta | ||
|
|
||
| except Exception as e: |
There was a problem hiding this comment.
[Bug] (blocking)
这里把 request_validator 抛出的 PermissionDeniedError 也一起吞掉了,导致原来应该直接拒绝的 SSRF 目标被回退成默认 WEBPAGE。我本地跑 tests/server/test_api_local_input_security.py::test_url_detector_request_validator_blocks_loopback_head,loopback URL 不再抛 PermissionDeniedError,而是静默返回 (URLType.WEBPAGE, meta)。这会把安全校验从 hard fail 降级成继续走后续抓取路径,属于行为回归。这里至少要像 main 上旧实现那样对 PermissionDeniedError 单独 raise。
| parse_time=time.time() - start_time, | ||
| warnings=[f"Failed to parse code repository: {e}"], | ||
| ) | ||
| return await self._parse_local_file(path, start_time, **kwargs) |
There was a problem hiding this comment.
[Bug] (blocking)
兼容层这里还没补完整。当前 HTMLParser(timeout=...) 虽然不再 TypeError,但旧 public API 仍然是 broken:HTMLParser().parse("http://example.com") 不再抓取 URL,而是把它当本地路径返回 File not found: http:/example.com;同时 HTMLParser()._fetch_html(...) 也已经不存在,tests/server/test_api_local_input_security.py::test_html_parser_request_validator_blocks_loopback_fetch 现在仍是 AttributeError。另外 re-export 出来的 URLType 也少了旧的 CODE_REPOSITORY 成员。如果这里想保持 backward compatibility,需要把旧 URL parse / _fetch_html 路径补回来;否则这就应该明确当成 breaking change,并同步清理现有调用和测试。
This commit completes the two-layer (accessor + parser) architecture implementation:
Two-layer architecture now fully functional:
Additional:
ZipParser refactoring:
Upload filename preservation:
All sources (local files, URLs, Git repos, zip files, uploaded files) now
go through the same unified pipeline with proper filename/directory preservation.