Description
PR file fetching stores metadata for deleted files, but skips content fetching for every file with status = "removed". Deleted files still have meaningful base-side content for scoring: the file existed before the PR and is removed at the PR head.
Because deletion-only PRs return successfully from the fetcher, the worker can set scoring_data_stored = true while the public files API returns the deleted file with base_content = null, head_content = null, and no byte size. Validators then see the PR as fully fetched even though the token input for the deleted file is missing.
Steps to Reproduce
- Mirror or backfill a PR that deletes a file, for example a file with
status = "removed" in the GitHub PR files API.
- Let the
fetch-pr-files worker complete for that PR.
- Query
GET /api/v1/pulls/:owner/:repo/:number/files.
Expected Behavior
Deleted files should get a pr_file_contents row with the base blob populated and the head blob left null.
For a removed file such as src/old.ts, the response should include:
{
"filename": "src/old.ts",
"status": "removed",
"base_content": "<content from the merge-base/base side>",
"head_content": null,
"byte_size": 123
}
Actual Behavior
Removed files are filtered out before GraphQL content fetching:
const scored = files.filter((f) => f.status !== "removed");
if (scored.length === 0) return;
For deletion-only PRs, no content request is made, no pr_file_contents row is written, and the worker still marks file scoring data complete. The files API left-joins missing content and returns null content fields for the deleted file.
Environment
- OS: N/A
- Runtime/Node version: Node 20
- Browser (if applicable): N/A
Additional Context
Relevant paths:
packages/das/src/webhook/github-fetcher.service.ts
fetchAndStorePrFiles stores removed file metadata in pr_files.
fetchAndStoreBatchedContents filters out removed files before content fetching.
fetchContentBatch already knows how to fetch base content for non-added files.
packages/das/src/queue/fetch.processor.ts
- marks
scoring_data_stored = true after the fetcher returns.
packages/das/src/api/pulls/pulls.service.ts
- exposes file metadata and content through a left join to
pr_file_contents.
packages/db/09_pr_file_contents.sql
- documents this table as PR file contents for token scoring: base and head versions.
Description
PR file fetching stores metadata for deleted files, but skips content fetching for every file with
status = "removed". Deleted files still have meaningful base-side content for scoring: the file existed before the PR and is removed at the PR head.Because deletion-only PRs return successfully from the fetcher, the worker can set
scoring_data_stored = truewhile the public files API returns the deleted file withbase_content = null,head_content = null, and no byte size. Validators then see the PR as fully fetched even though the token input for the deleted file is missing.Steps to Reproduce
status = "removed"in the GitHub PR files API.fetch-pr-filesworker complete for that PR.GET /api/v1/pulls/:owner/:repo/:number/files.Expected Behavior
Deleted files should get a
pr_file_contentsrow with the base blob populated and the head blob left null.For a removed file such as
src/old.ts, the response should include:{ "filename": "src/old.ts", "status": "removed", "base_content": "<content from the merge-base/base side>", "head_content": null, "byte_size": 123 }Actual Behavior
Removed files are filtered out before GraphQL content fetching:
For deletion-only PRs, no content request is made, no
pr_file_contentsrow is written, and the worker still marks file scoring data complete. The files API left-joins missing content and returns null content fields for the deleted file.Environment
Additional Context
Relevant paths:
packages/das/src/webhook/github-fetcher.service.tsfetchAndStorePrFilesstores removed file metadata inpr_files.fetchAndStoreBatchedContentsfilters out removed files before content fetching.fetchContentBatchalready knows how to fetch base content for non-added files.packages/das/src/queue/fetch.processor.tsscoring_data_stored = trueafter the fetcher returns.packages/das/src/api/pulls/pulls.service.tspr_file_contents.packages/db/09_pr_file_contents.sql