Skip to content

[Bug] Deleted PR files are marked complete without base content #20

@bitloi

Description

@bitloi

Description

PR file fetching stores metadata for deleted files, but skips content fetching for every file with status = "removed". Deleted files still have meaningful base-side content for scoring: the file existed before the PR and is removed at the PR head.

Because deletion-only PRs return successfully from the fetcher, the worker can set scoring_data_stored = true while the public files API returns the deleted file with base_content = null, head_content = null, and no byte size. Validators then see the PR as fully fetched even though the token input for the deleted file is missing.

Steps to Reproduce

  1. Mirror or backfill a PR that deletes a file, for example a file with status = "removed" in the GitHub PR files API.
  2. Let the fetch-pr-files worker complete for that PR.
  3. Query GET /api/v1/pulls/:owner/:repo/:number/files.

Expected Behavior

Deleted files should get a pr_file_contents row with the base blob populated and the head blob left null.

For a removed file such as src/old.ts, the response should include:

{
  "filename": "src/old.ts",
  "status": "removed",
  "base_content": "<content from the merge-base/base side>",
  "head_content": null,
  "byte_size": 123
}

Actual Behavior

Removed files are filtered out before GraphQL content fetching:

const scored = files.filter((f) => f.status !== "removed");
if (scored.length === 0) return;

For deletion-only PRs, no content request is made, no pr_file_contents row is written, and the worker still marks file scoring data complete. The files API left-joins missing content and returns null content fields for the deleted file.

Environment

  • OS: N/A
  • Runtime/Node version: Node 20
  • Browser (if applicable): N/A

Additional Context

Relevant paths:

  • packages/das/src/webhook/github-fetcher.service.ts
    • fetchAndStorePrFiles stores removed file metadata in pr_files.
    • fetchAndStoreBatchedContents filters out removed files before content fetching.
    • fetchContentBatch already knows how to fetch base content for non-added files.
  • packages/das/src/queue/fetch.processor.ts
    • marks scoring_data_stored = true after the fetcher returns.
  • packages/das/src/api/pulls/pulls.service.ts
    • exposes file metadata and content through a left join to pr_file_contents.
  • packages/db/09_pr_file_contents.sql
    • documents this table as PR file contents for token scoring: base and head versions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions