Skip to content

[Bug] Active file-fetch jobs can mark stale PR contents complete after a newer push #11

@bitloi

Description

@bitloi

Description

pull_request.synchronize updates the canonical PR row to the new head_sha/base_sha and resets scoring_data_stored=false, but fetch-pr-files jobs are keyed only by repository and PR number. If an older file-fetch job is already active, BullMQ can dedupe the newer enqueue against that active job while the old worker continues fetching files for the PR row it read before the push.

When that old worker finishes, it can mark the current PR row as complete even though the stored file rows were fetched for an older head. The public files endpoint can then expose the newer head_sha with stale file contents and scoring_data_stored=true.

Steps to Reproduce

  1. Start a fetch-pr-files job for owner/repo#7 while the PR row has head_sha='H1' and base_sha='B1'.
  2. Before that job finishes, process a pull_request.synchronize webhook for the same PR with head_sha='H2' and base_sha='B2'.
  3. Observe that the webhook updates the PR row and resets scoring_data_stored=false, then enqueues another fetch-pr-files job with the same stable job id.
  4. Because the older job is still active, BullMQ can treat the new enqueue as a duplicate of the active job.
  5. Let the old job finish after storing files fetched for H1.
  6. Query GET /api/v1/pulls/owner/repo/7/files.

Expected Behavior

Each PR file-fetch job should be tied to the PR head/base generation it is meant to fetch. A stale job should not be able to mark scoring_data_stored=true for a newer PR row, and a newer push should result in a current-generation file fetch.

Actual Behavior

The active old job can store stale file rows and then mark the current PR row as complete. Validators may see head_sha='H2' and scoring_data_stored=true while the file rows still correspond to H1.

Environment

  • OS: Any server environment
  • Runtime/Node version: Node 20, as used by CI
  • Browser (if applicable): N/A

Additional Context

Relevant code paths:

  • packages/das/src/webhook/handlers/pull-request.handler.ts updates the PR row on synchronize, resets scoring_data_stored=false, and enqueues fetch-pr-files.
  • packages/das/src/webhook/handlers/pull-request.handler.ts uses a stable file job id based only on repoFullName and prNumber.
  • packages/das/src/webhook/github-fetcher.service.ts reads the PR row once and fetches files/content using that snapshot.
  • packages/das/src/queue/fetch.processor.ts marks scoring_data_stored=true after fetchAndStorePrFiles returns.
  • packages/das/src/api/pulls/pulls.service.ts exposes head_sha, base_sha, scoring_data_stored, and stored file rows together.

This breaks the scoring-data contract: scoring_data_stored=true should mean the persisted file rows match the PR head/base currently exposed by the files API.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions