Description
saveLabelTimelineEvents in packages/das/src/webhook/github-fetcher.service.ts:986 calls labelEventRepo.save({...}) without an id. Because LabelEvent uses @PrimaryGeneratedColumn() (packages/das/src/entities/LabelEvent.entity.ts:5) and label_events has no unique constraint (packages/db/07_label_events.sql:6-18), every backfill INSERTs a fresh row for every label event instead of upserting. The full label history of every PR and issue is re-duplicated on every backfill run.
BullMQ is configured with attempts: 2 on backfill jobs (packages/das/src/api/admin.controller.ts:95), so any partial failure mid-backfill (e.g. GitHub rate limit) retries from the beginning and doubles every label event already written during that run. After N backfills, label_events holds N× the real rows; the consumer views (pr_labels_by_actor, issue_labels_by_actor) silently collapse duplicates via DISTINCT ON, so API output stays correct long enough that nobody notices the table is ballooning. Eventually full-table scans on the views slow the miners API to a crawl and timeouts begin.
Failure mode: silent-bad-data + DoS through unbounded resource exhaustion.
Steps to Reproduce
- Register a repo with at least one labeled PR or issue via
POST /api/v1/admin/repos/register (auto-enqueues a backfill), or call POST /api/v1/admin/backfill directly.
- Wait for the backfill job to complete, then run
SELECT COUNT(*) FROM label_events;.
- Trigger another backfill on the same repo with
POST /api/v1/admin/backfill.
- Re-run
SELECT COUNT(*) FROM label_events; — the row count has increased by exactly the number of label events on GitHub, even though no new labeling activity occurred. Query pr_labels_by_actor / issue_labels_by_actor and observe API output is unchanged because DISTINCT ON masks the duplicates.
Expected Behavior
Re-processing the same label timeline event should be idempotent. label_events should hold one row per (repo_full_name, target_number, target_type, label_name, action, timestamp) regardless of how many backfills or retries fire.
Actual Behavior
Every backfill (and every retry of a backfill, since attempts: 2) appends a full duplicate copy of every label event for every PR and issue in the repo. The table grows without bound, no error is thrown, and the API stays correct until view scans degrade and the miners endpoint starts timing out.
Environment
- OS: Linux 6.17.0-23-generic
- Runtime/Node version: Node.js v20.20.2 (container)
- Browser (if applicable): n/a
Additional Context
Affected code paths:
packages/das/src/webhook/github-fetcher.service.ts:975-999 — saveLabelTimelineEvents performs a non-idempotent save per node
packages/das/src/entities/LabelEvent.entity.ts:5 — id is @PrimaryGeneratedColumn(); no natural-key unique index
packages/db/07_label_events.sql:6-18 — table has only a (repo_full_name, target_number, timestamp) index, no UNIQUE constraint
packages/das/src/api/admin.controller.ts:95 — attempts: 2 causes any partial failure to double the rows written before the failure point
packages/db/24_view_pr_labels_by_actor.sql, packages/db/25_view_issue_labels_by_actor.sql — both use DISTINCT ON (...) ORDER BY timestamp DESC, hiding the duplication from API consumers
Description
saveLabelTimelineEventsinpackages/das/src/webhook/github-fetcher.service.ts:986callslabelEventRepo.save({...})without anid. BecauseLabelEventuses@PrimaryGeneratedColumn()(packages/das/src/entities/LabelEvent.entity.ts:5) andlabel_eventshas no unique constraint (packages/db/07_label_events.sql:6-18), every backfill INSERTs a fresh row for every label event instead of upserting. The full label history of every PR and issue is re-duplicated on every backfill run.BullMQ is configured with
attempts: 2on backfill jobs (packages/das/src/api/admin.controller.ts:95), so any partial failure mid-backfill (e.g. GitHub rate limit) retries from the beginning and doubles every label event already written during that run. After N backfills,label_eventsholds N× the real rows; the consumer views (pr_labels_by_actor,issue_labels_by_actor) silently collapse duplicates viaDISTINCT ON, so API output stays correct long enough that nobody notices the table is ballooning. Eventually full-table scans on the views slow the miners API to a crawl and timeouts begin.Failure mode: silent-bad-data + DoS through unbounded resource exhaustion.
Steps to Reproduce
POST /api/v1/admin/repos/register(auto-enqueues a backfill), or callPOST /api/v1/admin/backfilldirectly.SELECT COUNT(*) FROM label_events;.POST /api/v1/admin/backfill.SELECT COUNT(*) FROM label_events;— the row count has increased by exactly the number of label events on GitHub, even though no new labeling activity occurred. Querypr_labels_by_actor/issue_labels_by_actorand observe API output is unchanged becauseDISTINCT ONmasks the duplicates.Expected Behavior
Re-processing the same label timeline event should be idempotent.
label_eventsshould hold one row per(repo_full_name, target_number, target_type, label_name, action, timestamp)regardless of how many backfills or retries fire.Actual Behavior
Every backfill (and every retry of a backfill, since
attempts: 2) appends a full duplicate copy of every label event for every PR and issue in the repo. The table grows without bound, no error is thrown, and the API stays correct until view scans degrade and the miners endpoint starts timing out.Environment
Additional Context
Affected code paths:
packages/das/src/webhook/github-fetcher.service.ts:975-999—saveLabelTimelineEventsperforms a non-idempotentsaveper nodepackages/das/src/entities/LabelEvent.entity.ts:5—idis@PrimaryGeneratedColumn(); no natural-key unique indexpackages/db/07_label_events.sql:6-18— table has only a(repo_full_name, target_number, timestamp)index, noUNIQUEconstraintpackages/das/src/api/admin.controller.ts:95—attempts: 2causes any partial failure to double the rows written before the failure pointpackages/db/24_view_pr_labels_by_actor.sql,packages/db/25_view_issue_labels_by_actor.sql— both useDISTINCT ON (...) ORDER BY timestamp DESC, hiding the duplication from API consumers