Skip to content

feat(auto-review): add deduplication detection subagent#70

Draft
bkrem wants to merge 2 commits intomasterfrom
feat/deduplication-subagent
Draft

feat(auto-review): add deduplication detection subagent#70
bkrem wants to merge 2 commits intomasterfrom
feat/deduplication-subagent

Conversation

@bkrem
Copy link
Member

@bkrem bkrem commented Feb 24, 2026

Summary

  • Add deduplication subagent that detects near-duplicate files introduced in PRs using character-level n-gram Jaccard similarity (>70% threshold)
  • Heuristic script (should-spawn-deduplication.js) compares newly added files against existing repo files and other added files, with binary/small file exclusion, docs/test-only skip, label and force overrides
  • Agent spec (review-deduplication.md) with severity scale (CRITICAL >90%, HIGH >70%), false-positive guardrails, and DRY refactoring suggestions
  • Integrated into action.yml pipeline with force_deduplication_agent input and findings extraction (dup- prefix)
  • Fixed multi-extension repo file loading bug where only the first file extension's repo files were loaded

Test plan

  • All 194 tests pass (pnpm test), including 35 new tests for deduplication script
  • Multi-extension similarity detection covered by test
  • Verify action.yml syntax valid in CI
  • End-to-end: open a PR with a copy-pasted file and confirm subagent spawns
  • Verify force_deduplication_agent: true input triggers spawn regardless of heuristic
  • Verify deduplication label triggers spawn with empty pairs
  • Verify skip-review label suppresses spawn

Add conditional subagent that detects near-duplicate files in PRs using
n-gram Jaccard similarity. Spawns when newly added files share >70%
similarity with existing repo files or other added files.

- Heuristic script with configurable threshold, binary/small file
  exclusion, docs/test-only skip, label and force overrides
- Agent spec with severity scale, false-positive guardrails, DRY
  refactoring suggestions
- 34 tests covering n-gram computation, similarity, decision logic,
  filesystem helpers, and GitHub API calls
- Integrated into action.yml pipeline and findings extraction
@claude
Copy link

claude bot commented Feb 24, 2026

Claude finished @bkrem's task in 45s —— View job


Found 2 issue(s)

Issue 1: Multi-extension repo file loading is broken

ID: should-spawn-dedup-multi-ext-repo-load-b3f1
File: claude/auto-review/scripts/should-spawn-deduplication.js:191
Severity: HIGH
Category: code_quality

Context:

  • Pattern: if (!repoContents.size) guards the entire repo-file-loading block, but repoContents is a shared Map across all extensions. After the first extension loads its files, repoContents.size > 0, so subsequent extensions never load their own repo files.
  • Risk: When a PR adds files of multiple types (e.g. both .js and .ts), only the first extension encountered gets compared against existing repo files. The inner for loop then filters extname(repoPath) !== ext, which excludes all first-extension files, leaving zero comparisons for the second extension.
  • Impact: Near-duplicate detection silently misses cross-extension duplication candidates; no error is emitted. The checkedExtensions Set that was intended to guard this block is rendered useless.
  • Trigger: Any PR that adds files with more than one distinct file extension (very common).

Recommendation: Remove the inner if (!repoContents.size) guard — the outer checkedExtensions already prevents redundant loads per extension:

if (!checkedExtensions.has(ext)) {
  checkedExtensions.add(ext);
  // removed: if (!repoContents.size) {
  const repoPaths = listRepoFilesByExtension(ext);
  for (const rp of repoPaths) {
    const normalized = rp.startsWith('./') ? rp.slice(2) : rp;
    if (addedNGrams.has(normalized)) continue;
    try {
      repoContents.set(normalized, fs.readFileSync(rp, 'utf8'));
    } catch { /* skip unreadable */ }
  }
  // removed: }
}

Fix this →


Issue 2: Missing test coverage for multi-extension scenario

ID: should-spawn-dedup-test-multi-ext-coverage-c7a2
File: claude/auto-review/scripts/tests/should-spawn-deduplication.test.js:257
Severity: MEDIUM
Category: code_quality

Context:

  • Pattern: No test exercises a PR with added files of two different extensions (e.g. src/a.js + src/b.ts) where repo files of each type exist. All similarity tests use a single extension.
  • Risk: The multi-extension bug (Issue 1) would have been caught by such a test. Without it, future regressions in extension-bucketing logic are undetectable.
  • Impact: Silent false negatives in production — the heuristic reports "no similar pairs" for multi-extension PRs even when duplicates exist.
  • Trigger: Any PR adding files of mixed types.

Recommendation: Add a test:

it('should detect similarity across different extensions in same PR', () => {
  const content = 'export function handler(req, res) {\n  validate(req);\n  process(req);\n  res.json({ ok: true });\n  return;\n}\n';
  const files = [
    { filename: 'src/new.js', status: 'added' },
    { filename: 'src/new.ts', status: 'added' },
  ];
  const result = shouldSpawnDeduplication(files, {}, {
    addedFileContents: new Map([['src/new.js', content], ['src/new.ts', content]]),
    repoFileContents: new Map([['src/old.js', content], ['src/old.ts', content]]),
  });
  expect(result.spawn).toBe(true);
  expect(result.similarPairs.length).toBeGreaterThanOrEqual(2);
});

Breaking changes: ✅ No issues (new force_deduplication_agent input is optional with safe default; all additions are additive)
Data classification: ✅ No issues (file contents used only for in-memory n-gram computation, never logged or persisted)

The `if (!repoContents.size)` guard prevented repo file loading for
2nd+ extensions since the shared Map was already populated. The outer
`checkedExtensions` Set already deduplicates per extension.

Also add multi-extension test to cover this scenario.
@bkrem
Copy link
Member Author

bkrem commented Feb 24, 2026

@claude review

@claude
Copy link

claude bot commented Feb 24, 2026

Claude finished @bkrem's task in 4m 17s —— View job


🔄 Incremental Review:

Previously flagged issues:

  • should-spawn-dedup-multi-ext-repo-load-b3f1 ✅ Resolved — inner if (!repoContents.size) guard removed (commit a98ece2)
  • should-spawn-dedup-test-multi-ext-coverage-c7a2 ✅ Resolved — multi-extension test added at should-spawn-deduplication.test.js:238

Breaking changes: ✅ No issues — all additions are additive (force_deduplication_agent is optional with safe default, new $GITHUB_ENV vars don't affect existing steps, regex/map extensions are strictly additive)

Found 1 issue(s)

Issue 1: spawnSync calls through to real find in shouldSpawnDeduplication tests

ID: should-spawn-dedup-test-spawnthrough-e3b1
File: claude/auto-review/scripts/__tests__/should-spawn-deduplication.test.js:21
Severity: LOW
Category: code_quality

Context:

  • Pattern: vi.mock('child_process') wraps spawnSync with vi.fn(actual.spawnSync) — a spy that calls through to the real implementation. Tests in the listRepoFilesByExtension block stub it explicitly, but shouldSpawnDeduplication tests do not. Those tests run real find . against the CI filesystem.
  • Risk: When a test injects repoFileContents: new Map() or a small map, shouldSpawnDeduplication still calls listRepoFilesByExtension, reads real repo .js/.ts files from disk, and populates the injected map with them. Any real file with >70% n-gram Jaccard similarity to the test content would add unexpected pairs.
  • Impact: The test at line 201 (expect(result.similarPairs.length).toBe(1)) is the most fragile — it asserts an exact count. If a real repo file matches the injected content, the count exceeds 1 and the test fails. Currently unlikely given the generic content vs. the repo's script files, but fragile as a pattern.
  • Trigger: Any future test using generic/common content in addedFileContents while relying on exact similarPairs.length counts.

Recommendation: Add a default stub in the shouldSpawnDeduplication describe block:

beforeEach(() => {
  spawnSync.mockReturnValue({ stdout: '', status: 0, error: null });
});

This prevents real filesystem calls in all shouldSpawnDeduplication tests. The listRepoFilesByExtension describe block already overrides per-test, so this doesn't interfere.

| [Branch](https://github.com/WalletConnect/actions/tree/feat/deduplication-subagent)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant