feat(auto-review): add deduplication detection subagent#70
feat(auto-review): add deduplication detection subagent#70
Conversation
Add conditional subagent that detects near-duplicate files in PRs using n-gram Jaccard similarity. Spawns when newly added files share >70% similarity with existing repo files or other added files. - Heuristic script with configurable threshold, binary/small file exclusion, docs/test-only skip, label and force overrides - Agent spec with severity scale, false-positive guardrails, DRY refactoring suggestions - 34 tests covering n-gram computation, similarity, decision logic, filesystem helpers, and GitHub API calls - Integrated into action.yml pipeline and findings extraction
|
Claude finished @bkrem's task in 45s —— View job Found 2 issue(s)Issue 1: Multi-extension repo file loading is brokenID: should-spawn-dedup-multi-ext-repo-load-b3f1 Context:
Recommendation: Remove the inner if (!checkedExtensions.has(ext)) {
checkedExtensions.add(ext);
// removed: if (!repoContents.size) {
const repoPaths = listRepoFilesByExtension(ext);
for (const rp of repoPaths) {
const normalized = rp.startsWith('./') ? rp.slice(2) : rp;
if (addedNGrams.has(normalized)) continue;
try {
repoContents.set(normalized, fs.readFileSync(rp, 'utf8'));
} catch { /* skip unreadable */ }
}
// removed: }
}Issue 2: Missing test coverage for multi-extension scenarioID: should-spawn-dedup-test-multi-ext-coverage-c7a2 Context:
Recommendation: Add a test: it('should detect similarity across different extensions in same PR', () => {
const content = 'export function handler(req, res) {\n validate(req);\n process(req);\n res.json({ ok: true });\n return;\n}\n';
const files = [
{ filename: 'src/new.js', status: 'added' },
{ filename: 'src/new.ts', status: 'added' },
];
const result = shouldSpawnDeduplication(files, {}, {
addedFileContents: new Map([['src/new.js', content], ['src/new.ts', content]]),
repoFileContents: new Map([['src/old.js', content], ['src/old.ts', content]]),
});
expect(result.spawn).toBe(true);
expect(result.similarPairs.length).toBeGreaterThanOrEqual(2);
});Breaking changes: ✅ No issues (new |
The `if (!repoContents.size)` guard prevented repo file loading for 2nd+ extensions since the shared Map was already populated. The outer `checkedExtensions` Set already deduplicates per extension. Also add multi-extension test to cover this scenario.
|
@claude review |
|
Claude finished @bkrem's task in 4m 17s —— View job 🔄 Incremental Review: Previously flagged issues:
Breaking changes: ✅ No issues — all additions are additive ( Found 1 issue(s)Issue 1:
|
Summary
should-spawn-deduplication.js) compares newly added files against existing repo files and other added files, with binary/small file exclusion, docs/test-only skip, label and force overridesreview-deduplication.md) with severity scale (CRITICAL >90%, HIGH >70%), false-positive guardrails, and DRY refactoring suggestionsaction.ymlpipeline withforce_deduplication_agentinput and findings extraction (dup-prefix)Test plan
pnpm test), including 35 new tests for deduplication scriptaction.ymlsyntax valid in CIforce_deduplication_agent: trueinput triggers spawn regardless of heuristicdeduplicationlabel triggers spawn with empty pairsskip-reviewlabel suppresses spawn