Skip to content

feat(index): multi-threaded parallel indexing via worker_threads#21

Merged
mars167 merged 4 commits intomainfrom
feat/parallel-indexing
Feb 8, 2026
Merged

feat(index): multi-threaded parallel indexing via worker_threads#21
mars167 merged 4 commits intomainfrom
feat/parallel-indexing

Conversation

@mars167
Copy link
Owner

@mars167 mars167 commented Feb 6, 2026

Summary

  • Add worker thread pool for CPU-bound parse+embed+quantize operations
  • LanceDB: parallel writes per language table (Promise.all)
  • Incremental indexer: Promise concurrency + optional worker pool
  • Config: useWorkerThreads, workerThreadsMinFiles (default 50)
  • Fallback to single-threaded when pool unavailable or file count < threshold

Made with Cursor

mars and others added 3 commits February 7, 2026 00:00
- Remove all DSR-related descriptions from README.md
- Update architecture diagram to reflect current implementation
- Remove DSR tools from skill templates and references
- Update AGENTS.md files across codebase to remove DSR references
- Simplify core capabilities section to focus on vector + graph retrieval
- Update comparison table to highlight repo-map with PageRank
- Add worker thread pool for CPU-bound parse+embed+quantize operations
- LanceDB: parallel writes per language table (Promise.all)
- Incremental indexer: Promise concurrency + optional worker pool
- Config: useWorkerThreads, workerThreadsMinFiles (default 50)
- Fallback to single-threaded when pool unavailable or file count < threshold

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces true multi-threaded indexing using Node.js worker_threads to speed up CPU-bound parsing/embedding/quantization, along with parallelized LanceDB writes. It also updates multiple docs/templates to remove DSR-related references and reflect the current tool/command set.

Changes:

  • Add worker_threads worker entry + a worker pool, and wire it into runParallelIndexing with config gating (useWorkerThreads, workerThreadsMinFiles).
  • Update incremental and full indexers to perform LanceDB writes per-language in parallel.
  • Refresh docs/templates/AGENTS to remove DSR references and emphasize repo-map + graph/vector retrieval.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
templates/agents/common/skills/git-ai-code-search/references/tools.md Removes DSR tool documentation from templates.
templates/agents/common/skills/git-ai-code-search/references/constraints.md Removes DSR-related constraints/rules from templates.
templates/agents/common/skills/git-ai-code-search/SKILL.md Updates skill description/rules to remove DSR history references.
skills/git-ai-code-search/references/tools.md Removes DSR tool documentation from shipped skill refs.
skills/git-ai-code-search/references/constraints.md Removes DSR-related constraints/rules from shipped skill refs.
skills/git-ai-code-search/SKILL.md Updates skill description/rules to remove DSR history references.
src/core/indexing/worker.ts New worker entrypoint implementing CPU-bound indexing per file.
src/core/indexing/pool.ts New fixed-size worker pool to distribute file tasks and collect results.
src/core/indexing/parallel.ts Adds worker-pool path with fallback to existing single-threaded implementation.
src/core/indexing/config.ts Adds worker-thread enablement config knobs and defaults.
src/core/indexerIncremental.ts Adds optional worker-pool processing + parallel per-language LanceDB writes.
src/core/indexer.ts Switches per-language LanceDB writes to Promise.all.
src/core/AGENTS.md Updates core module overview/structure (removes DSR, adds repoMap).
src/commands/AGENTS.md Updates command docs (removes dsr command references).
src/cli/AGENTS.md Updates CLI architecture docs (removes dsr schemas/handlers, adds repo-map).
README.md Updates product positioning/architecture diagrams away from DSR and toward repo-map/graph/vector.
AGENTS.md Updates repository overview/structure text away from DSR references.
.gitignore Adds .cursor/ to ignored paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 79 to 81
const parsed = parser.parseContent(filePath, content);
const symbols: SymbolInfo[] = parsed.symbols;
const fileRefs: AstReference[] = parsed.refs;
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worker parsing uses parser.parseContent(...) directly, so any parse failure results in the file being skipped (worker returns error → pool resolves null). This diverges from the single-threaded path which applies parseFailureFallback (text-only/line-chunk/etc). Pass the relevant error-handling config in WorkerRequest and implement the same fallback logic in the worker so results are consistent regardless of useWorkerThreads.

Suggested change
const parsed = parser.parseContent(filePath, content);
const symbols: SymbolInfo[] = parsed.symbols;
const fileRefs: AstReference[] = parsed.refs;
let symbols: SymbolInfo[] = [];
let fileRefs: AstReference[] = [];
try {
const parsed = parser.parseContent(filePath, content);
symbols = parsed.symbols ?? [];
fileRefs = parsed.refs ?? [];
} catch (err) {
// On parse failure, fall back to an empty symbol/ref set so the worker
// still produces a result instead of signalling a hard error. This
// mirrors the single-threaded behaviour where parse failures degrade
// gracefully rather than skipping the file entirely.
symbols = [];
fileRefs = [];
}

Copilot uses AI. Check for mistakes.
content,
dim: this.dim,
quantizationBits: 8,
existingChunkHashes: [],
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the worker-thread path, files are processed with existingChunkHashes: [], so chunks that already exist in LanceDB can be re-inserted on every incremental run. The single-threaded path explicitly queries existing content_hash values and filters them out. Add an equivalent “filter existing hashes” step for the worker-thread path (either by querying LanceDB before t.chunks.add(...) or by providing existing hashes to workers).

Suggested change
existingChunkHashes: [],
existingChunkHashes: Array.from(seenChunkHashes),

Copilot uses AI. Check for mistakes.
Comment on lines 407 to 432
const tasks: Array<Promise<void>> = [];
for (const item of filesToIndex) {
const task = (async () => {
processed++;
this.onProgress?.({ totalFiles: state.totalFiles, processedFiles: processed, currentFile: item.filePosix });

const content = this.source === 'staged'
? await readStagedFile(this.repoRoot, item.filePosix)
: await readWorktreeFile(this.scanRoot, item.filePosix);
if (content == null) return;

const result = await pool.processFile({
filePath: item.filePosix,
content,
dim: this.dim,
quantizationBits: 8,
existingChunkHashes: [],
});

if (result) mergeResult(result);
})();
tasks.push(task);
}

await Promise.all(tasks);
}
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processFilesWithPool starts an async task per file and reads file contents before awaiting pool.processFile(...), which can trigger unbounded parallel file reads (and large in-memory contents) for big change sets. Limit concurrency for the read+dispatch stage (e.g., cap at pool.size or a small multiple, or reuse the queue/active scheduling pattern used in the single-threaded implementation).

Copilot uses AI. Check for mistakes.
Comment on lines 236 to 244
// Phase B: Process files — use worker threads when enough files, else single-threaded
const WORKER_THREAD_MIN_FILES = 20;
const useWorkerThreads = filesToIndex.length >= WORKER_THREAD_MIN_FILES;
let pool: IndexingWorkerPool | null = null;

if (useWorkerThreads) {
const poolSize = Math.max(1, Math.min(filesToIndex.length, (os.cpus()?.length ?? 2) - 1));
pool = IndexingWorkerPool.create({ poolSize });
}
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The incremental indexer uses a hard-coded WORKER_THREAD_MIN_FILES = 20, while the main indexer uses configurable workerThreadsMinFiles. This makes behavior inconsistent and prevents tuning via config. Consider threading an indexing config into IncrementalIndexOptions (or at least reusing the same default/constant) so both indexers follow the same enablement rules.

Copilot uses AI. Check for mistakes.
Comment on lines 147 to 164
private handleWorkerError(worker: Worker, err: Error): void {
// Reject all pending resolvers for this worker (there should be at most 1)
// The worker might be dead — remove it and try to replace if pool isn't closing
const idx = this.workers.indexOf(worker);
if (idx !== -1) {
this.workers.splice(idx, 1);
}
const idleIdx = this.idleWorkers.indexOf(worker);
if (idleIdx !== -1) {
this.idleWorkers.splice(idleIdx, 1);
}

// Reject any resolvers waiting on this worker's current task
for (const [id, entry] of this.resolvers.entries()) {
entry.reject(err);
this.resolvers.delete(id);
}
}
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleWorkerError currently rejects and deletes all entries in this.resolvers, even though only one worker errored. This can incorrectly fail unrelated in-flight tasks and leave the pool in an inconsistent state. Track which task id is assigned to which worker (e.g., a Map<Worker, id> or include worker reference in the resolver entry) and only reject that task; consider also respawning a replacement worker or failing queued work deterministically.

Copilot uses AI. Check for mistakes.
Comment on lines 88 to 99
async close(): Promise<void> {
if (this.closed) return;
this.closed = true;
// Reject any queued tasks
for (const pending of this.pendingTasks) {
pending.reject(new Error('Pool closed before task could be dispatched'));
}
this.pendingTasks = [];
await Promise.all(this.workers.map((w) => w.terminate()));
this.workers = [];
this.idleWorkers = [];
}
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close() terminates workers but does not reject/resolve any in-flight tasks stored in this.resolvers. If close() is called while tasks are running, the returned promises can hang indefinitely. Before/while terminating workers, reject all outstanding resolvers (and clear the map) so callers always see completion.

Copilot uses AI. Check for mistakes.
Comment on lines 149 to 158
const lang = inferIndexLang(filePosix);
const existingHashes = existingHashArrayByLang[lang] ?? [];

const result = await pool.processFile({
filePath: filePosix,
content,
dim: options.dim,
quantizationBits: options.indexing.hnswConfig.quantizationBits,
existingChunkHashes: existingHashes,
});
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the worker-thread path, existingHashes (potentially very large) is sent to the worker for every file. Serializing/transferring large arrays per task can dominate runtime and memory for big repos. Consider removing existingChunkHashes from per-file messages and doing dedupe only on the main thread, or pass a per-language snapshot once via workerData/an init message (or a compact Bloom filter) and keep it in the worker.

Copilot uses AI. Check for mistakes.
Comment on lines 47 to +66
export async function runParallelIndexing(options: ParallelIndexOptions): Promise<ParallelIndexResult> {
const { indexing, files } = options;
const useThreads =
indexing.useWorkerThreads &&
files.length >= indexing.workerThreadsMinFiles;

if (useThreads) {
const pool = IndexingWorkerPool.create({ poolSize: Math.max(1, indexing.workerCount) });
if (pool) {
try {
return await runWithWorkerPool(options, pool);
} finally {
await pool.close();
}
}
// Pool creation failed — fall through to single-threaded path
}

return runSingleThreaded(options);
}
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current tests for runParallelIndexing won’t exercise the worker-thread path because workerThreadsMinFiles defaults to 50 and the fixtures are smaller. Add/adjust a test to set workerThreadsMinFiles low (and useWorkerThreads: true) to validate worker-thread behavior (including parse-failure fallback and chunk deduplication).

Copilot uses AI. Check for mistakes.
* Main → Worker : WorkerRequest (file path + content + config)
* Worker → Main : WorkerResponse (parsed symbols, refs, chunks, AST data)
*/
import { parentPort, workerData } from 'worker_threads';
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import workerData.

Suggested change
import { parentPort, workerData } from 'worker_threads';
import { parentPort } from 'worker_threads';

Copilot uses AI. Check for mistakes.
- Remove unused workerData import from worker.ts
- Add parse failure fallback logic in worker.ts for consistency with single-threaded path
- Fix close() to reject in-flight resolvers in pool.ts
- Fix handleWorkerError to only reject affected task in pool.ts
- Use configurable workerThreadsMinFiles instead of hardcoded value in indexerIncremental.ts
- Add missing existingChunkHashes query in indexerIncremental.ts worker path
- Add concurrency limit for read+dispatch in indexerIncremental.ts processFilesWithPool
- Optimize existingHashes transfer in parallel.ts by removing per-task transfer
- Track worker-task mapping via workerTaskIds Map in pool.ts
@mars167
Copy link
Owner Author

mars167 commented Feb 7, 2026

Semantic Review Report

@mars167
Copy link
Owner Author

mars167 commented Feb 7, 2026

🔍 CodaGraph Semantic Review

📊 整体评估

指标
风险等级 🟢 LOW
变更文件 18 files
变更符号 7 symbols
影响函数 12 functions

🎯 整体意见

整体意见生成失败


🤖 Powered by CodaGraph — Semantic Code Review

@mars167
Copy link
Owner Author

mars167 commented Feb 7, 2026

🔍 CodaGraph Semantic Review

📊 整体评估

指标
风险等级 🟠 HIGH
变更文件 18 files
变更符号 8 symbols
影响函数 12 functions

🎯 整体意见

整体意见生成失败: git-ai JSON parse failed: unsupported stdout format


🤖 Powered by CodaGraph — Semantic Code Review

@mars167
Copy link
Owner Author

mars167 commented Feb 7, 2026

🔍 CodaGraph Semantic Review

📊 整体评估

指标
风险等级 🟢 LOW
变更文件 18 files
变更符号 0 symbols
影响函数 12 functions

🎯 整体意见

整体意见生成失败: Cannot read properties of undefined (reading 'map')


🤖 Powered by CodaGraph — Semantic Code Review

1 similar comment
@mars167
Copy link
Owner Author

mars167 commented Feb 7, 2026

🔍 CodaGraph Semantic Review

📊 整体评估

指标
风险等级 🟢 LOW
变更文件 18 files
变更符号 0 symbols
影响函数 12 functions

🎯 整体意见

整体意见生成失败: Cannot read properties of undefined (reading 'map')


🤖 Powered by CodaGraph — Semantic Code Review

@mars167
Copy link
Owner Author

mars167 commented Feb 7, 2026

🔍 CodaGraph Semantic Review

📊 整体评估

指标
风险等级 🟢 LOW
变更文件 18 files
变更符号 0 symbols
影响函数 12 functions

🎯 整体意见

解析失败:MiniMax M2.1-lightning 目前尚未纳入 Coding Plan。请使用 MiniMax M2.1 。在算力允许的情况下,我们会自动将 Coding Plan 会话升级为接近 lightning 的体验。


🤖 Powered by CodaGraph — Semantic Code Review

@mars167
Copy link
Owner Author

mars167 commented Feb 7, 2026

🔍 CodaGraph Semantic Review

📊 整体评估

指标
风险等级 🟡 MEDIUM
变更文件 18 files
变更符号 8 symbols
影响函数 12 functions

🎯 整体意见

PR 实现的多线程索引功能整体架构合理,但 pool.ts 存在代码重复的明显 bug,Promise.all() 的错误处理设计会降低系统容错性。建议修复代码重复后改用 Promise.allSettled(),并补充错误日志后再合并。

⚠️ Top Risks

  1. [object Object]
  2. [object Object]
  3. [object Object]

🤖 Powered by CodaGraph — Semantic Code Review

@mars167
Copy link
Owner Author

mars167 commented Feb 7, 2026

🔍 CodaGraph Semantic Review

📊 整体评估

指标
风险等级 🟡 MEDIUM
变更文件 18 files
变更符号 4 symbols
影响函数 12 functions

🎯 整体意见

PR实现多线程并行索引功能方向正确,但 pool.ts 存在代码重复的严重bug,且 indexer.ts 的并行写入存在竞态风险。建议修复这两个问题后再重新提交审查。

⚠️ Top Risks

  1. (🔴 CRITICAL)
  2. (🔴 CRITICAL)
  3. (🟠 WARNING)

🤖 Powered by CodaGraph — Semantic Code Review

@mars167
Copy link
Owner Author

mars167 commented Feb 7, 2026

🔍 CodaGraph Semantic Review

📊 整体评估

指标
风险等级 🟠 HIGH
变更文件 18 files
变更符号 12 symbols
影响函数 12 functions

🎯 整体意见

PR 引入 worker_threads 并行索引功能,但存在严重代码质量问题:pool.ts 有重复代码、worker.ts 有语法错误、parallel.ts 硬编码参数破坏逻辑完整性。建议修复所有阻断性问题后再合并。

⚠️ Top Risks

  1. src/core/indexing/pool.ts:186-188 存在重复代码,可能导致意外的 return 行为和运行时错误
  2. src/core/indexing/worker.ts:39-40 语法错误(return 语句后的额外字符),Worker 线程无法正常启动
  3. src/core/indexing/parallel.ts:145-151 existingChunkHashes 硬编码为空数组,破坏去重逻辑导致数据重复

🤖 Powered by CodaGraph — Semantic Code Review

quantizationBits: options.indexing.hnswConfig.quantizationBits,
existingChunkHashes: [],
});

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Suggested change
88

@mars167
Copy link
Owner Author

mars167 commented Feb 8, 2026

Review completed by CodaGraph AI Agent.


⚠️ Detailed Comments (Fallback)

Note: Could not post inline comments due to GitHub API restrictions (e.g. lines outside diff context).

src/core/indexer.ts:208

⚠️ WARNING: Promise.all 缺少错误处理

使用 Promise.all 时,如果任何一个语言的写入操作失败,整个索引操作会立即失败并抛出异常。建议使用 Promise.allSettled 来隔离错误,确保部分语言写入失败不会影响其他语言的写入

建议: 使用 Promise.allSettled 替代 Promise.all,并在完成后聚合结果

const results = await Promise.allSettled(languages.map(async (lang) => {
  const t = byLang[lang];
  if (!t) return;
  const chunkRows = chunkRowsByLang[lang] ?? [];
  const refRows = refRowsByLang[lang] ?? [];
  if (chunkRows.length > 0) await t.chunks.add(chunkRows as unknown as Record<string, unknown>[]);
  if (refRows.length > 0) await t.refs.add(refRows as unknown as Record<string, unknown>[]);
  addedByLang[lang] = { chunksAdded: chunkRows.length, refsAdded: refRows.length };
}));

// 可选:记录失败的语言
const failed = results.filter(r => r.status === 'rejected');
if (failed.length > 0) {
  console.warn(`部分语言写入失败: ${failed.length} 个`);
}

src/core/indexing/config.ts:19

💡 SUGGESTION: 缺少 workerThreadsMinFiles 的值约束

workerThreadsMinFiles 类型定义为 number,未限制最小值,可能导致用户配置 0、负数或浮点数等无效值

建议: 考虑使用更严格的类型或添加验证逻辑,建议最小值为 1

workerThreadsMinFiles: number;

src/core/indexing/config.ts:20

💡 SUGGESTION: JSDoc 可补充选项间的互斥关系

当 useWorkerThreads 为 false 时,workerThreadsMinFiles 设置无实际意义,但文档未说明此关系

建议: 在 JSDoc 中说明:'仅在 useWorkerThreads 为 true 时生效'

/** Minimum number of files before enabling worker threads (avoid startup overhead for small repos). Only effective when useWorkerThreads is true. */

@mars167
Copy link
Owner Author

mars167 commented Feb 8, 2026

Review completed by CodaGraph AI Agent.


⚠️ Detailed Comments (Fallback)

Note: Could not post inline comments due to GitHub API restrictions (e.g. lines outside diff context).

src/core/indexing/config.ts:42

⚠️ WARNING: 缺少参数验证逻辑

新参数 workerThreadsMinFiles 缺少最小值验证,应确保为正整数

建议: 添加参数验证,例如:workerThreadsMinFiles: Math.max(1, 50)

useWorkerThreads: true,
workerThreadsMinFiles: Math.max(1, 50),

src/core/indexing/config.ts:19

💡 SUGGESTION: 类型可细化

workerThreadsMinFiles 使用 number 类型,但实际应为正整数(>=1)

建议: 考虑添加类型别名或文档注释约束取值范围

/** Minimum number of files before enabling worker threads (avoid startup overhead for small repos). @minimum 1 */
workerThreadsMinFiles: number;

src/core/indexing/parallel.ts:54

⚠️ WARNING: 缺少 workerCount 的类型检查

indexing.workerCount 可能为 undefined,导致 poolSize 变成 NaN

建议: 使用空值合并运算符确保有效的 poolSize: poolSize: Math.max(1, indexing.workerCount ?? 1)

const pool = IndexingWorkerPool.create({ poolSize: Math.max(1, indexing.workerCount ?? 1) });

src/core/indexing/parallel.ts:78

⚠️ WARNING: 缺少 batchSize 的类型检查

options.indexing.batchSize 可能为 undefined,导致 batchSize 变为 NaN

建议: 添加空值检查: const batchSize = Math.max(1, options.indexing.batchSize ?? 1);

const batchSize = Math.max(1, options.indexing.batchSize ?? 1);

src/core/indexing/parallel.ts:127

⚠️ WARNING: 数组 splice 操作可能影响性能

在循环中使用 splice 修改数组会导致每次移动元素,时间复杂度为 O(n²)

建议: 使用索引遍历替代 splice: const batch = pendingFiles.slice(i, i + batchSize);

// 使用 for 循环配合 slice:
for (let i = 0; i < pendingFiles.length; i += batchSize) {
  const batch = pendingFiles.slice(i, i + batchSize);
  // ...
}

src/core/indexing/parallel.ts:150

⚠️ WARNING: existingChunkHashes 被硬编码为空数组

在 worker 处理文件时,existingChunkHashes 传入了空数组,但选项中有 options.existingChunkIdsByLang,语义不一致可能导致重复处理

建议: 考虑是否需要传递现有 chunk hash 给 worker 以支持增量索引

// 如果需要传递,应该使用:
existingChunkHashes: options.existingChunkIdsByLang[lang] || [],

src/core/indexing/parallel.ts:166

⚠️ WARNING: Promise.all 不会处理单个任务失败

如果批量中任何一个文件处理失败,整个 Promise.all 会 reject,导致大量已完成的工作被丢弃

建议: 使用 Promise.allSettled 并收集结果,或实现部分失败处理:
const results = await Promise.allSettled(tasks);
// 处理 fulfilled/rejected 情况

// 示例:
const results = await Promise.allSettled(tasks);
for (const result of results) {
  if (result.status === 'rejected') {
    console.error('File processing failed:', result.reason);
  }
}

src/core/indexing/parallel.ts:327

⚠️ WARNING: 代码重复 - MemoryMonitor 初始化

runWithWorkerPool (L74) 和 runSingleThreaded (L176) 都创建了 MemoryMonitor 实例,逻辑相同但代码重复

建议: 考虑提取公共函数 createMemoryMonitor(options): MemoryMonitor

function createMemoryMonitor(options: ParallelIndexOptions): MemoryMonitor {
  return MemoryMonitor.fromErrorConfig(options.errorHandling, options.indexing.memoryBudgetMb);
}

src/core/indexing/parallel.ts:95

💡 SUGGESTION: seenChunkHashes 初始化逻辑可简化

循环初始化 Map 的模式可以更简洁

建议: 使用 Object.entries 简化初始化

const seenChunkHashes = new Map<IndexLang, Set<string>>(
  Object.entries(options.existingChunkIdsByLang).map(([lang, ids]) => [lang as IndexLang, new Set(ids)])
);

src/core/indexing/worker.ts:54

⚠️ WARNING: 文件语言识别覆盖不全

inferIndexLang 只支持有限的扩展名,未识别的文件默认返回 'ts',可能导致 C++/C#/JS 等文件被错误识别

建议: 补充更多语言识别或从 parser 本身获取语言信息

function inferIndexLang(file: string): string {
  const ext = file.split('.').pop()?.toLowerCase();
  const map: Record<string, string> = {
    md: 'markdown', mdx: 'markdown',
    yml: 'yaml', yaml: 'yaml',
    java: 'java',
    c: 'c', h: 'c', hpp: 'c',
    cpp: 'cpp', cc: 'cpp', cxx: 'cpp',
    go: 'go',
    py: 'python',
    rs: 'rust',
    js: 'javascript', jsx: 'javascript',
    ts: 'ts', tsx: 'ts',
  };
  return map[ext || ''] || 'ts';
}

src/core/indexing/worker.ts:164

⚠️ WARNING: scope span 计算逻辑错误

计算 span 时 scope.endLine - scope.startLine 少算了 1 行,可能导致边界情况判断不准

建议: 改为 scope.endLine - scope.startLine + 1

const span = scope.endLine - scope.startLine + 1;

src/core/indexing/worker.ts:83

💡 SUGGESTION: 解析失败静默忽略

catch 块捕获所有错误但不记录日志,无法追踪哪些文件解析失败,影响问题排查

建议: 在开发环境记录错误日志,或收集到错误列表中返回

} catch (err) {
  console.error(`Failed to parse ${filePath}:`, err);
  // 保留错误信息供调用方排查
  return { ...processFileResult, error: String(err) };
}

src/core/indexing/worker.ts:135

⚠️ WARNING: hashEmbedding 错误未处理

如果 hashEmbedding 因维度过大或非法参数失败,函数会直接抛出,导致整个文件处理中断

建议: 在 embedding 调用外层添加 try-catch,失败时跳过该 symbol 的向量化

try {
  const vec = hashEmbedding(text, { dim });
  const q = quantizeSQ8(vec, quantizationBits);
  // ...push chunk
} catch {
  // 跳过无法向量化的 chunk,或使用默认值
  console.warn(`Failed to embed symbol ${s.name} in ${filePath}`);
}

src/core/indexing/worker.ts:142

💡 SUGGESTION: Buffer 转换假设未验证

假设 q.q 是可转换为 Buffer 的类型,如果 quantizeSQ8 返回其他类型可能失败

建议: 显式断言类型或添加运行时检查

qvec_b64: Buffer.from(q.q as Uint8Array).toString('base64'),

src/core/indexing/worker.ts:119

📝 NIT: container 构建未复用 buildChunkText

container 的 text 使用了不同的构建方式,与 buildChunkText 不一致

建议: 复用 buildChunkText 函数确保格式统一

if (s.container) {
  const cText = buildChunkText(filePath, s.container);
  // ...
}

@mars167
Copy link
Owner Author

mars167 commented Feb 8, 2026

Review completed by CodaGraph AI Agent.


⚠️ Detailed Comments (Fallback)

Note: Could not post inline comments due to GitHub API restrictions (e.g. lines outside diff context).

src/core/indexing/config.ts:5

⚠️ WARNING: 缺少边界验证

workerThreadsMinFiles 缺少最小值限制,设置为 0 或负数会导致逻辑错误

建议: 添加最小值验证或使用联合类型限制范围,例如 workerThreadsMinFiles: number & { minimum: 1 }

workerThreadsMinFiles: number & { minimum: 1 };

src/core/indexing/config.ts:7

💡 SUGGESTION: 注释位置与代码不一致

L18 的注释描述的是 L19 的字段,但被 L17 的注释隔开,影响可读性

建议: 将两个字段的注释合并或调换顺序,使注释与字段对齐

/** Enable true multi-threading via worker_threads for CPU-bound operations.
 * Minimum number of files before enabling worker threads (avoid startup overhead for small repos).
 */
useWorkerThreads: boolean;
workerThreadsMinFiles: number;

src/core/indexing/config.ts:15

📝 NIT: 硬编码的魔法数字

50 作为默认阈值是魔法数字,建议提取为常量或提供配置说明

建议: 将默认值改为常量 const DEFAULT_WORKER_THREADS_MIN_FILES = 50;


src/core/indexing/worker.ts:87

⚠️ WARNING: Empty catch block silently swallows parse errors

解析失败时捕获所有异常但不做任何处理,会导致难以调试的问题。当 tree-sitter 解析器失败时,可能是有 bug 或不兼容的语法,但开发者无法得知具体原因。

建议: 至少记录错误信息,或添加条件日志:

} catch (err) {
  // 临时调试:生产环境可改为 console.debug
  console.warn(`Failed to parse ${filePath}:`, err instanceof Error ? err.message : err);
  symbols = [];
  fileRefs = [];
}
} catch {
    // On parse failure, fall back to empty symbol/ref set.
    // This mirrors single-threaded behavior where parse failures don't skip the file.
    symbols = [];
    fileRefs = [];
  }

src/core/indexing/worker.ts:54

⚠️ WARNING: Duplicated helper function from parallel.ts

inferIndexLang 函数与 parallel.ts 中的实现重复,维护时容易出现不一致。代码注释也已说明这是为了避免导入问题而做的临时处理。

建议: 将公共 helper 提取到独立的 shared 模块(如 src/core/indexing/helpers.ts),然后统一导入使用。

function inferIndexLang(file: string): string {
  // ... 实现
}

src/core/indexing/worker.ts:65

⚠️ WARNING: Duplicated helper function from parallel.ts

buildChunkText 函数同样与 parallel.ts 重复,存在相同的一致性问题。

建议: 与 inferIndexLang 一同提取到共享模块。

function buildChunkText(file: string, symbol: { name: string; kind: string; signature: string }): string {
  return `file:${file}\nkind:${symbol.kind}\nname:${symbol.name}\nsignature:${symbol.signature}`;
}

src/core/indexing/worker.ts:160

⚠️ WARNING: pickScope 函数 span 计算可能不符合预期

span 计算使用 endLine - startLine,但实际选择 scope 时优先选择 span 最小的。逻辑上看,span 最小的 scope 应该是最内层的嵌套函数/方法,这个逻辑是对的。但变量命名 best.span 不够直观,注释说明会更好。

建议: 添加注释说明选择最小 span 的原因:

// 选择行范围最小的调用者 scope(即最内层的函数/方法)
const pickScope = (line: number): string => {
  // ...
}
const pickScope = (line: number): string => {

src/core/indexing/worker.ts:202

📝 NIT: 不必要的非空断言操作符

parentPort!.postMessage 使用了非空断言 (!),但第 195 行已经确认 parentPort 存在。可以移除断言以提高代码可读性。

建议: ```typescript
parentPort.postMessage(response);


```suggestion
parentPort!.postMessage(response);

src/core/indexing/worker.ts:206

📝 NIT: 不必要的非空断言操作符

同上,parentPort!.postMessage 可以简化为 parentPort.postMessage

建议: ```typescript
parentPort.postMessage(response);


```suggestion
parentPort!.postMessage(response);

@mars167 mars167 merged commit 2501082 into main Feb 8, 2026
1 check passed
@mars167 mars167 deleted the feat/parallel-indexing branch February 8, 2026 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant