fix: prevent learn tool from hanging + background indexing + memory optimizations by jaggernaut007 · Pull Request #3 · dondetir/CodeGrok_mcp

jaggernaut007 · 2026-03-11T19:25:21Z

Summary

Fixes the learn tool hanging on large codebases with many folders/subfolders, reduces peak memory consumption during indexing, adds configurable timeout protection, and implements background indexing to prevent MCP transport timeouts.

Closes #1

Changes (3 commits)

Commit 1: `fix: prevent learn tool from hanging on large codebases`

.gitignore support — discover_files() rewritten with os.walk() + pathspec for directory pruning; respects root and nested .gitignore files; followlinks=False prevents symlink loops
Safety limits — max_files=200,000 circuit breaker (addresses SECURITY_REVIEW HIGH-003)
Upsert-based indexing — get_or_create_collection() + collection.upsert() instead of destructive delete-recreate; stale chunks cleaned after embedding
Resumable checkpointing — .codegrok/checkpoint.json saved every 1000 chunks via atomic os.replace(); deleted on success
Progress reporting — new discovery_progress event + ETA in embedding progress messages

Commit 2: `perf: reduce memory consumption and add configurable timeout`

Inline symbol→chunk conversion — symbols converted to chunks per file batch then freed
File batch processing — parse 500 files at a time (FILE_BATCH_SIZE) with gc.collect() between batches
Worker cap — parallel parse workers capped at 4 (MAX_PARSE_WORKERS, was up to 32)
Post-embedding cleanup — explicit del chunks + gc.collect() after embedding
Configurable timeout — default 600s; CODEGROK_TIMEOUT env var or timeout_seconds param

Commit 3: `fix: background indexing to prevent MCP transport timeout`

Background indexing — learn returns immediately; indexing runs in threading.Thread(daemon=True); client polls get_stats() for progress
IndexingStatus — Thread-safe dataclass in state.py with threading.Lock for progress tracking
Stateful responses — learn returns indexing_started (new), indexing_in_progress (already running), complete (done), or raises ToolError (failed)
get_stats() enhanced — includes indexing field with active, progress, message, error during indexing
load_only unchanged — remains synchronous since it's fast (just loads existing index)

Why background indexing? MCP clients (Claude Desktop, etc.) have ~60-120s transport-level timeouts that kill long-running tool calls. Internal timeouts can't help because the transport drops first. Background threading solves this by returning immediately.

Polling workflow:

1. learn(path="/project")   → {"status": "indexing_started", "progress": 0}
2. get_stats()              → {"indexing": {"active": true, "progress": 42, "message": "Embedding..."}}
3. get_stats()              → {"loaded": true, "stats": {...}}  (indexing complete)

New Dependency

pathspec>=0.11.0 — pure Python .gitignore pattern matching

Security

Snyk Code Scan: 0 issues on all modified files (3 scans across all commits)
Addresses HIGH-003 (Unbounded Resource Consumption / DoS) via max_files limit
Addresses LOW-009 (Symlink Following) via followlinks=False

Test plan

10 unit tests for discover_files + memory constants (tests/unit/test_discover_files.py)
9 unit tests for timeout configuration (tests/unit/test_timeout.py)
33 unit tests for background indexing (tests/unit/test_background_indexing.py)
11 integration tests for upsert, checkpointing, memory optimizations (tests/integration/test_source_retriever.py)
191/191 tests pass in full suite (30s)
black --check passes on all modified files
Snyk SAST scan: 0 issues

Configuration

{
  "mcpServers": {
    "codegrok": {
      "command": "codegrok-mcp",
      "env": {
        "CODEGROK_TIMEOUT": "1200"
      }
    }
  }
}

🤖 Generated with Claude Code

Resolves dondetir#1. The learn tool would get stuck on large folders because discover_files() traversed everything (including node_modules, build artifacts) and had no checkpointing for interruption recovery. Changes: - Rewrite discover_files() to use os.walk() with pathspec for .gitignore support and directory pruning (followlinks=False for symlink safety) - Add max_files=200K safety limit (addresses SECURITY_REVIEW HIGH-003) - Switch index_codebase() from delete-recreate to get_or_create + upsert with stale chunk cleanup - Add resumable checkpointing (atomic writes every 1000 chunks) - Improve progress reporting with discovery events and ETA - Add 15 new tests (8 unit + 7 integration) - Apply black formatting across codebase - Snyk code scan: 0 issues Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Memory optimizations: - Inline symbol-to-chunk conversion per file batch (never hold both lists) - Process files in batches of 500 with gc.collect() between batches - Cap parallel parse workers at 4 (was up to 32) - Explicit del + gc.collect() after embedding completes Configurable MCP timeout: - Default 600s, configurable via CODEGROK_TIMEOUT env var - Per-call override via timeout_seconds parameter on learn tool - Indexing runs in asyncio.to_thread() so wait_for can cancel - On timeout: checkpoint preserved, re-run resumes Tests: 15 new tests (9 timeout unit, 4 memory integration, 2 constant unit) Snyk: 0 issues on modified files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MCP clients (Claude Desktop, etc.) have ~60-120s transport-level timeouts that kill long-running tool calls. The learn tool now returns immediately and runs indexing in a background daemon thread. Clients poll get_stats() for progress. - Add IndexingStatus thread-safe dataclass for background progress tracking - Rewrite learn tool: starts threading.Thread(daemon=True), returns immediately - get_stats() includes indexing progress when active or errored - Stateful learn responses: indexing_started, indexing_in_progress, complete, error - load_only mode stays synchronous (fast, no background needed) - 33 new tests for IndexingStatus, progress callbacks, learn behavior, get_stats - Snyk SAST scan: 0 issues - All 191 tests pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jaggernaut007 and others added 3 commits March 11, 2026 17:48

jaggernaut007 changed the title ~~fix: prevent learn tool from hanging + reduce memory + add timeout~~ fix: prevent learn tool from hanging + background indexing + memory optimizations Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent learn tool from hanging + background indexing + memory optimizations#3

fix: prevent learn tool from hanging + background indexing + memory optimizations#3
jaggernaut007 wants to merge 3 commits intodondetir:mainfrom
jaggernaut007:fix/large-folder-indexing

jaggernaut007 commented Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jaggernaut007 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes (3 commits)

Commit 1: fix: prevent learn tool from hanging on large codebases

Commit 2: perf: reduce memory consumption and add configurable timeout

Commit 3: fix: background indexing to prevent MCP transport timeout

New Dependency

Security

Test plan

Configuration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jaggernaut007 commented Mar 11, 2026 •

edited

Loading

Commit 1: `fix: prevent learn tool from hanging on large codebases`

Commit 2: `perf: reduce memory consumption and add configurable timeout`

Commit 3: `fix: background indexing to prevent MCP transport timeout`