Skip to content

fix: prevent learn tool from hanging + background indexing + memory optimizations#3

Open
jaggernaut007 wants to merge 3 commits intodondetir:mainfrom
jaggernaut007:fix/large-folder-indexing
Open

fix: prevent learn tool from hanging + background indexing + memory optimizations#3
jaggernaut007 wants to merge 3 commits intodondetir:mainfrom
jaggernaut007:fix/large-folder-indexing

Conversation

@jaggernaut007
Copy link

@jaggernaut007 jaggernaut007 commented Mar 11, 2026

Summary

Fixes the learn tool hanging on large codebases with many folders/subfolders, reduces peak memory consumption during indexing, adds configurable timeout protection, and implements background indexing to prevent MCP transport timeouts.

Closes #1

Changes (3 commits)

Commit 1: fix: prevent learn tool from hanging on large codebases

  • .gitignore supportdiscover_files() rewritten with os.walk() + pathspec for directory pruning; respects root and nested .gitignore files; followlinks=False prevents symlink loops
  • Safety limitsmax_files=200,000 circuit breaker (addresses SECURITY_REVIEW HIGH-003)
  • Upsert-based indexingget_or_create_collection() + collection.upsert() instead of destructive delete-recreate; stale chunks cleaned after embedding
  • Resumable checkpointing.codegrok/checkpoint.json saved every 1000 chunks via atomic os.replace(); deleted on success
  • Progress reporting — new discovery_progress event + ETA in embedding progress messages

Commit 2: perf: reduce memory consumption and add configurable timeout

  • Inline symbol→chunk conversion — symbols converted to chunks per file batch then freed
  • File batch processing — parse 500 files at a time (FILE_BATCH_SIZE) with gc.collect() between batches
  • Worker cap — parallel parse workers capped at 4 (MAX_PARSE_WORKERS, was up to 32)
  • Post-embedding cleanup — explicit del chunks + gc.collect() after embedding
  • Configurable timeout — default 600s; CODEGROK_TIMEOUT env var or timeout_seconds param

Commit 3: fix: background indexing to prevent MCP transport timeout

  • Background indexinglearn returns immediately; indexing runs in threading.Thread(daemon=True); client polls get_stats() for progress
  • IndexingStatus — Thread-safe dataclass in state.py with threading.Lock for progress tracking
  • Stateful responseslearn returns indexing_started (new), indexing_in_progress (already running), complete (done), or raises ToolError (failed)
  • get_stats() enhanced — includes indexing field with active, progress, message, error during indexing
  • load_only unchanged — remains synchronous since it's fast (just loads existing index)

Why background indexing? MCP clients (Claude Desktop, etc.) have ~60-120s transport-level timeouts that kill long-running tool calls. Internal timeouts can't help because the transport drops first. Background threading solves this by returning immediately.

Polling workflow:

1. learn(path="/project")   → {"status": "indexing_started", "progress": 0}
2. get_stats()              → {"indexing": {"active": true, "progress": 42, "message": "Embedding..."}}
3. get_stats()              → {"loaded": true, "stats": {...}}  (indexing complete)

New Dependency

  • pathspec>=0.11.0 — pure Python .gitignore pattern matching

Security

  • Snyk Code Scan: 0 issues on all modified files (3 scans across all commits)
  • Addresses HIGH-003 (Unbounded Resource Consumption / DoS) via max_files limit
  • Addresses LOW-009 (Symlink Following) via followlinks=False

Test plan

  • 10 unit tests for discover_files + memory constants (tests/unit/test_discover_files.py)
  • 9 unit tests for timeout configuration (tests/unit/test_timeout.py)
  • 33 unit tests for background indexing (tests/unit/test_background_indexing.py)
  • 11 integration tests for upsert, checkpointing, memory optimizations (tests/integration/test_source_retriever.py)
  • 191/191 tests pass in full suite (30s)
  • black --check passes on all modified files
  • Snyk SAST scan: 0 issues

Configuration

{
  "mcpServers": {
    "codegrok": {
      "command": "codegrok-mcp",
      "env": {
        "CODEGROK_TIMEOUT": "1200"
      }
    }
  }
}

🤖 Generated with Claude Code

jaggernaut007 and others added 3 commits March 11, 2026 17:48
Resolves dondetir#1. The learn tool would get stuck on large folders because
discover_files() traversed everything (including node_modules, build
artifacts) and had no checkpointing for interruption recovery.

Changes:
- Rewrite discover_files() to use os.walk() with pathspec for .gitignore
  support and directory pruning (followlinks=False for symlink safety)
- Add max_files=200K safety limit (addresses SECURITY_REVIEW HIGH-003)
- Switch index_codebase() from delete-recreate to get_or_create + upsert
  with stale chunk cleanup
- Add resumable checkpointing (atomic writes every 1000 chunks)
- Improve progress reporting with discovery events and ETA
- Add 15 new tests (8 unit + 7 integration)
- Apply black formatting across codebase
- Snyk code scan: 0 issues

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Memory optimizations:
- Inline symbol-to-chunk conversion per file batch (never hold both lists)
- Process files in batches of 500 with gc.collect() between batches
- Cap parallel parse workers at 4 (was up to 32)
- Explicit del + gc.collect() after embedding completes

Configurable MCP timeout:
- Default 600s, configurable via CODEGROK_TIMEOUT env var
- Per-call override via timeout_seconds parameter on learn tool
- Indexing runs in asyncio.to_thread() so wait_for can cancel
- On timeout: checkpoint preserved, re-run resumes

Tests: 15 new tests (9 timeout unit, 4 memory integration, 2 constant unit)
Snyk: 0 issues on modified files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MCP clients (Claude Desktop, etc.) have ~60-120s transport-level timeouts
that kill long-running tool calls. The learn tool now returns immediately
and runs indexing in a background daemon thread. Clients poll get_stats()
for progress.

- Add IndexingStatus thread-safe dataclass for background progress tracking
- Rewrite learn tool: starts threading.Thread(daemon=True), returns immediately
- get_stats() includes indexing progress when active or errored
- Stateful learn responses: indexing_started, indexing_in_progress, complete, error
- load_only mode stays synchronous (fast, no background needed)
- 33 new tests for IndexingStatus, progress callbacks, learn behavior, get_stats
- Snyk SAST scan: 0 issues
- All 191 tests pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jaggernaut007 jaggernaut007 changed the title fix: prevent learn tool from hanging + reduce memory + add timeout fix: prevent learn tool from hanging + background indexing + memory optimizations Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

learn tool hangs on large codebases with many folders/subfolders

1 participant