fix: prevent learn tool from hanging + background indexing + memory optimizations#3
Open
jaggernaut007 wants to merge 3 commits intodondetir:mainfrom
Open
fix: prevent learn tool from hanging + background indexing + memory optimizations#3jaggernaut007 wants to merge 3 commits intodondetir:mainfrom
jaggernaut007 wants to merge 3 commits intodondetir:mainfrom
Conversation
Resolves dondetir#1. The learn tool would get stuck on large folders because discover_files() traversed everything (including node_modules, build artifacts) and had no checkpointing for interruption recovery. Changes: - Rewrite discover_files() to use os.walk() with pathspec for .gitignore support and directory pruning (followlinks=False for symlink safety) - Add max_files=200K safety limit (addresses SECURITY_REVIEW HIGH-003) - Switch index_codebase() from delete-recreate to get_or_create + upsert with stale chunk cleanup - Add resumable checkpointing (atomic writes every 1000 chunks) - Improve progress reporting with discovery events and ETA - Add 15 new tests (8 unit + 7 integration) - Apply black formatting across codebase - Snyk code scan: 0 issues Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Memory optimizations: - Inline symbol-to-chunk conversion per file batch (never hold both lists) - Process files in batches of 500 with gc.collect() between batches - Cap parallel parse workers at 4 (was up to 32) - Explicit del + gc.collect() after embedding completes Configurable MCP timeout: - Default 600s, configurable via CODEGROK_TIMEOUT env var - Per-call override via timeout_seconds parameter on learn tool - Indexing runs in asyncio.to_thread() so wait_for can cancel - On timeout: checkpoint preserved, re-run resumes Tests: 15 new tests (9 timeout unit, 4 memory integration, 2 constant unit) Snyk: 0 issues on modified files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MCP clients (Claude Desktop, etc.) have ~60-120s transport-level timeouts that kill long-running tool calls. The learn tool now returns immediately and runs indexing in a background daemon thread. Clients poll get_stats() for progress. - Add IndexingStatus thread-safe dataclass for background progress tracking - Rewrite learn tool: starts threading.Thread(daemon=True), returns immediately - get_stats() includes indexing progress when active or errored - Stateful learn responses: indexing_started, indexing_in_progress, complete, error - load_only mode stays synchronous (fast, no background needed) - 33 new tests for IndexingStatus, progress callbacks, learn behavior, get_stats - Snyk SAST scan: 0 issues - All 191 tests pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the
learntool hanging on large codebases with many folders/subfolders, reduces peak memory consumption during indexing, adds configurable timeout protection, and implements background indexing to prevent MCP transport timeouts.Closes #1
Changes (3 commits)
Commit 1:
fix: prevent learn tool from hanging on large codebases.gitignoresupport —discover_files()rewritten withos.walk()+pathspecfor directory pruning; respects root and nested.gitignorefiles;followlinks=Falseprevents symlink loopsmax_files=200,000circuit breaker (addresses SECURITY_REVIEW HIGH-003)get_or_create_collection()+collection.upsert()instead of destructive delete-recreate; stale chunks cleaned after embedding.codegrok/checkpoint.jsonsaved every 1000 chunks via atomicos.replace(); deleted on successdiscovery_progressevent + ETA in embedding progress messagesCommit 2:
perf: reduce memory consumption and add configurable timeoutFILE_BATCH_SIZE) withgc.collect()between batchesMAX_PARSE_WORKERS, was up to 32)del chunks+gc.collect()after embeddingCODEGROK_TIMEOUTenv var ortimeout_secondsparamCommit 3:
fix: background indexing to prevent MCP transport timeoutlearnreturns immediately; indexing runs inthreading.Thread(daemon=True); client pollsget_stats()for progressstate.pywiththreading.Lockfor progress trackinglearnreturnsindexing_started(new),indexing_in_progress(already running),complete(done), or raisesToolError(failed)indexingfield withactive,progress,message,errorduring indexingWhy background indexing? MCP clients (Claude Desktop, etc.) have ~60-120s transport-level timeouts that kill long-running tool calls. Internal timeouts can't help because the transport drops first. Background threading solves this by returning immediately.
Polling workflow:
New Dependency
pathspec>=0.11.0— pure Python.gitignorepattern matchingSecurity
max_fileslimitfollowlinks=FalseTest plan
discover_files+ memory constants (tests/unit/test_discover_files.py)tests/unit/test_timeout.py)tests/unit/test_background_indexing.py)tests/integration/test_source_retriever.py)black --checkpasses on all modified filesConfiguration
{ "mcpServers": { "codegrok": { "command": "codegrok-mcp", "env": { "CODEGROK_TIMEOUT": "1200" } } } }🤖 Generated with Claude Code