supercharge codebase-memory-mcp: streamline and consolidate api, autoindexing, PageRank, dependency indexing, speedup, cli config, autotune#151
Conversation
Apply 8 token reduction techniques inspired by RTK (Rust Token Killer):
1. Default search limits: search_graph/search_code default limit 500K→50
(CBM_DEFAULT_SEARCH_LIMIT constant). Callers can override explicitly.
2. Smart truncation for get_code_snippet: 3 modes (full/signature/head_tail)
with max_lines=200 default (CBM_DEFAULT_SNIPPET_MAX_LINES). head_tail
preserves function signature + return/cleanup code. Signature mode
returns only API surface without reading source files.
3. Compact mode for search_graph/trace_call_path: omits redundant name
field when it's the last segment of qualified_name.
4. Summary mode for search_graph: returns aggregated counts by label and
file (top 20) instead of individual results. 95% token reduction.
5. Trace edge case fixes: max_results param (default 25), BFS cycle
deduplication by node ID, candidates array for ambiguous function names,
callees_total/callers_total counts.
6. query_graph output truncation: max_output_bytes (default 32KB) caps
worst-case output. Does NOT change max_rows (which is a scan-limit
that would break aggregation queries).
7. Token metadata: _result_bytes and _est_tokens in all MCP tool responses
for LLM token awareness.
8. Stable pagination: ORDER BY name, id for deterministic pagination.
All defaults use named constants (CBM_DEFAULT_*) — no magic numbers.
CYPHER_RESULT_CEILING reduced 100K→10K as safety net.
Tests: 22 new tests in test_token_reduction.c, all passing.
All 2060+ existing tests pass with zero regressions.
Register index_dependencies MCP tool for indexing dependency/library
source code into a separate dependency graph. Dependencies are stored
in {project}_deps.db (separate from project.db) and are NOT included
in queries unless include_dependencies=true is passed.
AI grounding safeguards (7-layer defense):
1. Storage: separate _deps.db not touched by index_repository
2. Query default: include_dependencies=false (deps excluded by default)
3. QN prefix: dep.{mgr}.{package}.{symbol} convention documented
4. Response field: "source":"project" / "source":"dependency" labels
5. Properties: "external":true on dependency nodes
6. Tool description: explicitly states "SEPARATE dependency graph"
7. Boundary markers: trace_call_path shows project→dep edges
Current state:
- Tool registered with full parameter validation (project, package_manager required)
- include_dependencies param added to search_graph with source field
- Handler returns structured "not_yet_implemented" status
- Full dep resolution pipeline (depindex module) designed but deferred
Tests: 12 new tests in test_depindex.c, all passing.
All 2042 existing tests pass with zero regressions.
Next: implement src/depindex/ module for actual package resolution
(uv/cargo/npm/bun), dependency file discovery, and pipeline
integration per the plan in plans/serialized-pondering-puppy.md.
…ndexing (reference-api-indexing)
Resolve 5 conflicts across 3 files:
- Makefile.cbm: combine TEST_TOKEN_REDUCTION_SRCS + TEST_DEPINDEX_SRCS in ALL_TEST_SRCS
- src/mcp/mcp.c: merge compact/summary/search_mode params with include_dependencies param
in handle_search_graph; add source:"project" field in full-mode results loop at line ~912
- tests/test_main.c: register both suite_token_reduction and suite_depindex
Combined capabilities on merged branch:
- 8 RTK-inspired token reduction strategies (CBM_DEFAULT_* constants, 3 snippet modes,
compact/summary search, BFS dedup, query output cap, _result_bytes/_est_tokens metadata)
- index_dependencies MCP tool with AI grounding (7-layer defense: separate _deps.db,
include_dependencies=false default, dep.{mgr}.{pkg} QN prefix, source field, external property)
Tests: 2064 passed, 0 failed (22 token_reduction + 12 depindex + 2030 existing)
Summary mode bug: by_label only counted 50 results (the default limit) instead of all symbols. Fix: override effective_limit to 10000 when mode=summary so aggregation covers representative sample. Pagination: when has_more=true, add pagination_hint field: "Use offset:50 and limit:50 for next page (13818 total)" This guides LLMs to use offset/limit for progressive exploration. Verified on RTK codebase (45,388 symbols): - Summary mode: 1,317 bytes with accurate label counts - Default search: pagination_hint present when has_more=true - All 2064 tests pass
Summary mode bug: by_label only counted 50 results (the default limit) instead of all symbols. Fix: override effective_limit to 10000 when mode=summary so aggregation covers representative sample. Pagination: when has_more=true, add pagination_hint field: "Use offset:50 and limit:50 for next page (13818 total)" This guides LLMs to use offset/limit for progressive exploration. Verified on RTK codebase (45,388 symbols): - Summary mode: 1,317 bytes with accurate label counts - Default search: pagination_hint present when has_more=true - All 2064 tests pass
…branch The TEST_DEPINDEX_SRCS and suite_depindex belong on the reference-api-indexing branch only. Remove from this branch to fix build error (test_depindex.c not present here).
All token reduction defaults are now configurable at runtime via the config system (cbm_config_get_int). Config keys: - search_limit: default result limit for search_graph/search_code - snippet_max_lines: default max source lines for get_code_snippet - trace_max_results: default max BFS nodes for trace_call_path - query_max_output_bytes: default output cap for query_graph Tool schema descriptions no longer contain hardcoded numbers — they reference config keys instead, so changing a default won't make the description misleading. Tool descriptions now include comprehensive AI guidance: - search_graph: how to paginate (offset+limit), mode=summary for overview - query_graph: max_output_bytes=0 for unlimited, LIMIT in Cypher - get_code_snippet: mode=signature for API lookup, mode=head_tail for preserving return/cleanup, max_lines=0 for full source - trace_call_path: max_results for exhaustive traces, callees_total for truncation awareness - All tools: config key names documented for runtime override Tests: 2052 passed, 0 failed
Previous behavior: Merging reduce-token-usage (which removed depindex refs from its branch) into the combined branch dropped TEST_DEPINDEX_SRCS and suite_depindex, reducing test count from 2064 to 2052. What changed: - Makefile.cbm: re-add TEST_DEPINDEX_SRCS = tests/test_depindex.c and include $(TEST_DEPINDEX_SRCS) in ALL_TEST_SRCS - tests/test_main.c: re-add extern suite_depindex declaration and RUN_SUITE(depindex) call before integration suite Why: The merged branch must run both test suites (token_reduction + depindex). The upstream reduce-token-usage branch correctly excludes depindex (it doesn't have that feature), but the combined branch needs both. Testable: make -f Makefile.cbm test → 2064 passed, 0 failed
1. Remove misleading "Set limit=0 for no cap" from search_graph schema description — store.c maps limit=0 to 500K, not truly unlimited 2. Eliminate redundant is_summary_early variable — merge into single is_summary bool computed once before the search query 3. Add bounds-check comment for summary mode labels[64] array explaining the cap matches CBM's ~12 label types with margin 4. Replace %zu with %lu + (unsigned long) cast in query_graph truncation snprintf for portability (existing codebase avoids %zu) 5. Add include_dependencies parameter to search_graph tool schema so LLMs can discover the opt-in dependency inclusion feature 6. Remove hardcoded "default":50 from search_code JSON schema — actual default comes from config key search_limit at runtime Tests: 2064 passed, 0 failed
1. Remove misleading "Set limit=0 for no cap" from search_graph schema description — store.c maps limit=0 to 500K, not truly unlimited 2. Eliminate redundant is_summary_early variable — merge into single is_summary bool computed once before the search query 3. Add bounds-check comment for summary mode labels[64] array explaining the cap matches CBM's ~12 label types with margin 4. Replace %zu with %lu + (unsigned long) cast in query_graph truncation snprintf for portability (existing codebase avoids %zu) 5. Add include_dependencies parameter to search_graph tool schema so LLMs can discover the opt-in dependency inclusion feature 6. Remove hardcoded "default":50 from search_code JSON schema — actual default comes from config key search_limit at runtime Tests: 2064 passed, 0 failed
The include_dependencies parameter belongs to the reference-api-indexing branch only. It was accidentally introduced via cherry-pick of the code review fix. The schema declared a parameter that the handler on this branch doesn't read — a maintainer would flag this as a schema/code mismatch. Removed the include_dependencies property from the search_graph tool schema JSON. The parameter remains in the combined branch where the handler code exists. Tests: 2052 passed, 0 failed
- Token metadata comment: explain _result_bytes (byte length of inner JSON text) and _est_tokens (bytes/4, same heuristic as RTK's estimate_tokens function in tracking.rs) - Pagination hint: add comment explaining the pagination_hint field purpose (tells caller how to get next page) - Head/tail mode: document the 60/40 split rationale (60% head captures signature/setup, 40% tail captures return/cleanup; middle implementation detail is what gets omitted) Tests: 2064 passed, 0 failed
- Token metadata comment: explain _result_bytes (byte length of inner JSON text) and _est_tokens (bytes/4, same heuristic as RTK's estimate_tokens function in tracking.rs) - Pagination hint: add comment explaining the pagination_hint field purpose (tells caller how to get next page) - Head/tail mode: document the 60/40 split rationale (60% head captures signature/setup, 40% tail captures return/cleanup; middle implementation detail is what gets omitted) Tests: 2064 passed, 0 failed
Three defensive guards for out-of-memory conditions: 1. trace_call_path: calloc for seen_out/seen_in dedup arrays now gracefully degrades — if calloc returns NULL, dedup is skipped (may return duplicates) instead of NULL-dereference crash 2. build_snippet_response: head_tail combined buffer malloc is NULL-checked — on OOM, falls back to outputting head portion only instead of passing NULL to snprintf All guards are idiomatic C (if-pointer-check, no gotos). Existing tests cover the functional behavior; OOM paths are defensive safety nets for production resilience. Tests: 2052 passed, 0 failed
The include_dependencies parameter was parsed in the handler (line 776) but not declared in the TOOLS[] schema JSON. This meant LLMs could not discover the parameter from tool descriptions — it was silently accepted but undiscoverable. Added include_dependencies boolean property with description to the search_graph tool schema, matching the merged branch's schema. Tests: 2042 passed, 0 failed
OOM fixes (applied to both feature branches): 1. trace_call_path: calloc for seen_out/seen_in dedup arrays gracefully degrades on OOM — skips dedup instead of NULL-dereference crash 2. build_snippet_response: head_tail combined buffer malloc falls back to head-only output on OOM instead of NULL snprintf Documentation (notes/ folder): - notes/token-reduction-changes.md: 8 RTK-inspired strategies, config system, real-world results, mermaid architecture diagram - notes/reference-api-indexing-changes.md: 7-layer AI grounding defense, QN prefix format, deferred work, mermaid flow diagram - notes/merged-branch-changes.md: branch lineage gitGraph, combined architecture diagram, snippet mode decision flow, token reduction pipeline per tool, test coverage, merge conflict resolution Tests: 2064 passed, 0 failed
…l 4 SKILL.md files codebase-memory-reference/SKILL.md: - Update tool count from 14 to 15 (add index_dependencies) - Remove read_file/list_directory (not in TOOLS[] array) - Add "Token Reduction Parameters" section documenting mode, compact, max_lines, max_output_bytes, max_results, include_dependencies - Add config key reference for runtime overrides - Update Critical Pitfalls: search_graph defaults to 50, query_graph capped at 32KB - Add decision matrix entries for summary, signature, head_tail, dependency search codebase-memory-tracing/SKILL.md: - Add mode=signature example to Step 5 for quick API inspection - Document max_results default (25) and compact=true for token savings codebase-memory-exploring/SKILL.md: - Add mode=summary to Step 2 as alternative overview method - Update default from 10 to 50 results per page - Add compact=true and pagination_hint tips codebase-memory-quality/SKILL.md: - Add mode=summary and compact=true tips - Update pagination guidance with pagination_hint Tests: 2064 passed, 0 failed
codebase-memory-reference/SKILL.md: - Update search_graph, trace_call_path, query_graph, get_code_snippet tool descriptions with new parameters - Remove read_file/list_directory (not in TOOLS[] array) - Add "Token Reduction Parameters" section with mode, compact, max_lines, max_output_bytes, max_results documentation - Add config key reference for runtime overrides - Update Critical Pitfalls for new defaults - Add decision matrix entries for summary, signature, head_tail codebase-memory-tracing/SKILL.md: - Add mode=signature example, max_results default, compact=true tip codebase-memory-exploring/SKILL.md: - Add mode=summary to Step 2, update default to 50, add compact tip codebase-memory-quality/SKILL.md: - Add mode=summary, compact=true, pagination_hint tips Tests: 2052 passed, 0 failed
codebase-memory-reference/SKILL.md: - Update tool count from 14 to 15 (add index_dependencies) - Remove read_file/list_directory (not in TOOLS[] array) - Add include_dependencies note to search_graph description - Add decision matrix entries for dependency search and indexing Tests: 2042 passed, 0 failed
…grams Comprehensive feature matrix documenting: - Branch availability for all 13 existing + new features - Composability matrix showing how features interact when combined - Detailed interaction table with justifications for each combination - Strengths and limitations of each feature with specific measurements - AI grounding 7-layer defense failure mode analysis - Architecture diagram showing composable pipeline stages - 5 generalizable design patterns extracted from the implementation Key composability findings: - summary mode overrides limit (uses 10K for accurate aggregation) - signature mode overrides max_lines (no file I/O needed) - compact applies independently at serialization stage - include_dependencies composes with all token reduction features - _result_bytes/_est_tokens always reflects final output size Tests: 2064 passed, 0 failed
…h_code fix
New module src/depindex/ with package resolution (uv/cargo/npm/bun),
ecosystem detection, dep discovery from indexed graph, auto-index helper,
and cross-boundary edge creation stub. Dependencies stored in same db
with {project}.dep.{package} naming convention.
Pipeline changes:
- Add CBM_MODE_DEP index mode (keeps vendor/, .d.ts for dep source)
- Add cbm_pipeline_set_project_name() to override auto-derived name
- Add cbm_pipeline_set_flush_store() for upsert vs fresh dump
- Conditional dump/flush at pipeline.c:646
Store changes:
- Add project_pattern (LIKE) and project_exact fields to cbm_search_params_t
- Support LIKE queries for glob-style project filtering
- Add project-first ORDER BY for mixed project+dep results
- Stable pagination via ORDER BY name, id
MCP changes:
- Replace index_dependencies stub with full implementation
(source_paths[] primary interface, package_manager optional shortcut)
- Fix detect_session() to use cbm_project_name_from_path (Bug DeusData#12)
- REQUIRE_STORE error now includes actionable hint field
- search_code: fix -m limit exhaustion (limit*50 min 500 vs limit*3)
- search_code: add case_sensitive param (default false = case-insensitive)
DRY improvements:
- CBM_MANIFEST_FILES shared list in depindex.h used by pass_configlink.c
and dep discovery (adds pyproject.toml, setup.py, Pipfile)
- Remove package.json and composer.json from IGNORED_JSON_FILES
(needed by pass_configlink and dep auto-discovery)
Tests: 25 depindex tests (2055 total, all passing)
- Package manager parse/str roundtrip, dep naming, is_dep detection
- Ecosystem detection (python/rust/none), manifest path matching
- npm resolution with fixture, pipeline set_project_name
- MCP tool validation, AI grounding, dep reindex replaces
Merge reduce-token-usage branch into token-reduction-and-reference-indexing. Conflict in codebase-memory-reference/SKILL.md resolved by taking the reduce-token-usage version (has complete token reduction documentation).
Merge reference-api-indexing into token-reduction-and-reference-indexing. Three conflicts resolved with superset approach: 1. search_graph schema (mcp.c:272): Keep reduce-token-usage's mode/compact/ pagination_hint params AND reference-api-indexing's include_dependencies 2. search_code schema (mcp.c:349): Keep reduce-token-usage's configurable limit description AND reference-api-indexing's case_sensitive param 3. store.c ORDER BY (1866): Keep reference-api-indexing's project-first sort for mixed project+dep results (superset of stable name,id sort) SKILL.md: took reduce-token-usage version (complete token reduction docs). All 2077 tests pass (2042 base + 22 token-reduction + 13 dep-indexing).
…paths expand_project_param() (mcp.c:764-840): - "self" → session project exact match - "dep"/"deps" → session.dep prefix match - "dep.pandas" → session.dep.pandas prefix - "myapp.pandas" → myapp.dep.pandas (auto-insert .dep.) - Glob "*" → SQL LIKE with % substitution - fill_project_params() helper sets cbm_search_params_t fields search_graph result tagging (mcp.c:930-960): - Every result tagged source:"project" or source:"dependency" - Dep results get package name + read_only:true - session_project added to response for AI project name awareness - Uses cbm_is_dep_project() with session context for precision handle_index_status (mcp.c:1046-1100): - Reports dependencies[] array with package names and node counts - Reports detected_ecosystem from project root marker files - session_project in response Dep auto-reindex in all 3 re-index paths: - handle_index_repository (mcp.c:1472): cbm_dep_auto_index after dump - watcher_index_fn (main.c:86-96): cbm_dep_auto_index after dump - autoindex_thread (mcp.c:2496-2501): cbm_dep_auto_index after dump All use DRY cbm_dep_auto_index() with CBM_DEFAULT_AUTO_DEP_LIMIT cbm_mcp_server_set_session_project() added (mcp.h:128, mcp.c:526) Fix: yyjson_mut_obj_add_strcpy for dep package names from search results (heap-use-after-free when cbm_store_search_free frees borrowed strings) Fix: db_project selection when session_project is empty (integration test integ_mcp_delete_project was failing — resolve_store got NULL instead of project name after expand_project_param) Tests: 29 depindex tests (2059 total, all passing) - test_search_results_have_source_field: project results tagged - test_search_dep_results_tagged_dependency: dep results have package+read_only - test_search_response_has_session_project: session_project in response - test_index_status_shows_deps: dependencies[] in index_status response
…ep re-index Merge latest reference-api-indexing (bd09623) with gap implementations: Conflicts resolved (2 in mcp.c search_graph handler): 1. Param parsing: keep reduce-token-usage compact/summary/mode params, use fill_project_params() from reference-api-indexing for smart project 2. Result loop: keep reduce-token-usage compact mode + pagination_hint, add unconditional source/package/read_only tagging from ref-api-indexing Superset features in merged handler: - expand_project_param (self/dep/deps/glob/prefix resolution) - compact mode (omit redundant name when suffix of QN) - summary mode (aggregate counts by label/file) - pagination_hint with offset/limit guidance - session_project in all responses - source:"project"/"dependency" on every result - package + read_only:true on dep results - dep auto-reindex in all 3 paths (handler, watcher, autoindex) - index_status reports dependencies[] + detected_ecosystem All 2081 tests pass (2042 base + 22 token-reduction + 17 dep-indexing).
Root cause: handle_list_projects opens every .db file in ~/.cache/codebase-memory-mcp/ via cbm_store_open_path (which runs CREATE TABLE IF NOT EXISTS, modifying foreign databases). With 62 stale .db files (1.3GB) including a corrupt 223MB "..db" (empty project name), the server hung during Claude Code health checks. Fixes: - Add validate_cbm_db(): read-only SQLite validation with magic byte check + 'nodes' table schema check + 1s busy_timeout. Never modifies foreign databases. Logs actionable warnings on skip. - Guard detect_session() against empty/dot project names that produce the corrupt "..db" filename - Skip "..db" and ".db" filenames in handle_list_projects - Skip empty/dot project names after filename-to-name extraction - Force unbuffered stdin/stdout via setvbuf for MCP stdio protocol - Add #include <sqlite3.h> for read-only validation queries Files: src/main.c (setvbuf), src/mcp/mcp.c (validate_cbm_db, detect_session guard, list_projects guards, sqlite3.h include)
Root cause: handle_list_projects opens every .db file in ~/.cache/codebase-memory-mcp/ via cbm_store_open_path (which runs CREATE TABLE IF NOT EXISTS, modifying foreign databases). With 62 stale .db files (1.3GB) including a corrupt 223MB "..db" (empty project name), the server hung during Claude Code health checks. Fixes: - Add validate_cbm_db(): read-only SQLite validation with magic byte check + 'nodes' table schema check + 1s busy_timeout. Never modifies foreign databases. Logs actionable warnings on skip. - Guard detect_session() against empty/dot project names that produce the corrupt "..db" filename - Skip "..db" and ".db" filenames in handle_list_projects - Skip empty/dot project names after filename-to-name extraction - Force unbuffered stdin/stdout via setvbuf for MCP stdio protocol - Add #include <sqlite3.h> for read-only validation queries Files: src/main.c (setvbuf), src/mcp/mcp.c (validate_cbm_db, detect_session guard, list_projects guards, sqlite3.h include)
Gap 3 (trace boundary tagging): trace_call_path now tags each caller and callee with source:"project"|"dependency" and read_only:true for dep nodes. Uses cbm_is_dep_project() for consistent tagging. Gap 4 (snippet provenance): build_snippet_response adds source and read_only fields so get_code_snippet results indicate whether code is from the project or a dependency. Cross-edges: cbm_dep_link_cross_edges implemented — searches project Variable nodes, looks for matching Module nodes in dep projects (project.dep.%), creates IMPORTS edges to link them. Enables trace_call_path to follow imports across project/dep boundary. Gap 1 (watcher dep re-index) was already done in prior commit. Files: src/mcp/mcp.c (trace + snippet tagging), src/depindex/depindex.c (cross-edge implementation)
Gap 3 (trace boundary tagging): trace_call_path now tags each caller and callee with source:"project"|"dependency" and read_only:true for dep nodes. Uses cbm_is_dep_project() for consistent tagging. Gap 4 (snippet provenance): build_snippet_response adds source and read_only fields so get_code_snippet results indicate whether code is from the project or a dependency. Cross-edges: cbm_dep_link_cross_edges implemented — searches project Variable nodes, looks for matching Module nodes in dep projects (project.dep.%), creates IMPORTS edges to link them. Enables trace_call_path to follow imports across project/dep boundary. Gap 1 (watcher dep re-index) was already done in prior commit. Files: src/mcp/mcp.c (trace + snippet tagging), src/depindex/depindex.c (cross-edge implementation)
Config registry (CBM_CONFIG_REGISTRY in cli.c): - 25 config keys across 5 categories: Indexing, Search, Tools, PageRank, Dependencies. Each entry has key, default, env var name, category, description. - All defaults verified against code-level #define values. cbm_config_get_effective(): priority chain env > DB > default. - Checks registry for env var name, reads env first, falls back to DB. - Used by config get CLI and auto_index in maybe_auto_index. Env var overrides for key settings: - CBM_AUTO_INDEX (bool), CBM_AUTO_INDEX_LIMIT (int) - CBM_REINDEX_ON_STARTUP (bool) - CBM_KEY_FUNCTIONS_EXCLUDE (comma-separated globs) - CBM_TOOL_MODE (streamlined/classic) config list output: - Grouped by category with [Category] headers - Shows (env) when env var is active, (set) when DB value differs from default - All 25 keys visible (was: only 2) config help: - Shows storage location (~/.cache/codebase-memory-mcp/_config.db) - Priority explanation (env > config set > default) - Examples for config set and env var usage - Keys grouped by category with [env: VAR_NAME] annotation Fixed: auto_dep_limit default 5→20, dep_max_files default 5000→1000 to match code-level CBM_DEFAULT_AUTO_DEP_LIMIT and CBM_DEFAULT_DEP_MAX_FILES. Fixed: hint message provides complete commands, not fragments. Improved: dependency config descriptions explain what packages/files mean.
Root cause: pass_configlink.c allocated ~4.2MB on the stack: - config_entries[4096] × 520 bytes = 2.0MB - code_entries[8192] × 264 bytes = 2.1MB - deps[2048] × 264 bytes = 0.5MB Background threads get 512KB stack (macOS default) → SIGBUS. Fix: heap-allocate all three arrays with calloc, free on every return path. Verified: autorun repo (311 files, 6766 nodes) completes in 409ms. Also fix: main.c shutdown order — join autoindex thread BEFORE freeing watcher and watch_store. Previously watcher was freed while autoindex thread still had a reference to srv->watcher, causing use-after-free. Tested: CBM_AUTO_INDEX=true on ~/.claude/autorun — clean completion, no SIGBUS, no hang. 2201 tests pass.
MEMBER_OF edges (Method→Class): - Pipeline inserts MEMBER_OF reverse edge alongside each DEFINES_METHOD edge in both parallel (pass_parallel.c) and sequential (pass_definitions.c) paths. PageRank power iteration naturally propagates member importance to parent classes via the graph structure — no post-hoc hacks. - Config: edge_weight_member_of (default 0.5, 0=disabled) Edge weight tuning: - USAGE: 0.2→0.7 (type refs dominant in Python/JS) - DEFINES: 0.5→0.1 (structural noise) - DEFINES_METHOD: 0.8→0.5 - default_weight: 0.3→0.1 - New explicit: TESTS=0.05, WRITES=0.15, DECORATES=0.2 Result on autorun (no hacks, pure algorithm): EventContext DeusData#5, SessionStateManager DeusData#4, classes throughout top 10 Test functions dampened, structural noise reduced
…ere silently ignored
Previous behavior: search_graph accepted qn_pattern, relationship, exclude_entry_points,
include_connected, and include_dependencies in its JSON schema but never extracted or
applied them — all 5 were silently ignored. trace_call_path hardcoded edge_types=["CALLS"]
regardless of user input, and its compact default (true) disagreed with the schema (false).
include_dependencies schema default was false, opposite to the prefix-match behavior that
already included dep sub-projects by default.
What changed:
- src/mcp/mcp.c: extract qn_pattern and relationship in handle_search_graph Phase 1
(after name_pattern, before file_pattern); extract exclude_entry_points, include_connected,
include_dependencies as bools after max_degree; wire all 5 into cbm_search_params_t;
add include_dependencies=false guard: sets project_exact=true when project is set without
glob pattern, scoping results to exact project name (excludes .dep.* sub-projects);
add free(qn_pattern) and free(relationship) to cleanup block
- src/mcp/mcp.c: replace hardcoded edge_types[]={"CALLS"} in handle_trace_call_path with
user-supplied edge_types array extracted after all three early-return guards (lines 2062,
2069, 2086) to avoid memory leaks on those paths; use free_string_array() for cleanup;
fix compact default from false to true (matches schema); fix include_dependencies schema
default from false to true with updated description
- src/store/store.c: add qn_pattern REGEXP/iregexp dual-branch WHERE clause after
name_pattern block (same pattern as name_pattern at lines 1835-1844); add relationship
EXISTS filter using local rel_cond[128] (exceeds bind_buf[64]) with both edge directions
(source OR target); merge exclude_entry_points "in_deg > 0" condition into the existing
degree-filter subquery block to avoid double subquery nesting; fix has_degree_wrap to
include exclude_entry_points so ORDER BY uses bare column names in the outer wrapped query
- tests/test_token_reduction.c: add setup_sp_server() fixture (4 nodes: main, process_request,
fetch_data, dep_helper; 2 edges: CALLS main->process_request, HTTP_CALLS fetch_data->process_request);
add 12 new parameterization accuracy tests in token_reduction suite covering qn_pattern filter,
relationship filter, exclude_entry_points, include_dependencies=true/false, compact default,
edge_types traversal
Why: parameters declared in the MCP schema but not implemented silently accept user input
and return wrong results — AI agents and users passing these params get misleading output.
The include_dependencies schema default disagreed with actual behavior. The trace edge_types
hardcoding prevented traversal of non-CALLS relationships (HTTP_CALLS, IMPORTS, etc.).
Testable: make -f Makefile.cbm test (2213 passed, 0 failed)
search_graph '{"qn_pattern":".*handlers.*","project":"sp-test"}' returns only handlers
search_graph '{"relationship":"HTTP_CALLS","project":"sp-test"}' returns nodes with HTTP edges
search_graph '{"exclude_entry_points":true}' removes nodes with in_deg=0 (CALLS)
search_graph '{"include_dependencies":false,"project":"myapp"}' excludes myapp.dep.* nodes
trace_call_path '{"function_name":"f","edge_types":["HTTP_CALLS"]}' follows HTTP edges
…token efficiency search_graph compact: enumerate all omitted fields explicitly (name, empty label/file_path, zero degrees) with concrete example and absent-field defaults, replacing ambiguous "Absent:" footnote that didn't connect omission to compact. search_graph include_dependencies: remove redundant "Default: true" restatement (already in schema) and duplicate "dep sub-projects" mention. trace_call_path compact: add missing omission condition (name == qualified_name last segment) and example, replacing unexplained "redundant" jargon. query_graph max_rows: tighten prose without losing the "default: unlimited" fact (absent from schema) or the scanned-vs-returned distinction. search_code case_sensitive: consolidate into single clause "Match case-sensitively (default: case-insensitive)." Also includes (from prior commits on this branch): - search_graph: omit empty label/file_path fields instead of emitting "" - search_graph: omit zero in_degree/out_degree instead of emitting 0 - trace_call_path candidates: omit empty file_path instead of emitting ""
Replace hardcoded /Users/martinvogel path (and intermediate ~ which MCP clients don't expand) with sh -c "exec \$HOME/.local/bin/..." so the shell expands \$HOME at launch time on any machine.
…on to 17 managers, improve compact output Memory leaks fixed (0 leaks confirmed via leaks --atExit): - mcp.c resolve_store: cbm_project_free_fields was gated on proj.root_path[0] — empty string paths silently skipped free. Separated free from the watcher call; now always frees after successful cbm_store_get_project. - mcp.c handle_index_status: cbm_store_search_free skipped when dep_out.count==0 — cbm_store_search allocates even for empty results. Restructured to free whenever search succeeds. Same fix for cbm_project_free_fields call in ecosystem detection path. - pagerank.c: node_labels leaked on two early return paths (N==0 and id_map_init failure). Both paths now free node_ids and node_labels (with per-element free for strdup'd entries before the N==0 branch assigns any). - pass_envscan.c: 8 static regexes compiled once by compile_patterns() were never freed. Added cbm_envscan_free_patterns() that calls cbm_regfree on each and resets patterns_compiled=0. - pipeline.h/pipeline.c: public cbm_pipeline_global_cleanup() wraps cbm_envscan_free_patterns(). Called in main.c after ALL server threads joined (HTTP + stdio) to avoid racing with autoindex threads. Also called in run_cli() path and test_pipeline.c teardown. Ecosystem detection expanded from 8 to 17 package managers: - depindex.h: added CBM_PKG_MAKE, CBM_PKG_CMAKE, CBM_PKG_MESON, CBM_PKG_CONAN (C/C++ build systems). Expanded CBM_MANIFEST_FILES with build.gradle.kts, bun.lockb, global.json, Directory.Build.props, NuGet.Config, Makefile, GNUmakefile, Makefile.cbm, CMakeLists.txt, meson.build, conanfile.txt, conanfile.py, vcpkg.json. - depindex.c: rewrote cbm_detect_ecosystem to cover all 17 managers using CHECK() macro for exact filename matches and dir_contains_suffix() for wildcard patterns (*.csproj, *.fsproj). Added has_vendored_deps_dir() helper. Added discover_vendored_deps() which scans vendor/ vendored/ third_party/ thirdparty/ deps/ external/ ext/ contrib/ lib/ _vendor/ submodules/ for C/C++ and CBM_PKG_CUSTOM build systems. dep search hint in handle_search_graph: - When a dep project search (project:"dep", expanded to prefix "<session>.dep") returns 0 results, emits a "hint" field with an ecosystem-aware actionable message. If cbm_detect_ecosystem succeeds, the hint names the detected build system and instructs to re-run index_repository. If no ecosystem detected, lists all 17 supported manifest file types. Compact output improvements in mcp.c: - handle_search_graph: skip emitting "name" when it equals the last segment of qualified_name (ends_with_segment check) or when empty. - handle_trace_call_path: same fix for both outbound (callees) and inbound (callers) node arrays. Added callers_total emission to match callees_total (was documented in tool description but never emitted). - build_snippet_response: skip empty name, label, and file_path fields. Compact param now wired through all six call sites in handle_get_code. Zero-value numeric fields skipped in compact mode. - handle_get_architecture / build_resource_architecture: skip redundant name (when equals last qualified_name segment) and empty label/fp in key_functions arrays. Test coverage: - test_token_reduction.c: 504-line new file covering compact suppression of redundant name/label/empty fields, callers_total presence, get_code compact param propagation, architecture key_functions, and dep search hint emission. - test_mcp.c, test_pipeline.c: minor additions for new behaviors. Makefile.cbm: - Added nosan build (CFLAGS_NOSAN, LDFLAGS_NOSAN, MONGOOSE_CFLAGS_NOSAN, per-object NOSAN variants for sqlite3/lsp/grammar/ts_runtime/mongoose). - Added test-leak target: macOS uses leaks --atExit on test-runner-nosan; Linux uses ASAN_OPTIONS=detect_leaks=1 on regular test-runner. - Added test-analyze target: Clang --analyze on production + test sources (skipped with message when IS_GCC=yes). - Updated .PHONY with test-leak, test-analyze, test-runner-nosan.
CLAUDE.md (new): project-level developer notes for Claude with concrete
commands for make test, make test-leak, make test-analyze, and explanation
of why macOS requires test-runner-nosan (ASan replaces malloc, blocking
leaks --atExit from walking the heap).
CONTRIBUTING.md: added "Run C Server Tests" section after the Go test
section. Covers make -f Makefile.cbm test/test-leak/test-analyze, the
macOS vs Linux difference in leak detection approach, and the expected
clean-run output ("0 leaks for 0 total leaked bytes").
Makefile.cbm HOW TO USE block (committed previously) already documents
the commands inline — these docs surface the same info for contributors
who read CONTRIBUTING.md first.
…te, pass_normalize, 11 TDD tests Phases 1-8 from comprehensive plan (notes/2026-03-26-0013-plan-*.md): Phase 1 — Input validation (F1,F4,F6,F7,F9,F10,F15): mcp.c: empty label→NULL, limit≤0→default, sort_by/mode enum validation, regex pre-validation via cbm_regcomp, depth clamp, direction validation Phase 2 — B7 Cypher param fix + CQ-2 project expansion: mcp.c:handle_query_graph reads "cypher" first with "query" fallback, uses resolve_project_store for "self"/"dep"/path shortcuts Phase 3 — DRY resolve_project_store in 5 handlers: handle_get_graph_schema, handle_index_status, handle_get_architecture, handle_get_code_snippet, handle_index_dependencies Phase 4 — DF-1 degree precompute (100× faster queries): store.c: node_degree table DDL, search SELECT uses LEFT JOIN with HC-6 fallback to edge COUNT, cbm_store_node_degree reads precomputed table, arch_hotspots uses nd.calls_in, arch_boundaries adds behavioral types pagerank.c: is_calls field, degree accumulation during edge iteration, node_degree batch INSERT after LinkRank, OOM-safe allocations Phase 5 — B2/B5 name-based caller fallback: pass_calls.c: 3-step resolution (exact QN → shared helper → Module) graph_buffer.c: cbm_gbuf_resolve_by_name_in_file DRY helper (HC-1) Phase 6 — B17/B13 class-method edge repair: NEW pass_normalize.c: enforces I2 (Method→Class) and I3 (Field→Class) invariants via QN prefix + name+label+file fallback. O(M+F) runtime. pipeline.c: normalize pass before dump. Makefile.cbm updated. Phase 7 — CBMLangSpec section_node_types field: lang_specs.h: added section_node_types (17th field) lang_specs.c: all 64 language specs updated with NULL initializer Phase 8 — IX-1..3 indexing pathway fixes: mcp.c: autoindex_failed + just_autoindexed flags in server struct, REQUIRE_STORE captures pipeline return code, build_resource_status shows "indexing" state + failure detail + action_required hints Additional fixes: G1: summary mode adds results=[] + results_suppressed=true CQ-3: Cypher + filter params produces warning Tests: 2238 pass (11 new in test_input_validation.c covering F1,F6,F9, F10,F15 edge cases, G1, CQ-3, IX-2). Updated test_store_nodes.c for total degree. Updated test_token_reduction.c for G1 results key.
.clangd: mirrors Makefile.cbm CFLAGS_COMMON include paths so clangd resolves headers without compile_commands.json. .gitignore: add .worktrees/, session_project, project, conductor/, with — runtime/session artifacts from Claude Code subagents.
…/3 indexing status Phase 3 — DRY project resolution in 5 handlers: handle_get_graph_schema, handle_index_status, handle_get_architecture, handle_get_code_snippet: resolve_store → resolve_project_store handle_index_dependencies: expand raw_project before resolve_store Forward declaration added for resolve_project_store (needed by handle_get_graph_schema which precedes the definition) Phase 8 — Indexing pathway status state machine: IX-1: autoindex_failed flag in server struct. REQUIRE_STORE captures pipeline_run return code — on failure sets flag + logs error. Error response includes "auto-indexing failed" with detail and fix hint. IX-2: build_resource_status checks autoindex_active → "indexing" state with timing hint. Not-indexed path shows failure detail or action_required. Empty store path shows hint about no recognized source files. IX-3: just_autoindexed flag set on successful auto-index in REQUIRE_STORE. All 2238 tests pass. Installed to ~/.local/bin/.
search_code_graph: add auto-index on first query, cypher filter ignore note, summary mode results_suppressed behavior. trace_call_path: add auto-index, depth<1 clamped to 1, invalid direction returns error. get_code: add Module metadata-only note with auto_resolve hint. codebase://status resource: add indexing state, project name field, action_required hint, auto-index failure detail. _hidden_tools: add auto-index note, list all 4 status states. All 2238 tests pass. Installed to ~/.local/bin/.
…totune Fixes codebase://architecture returning only 10 results all from graph-ui by wiring hardcoded limits through the config system and raising defaults to 25. Key changes: - mcp.c: add key_functions_count config (default 25); wire into build_key_functions_sql (was hardcoded LIMIT 10 at line 4317) and build_resource_architecture call site - mcp.c: add arch_hotspot_limit config (default 25); wire into classic get_architecture tool handler - store.c/store.h: raise CBM_ARCH_HOTSPOT_DEFAULT_LIMIT 10->25; add hotspot_limit param to cbm_store_get_architecture - store.c/store.h: add sort_by=calls (ORDER BY calls_in+calls_out DESC) and sort_by=linkrank (ORDER BY linkrank_in DESC) dispatch cases; add degree_mode config (weighted|unweighted|calls_only) for min_degree/max_degree filter column selection - watcher.c/watcher.h: add poll_base_ms/poll_max_ms to struct cbm_watcher; change cbm_watcher_run and cbm_watcher_poll_interval_ms signatures to accept base_ms/max_ms params (0=defaults); wire watcher_poll_base_ms and watcher_poll_max_ms config keys through main.c - cli.h: extend cbm_config_entry_t with range and guidance fields (5->7) - cli.c: replace entire CBM_CONFIG_REGISTRY with 7-field entries for all 32 config keys with broadest feasible ranges and actionable guidance strings; update config list/get/help display to print [range] + guidance per entry - scripts/autotune.py: new standalone Python 3.9+ script that sends JSON-RPC directly to the binary via stdin/stdout, tries 7 experiments, scores against expected top-10 ground truth for 3 repos, resets config on exit - tests: update all callers of cbm_store_get_architecture (pass 0 for hotspot_limit) and cbm_watcher_poll_interval_ms (pass 0,0 for defaults) All 2238 tests pass.
…ults, CLI params Previous behavior: autotune set config keys but never triggered PageRank recompute between experiments — all experiments read stale stored scores, producing identical results. The binary also got SIGKILL'd on macOS 25+ due to invalidated ad-hoc signature after `cp` during install. What changed: - scripts/autotune.py: replace query_architecture() (async REQUIRE_STORE reindex) with index_and_query_architecture() — opens one persistent stdio MCP session per repo per experiment sending 3 sequential messages: initialize → tools/call index_repository (synchronous, blocks until full pipeline+PageRank completes with current edge weights) → resources/read codebase://architecture - scripts/autotune.py: add project_name_from_path() mirroring cbm_project_name_from_path() from src/pipeline/fqn.c, and delete_project_db() to remove stale DBs - scripts/autotune.py: add _send_batch() env+cwd params; pass CBM_TOOL_MODE=classic so index_repository tool is available in MCP session - scripts/autotune.py: add --top-matches (default 10) and --key-count (default 25) CLI params; show matched expected names + top-N per repo in output - scripts/autotune.py: default timeout 60s → 1200s (indexing takes ~40s per repo) - scripts/autotune.py: add exclude_ui_tests experiment; rename calls_boost_excl → calls_boost_excl_tests with tests/** added to exclude list - scripts/autotune.py: save every run to scripts/autotune_results.json (appended, with timestamp/binary/repos/experiments/best fields) - scripts/autotune.py: show progress bar (█/░) and ◀ BEST marker in final report - .gitignore: add scripts/autotune_results.json (generated artifact, not tracked) Why: edge weights and PageRank iterations are only applied at index time via cbm_pagerank_compute_with_config(); querying a DB indexed with old weights produces wrong rankings regardless of config changes. Full reindex per experiment is required. Also fixes macOS 25+ SIGKILL by rebuilding binary (Makefile.cbm re-signs with codesign --force --sign - after install). First run result: calls_boost_excl_tests scores 6/30 (best), baseline 0/30. Testable: python3 scripts/autotune.py
…l_tests) Previous defaults: edge_weight_calls=1.0, edge_weight_usage=0.7, key_functions_exclude="" (no exclusions). What changed: - scripts/autotune.py DEFAULTS: edge_weight_calls 1.0 → 2.0 (call edges are the strongest signal for production importance) - scripts/autotune.py DEFAULTS: edge_weight_usage 0.7 → 0.3 (type-reference edges add noise, dampening improves ranking signal) - scripts/autotune.py DEFAULTS: key_functions_exclude "" → "graph-ui/**, tools/**,scripts/**,tests/**" (excluding non-production paths surfaces core library functions instead of test helpers) Why: autotune run on 2026-03-26 scored calls_boost_excl_tests at 6/30 across 3 repos (codebase-memory-mcp, autorun, rtk), best of 8 experiments. Baseline scored 0/30. These defaults are now the baseline that experiments diverge from, so future autotune runs search the config space around the current best. Testable: python3 scripts/autotune.py (baseline_25 now starts from these values)
|
Hey, thanks for the commit! Looks cool, but evaluating and checking will take a bit of time as this is quite a lot. Will give you feedback as soon as I can :) |
|
@DeusData yeah it is quite a bit haha, i appreciate the patience! also i edited the summary a bit to be clearer |
|
@ahundt would it maybe be possible that you split this up into seperate PRs? Then its easier to review and we can discuss on the individual things you have implemented. That would be awesome! |
|
I also see that some of the issues were already addressed. Maybe you can rebase here with latest main and check what might be already here |
|
Thanks @ahundt — ambitious PR! Note: we have a strict no- |
|
@ahundt — appreciate the energy and the thorough write-up. A few things: On PageRank — same feedback as on #147: PageRank on a call graph would rank On the API consolidation (3 tools instead of 15) — this is a major UX decision that changes how every user interacts with the tool. Reducing to 3 tools looks clean on paper, but it front-loads complexity into each tool (your On the bug fixes — the 5 silently-ignored parameters and the memory leaks sound like genuine issues worth addressing. These are the kind of contributions I'd love to merge. Could you split those into a focused PR? Same for the input validation tests. On scope — as mentioned earlier, this needs to be split up. A PR that touches I'd suggest: (1) bug fixes + leak fixes, (2) input validation tests, (3) config registry, (4) dependency indexing, (5) API consolidation — each as separate PRs rebased on latest main. |
|
@ahundt — one more note: I've updated CONTRIBUTING.md with clearer guidelines. Going forward, feature work requires opening an issue first to discuss the approach before implementation. This is especially important for changes that affect the API surface, pipeline, or project defaults. The bug fixes you identified (silently-ignored params, memory leaks) are genuinely valuable — I'd happily review those as standalone PRs. But the architectural changes (PageRank, API consolidation, config registry, dependency indexing) each need their own design discussion in an issue before code is written. |
Thanks for building codebase-memory-mcp! I really like how it works and the performance, plus i saw some opportunities to make it even better, so I supercharged it!
What This PR Proposes
This PR brings together a number of development tracks I've been working on:
1. a streamlined & modernized MCP tool surface,
2. automatic indexing on first call,
3. PageRank-based result ranking,
4. dependency source indexing,
5. token reduction strategies,
6. a full config registry, and
7. autotuning of algorithm performance to get better results.
It also fixes several bugs where search parameters were accepted by the API but silently never applied to queries, and i fixed a bunch of memory leaks. I'd appreciate feedback on any of these areas!
Built on peer reviewed research
I built these changes upon several peer reviewed papers and other repos that show empirically strong performance:
Here is a few iterations of ai generated description about what changes, and yes I've actually been running and using and testing the code, there is a whole test suite which is a good portion of the changes:
A Cleaner Tool Surface (
src/mcp/mcp.c)Why This Matters
The classic API exposed 15 tools by default. When an AI client receives a large tool list, it has to reason about which tool applies before doing any actual work, and the likelihood of choosing a suboptimal tool grows with list size. Presenting three well-chosen tools lowers that overhead and makes the common workflows (find code, trace a call chain, read a symbol) straightforward to discover and use. The tradeoff is worth it: the three streamlined tools cover the large majority of real usage, and the full classic set is one config change away, which is the best practice known as "progressive disclosure".
The Streamlined API (default)
Three tools are exposed by default (
tool_mode=streamlined):search_code_graphsearches the knowledge graph for functions, classes, routes, and other symbols, sorted by PageRank by default. It also accepts Cypher queries for multi-hop patterns.projectself,dep/deps,dep.{name}, or globcypherlabelname_patternqn_patternfile_patternrelationshipexclude_entry_pointsinclude_dependenciesexcludesort_byrelevance,name,degree,calls,linkrankmodefull(default) orsummary(counts only, no individual results)compactlimitsearch_limit)offsetmin_degree/max_degreemax_output_bytestrace_call_pathtraces who calls a function and what it calls, using BFS over the graph.function_nameprojectdirectioninbound,outbound, orbothdepthmax_resultstrace_max_results)edge_types["CALLS", "HTTP_CALLS"])excludecompactget_codereturns source code for a symbol by qualified name.qualified_nameprojectmodefull(default),signature(declaration only),head_tail(first 60% + last 40%)max_linessnippet_max_lines)auto_resolveinclude_neighbors_hidden_toolsis a discovery tool that describes the full classic API and explains how to unlock it. It takes no parameters and returns a description of all 15 classic tools, how to enable them individually (config set tool_<name> true) or globally (config set tool_mode classic). This allows AI clients operating in streamlined mode to find out what else exists without being presented with the full list upfront.Modern MCP: Resources
This PR adds three MCP resources, following the modern MCP spec (
resources/list+resources/read). Resources are the right mechanism for ambient context that clients can read passively without invoking a tool call, which preserves tool slots for active operations.codebase://architecture: highest-importance functions by PageRank score, hotspot files, graph statisticscodebase://status: current index state (ready / indexing / not_indexed / empty)codebase://schema: node labels, edge types, and example Cypher queriesBackwards Compatibility
The classic API is fully preserved. Nothing is removed. Clients can restore the full 15-tool list with a single setting:
Individual classic tools can be enabled selectively (
config set tool_search_graph true). A_hidden_toolspseudo-tool in the streamlined list explains this to AI clients, so they can discover what exists and unlock what they need without breaking existing integrations.Every tool response now appends
_result_bytesand_est_tokensso clients can gauge context cost.Ranked Results: PageRank + LinkRank (
src/pagerank/pagerank.c,pagerank.h, new)Search results previously came back in arbitrary SQLite order, so core utilities and test helpers ranked equally. I've added a new
src/pagerank/module that runs power-iteration PageRank (Google, d=0.85) over the graph after indexing and stores scores in apagerankcolumn on the nodes table. Results fromsearch_code_graphandcodebase://architecturesort by this score by default.Edge types carry configurable weights: call edges are weighted more heavily than type-reference edges, which matter more than test edges. Thirteen edge weight keys are exposed in the config registry (
edge_weight_calls,edge_weight_usage, etc.). The defaults were tuned usingscripts/autotune.py, which runs full reindex cycles against known codebases and scores results against a ground-truth function list.Dependency Source Indexing (
src/depindex/depindex.c,depindex.h, new)I've added a new
index_dependenciestool that indexes library source code as searchable sub-projects named{project}.dep.{package}. It accepts a package manager name (uv, cargo, npm, and 10 others) and resolves installed source paths automatically. Results are taggedsource: "dependency"so project code and library code stay distinguishable. Theinclude_dependenciesparameter onsearch_code_graphcontrols whether dependency nodes appear in results.Token Reduction (
src/mcp/mcp.c)Several additions reduce how much context search and trace results consume:
search_limit.compact=trueomits fields that are redundant given other fields (e.g.namewhen it's the last segment ofqualified_name, and zero-value degree counts).mode=summaryreturns label and file counts rather than individual nodes, useful for getting a sense of a module's contents.mode=signatureon source retrieval returns the function signature only.mode=head_tailreturns the first and last portions of the body.max_linesdefaults to 200.has_moreandnext_offset.Search Parameters That Were Silently Ignored (bug fixes,
src/mcp/mcp.c,src/store/store.c)I found five parameters that were being extracted from incoming requests but never passed to the query layer. The server accepted them without error and returned results, giving no indication the parameters had no effect:
qn_pattern: regex filter onqualified_namerelationship: filter to nodes connected by a specific edge typeexclude_entry_points: filter out nodes with no inbound edgesinclude_dependencies: scope results to project-only or include dep sub-projectsedge_typesintrace_call_path: restrict BFS traversal to specific edge typesAll five are now wired through to the store layer. Results for queries using these parameters will differ from before.
Config Registry (
src/cli/cli.c,cli.h)I've replaced hardcoded
#defineconstants with a typed config registry of 33 keys, settable viaconfig setorCBM_*environment variables. Groups: Indexing, Search, Ranking, Watcher, Dependencies. Each key has a description and valid range hint.Notable keys:
search_limit,trace_max_results,key_functions_count,key_functions_exclude,pagerank_max_iter, all 13edge_weight_*keys,auto_index,tool_mode,snippet_max_lines,query_max_output_bytes.Reliability Fixes
Heap leaks (
src/mcp/mcp.c,src/depindex/,src/pipeline/,src/watcher/): Over 200 leaks identified under macOSleaks --atExitand Linux LSan, now fixed. I've addedmake test-leakandmake test-analyze(Clang static analyzer) targets to make these checks easy to run.Graph normalization pass (
src/pipeline/pass_normalize.c, new): Runs after extraction, before the SQLite dump. Enforces that every Method node has a parent Class viaDEFINES_METHOD/MEMBER_OFedges and every Field is attached to its containing type. Extraction gaps previously left orphaned nodes that confused traversal queries.Watcher (
src/watcher/watcher.c): Now tracks all projects accessed in a session, not just the startup project. Uses an adaptive poll interval (1s to 30s). Fixed a SIGBUS crash caused by a stack overflow in the background thread.macOS 25+ code signing (
Makefile.cbm): Addedcodesign --force --sign -aftermake cbmandmake install. Without this, macOS kills the binary after acpinvalidates the ad-hoc signature.New Test Files
tests/test_tool_consolidation.c_hidden_tools, all 3 streamlined tool schemastests/test_token_reduction.ctests/test_depindex.cinclude_dependenciesfiltertests/test_pagerank.ctests/test_input_validation.cThe full suite runs under ASan + UBSan.
make test-leakruns the leak check separately.Key Files Changed
src/mcp/mcp.csrc/pagerank/pagerank.c+.hsrc/depindex/depindex.c+.hsrc/cli/cli.c+.hsrc/store/store.c+.hcbm_search_params_tfields, degree precompute columns, stable paginationsrc/pipeline/pass_normalize.cscripts/autotune.pyMakefile.cbmBreaking Changes
CBM_TOOL_MODE=classic.limitor raisesearch_limitconfig to get more.query_max_output_bytesor addLIMITin Cypher if needed.Additional useful list of repos:
https://github.com/YerbaPage/Awesome-Repo-Level-Code-Generation