Skip to content

feat: add PageRank ranking, architecture summary, and token-budgeted responses#147

Open
maplenk wants to merge 6 commits intoDeusData:mainfrom
maplenk:feat/pagerank-arch-summary
Open

feat: add PageRank ranking, architecture summary, and token-budgeted responses#147
maplenk wants to merge 6 commits intoDeusData:mainfrom
maplenk:feat/pagerank-arch-summary

Conversation

@maplenk
Copy link

@maplenk maplenk commented Mar 26, 2026

Summary

Adds structural importance ranking (PageRank), a one-call architecture overview tool, and token-budgeted responses to prevent context window overflow.

New tools

  • get_architecture_summary — Structured markdown overview of the project: top files by connectivity, route→controller→service chains, Louvain clusters, high fan-in functions, entry points. Supports max_tokens for output size control and focus for narrowing to a specific area.

  • get_key_symbols — Returns top-K functions/classes ranked by PageRank. Enables "what are the most important functions in this codebase?" queries.

Enhanced tools

  • search_graph — New ranked parameter (default true). When enabled, results are sorted by PageRank score. PageRank included in response JSON.

  • trace_call_path — New ranked parameter. BFS results post-sorted by PageRank when enabled.

  • search_graph, trace_call_path, query_graph — New max_tokens parameter. Two-tier truncation: top 5 results in full detail, remainder as compact signatures. Emits truncated, total_results, shown metadata.

Implementation details

  • PageRank: standard iterative algorithm (d=0.85, 20 iterations) with dangling node handling. Persisted in node_scores table. Runs as pipeline post-processing step. Non-fatal on failure.
  • Architecture summary: SQL queries against existing graph — no new indexing. Hash table lookups for O(1) file resolution. yyjson route property extraction.
  • Token budget: build-then-check approach (zero overhead on happy path). Compact chain summaries (A → ... (3 more) → Z) for truncated traces.
  • WAL-mode fix: read-only query opens use immutable SQLite URIs (fixes corrupt DB misclassification).

Tests

  • test_store_arch.c: architecture summary (basic, focus, many_files, cluster_growth)
  • test_store_search.c: PageRank computation + ranking
  • test_mcp.c: get_key_symbols, ranked search, truncation for all 3 tools
  • test_pipeline.c: PageRank in pipeline
  • test_integration.c: live index tests

Motivation

AI coding agents consume 7–38% of context window per structural query. PageRank ranking ensures the most important results appear first. Token budgets let agents request "give me the answer in under 2000 tokens." Architecture summaries eliminate entire categories of exploratory queries — one call replaces 3–5 tool invocations.

Benchmarked on a 32K-node / 70K-edge production Laravel codebase.


Part 1 of a 4-PR series. PRs 2–4 build on this foundation.


Built with OpenAI Codex and Claude Code.

maplenk added a commit to maplenk/codebase-memory-mcp that referenced this pull request Mar 26, 2026
All install paths, download URLs, self-update checks, CI workflows,
and documentation now reference maplenk/codebase-memory-mcp so the
fork can operate independently with its own releases while upstream
PRs (DeusData#147-DeusData#150) are pending. Upstream attribution in README fork
section and LICENSE preserved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@DeusData DeusData added the enhancement New feature or request label Mar 26, 2026
@DeusData
Copy link
Owner

Thanks @maplenk — PageRank for code ranking and architecture summaries is a great idea. Large PR — will review carefully.

Naman Khator and others added 6 commits March 27, 2026 18:10
Account for optional signatures in the search_graph and trace_call_path size estimators, and improve compact trace chains to report omitted-node counts.

This also documents the normal-path output enrichment introduced with Task 4: search_graph results now include file_path, start_line, end_line, and signature, and trace_call_path hop items now include file_path, start_line, and signature.
- Guard cbm_mcp_text_result() against NULL text
- Fix memory leak in handle_get_key_symbols() REQUIRE_STORE path (focus not freed)
- Wire qn_pattern through handle_search_graph()
- Fix OOM infinite loop in markdown_builder_reserve()
- Return 0 instead of CBM_STORE_ERR from summary_count_nodes() on prepare fail

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@maplenk maplenk force-pushed the feat/pagerank-arch-summary branch from 1e02f10 to f3e93e7 Compare March 27, 2026 12:46
@DeusData
Copy link
Owner

Thanks for the effort here, @maplenk. I want to give honest feedback on the core premise before we go further.

PageRank is the wrong algorithm for code graphs. PageRank measures "if you randomly follow edges, where do you end up?" On the web, being linked-to is an editorial signal. In a call graph, being called by many things means you're a leaf utility — log.Error(), fmt.Sprintf(), strings.Contains(). These would rank highest, which is the opposite of architecturally important code. Handlers, orchestrators, and pipeline stages — the code that actually matters — typically have few callers but many callees. PageRank would rank them low.

We already expose min_degree/max_degree on search_graph, which gives you direct fan-in/fan-out filtering with zero computational overhead. That covers the "find heavily-connected code" use case without the conceptual mismatch.

The architecture summary and token-budget features are separate ideas worth discussing on their own merits — but they're bundled here with PageRank as the foundation, which makes it hard to evaluate them independently. Could you split those into standalone PRs?

Also noting: this PR modifies store.c (+1,587 lines) and mcp.c (+944 lines), which are core files. Changes of that magnitude to the store and MCP layers need very careful review, especially since this is part 1 of 4 — I need to understand the full scope before committing to a direction.

@maplenk
Copy link
Author

maplenk commented Mar 27, 2026

Hey @DeusData
Thanks for the details.

Will split the other features first and check on the PageRank algorithm as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants