Skip to content

killop/codedb-mcp

Repository files navigation

codebase-mcp

Local-first MCP toolkit for fast code search, dependency-aware module discovery, visual code atlas pages, and DeepWiki-style repository documentation.

Rust MCP tree-sitter indexing local first Minish Model2Vec

English | 简体中文

OverviewMCP ToolsCode Module AtlasDeepWikiBenchmarksSetupSkills

Project Overview

codebase-mcp turns a local repository into a persistent MCP code intelligence service. It keeps tree-sitter indexed source data, symbols, references, dependencies, graph metadata, lexical indexes, and vector search data under the target repo's .codedb-mcp directory.

Warm MCP calls are designed to be millisecond-level inside a persistent server process. See Benchmark Snapshot and MCP Tool Benchmark Matrix for measured latency, peak memory, and rg comparisons.

Feature Overview

Area What It Provides
Fast MCP tools Indexed exact/regex search, BM25/symbol search, lazy vector search, outlines, definitions, callers, dependencies, fuzzy file lookup, query pipelines, and 100-call bundles.
Module discovery Dependency-connected file components plus dependency-weighted label propagation, with terms and paths used as explainable labels and evidence.
Code Module Atlas A packaged meet-blog-style 3D viewer with one star per source file, module/file lists, dependency edges, and file focus/details.
DeepWiki Local repository documentation generated from MCP evidence and the active agent's reasoning, with business-module-first pages and cited source files.
Local deployment Explicit .codedb-mcp/codedb-mcp.toml, project-local storage, bundled skills, and no hidden environment-variable behavior.

MCP Tools

The server keeps a tree-sitter indexed, project-local code database under .codedb-mcp and exposes tools for:

  • fast exact/regex search, BM25/symbol search, and lazy vector search;
  • symbol outlines and definition lookup;
  • LSP-like callers anchored to a definition path and line;
  • direct and reverse file dependencies, including transitive walks;
  • fuzzy file lookup, path globbing, compact query pipelines, and 100-call bundles;
  • graph summaries, lazy Louvain communities, module planning, atlas export, and DeepWiki evidence gathering.

Code Module Atlas

Code Module Atlas demo

Watch the MP4 demo

The atlas page is generated by the skills/code-module-atlas skill. It calls the local MCP module-atlas export, converts the result into the bundled meet-blog-style 3D viewer dataset, and shows one star node per source file.

Module boundaries are computed from the dependency-connected file graph first. Inside each connected component, the Rust module planner uses dependency-weighted label propagation; paths and distinctive terms are used for names, evidence, and oversized-component splitting, not as the primary grouping rule. The page then provides a module list, a file list for the selected module, file-to-file dependency edges, and file focus/details.

node skills\code-module-atlas\scripts\build-module-atlas.mjs u3dclient
cd skills\code-module-atlas\assets\viewer
npm run dev -- --port 5174 --strictPort

DeepWiki

The skills/deepwiki skill builds local DeepWiki-style documentation from MCP evidence and the active agent's reasoning. It starts from dependency-aware module candidates, then writes business-module-first pages with cited files, entry points, flows, dependencies, and risk notes. It does not require a separate model API.

The intended distribution model is setup-guide first: give an agent setup-for-agent.md, let it create .codedb-mcp, use the default HuggingFace cache when it already exists, fall back to a second-drive cache when it does not, and then ask the human whether this specific agent should register the MCP server. The codedb-mcp skill is for using the tools after setup, not for installing them.

Benchmark Snapshot

Benchmark target: u3dclient.

Benchmarks were rerun on 2026-05-29 on Windows. warm timings run inside one loaded MCP process after warmup. one-shot timings launch a CLI child process and include startup/cache load. Peak memory is sampled as MB Working Set / Private Bytes.

Current index status with the Unity C# benchmark config:

  • Indexed files: 19,035.
  • Chunks: 31,949.
  • Symbols: 277,213.
  • Graph: 19,941 nodes and 166,132 edges.
  • Vector search: Model2Vec minishlab/potion-code-16M file embeddings are built lazily on first natural-language search and queried with flat cosine scan.
  • Storage: u3dclient\.codedb-mcp.
  • Cache v20 sidecars: compact index.bin, spilled bm25.postings, lazy word_index.bin/word_hits.bin, lazy callers.bin, lazy deps.bin, optional legacy embeddings.bin, and binary source fingerprints.
  • Peak memory below is sampled Working Set / Private Bytes for child processes. The cold rebuild row was measured with memory sampling enabled, so its wall time is not directly comparable to the faster no-sampling rebuild pass.

Index and cache baseline:

Scenario Time Peak memory Notes
Cache v20 cold rebuild 30.258s wall 255.8 / 249.6 MB tree-sitter declaration parse, source-on-demand dependencies, spill-to-disk BM25, lazy embeddings, compact cache save
Cache-hit index open 0.873s internal / 1.132s wall 134.9 / 136.0 MB process startup, source fingerprint validation, and cache load
codedb_index cache-hit tool call 1.556s wall 141.5 / 140.4 MB explicit tool call after cache is already valid

MCP Tool Benchmark Matrix

The table is intentionally three columns so it fits GitHub README pages without horizontal scrolling. Memory values are MB Working Set / Private Bytes.

Tool / Purpose MCP benchmark rg comparison
codedb_index
Build/rebuild local index
cold 30.258s, 255.8 / 249.6 MB
cache-hit tool 1.556s, 141.5 / 140.4 MB
none
codedb_status
Health, counts, scan state
one-shot 0.561s, 14.2 / 7.9 MB none
codedb_tree
Indexed tree with language, lines, symbols
warm 11.891ms
one-shot 1.018s, 142.0 / 141.0 MB
partial file list only
codedb_outline
One-file symbol outline
warm 0.074ms
one-shot 1.279s, 140.2 / 140.3 MB
none
codedb_symbol
Symbol definition lookup
warm 2.106ms
one-shot 1.034s, 140.7 / 140.0 MB
regex approximates text only
codedb_search
Hybrid search, regex, batch queries
warm scoped regex 7.120ms
one-shot 1.097s, 142.3 / 140.5 MB
scoped rg 0.047s, MCP 6.6x faster warm
broad grep is 1.5-1.8x slower
codedb_word
Exact identifier inverted index
warm first lazy load 94.403ms
one-shot 1.033s, 167.3 / 172.6 MB
partial word grep only
codedb_callers
Definition-anchored references
warm 3.422ms
one-shot 1.309s, 168.5 / 173.0 MB
no semantic anchor
codedb_hot
Recently modified indexed files
warm 7.069ms
one-shot 1.454s, 141.4 / 140.5 MB
none
codedb_deps
Direct/reverse/transitive file deps
warm 0.098ms
one-shot 0.528s, 29.5 / 23.0 MB
none
codedb_read
Indexed file or line-range read
warm 0.757ms
one-shot 1.307s, 141.7 / 140.1 MB
partial file print only
codedb_edit
Read-only compatibility stub
one-shot 0.128s, 4.8 / 1.2 MB none
codedb_changes
Files changed since sequence
warm 10.818ms
one-shot 0.871s, 144.7 / 145.8 MB
none
codedb_snapshot
JSON snapshot of files/symbols/deps
one-shot 2.421s, 634.0 / 715.8 MB none
codedb_bundle
Up to 100 tools in one request
warm 100 fast ops 57.725ms
one-shot 20 searches 1.107s, 143.3 / 141.5 MB
no MCP batching
codedb_remote
Remote compatibility stub
one-shot 0.136s, 5.4 / 1.3 MB none
codedb_projects
Projects loaded in server process
one-shot 0.114s, 3.8 / 1.0 MB none
codedb_find
Fuzzy file/path lookup
warm 18.019-20.230ms
one-shot 0.406s, 14.1 / 7.8 MB
no fuzzy ranking
codedb_query
find/search/filter/limit/outline pipeline
warm 6.786-25.139ms
one-shot 1.149s, 141.6 / 140.6 MB
no equivalent single tool
codedb_glob
Glob over indexed paths
warm 4.231ms
one-shot 0.956s, 140.7 / 140.1 MB
rg --files -g 0.045s
MCP 10.6x faster warm
codedb_ls
Immediate indexed directory children
warm 4.027ms
one-shot 0.940s, 139.3 / 138.8 MB
partial file list only
codedb_graph
Graph summary/export
one-shot 1.988s, 389.4 / 396.8 MB none
codedb_explain
Explain graph node and edges
warm first graph explain 845.369ms
one-shot 1.854s, 392.8 / 397.6 MB
none
codedb_path
Shortest graph path
warm after graph load 13.073ms
one-shot 1.790s, 392.6 / 397.2 MB
none
codedb_communities
Lazy Louvain communities
warm 265.593ms
one-shot 1.905s, 390.8 / 400.1 MB
none
codedb_module_map
DeepWiki module planning
warm 1.679s
one-shot 2.236s, 214.4 / 215.3 MB
none
codedb_module_atlas
Module/file atlas JSON export
Rust export 8.548s, 319.8 / 323.5 MB
full skill 10.870s wall, 371.8 / 369.9 MB sampled
none
codedb_analyze
Graph stats and suggested questions
warm graph analysis 830.637ms
one-shot 2.936s, 392.2 / 397.5 MB
none
codedb_export
Graph JSON/GraphML/Cypher export
warm after graph load 10.313ms
one-shot 1.963s, 390.0 / 397.0 MB
none

Java smoke benchmark on gameserver:

Scenario Files Chunks Symbols Time Peak memory
Cold build after config/model-path change 6,940 55,057 245,238 10.477s 656.0 / 664.4 MB
Reopen with unchanged files/config 6,940 55,057 245,238 1.027s 129.4 / 176.4 MB

Multi-language smoke coverage includes C#, Java, Rust, Python, Lua, TypeScript, C, and C++ parser paths: 8 files, 8 chunks, 14 symbols, 0.219s. Rust smoke check on this repository: 29 indexed files, 1,752 chunks, 1,901 symbols; codedb_outline, codedb_search, and codedb_deps all returned Rust results.

Recommended Setup Flow

  1. Give the target agent setup-for-agent.md.
  2. The agent creates <repo-root>\.codedb-mcp and <repo-root>\.codedb-mcp\models.
  3. On Windows, the agent checks the default HuggingFace hub cache first. If minishlab/potion-code-16M already has a valid snapshot there, config points to that snapshot. If the hub cache exists but the model is missing, the agent downloads to C:\Users\<user>\.cache\huggingface\hub\codedb-mcp\models\potion-code-16M. If the default hub cache does not exist, it uses the second available drive, such as D:\codedb-mcp-cache\models\potion-code-16M.
  4. The agent writes <repo-root>\.codedb-mcp\codedb-mcp.toml from the demo config, writes the model as an absolute path, and shows the human which languages are configured.
  5. The human can edit extensions, root_paths, include_paths, exclude_paths, skip_dirs, and the model path before first indexing.
  6. The agent runs an index check.
  7. The agent asks whether this specific agent should register MCP. If yes, it uses its own MCP mechanism.
  8. Restart or reload the agent MCP session and check /mcp.

The MCP command shape is:

<package-root>\skills\codedb-mcp\assets\codebase-mcp.exe --config <repo-root>\.codedb-mcp\codedb-mcp.toml mcp <repo-root>

This project intentionally keeps installation explicit: setup prepares local project files, while the agent/user chooses when and where to register MCP.

What It Does

  • Exposes local MCP tools for code search, outlines, symbols, typed callers, dependencies, file discovery, graph analysis, DeepWiki module planning, module atlas export, batching, and exports.
  • Indexes configured source languages through one explicit config file: <repo-root>/.codedb-mcp/codedb-mcp.toml.
  • Stores generated data inside the target repo under .codedb-mcp. Delete that directory to remove local cache and generated wiki/index data.
  • Uses a unified tree-sitter parser layer, not Roslyn/JDT. C#, Java, Rust, Python, Lua, JavaScript, TypeScript/TSX, C, and C++ all emit the same FileEntry/Symbol model. C#/Java typed callers and dependencies remain the strongest path because their namespace/package import rules are implemented on top of that shared AST output.
  • Uses Minish ecosystem pieces: model2vec-rs with explicit-path minishlab/potion-code-16M, file-level semantic units, BM25 lexical ranking, exact identifier indexes, and on-demand flat-cosine vectors for natural-language search.
  • Builds a graphify-style code graph, computes Louvain communities lazily for codedb_communities, and exposes Rust-native codedb_module_map/codedb_module_atlas outputs from a dependency-connected file graph with label propagation, dependency cohesion, cross-folder evidence, semantic-neighbor probes, key symbols, and c-TF-IDF-like labels.
  • Watches configured source extensions in MCP mode and rebuilds after a debounce.

Technology Architecture

  1. Explicit project-local config: all behavior comes from .codedb-mcp/codedb-mcp.toml. There are no environment-variable switches for indexing behavior.
  2. Project-local storage: cache payloads, manifests, Louvain caches, and DeepWiki output live under .codedb-mcp. Deleting that directory removes all generated data for the repo.
  3. Scanner: walks the repo with explicit extensions, max file size, project .gitignore behavior, scan roots, include paths, exclude globs, and skip dirs. Nested Git worktrees/submodules under the target root are scanned as normal source directories. Unity runtime scans can be limited to Assets, Packages, and Library/PackageCache while excluding **/Editor/**.
  4. Unified language layer: extension dispatch selects a tree-sitter grammar for C#, Java, Rust, Python, Lua, JavaScript, TypeScript/TSX, C, or C++. The parser emits the same FileEntry/Symbol model for every language and visits declarations without descending into large method bodies.
  5. Code-aware references: C#/Java namespace/package imports, qualified names, aliases, static using, annotations, and attribute suffixes feed typed callers and dependency edges. Rust and the other non C#/Java languages currently provide indexed search, outlines, imports/includes/use declarations, Lua require() imports, and graph nodes, but not Roslyn/JDT-level semantic binding.
  6. Search indexes: builds chunk metadata, symbol-definition chunk hits, dependency references, and spill-to-disk BM25 lexical search during cold indexing. Exact identifier hits and Model2Vec file embeddings are generated lazily when callers or natural-language search actually need them.
  7. Memory-shaped cache: cache v20 follows the bounded-content-cache lesson from justrach/codedb: full file bodies, chunk preview text, repeated chunk file paths, repeated language/kind strings, BM25 postings, word-index hits, caller results, embeddings, forward/reverse dependencies, graph objects, and Louvain results are no longer all resident by default. Tools read exact lines, postings, word hits, caller sidecars, embeddings, dependencies, or graph data on demand.
  8. Graph layer: builds a graphify-style code graph lazily. Small repos keep file, namespace/package, symbol, dependency, and reference edges; large repos keep graph construction behind graph/community/module tools while symbol data stays in outline/search/callers indexes. Louvain communities and subcommunities are computed lazily on first request and cached under .codedb-mcp.
  9. Module atlas layer: codedb_module_map and codedb_module_atlas run in Rust. They first split files by dependency-connected components, then do dependency-weighted label propagation inside each component. Path and token terms are used for naming, evidence, and oversized-component splitting, not as the primary clustering basis. codedb_module_atlas exports Embedding Atlas-ready JSON.
  10. MCP runtime: implemented with the Rust rmcp SDK over stdio. Tools operate against a warm in-process index, and batch-capable tools plus codedb_bundle reduce MCP round trips.
  11. Setup guide and skills package: setup-for-agent.md owns installation guidance. skills/codedb-mcp is standalone for tool usage and includes the executable, config template, MCP reference, and tool guidance. skills/deepwiki builds local DeepWiki-style docs from MCP evidence plus the active agent's reasoning. skills/code-module-atlas calls codedb_module_atlas and packages the local meet-blog-style module/file graph webpage.

Configuration

Default config path:

<repo-root>/.codedb-mcp/codedb-mcp.toml

The repo includes a working example at .codedb-mcp/codedb-mcp.toml and a distributable template at skills/codedb-mcp/assets/codedb-mcp.toml.template.

Important defaults:

[scan]
extensions = ["cs", "java", "rs", "py", "pyw", "lua", "js", "jsx", "mjs", "cjs", "ts", "tsx", "c", "h", "cc", "cpp", "cxx", "hpp", "hh", "hxx"]
max_file_bytes = 50000000
respect_gitignore = true
root_paths = []
include_paths = ["Library/PackageCache"]
exclude_paths = []

[embedding]
model = "C:/Users/<user>/.cache/huggingface/hub/codedb-mcp/models/potion-code-16M"

[storage]
enabled = true
dir = ".codedb-mcp"

There are no environment-variable toggles. Edit the config file explicitly. root_paths can limit scanning to source roots such as Assets, Packages, and Library/PackageCache; include_paths adds extra roots even when a parent is skipped; exclude_paths accepts globs such as **/Editor/** for Unity runtime-only scans. respect_gitignore=true reads project .gitignore files, but nested Git worktrees/submodules inside the target root are still indexed unless excluded by skip_dirs, exclude_paths, or file extension rules. The model path is explicit and absolute; on Windows the setup guide uses the default HuggingFace cache when present, otherwise it falls back to the second available drive.

Build And CLI

Build:

cargo build --release

Run MCP directly:

target\release\codebase-mcp.exe --config u3dclient\.codedb-mcp\codedb-mcp.toml mcp u3dclient

Quick CLI checks:

target\release\codebase-mcp.exe --config u3dclient\.codedb-mcp\codedb-mcp.toml index u3dclient
target\release\codebase-mcp.exe --config u3dclient\.codedb-mcp\codedb-mcp.toml search "network listener manager" u3dclient -k 5
target\release\codebase-mcp.exe --config u3dclient\.codedb-mcp\codedb-mcp.toml --root u3dclient tool codedb_status "{}"

MCP mode answers the protocol handshake before the initial index finishes, then builds the default project index in the background. Early tool calls may wait for that first build. It also watches indexed extensions by default; when a configured source file changes, the server debounces events, rebuilds the project index in the background, and swaps in the new index after it is ready. Use --no-watch for static benchmark runs.

Batch Examples

codedb_search accepts queries:

{
  "max_results": 3,
  "queries": [
    "PoolManager",
    {
      "query": "Joystick",
      "path_glob": "Assets/Plugins/3rdPlugins/Joystick Pack/**"
    },
    {
      "query": "NetworkListenerManager",
      "regex": true,
      "compact": true
    }
  ]
}

codedb_callers accepts targets:

{
  "max_results": 10,
  "targets": [
    {
      "name": "PoolManager",
      "definition_path": "Assets/Scripts/HotFix/3rdExtend/Runtime/PoolManager/PoolManager.cs",
      "definition_line": 26
    },
    {
      "name": "Joystick",
      "definition_path": "Assets/Plugins/3rdPlugins/Joystick Pack/Scripts/Runtime/Base/Joystick.cs",
      "definition_line": 8
    }
  ]
}

codedb_communities uses lazy Louvain clustering:

target\release\codebase-mcp.exe --config u3dclient\.codedb-mcp\codedb-mcp.toml --root u3dclient tool codedb_communities "{`"community_limit`":10}"
target\release\codebase-mcp.exe --config u3dclient\.codedb-mcp\codedb-mcp.toml --root u3dclient tool codedb_communities "{`"community_id`":0,`"children`":true,`"community_limit`":20}"

Overview calls return community IDs, labels, member counts, and cohesion. Add children=true or subcommunities=true with a community_id to split only that community's subgraph; child clusters are cached in .codedb-mcp/louvain-subcommunities.bin.

codedb_module_map is the preferred DeepWiki planning call. It uses the Rust dependency-connected module graph, then adds dependency cohesion, cross-folder roots, semantic-neighbor probes, entry points, key symbols, and c-TF-IDF-like labels:

target\release\codebase-mcp.exe --config u3dclient\.codedb-mcp\codedb-mcp.toml --root u3dclient tool codedb_module_map "{`"path_prefix`":`"Assets/Scripts`",`"limit`":40,`"min_files`":2,`"semantic_neighbors`":5}"

Skills

The skills/ directory is intended to be copied as a standalone package.

  • setup-for-agent.md: installation guide for agents. It reuses the default HuggingFace cache when present, falls back to the second Windows drive when absent, and writes project-local config with an absolute model path.
  • skills/codedb-mcp: includes assets/codebase-mcp.exe, a config template, MCP registration reference, and tool guidance. It does not own setup.
  • skills/deepwiki: creates DeepWiki-style local documentation using local codedb_* tools plus the active agent's reasoning. It emphasizes business module boundaries over folder-only or community-only grouping.
  • skills/code-module-atlas: creates a local 3D module/file atlas webpage by calling codedb_module_atlas, then adapting the bundled meet-blog-style viewer. Generated repo-specific JSON stays ignored.

Acknowledgements

About

fast code databse mcp

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors