Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,29 @@

## [Unreleased]

## [Unshipped - Phase 09] - High-Signal Search + Decision Card

Cleaned up the edit decision card and sharpened search ranking for exact-name queries.

### Added

- **Definition-first ranking (SEARCH-01)**: For exact-name queries (PascalCase/camelCase), the file that *defines* a symbol now ranks above files that merely use it. Symbol-level dedup ensures multiple methods from the same class don't clog the top slots.
- **Smart snippets with scope headers (SEARCH-02)**: When `includeSnippets: true`, code chunks from symbol-aware analysis include a scope comment header (`// ClassName.methodName`) before the snippet, giving structural context without extra disk reads.
- **Clean decision card (PREF-01-04)**: The preflight response for `intent="edit"|"refactor"|"migrate"` is now a decision card: `ready`, `nextAction` (if not ready), `warnings`, `patterns` (do/avoid capped at 3), `bestExample` (top golden file), `impact` (caller coverage + top files), and `whatWouldHelp`. Internal fields like `evidenceLock`, `riskLevel`, `confidence` are no longer exposed.
- **Impact coverage gating (PREF-02)**: When result files have known callers (from import graph), the card shows caller coverage: "X/Y callers in results". Low coverage (< 40% with > 3 total callers) triggers an epistemic stress alert.
- **whatWouldHelp recommendations (PREF-03)**: When `ready=false`, concrete next steps appear: search more specifically, call `get_team_patterns`, search for uncovered callers, or check memories. Each is actionable in 1-2 sentences.

### Changed

- **Preflight shape**: `{ ready, reason?, ... }` → `{ ready, nextAction?, warnings?, patterns?, bestExample?, impact?, whatWouldHelp? }`. `reason` renamed to `nextAction` for clarity. No breaking changes to `ready` (stays top-level).

### Fixed

- Agents no longer parse unstable internal fields. Preflight output is stable by design.
- Snippets now include scope context, reducing ambiguity for symbol-heavy edits.

## [Unreleased]

### Added

- **Index versioning (Phase 06)**: Index artifacts are versioned via `index-meta.json`. Mixed-version indexes are never served; version mismatches or corruption trigger automatic rebuild.
Expand Down
2 changes: 1 addition & 1 deletion MOTIVATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ Correct the agent once. Record the decision. From then on, it surfaces in search

### Evidence gating

Before an edit, the agent gets a curated "preflight" check from three sources (code, patterns, memories). If evidence is thin or contradictory, the response tells the AI Agent to look for more evidence with a concrete next step. This is the difference between "confident assumption" and "informed decision."
Before an edit, the response includes a decision card. `ready: true` means there's enough evidence from the codebase, patterns, and team memory to proceed. `ready: false` comes with `whatWouldHelp` — specific searches to run, specific files to check, or calls to `get_team_patterns` that would close the gap. The card also surfaces caller coverage: if you're editing a function that five files import but only two of them appear in your results, you know which ones you haven't looked at yet (`coverage: "2/5 callers in results"`). This is the difference between "confident assumption" and "informed decision."

### Guardrails via frozen eval + regressions

Expand Down
22 changes: 19 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,14 +122,30 @@ This is where it all comes together. One call returns:
- **Relationships** per result: `importedByCount` and `hasTests` (condensed) + **hints** (capped ranked callers, consumers, tests)
- **Related memories**: up to 3 team decisions, gotchas, and failures matched to the query
- **Search quality**: `ok` or `low_confidence` with confidence score and `hint` when low
- **Preflight**: `ready` (boolean) + `reason` when evidence is thin. Pass `intent="edit"` to get the full preflight card. If search quality is low, `ready` is always `false`.
- **Preflight**: `ready` (boolean) with decision card when `intent="edit"|"refactor"|"migrate"`. Shows `nextAction` (if not ready), `warnings`, `patterns` (do/avoid), `bestExample`, `impact` (caller coverage), and `whatWouldHelp` (next steps). If search quality is low, `ready` is always `false`.

Snippets are opt-in (`includeSnippets: true`). Default output is lean — if the agent wants code, it calls `read_file`.

```json
{
"searchQuality": { "status": "ok", "confidence": 0.72 },
"preflight": { "ready": true },
"preflight": {
"ready": false,
"nextAction": "2 of 5 callers aren't in results — search for src/app.module.ts",
"patterns": {
"do": ["HttpInterceptorFn — 97%", "standalone components — 84%"],
"avoid": ["constructor injection — 3% (declining)"]
},
"bestExample": "src/auth/auth.interceptor.ts",
"impact": {
"coverage": "3/5 callers in results",
"files": ["src/app.module.ts", "src/boot.ts"]
},
"whatWouldHelp": [
"Search for src/app.module.ts to cover the main caller",
"Call get_team_patterns for auth/ injection patterns"
]
},
"results": [
{
"file": "src/auth/auth.interceptor.ts:1-20",
Expand Down Expand Up @@ -171,7 +187,7 @@ Record a decision once. It surfaces automatically in search results and prefligh

| Tool | What it does |
| ------------------------------ | ------------------------------------------------------------------------------------------- |
| `search_codebase` | Hybrid search with enrichment + preflight + ranked relationship hints. Pass `intent="edit"` for edit readiness check. |
| `search_codebase` | Hybrid search + decision card. Pass `intent="edit"` to get `ready`, `nextAction`, patterns, caller coverage, and `whatWouldHelp`. |
| `get_team_patterns` | Pattern frequencies, golden files, conflict detection |
| `get_symbol_references` | Find concrete references to a symbol (usageCount + top snippets + confidence + completeness) |
| `remember` | Record a convention, decision, gotcha, or failure |
Expand Down
71 changes: 50 additions & 21 deletions docs/capabilities.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Technical reference for what `codebase-context` ships today. For the user-facing

| Tool | Input | Output |
| ----------------------- | ----------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `search_codebase` | `query`, optional `intent`, `limit`, `filters`, `includeSnippets` | Ranked results (`file`, `summary`, `score`, `type`, `trend`, `patternWarning`, `relationships`, `hints`) + `searchQuality` (with `hint` when low confidence) + `preflight` ({ready, reason}). Hints capped at 3 per category. |
| `search_codebase` | `query`, optional `intent`, `limit`, `filters`, `includeSnippets` | Ranked results (`file`, `summary`, `score`, `type`, `trend`, `patternWarning`, `relationships`, `hints`) + `searchQuality` + decision card (`ready`, `nextAction`, `patterns`, `bestExample`, `impact`, `whatWouldHelp`) when `intent="edit"`. Hints capped at 3 per category. |
| `get_team_patterns` | optional `category` | Pattern frequencies, trends, golden files, conflicts |
| `get_symbol_references` | `symbol`, optional `limit` | Concrete symbol usage evidence: `usageCount` + top usage snippets + `confidence` ("syntactic") + `isComplete` boolean |
| `remember` | `type`, `category`, `memory`, `reason` | Persists to `.codebase-context/memory.json` |
Expand All @@ -34,11 +34,13 @@ Ordered by execution:
2. **Query expansion** — bounded domain term expansion for conceptual queries.
3. **Dual retrieval** — keyword (Fuse.js) + semantic (local embeddings or OpenAI).
4. **RRF fusion** — Reciprocal Rank Fusion (k=60) across all retrieval channels.
5. **Structure-aware boosting** — import centrality, composition root boost, path overlap, definition demotion for action queries.
6. **Contamination control** — test file filtering for non-test queries.
7. **File deduplication** — best chunk per file.
8. **Stage-2 reranking** — cross-encoder (`Xenova/ms-marco-MiniLM-L-6-v2`) triggers when the score between the top files are very close. CPU-only, top-10 bounded.
9. **Result enrichment** — compact type (`componentType:layer`), pattern momentum (`trend` Rising/Declining only, Stable omitted), `patternWarning`, condensed relationships (`importedByCount`/`hasTests`), structured hints (capped callers/consumers/tests ranked by frequency), related memories (capped to 3), search quality assessment with `hint` when low confidence.
5. **Definition-first boost** — for EXACT_NAME intent, results matching the symbol name get +15% score boost (e.g., defining file ranks above using files).
6. **Structure-aware boosting** — import centrality, composition root boost, path overlap, definition demotion for action queries.
7. **Contamination control** — test file filtering for non-test queries.
8. **File deduplication** — best chunk per file.
9. **Symbol-level deduplication** — within each `symbolPath` group, keep only the highest-scoring chunk (prevents duplicate methods from same class clogging results).
10. **Stage-2 reranking** — cross-encoder (`Xenova/ms-marco-MiniLM-L-6-v2`) triggers when the score between the top files are very close. CPU-only, top-10 bounded.
11. **Result enrichment** — compact type (`componentType:layer`), pattern momentum (`trend` Rising/Declining only, Stable omitted), `patternWarning`, condensed relationships (`importedByCount`/`hasTests`), structured hints (capped callers/consumers/tests ranked by frequency), scope header for symbol-aware snippets (`// ClassName.methodName`), related memories (capped to 3), search quality assessment with `hint` when low confidence.

### Defaults

Expand All @@ -47,29 +49,56 @@ Ordered by execution:
- **Embedding model**: Granite (`ibm-granite/granite-embedding-30m-english`, 8192 token context) via `@huggingface/transformers` v3
- **Vector DB**: LanceDB with cosine distance

## Preflight (Edit Intent)

Returned as `preflight` when search `intent` is `edit`, `refactor`, or `migrate`. Also returned for default searches when intelligence is available.

Output: `{ ready: boolean, reason?: string }`

- `ready`: whether evidence is sufficient to proceed with edits
- `reason`: when `ready` is false, explains why (e.g., "Search quality is low", "Insufficient pattern evidence")
## Decision Card (Edit Intent)

Returned as `preflight` when search `intent` is `edit`, `refactor`, or `migrate`.

**Output shape:**

```typescript
{
ready: boolean;
nextAction?: string; // Only when ready=false; what to search for next
warnings?: string[]; // Failure memories (capped at 3)
patterns?: {
do: string[]; // Top 3 preferred patterns with adoption %
avoid: string[]; // Top 3 declining patterns
};
bestExample?: string; // Top 1 golden file (path format)
impact?: {
coverage: string; // "X/Y callers in results"
files: string[]; // Top 3 impact candidates (files importing results)
};
whatWouldHelp?: string[]; // Concrete next steps (max 4) when ready=false
}
```

**Fields explained:**

- `ready`: boolean, whether evidence is sufficient to proceed
- `nextAction`: actionable reason why `ready=false` (e.g., "2 of 5 callers missing")
- `warnings`: failure memories from team (auto-surfaces past mistakes)
- `patterns.do`: patterns the team is adopting, ranked by adoption %
- `patterns.avoid`: declining patterns, ranked by % (useful for migrations)
- `bestExample`: exemplar file for the area under edit
- `impact.coverage`: shows caller visibility ("3/5 callers in results" means 2 callers weren't searched yet)
- `impact.files`: which files import the results (helps find blind spots)
- `whatWouldHelp`: specific next searches, tool calls, or files to check that would close evidence gaps

### How `ready` is determined

1. **Evidence triangulation** — scores code match (45%), pattern alignment (30%), and memory support (25%). Needs combined score ≥ 40 to pass.
2. **Epistemic stress check** — if pattern conflicts, stale memories, or thin evidence are detected, `ready` is set to false with an abstain signal.
3. **Search quality gate** — if `searchQuality.status` is `low_confidence`, `ready` is forced to false regardless of evidence scores. This prevents the "confidently wrong" problem where evidence counts look good but retrieval quality is poor.
2. **Epistemic stress check** — if pattern conflicts, stale memories, thin evidence, or low caller coverage are detected, `ready` is set to false.
3. **Search quality gate** — if `searchQuality.status` is `low_confidence`, `ready` is forced to false regardless of evidence scores. This prevents the "confidently wrong" problem.

### Internal analysis (not in output, used to compute `ready`)
### Internal signals (not in output, feed `ready` computation)

- Risk level from circular deps + impact breadth + failure memories
- Risk level from circular deps, impact breadth, and failure memories
- Preferred/avoid patterns from team pattern analysis
- Golden files by pattern density
- Impact candidates from import graph
- Failure warnings from related memories
- Golden files ranked by pattern density
- Caller coverage from import graph (X of Y callers appearing in results)
- Pattern conflicts when two patterns in the same category are both > 20% adoption
- Confidence decay of related memories

## Memory System

Expand Down
46 changes: 45 additions & 1 deletion src/core/search.ts
Original file line number Diff line number Diff line change
Expand Up @@ -693,6 +693,21 @@ export class CodebaseSearcher {
})
.sort((a, b) => b.score - a.score);

// SEARCH-01: Definition-first boost for EXACT_NAME intent
// Boost results where symbolName matches query (case-insensitive)
if (intent === 'EXACT_NAME') {
const queryNormalized = query.toLowerCase();
for (const result of scoredResults) {
const symbolName = result.metadata?.symbolName;
if (symbolName && symbolName.toLowerCase() === queryNormalized) {
result.score *= 1.15; // +15% boost for definition
}
}
// Re-sort after boost
scoredResults.sort((a, b) => b.score - a.score);
}

// File-level deduplication
const seenFiles = new Set<string>();
const deduped: SearchResult[] = [];
for (const result of scoredResults) {
Expand All @@ -702,7 +717,36 @@ export class CodebaseSearcher {
deduped.push(result);
if (deduped.length >= limit) break;
}
const finalResults = deduped;

// SEARCH-01: Symbol-level deduplication
// Within each symbol group (symbolPath), keep only the highest-scoring chunk
const seenSymbols = new Map<string, SearchResult>();
const symbolDeduped: SearchResult[] = [];
for (const result of deduped) {
const symbolPath = result.metadata?.symbolPath;
if (!symbolPath) {
// No symbol info, keep as-is
symbolDeduped.push(result);
continue;
}

const symbolPathKey = Array.isArray(symbolPath) ? symbolPath.join('.') : String(symbolPath);
const existing = seenSymbols.get(symbolPathKey);
if (!existing || result.score > existing.score) {
if (existing) {
// Replace lower-scoring version
const idx = symbolDeduped.indexOf(existing);
if (idx >= 0) {
symbolDeduped[idx] = result;
}
} else {
symbolDeduped.push(result);
}
seenSymbols.set(symbolPathKey, result);
}
}

const finalResults = symbolDeduped;

if (
isNonTestQuery &&
Expand Down
52 changes: 51 additions & 1 deletion src/preflight/evidence-lock.ts
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ export interface EvidenceLock {
gaps?: string[];
nextAction?: string;
epistemicStress?: EpistemicStress;
whatWouldHelp?: string[];
}

interface PatternConflict {
Expand All @@ -41,6 +42,8 @@ interface BuildEvidenceLockInput {
patternConflicts?: PatternConflict[];
/** When search quality is low_confidence, evidence lock MUST block edits. */
searchQualityStatus?: 'ok' | 'low_confidence';
/** Impact coverage: number of known callers covered by results */
impactCoverage?: { covered: number; total: number };
}

function strengthFactor(strength: EvidenceStrength): number {
Expand Down Expand Up @@ -162,6 +165,17 @@ export function buildEvidenceLock(input: BuildEvidenceLockInput): EvidenceLock {
stressTriggers.push('Insufficient evidence: most evidence sources are empty');
}

// Trigger: low caller coverage
if (
input.impactCoverage &&
input.impactCoverage.total > 3 &&
input.impactCoverage.covered / input.impactCoverage.total < 0.4
) {
stressTriggers.push(
`Low caller coverage: only ${input.impactCoverage.covered} of ${input.impactCoverage.total} callers appear in results`
);
}

let epistemicStress: EpistemicStress | undefined;
if (stressTriggers.length > 0) {
const level: EpistemicStress['level'] =
Expand Down Expand Up @@ -195,6 +209,41 @@ export function buildEvidenceLock(input: BuildEvidenceLockInput): EvidenceLock {
(!epistemicStress || !epistemicStress.abstain) &&
input.searchQualityStatus !== 'low_confidence';

// Generate whatWouldHelp recommendations
const whatWouldHelp: string[] = [];
if (!readyToEdit) {
// Code evidence weak/missing
if (codeStrength === 'weak' || codeStrength === 'missing') {
whatWouldHelp.push(
'Search with a more specific query targeting the implementation files'
);
}

// Pattern evidence missing
if (patternsStrength === 'missing') {
whatWouldHelp.push('Call get_team_patterns to see what patterns apply to this area');
}

// Low caller coverage with many callers
if (
input.impactCoverage &&
input.impactCoverage.total > 3 &&
input.impactCoverage.covered / input.impactCoverage.total < 0.4
) {
const uncoveredCallers = input.impactCoverage.total - input.impactCoverage.covered;
if (uncoveredCallers > 0) {
whatWouldHelp.push(
`Search specifically for uncovered callers to check ${Math.min(2, uncoveredCallers)} more files`
);
}
}

// Memory evidence missing + failure warnings
if (memoriesStrength === 'missing' && input.failureWarnings.length > 0) {
whatWouldHelp.push('Review related memories with get_memory to understand past issues');
}
}

return {
mode: 'triangulated',
status,
Expand All @@ -203,6 +252,7 @@ export function buildEvidenceLock(input: BuildEvidenceLockInput): EvidenceLock {
sources,
...(gaps.length > 0 && { gaps }),
...(nextAction && { nextAction }),
...(epistemicStress && { epistemicStress })
...(epistemicStress && { epistemicStress }),
...(whatWouldHelp.length > 0 && { whatWouldHelp: whatWouldHelp.slice(0, 4) })
};
}
Loading
Loading