Updated 2026-05-08 (rounds 1 + 2 per Codex review on #165). Round 1 added quality-gate cells. Round 2 sharpened: (a) "non-empty" is necessary but not sufficient — concept-only and stopword-heavy checks now require top-K relevance, not just any results; (b) added pathological-behavior hard-fails covering tokenizer parity, all-stopword queries, duplicate terms, edge-length tokens, missing display-join rows, filter composition; (c) NO-GO framing makes hosted-search a permanent contingency for either v1 failure or future v2+ quality requirements.
Sub-issue of #165. Depends on #171 (browser query prototype + benchmark data).
Goal
Mechanical decision gate. No budget renegotiation here — the budgets in #169's SEARCH_INDEX_V1.md are the contract.
Decision criterion
Does the prototype meet ALL of the cells below — every latency/bytes target, every quality target, and every hard-fail check?
A fast-but-mediocre search that fails any quality cell or any hard-fail check is NOT a GO, regardless of how it performs on the latency/bytes table.
Performance gates (hard)
| metric |
contract |
prototype |
pass? |
| cold first search (P50) |
≤ 2 s |
(fill) |
|
| warm repeat-same-query search |
≤ 500 ms |
(fill) |
|
| warm new-query-after-warm-up search |
≤ 500 ms |
(fill) |
|
| filter-composed cold search |
≤ 3 s |
(fill) |
|
| bytes transferred cold |
≤ 5 MB |
(fill) |
|
| bytes transferred warm |
≤ 1 MB |
(fill) |
|
Quality gates (hard, not advisory)
| metric |
contract |
prototype |
pass? |
| top-3 overlap vs hand-labeled set |
≥ TBD% |
(fill) |
|
| top-10 overlap vs hand-labeled set |
≥ TBD% |
(fill) |
|
| top-10 overlap vs DuckDB FTS local oracle |
≥ TBD% |
(fill) |
|
concept-only top-3 relevance: each of ceramic, bone, mammal (+ 1-2 more) returns ≥ 2 of 3 hand-labeled known-good PIDs |
2/3 each |
(fill) |
|
stopword-heavy near-equivalence: top-K Jaccard between pottery from Cyprus and pottery Cyprus (stopword-stripped form) |
≥ 0.8 top-10 |
(fill) |
|
Numeric thresholds get filled in once #167 baseline + #171 prototype + DuckDB FTS oracle numbers land, so we know what "beats ILIKE" and "approaches BM25 oracle" mean on the canonical query set.
Hard-fail checks (any single failure = NO-GO)
Semantics
| check |
pass? |
Concept-only queries (ceramic, bone, mammal) all return non-empty results |
|
| Concept-only queries hit the top-3 relevance bar in the quality table above |
|
Stopword-heavy queries (pottery from Cyprus) return non-empty results |
|
| Stopword-heavy queries hit the near-equivalence bar in the quality table above |
|
Diacritic queries (Çatalhöyük) match the diacritic-stripped index |
|
Tokenizer + query parsing
| check |
pass? |
| Tokenizer parity: Python and JS produce identical token sequences for every term in the curated benchmark (not just the regression set) |
|
All-stopword query (a the of) yields a controlled empty state with helpful copy, not an error or a full-corpus dump |
|
Duplicate terms (pottery pottery cyprus) produce the same top-K result identity as pottery cyprus, within ranking-order tolerance |
|
| Empty / 1-char / very-long token queries: do not fetch broad shards; return an empty or error-with-copy state without long stalls |
|
Wildcard literals (%, _) tokenize without errors |
|
Display + composition
| check |
pass? |
Missing display-join rows: substrate hit whose pid has no row in samples_map_lite does not crash and does not silently drop a top hit (either show with placeholder or document as known limit) |
|
Filter composition matches a labeled expectation (one of two modes per (query, filter) pair). (a) Pair has a hand-labeled expected filtered top-K (in tests/search_benchmark.json); the substrate's filtered top-K must match it. (b) Filter is chosen such that ALL hand-labeled unfiltered top-K results satisfy it (e.g., source filter whose set covers every top-K result's source); the filtered top-K must equal the unfiltered top-K. The earlier "implicitly satisfies" wording was too loose — a top result that doesn't satisfy the filter legitimately drops out, so a raw top-K change is not necessarily a bug; the invariant has to be tied to labeled expectation. Tested on at least 3 distinct (query, filter) pairs. |
|
Two outcomes
GO
A v1 GO does not close the hosted-search-backend question. It defers it. See NO-GO framing below for why hosted search remains a permanent contingency for v2+ requirements (richer analyzers, phrase search, typo tolerance, v2 field growth).
NO-GO
Hosted-search backend as a permanent contingency
The Track 6 hosted-search-backend issue may be triggered by either:
- (a) v1 GO/NO-GO failure — at least one cell fails the gate above.
- (b) Post-ship v2+ requirements — even on a v1 GO, future quality requirements (phrase search, typo tolerance, richer analyzers, v2 field growth that exceeds the static substrate's byte budget) may exceed what a static-Parquet substrate can deliver. When that happens, Track 6 fires for the same reasons the budget data would have triggered it under (a).
Both triggers file the same downstream issue with the same starter requirements doc.
Refs
#165, #169, #171
Sub-issue of #165. Depends on #171 (browser query prototype + benchmark data).
Goal
Mechanical decision gate. No budget renegotiation here — the budgets in #169's
SEARCH_INDEX_V1.mdare the contract.Decision criterion
A fast-but-mediocre search that fails any quality cell or any hard-fail check is NOT a GO, regardless of how it performs on the latency/bytes table.
Performance gates (hard)
Quality gates (hard, not advisory)
ceramic,bone,mammal(+ 1-2 more) returns ≥ 2 of 3 hand-labeled known-good PIDspottery from Cyprusandpottery Cyprus(stopword-stripped form)Numeric thresholds get filled in once #167 baseline + #171 prototype + DuckDB FTS oracle numbers land, so we know what "beats ILIKE" and "approaches BM25 oracle" mean on the canonical query set.
Hard-fail checks (any single failure = NO-GO)
Semantics
ceramic,bone,mammal) all return non-empty resultspottery from Cyprus) return non-empty resultsÇatalhöyük) match the diacritic-stripped indexTokenizer + query parsing
a the of) yields a controlled empty state with helpful copy, not an error or a full-corpus dumppottery pottery cyprus) produce the same top-K result identity aspottery cyprus, within ranking-order tolerance%,_) tokenize without errorsDisplay + composition
pidhas no row insamples_map_litedoes not crash and does not silently drop a top hit (either show with placeholder or document as known limit)tests/search_benchmark.json); the substrate's filtered top-K must match it. (b) Filter is chosen such that ALL hand-labeled unfiltered top-K results satisfy it (e.g., source filter whose set covers every top-K result's source); the filtered top-K must equal the unfiltered top-K. The earlier "implicitly satisfies" wording was too loose — a top result that doesn't satisfy the filter legitimately drops out, so a raw top-K change is not necessarily a bug; the invariant has to be tied to labeled expectation. Tested on at least 3 distinct (query, filter) pairs.Two outcomes
GO
?fts=v1flag, routedoSearch()permanently to substrate path, deprecate the ILIKE path.query-spec.qmd:225to describe the substrate-backed search.A v1 GO does not close the hosted-search-backend question. It defers it. See NO-GO framing below for why hosted search remains a permanent contingency for v2+ requirements (richer analyzers, phrase search, typo tolerance, v2 field growth).
NO-GO
Explorer FTS Track 6: Hosted-search backendissue with:searchTextsemantics fromquery-spec.qmd:213-221?fts=v1flag in place as a measurement tool until the hosted backend lands.Hosted-search backend as a permanent contingency
The Track 6 hosted-search-backend issue may be triggered by either:
Both triggers file the same downstream issue with the same starter requirements doc.
Refs
#165, #169, #171