Explorer FTS Track 5: GO/NO-GO decision gate

> **Updated 2026-05-08** (rounds 1 + 2 per Codex review on #165). Round 1 added quality-gate cells. Round 2 sharpened: (a) "non-empty" is necessary but not sufficient — concept-only and stopword-heavy checks now require top-K relevance, not just any results; (b) added pathological-behavior hard-fails covering tokenizer parity, all-stopword queries, duplicate terms, edge-length tokens, missing display-join rows, filter composition; (c) NO-GO framing makes hosted-search a permanent contingency for *either* v1 failure or future v2+ quality requirements.

Sub-issue of #165. **Depends on #171 (browser query prototype + benchmark data).**

## Goal

Mechanical decision gate. **No budget renegotiation here** — the budgets in #169's `SEARCH_INDEX_V1.md` are the contract.

## Decision criterion

> Does the prototype meet **ALL** of the cells below — every latency/bytes target, every quality target, **and** every hard-fail check?

A fast-but-mediocre search that fails any quality cell or any hard-fail check is **NOT a GO**, regardless of how it performs on the latency/bytes table.

### Performance gates (hard)

| metric                                          | contract   | prototype | pass? |
|-------------------------------------------------|------------|-----------|-------|
| cold first search (P50)                         | ≤ 2 s      | (fill)    |       |
| warm repeat-same-query search                   | ≤ 500 ms   | (fill)    |       |
| warm new-query-after-warm-up search             | ≤ 500 ms   | (fill)    |       |
| filter-composed cold search                     | ≤ 3 s      | (fill)    |       |
| bytes transferred cold                          | ≤ 5 MB     | (fill)    |       |
| bytes transferred warm                          | ≤ 1 MB     | (fill)    |       |

### Quality gates (hard, not advisory)

| metric                                                                                          | contract   | prototype | pass? |
|-------------------------------------------------------------------------------------------------|------------|-----------|-------|
| top-3 overlap vs hand-labeled set                                                               | ≥ TBD%     | (fill)    |       |
| top-10 overlap vs hand-labeled set                                                              | ≥ TBD%     | (fill)    |       |
| top-10 overlap vs DuckDB FTS local oracle                                                       | ≥ TBD%     | (fill)    |       |
| **concept-only top-3 relevance**: each of `ceramic`, `bone`, `mammal` (+ 1-2 more) returns ≥ 2 of 3 hand-labeled known-good PIDs | 2/3 each | (fill) | |
| **stopword-heavy near-equivalence**: top-K Jaccard between `pottery from Cyprus` and `pottery Cyprus` (stopword-stripped form) | ≥ 0.8 top-10 | (fill) | |

Numeric thresholds get filled in once #167 baseline + #171 prototype + DuckDB FTS oracle numbers land, so we know what "beats ILIKE" and "approaches BM25 oracle" mean on the canonical query set.

### Hard-fail checks (any single failure = NO-GO)

#### Semantics

| check                                                                                 | pass? |
|---------------------------------------------------------------------------------------|-------|
| Concept-only queries (`ceramic`, `bone`, `mammal`) all return non-empty results       |       |
| Concept-only queries hit the top-3 relevance bar in the quality table above           |       |
| Stopword-heavy queries (`pottery from Cyprus`) return non-empty results               |       |
| Stopword-heavy queries hit the near-equivalence bar in the quality table above        |       |
| Diacritic queries (`Çatalhöyük`) match the diacritic-stripped index                   |       |

#### Tokenizer + query parsing

| check                                                                                          | pass? |
|------------------------------------------------------------------------------------------------|-------|
| **Tokenizer parity**: Python and JS produce identical token sequences for every term in the curated benchmark (not just the regression set) | |
| **All-stopword query** (`a the of`) yields a *controlled* empty state with helpful copy, not an error or a full-corpus dump | |
| **Duplicate terms** (`pottery pottery cyprus`) produce the same top-K result identity as `pottery cyprus`, within ranking-order tolerance | |
| **Empty / 1-char / very-long token queries**: do not fetch broad shards; return an empty or error-with-copy state without long stalls | |
| Wildcard literals (`%`, `_`) tokenize without errors                                          |       |

#### Display + composition

| check                                                                                          | pass? |
|------------------------------------------------------------------------------------------------|------------|
| **Missing display-join rows**: substrate hit whose `pid` has no row in `samples_map_lite` does not crash and does not silently drop a top hit (either show with placeholder or document as known limit) | |
| **Filter composition matches a labeled expectation** (one of two modes per (query, filter) pair). **(a)** Pair has a hand-labeled expected *filtered* top-K (in `tests/search_benchmark.json`); the substrate's filtered top-K must match it. **(b)** Filter is chosen such that ALL hand-labeled unfiltered top-K results satisfy it (e.g., source filter whose set covers every top-K result's source); the filtered top-K must equal the unfiltered top-K. The earlier "implicitly satisfies" wording was too loose — a top result that doesn't satisfy the filter legitimately drops out, so a raw top-K change is not necessarily a bug; the invariant has to be tied to labeled expectation. Tested on at least 3 distinct (query, filter) pairs. | |

## Two outcomes

### GO

- [ ] All performance cells pass.
- [ ] All quality cells pass.
- [ ] All hard-fail checks pass.
- [ ] Open ship issue: remove `?fts=v1` flag, route `doSearch()` permanently to substrate path, deprecate the ILIKE path.
- [ ] Update `query-spec.qmd:225` to describe the substrate-backed search.
- [ ] Close #165 once ship issue lands.

A v1 GO **does not** close the hosted-search-backend question. It defers it. See NO-GO framing below for why hosted search remains a permanent contingency for v2+ requirements (richer analyzers, phrase search, typo tolerance, v2 field growth).

### NO-GO

- [ ] At least one cell fails.
- [ ] File `Explorer FTS Track 6: Hosted-search backend` issue with:
  - the failed-cell data attached (which budgets, which quality, which hard-fails)
  - a starter requirements doc referencing Solr `searchText` semantics from `query-spec.qmd:213-221`
  - the DuckDB FTS local oracle numbers from #171 §5 as the relevance bar to clear
  - explicit framing: hosted-search is the answer if the static substrate is structurally limited; static-site constraint should not permanently cap search quality
- [ ] Keep the `?fts=v1` flag in place as a measurement tool until the hosted backend lands.
- [ ] Close #165 with a pointer to the hosted-search issue.

### Hosted-search backend as a permanent contingency

The Track 6 hosted-search-backend issue may be triggered by **either**:

- **(a) v1 GO/NO-GO failure** — at least one cell fails the gate above.
- **(b) Post-ship v2+ requirements** — even on a v1 GO, future quality requirements (phrase search, typo tolerance, richer analyzers, v2 field growth that exceeds the static substrate's byte budget) may exceed what a static-Parquet substrate can deliver. When that happens, Track 6 fires for the same reasons the budget data would have triggered it under (a).

Both triggers file the same downstream issue with the same starter requirements doc.

## Refs

#165, #169, #171



check	pass?
Tokenizer parity: Python and JS produce identical token sequences for every term in the curated benchmark (not just the regression set)
All-stopword query (`a the of`) yields a controlled empty state with helpful copy, not an error or a full-corpus dump
Duplicate terms (`pottery pottery cyprus`) produce the same top-K result identity as `pottery cyprus`, within ranking-order tolerance
Empty / 1-char / very-long token queries: do not fetch broad shards; return an empty or error-with-copy state without long stalls
Wildcard literals (`%`, `_`) tokenize without errors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explorer FTS Track 5: GO/NO-GO decision gate #172

Goal

Decision criterion

Performance gates (hard)

Quality gates (hard, not advisory)

Hard-fail checks (any single failure = NO-GO)

Semantics

Tokenizer + query parsing

Display + composition

Two outcomes

GO

NO-GO

Hosted-search backend as a permanent contingency

Refs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

metric	contract	prototype
cold first search (P50)	≤ 2 s	(fill)
warm repeat-same-query search	≤ 500 ms	(fill)
warm new-query-after-warm-up search	≤ 500 ms	(fill)
filter-composed cold search	≤ 3 s	(fill)
bytes transferred cold	≤ 5 MB	(fill)
bytes transferred warm	≤ 1 MB	(fill)

metric	contract	prototype
top-3 overlap vs hand-labeled set	≥ TBD%	(fill)
top-10 overlap vs hand-labeled set	≥ TBD%	(fill)
top-10 overlap vs DuckDB FTS local oracle	≥ TBD%	(fill)
concept-only top-3 relevance: each of `ceramic`, `bone`, `mammal` (+ 1-2 more) returns ≥ 2 of 3 hand-labeled known-good PIDs	2/3 each	(fill)
stopword-heavy near-equivalence: top-K Jaccard between `pottery from Cyprus` and `pottery Cyprus` (stopword-stripped form)	≥ 0.8 top-10	(fill)

check	pass?
Concept-only queries (`ceramic`, `bone`, `mammal`) all return non-empty results
Concept-only queries hit the top-3 relevance bar in the quality table above
Stopword-heavy queries (`pottery from Cyprus`) return non-empty results
Stopword-heavy queries hit the near-equivalence bar in the quality table above
Diacritic queries (`Çatalhöyük`) match the diacritic-stripped index

check	pass?
Missing display-join rows: substrate hit whose `pid` has no row in `samples_map_lite` does not crash and does not silently drop a top hit (either show with placeholder or document as known limit)
Filter composition matches a labeled expectation (one of two modes per (query, filter) pair). (a) Pair has a hand-labeled expected filtered top-K (in `tests/search_benchmark.json`); the substrate's filtered top-K must match it. (b) Filter is chosen such that ALL hand-labeled unfiltered top-K results satisfy it (e.g., source filter whose set covers every top-K result's source); the filtered top-K must equal the unfiltered top-K. The earlier "implicitly satisfies" wording was too loose — a top result that doesn't satisfy the filter legitimately drops out, so a raw top-K change is not necessarily a bug; the invariant has to be tied to labeled expectation. Tested on at least 3 distinct (query, filter) pairs.

Explorer FTS Track 5: GO/NO-GO decision gate #172

Description

Goal

Decision criterion

Performance gates (hard)

Quality gates (hard, not advisory)

Hard-fail checks (any single failure = NO-GO)

Semantics

Tokenizer + query parsing

Display + composition

Two outcomes

GO

NO-GO

Hosted-search backend as a permanent contingency

Refs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions