Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ An Apple-first Swift Package family for local document search and semantic retri

SwiftlyFetch is the umbrella product direction for a small family of Apple-first local search packages. The product goal is simple: hand the system a local corpus and get back a real search engine, with conventional search and semantic retrieval both living under one coherent Swift-native story. In practical terms, SwiftlyFetch is the family for "drop in a corpus, get back local search," with `FetchKit` covering conventional full-document search and `RAGKit` covering semantic retrieval over the same broader corpus model.

Today, the package exposes shipped semantic retrieval work through `RAGCore` and `RAGKit`, plus the first conventional-search foundation through `FetchCore` and `FetchKit`. `FetchCore` now owns the portable conventional-search vocabulary, the durable document-record model, and the indexing-changeset boundary. That record model carries first-class typed lifecycle and source fields like `kind`, `language`, `createdAt`, `updatedAt`, `sourceURI`, and `lastIndexedAt`, while leaving the freeform metadata bag string-based. `FetchCore` also distinguishes between the durable stored record, the lean search-facing document view, and the richer index-facing payload used by the sync boundary. `FetchKitLibrary` now supports a default in-memory construction path, and `FetchKit` includes a Core Data-backed `FetchDocumentStore`, a persisted pending-sync queue, and the first thin macOS SearchKit-backed index. Conventional-search results also carry field evidence through `matchedFields` and `snippetField`, so UI code can tell whether a result matched title text, body text, or both.
Today, the package exposes shipped semantic retrieval work through `RAGCore` and `RAGKit`, plus the first conventional-search foundation through `FetchCore` and `FetchKit`. `FetchCore` now owns the portable conventional-search vocabulary, the durable document-record model, and the indexing-changeset boundary. That record model carries first-class typed lifecycle and source fields like `kind`, `language`, `createdAt`, `updatedAt`, `sourceURI`, and `lastIndexedAt`, while leaving the freeform metadata bag string-based. `FetchCore` also distinguishes between the durable stored record, the lean search-facing document view, and the richer index-facing payload used by the sync boundary. `FetchKitLibrary` now supports a default in-memory construction path, and `FetchKit` includes a Core Data-backed `FetchDocumentStore`, a persisted pending-sync queue, and the first thin macOS SearchKit-backed index. Conventional-search results carry field evidence through `matchedFields` and `snippetField`, so UI code can tell whether a result matched title text, body text, or both. The default in-memory search path now also rewards tighter all-term evidence, so a focused passage can rank ahead of a scattered near-miss instead of relying on document ID tie-breaking.

The intended family split is:

Expand All @@ -30,7 +30,7 @@ The intended family split is:
- `FetchKit` for traditional search, with `FetchKitLibrary` as the first public facade and Core Data plus SearchKit as the intended Apple implementation model
- `SwiftlyFetch` as the umbrella story tying those sibling package surfaces together over time

That intended split does not change the current package boundary: `RAGKit` still owns semantic retrieval work, not conventional document search. The next family step is refinement, not first existence: improve conventional-search ranking and snippets, keep the persistent library surface polished, and continue making it realistic for one local corpus to support both traditional search and semantic retrieval without forcing those jobs into one module.
That intended split does not change the current package boundary: `RAGKit` still owns semantic retrieval work, not conventional document search. The next family step is caller-driven polish, not first existence: keep conventional-search result quality under pressure with fixture and app corpora, keep the persistent library surface polished, and continue making it realistic for one local corpus to support both traditional search and semantic retrieval without forcing those jobs into one module.

Platform-wise, the family target is still "macOS and iOS are both first-class," but the first concrete full-text backend is intentionally macOS-first. Apple documents Search Kit as a Mac app indexing and search framework, while Core Spotlight is the more obvious Apple-side indexing/search direction for iOS later. That means the current plan is not to pretend one backend fits both platforms immediately. Instead, `FetchCore` stays portable, `FetchKit` starts with the honest macOS path, and iOS remains a first-class family target through a future sibling backend rather than through fake cross-platform wording.

Expand Down Expand Up @@ -144,6 +144,7 @@ Current defaults:
- markdown images keep alt text primary in chunk text while recording image references as chunk metadata, and whitelisted HTML blocks currently cover `img` plus `details` / `summary`
- markdown fallback is selective: ordinary supported prose still chunks normally, but policy-rejected markdown like unsupported raw-HTML-only or reference-definition-only content does not fall back through the plain paragraph chunker
- conventional search now uses modest field-aware ranking, prefers title hits over body-only hits when both are relevant, and builds query-aware snippets with multi-term highlights instead of a single fixed-width first-term window
- in-memory all-term search gives a small boost to compact evidence, so focused passages rank ahead of scattered near-matches when they satisfy the same query terms
- conventional-search results report `matchedFields` and `snippetField`, keeping title-only snippets visible while letting consumers distinguish title evidence from body evidence
- `makeContext(...)` suppresses redundant same-document chunk text, groups annotated output by document, and skips annotated sections that only have room for labels

Expand All @@ -155,7 +156,7 @@ Supported today:
- use `FetchKitLibrary()` with a default in-memory backend or inject custom `FetchDocumentStore` and `FetchIndex` implementations explicitly
- use a real Core Data-backed `FetchDocumentStore` in `FetchKit` with the first thin macOS SearchKit index backend
- persist and retry pending index-sync work through `FetchKitLibrary.pendingIndexSyncs()` and `retryPendingIndexSyncs(...)`
- return conventional-search results with query-aware snippets, field-aware ranking, matched-field metadata, and snippet-source metadata across title and body matches
- return conventional-search results with query-aware snippets, field-aware ranking, compact-evidence ranking in the default in-memory path, matched-field metadata, and snippet-source metadata across title and body matches
- narrow retrieval with typed metadata filters
- preserve meaningful markdown structure for retrieval, including heading paths, list semantics, quote-heavy documents, code-heavy documents, short section breaks, images, and a narrow raw-HTML whitelist
- turn ranked search results into plain or annotated context text for downstream UI or model consumers
Expand All @@ -175,6 +176,7 @@ Current constraints:
- the SearchKit backend is macOS-first
- Natural Language asset-backed verification runs in local maintainer validation by default, but stays out of the default GitHub-hosted CI lane because hosted macOS still stalls in the asset-backed step
- the package family direction is broader than the currently shipped polished surface, especially on the `FetchKit` side
- conventional-search quality coverage uses a small checked-in Project Gutenberg fixture corpus plus synthetic near-miss and longer-body records; larger app-like corpora are still future validation work

If you want to contribute to the package itself, use [CONTRIBUTING.md](./CONTRIBUTING.md). Maintainer planning and architecture notes live under [docs/maintainers/](./docs/maintainers/).

Expand Down
9 changes: 7 additions & 2 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,8 @@ In Progress
### Scope

- [x] Refine conventional-search ranking and snippet behavior now that the first SearchKit backend works end to end.
- [ ] Validate whether the current refinement pass is enough for ordinary app callers or whether another real-corpus quality pass is needed.
- [x] Validate the current refinement pass against a broader checked-in fixture corpus with near-miss ranking and longer-body snippet cases.
- [ ] Validate whether the current refinement pass is enough for ordinary app callers against larger real app corpora.
- [ ] Keep the public `FetchKitLibrary` surface polished as the conventional-search side moves from foundation into quality work.

### Tickets
Expand All @@ -184,7 +185,9 @@ In Progress
- [x] Improve snippet behavior and result presentation without bloating `FetchCore` into a larger query or rendering DSL.
- [x] Add the first checked-in fixture corpus and cover title/body result-evidence behavior across the default in-memory path and the macOS SearchKit-backed path.
- [x] Decide that title-only hits should keep title snippets while exposing `matchedFields` and `snippetField` so consumers can distinguish title evidence from body evidence.
- [ ] Audit broader real-corpus result quality now that field-aware ranking, phrase weighting, truncation cues, multi-term snippets, and field-evidence metadata are in place.
- [x] Add broader fixture-corpus pressure for near-miss all-term ranking and longer-body snippet selection across the default in-memory path and the macOS SearchKit-backed path.
- [x] Refine the default in-memory all-term ranker so tighter evidence beats scattered term mentions instead of falling through to document ID tie-breaking.
- [ ] Audit larger app-like corpus result quality now that field-aware ranking, compact all-term evidence, phrase weighting, truncation cues, multi-term snippets, and field-evidence metadata are in place.
- [ ] Keep the persistent `FetchKitLibrary` construction and search API surface under review as real callers exercise the current design.
- [ ] Explore an opt-in extended snippet surface that can use idle time to precompute short document summaries for larger records, with Apple's [`FoundationModels`](https://developer.apple.com/documentation/foundationmodels) or another local summarization path as the first candidate instead of making foreground full-text search wait on summarization.

Expand Down Expand Up @@ -223,6 +226,7 @@ Planned
- [ ] If parser-backed markdown chunking still leaves retrieval-quality gaps, add retrieval-specific chunking heuristics on top of the chosen markdown parser instead of rebuilding markdown parsing rules locally.
- [ ] If asset-backed automation becomes important again, evaluate a self-hosted macOS runner with prewarmed assets before retrying a hosted GitHub Actions lane.
- [ ] Consider a follow-on conventional-search quality pass only if real corpora show ranking, snippet, or result-presentation gaps beyond the current field-aware heuristics.
- [ ] Evaluate whether fixture-corpus coverage should grow through additional checked-in micro-records, generated local fixtures, or an opt-in live dataset lane before adopting a Swift Hub dependency.

## History

Expand Down Expand Up @@ -269,3 +273,4 @@ Planned
- Promoted the SearchKit-backed test suite from a local opt-in lane into normal XCTest validation and the default GitHub CI path once the lane proved fast and stable enough.
- Promoted the Natural Language integration lane into default local maintainer validation, but kept it out of GitHub-hosted CI after another hosted experiment remained stuck in the asset-backed step for minutes while the local path completed in seconds.
- Opened the next roadmap phase around SearchKit/Natural Language verification strategy, iOS conventional-search backend direction, and another caller-driven `FetchKitLibrary` polish pass if real usage shows it is needed.
- Broadened the checked-in fixture corpus with synthetic near-miss and longer-body records, added in-memory and SearchKit parity coverage for those cases, and refined in-memory all-term ranking so compact evidence beats scattered mentions.
5 changes: 3 additions & 2 deletions docs/maintainers/fetchkit-product-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,9 @@ Current status:
- because the Search Kit tests now finish quickly and pass reliably, the repo now runs them in ordinary local validation and the default GitHub macOS CI lane instead of keeping them behind a local-only opt-in gate
- the persistent `FetchKitLibrary` construction path is now intentionally caller-shaped around one storage location, with an Application Support default plus a direct directory override, instead of asking app code to assemble separate Core Data and Search Kit URLs itself
- the first refinement pass on conventional-search result quality is now in place: SearchKit scores are normalized per field, title hits get a modest weight bump, cross-field matches accumulate instead of collapsing to the single best field, and snippets now highlight multiple query terms instead of showing only the first term in a fixed-width window
- the default in-memory all-term ranker now gives a small compactness boost to tighter evidence, so a focused passage can rank ahead of a scattered near-miss when both documents satisfy the same query terms
- title-only hits intentionally keep a title snippet, and `FetchSearchResult` now reports `matchedFields` plus `snippetField` so consumers can distinguish title evidence from body evidence without losing the simple "why did this result appear?" explanation
- the first checked-in fixture corpus now covers both the default in-memory index path and the macOS SearchKit-backed path, using a tiny attributed Project Gutenberg sample from Hugging Face instead of making CI download a live dataset
- the first checked-in fixture corpus now covers both the default in-memory index path and the macOS SearchKit-backed path, using a tiny attributed Project Gutenberg sample from Hugging Face plus small synthetic near-miss and longer-body records instead of making CI download a live dataset
- the CI investigation on GitHub-hosted macOS found that the Core Data-backed store path could abort under Swift Testing with `Incorrect actor executor assumption`, even after global test parallelism was disabled
- that investigation surfaced two store-shape fixes worth keeping regardless of the runner: the durable Core Data store should use a private-queue background context instead of `viewContext`, and it should use Core Data's async `perform` API directly instead of manually bridging context work through checked continuations
- the Core Data-backed store coverage now lives on XCTest rather than Swift Testing so the package keeps the newer test surface where it is stable while reserving the older runner for framework-heavy Core Data verification
Expand Down Expand Up @@ -162,7 +163,7 @@ The next work is refinement, not first architecture:

- keep the persistent `FetchKitLibrary` surface polished as real callers exercise it
- keep the SearchKit-backed path inside ordinary validation unless a future framework regression forces it back out
- use broader fixture corpora or real app corpora to decide whether the current ranking, snippet, and result-evidence heuristics are already enough for ordinary callers
- use larger app corpora to decide whether the current ranking, snippet, and result-evidence heuristics are already enough for ordinary callers now that the checked-in fixture corpus covers title evidence, body evidence, near misses, and longer-body snippets
- explore opt-in extended snippets later as background summary metadata for larger documents, not as work that foreground full-text search has to perform before returning results

## First Core Data Entity Shape
Expand Down
15 changes: 11 additions & 4 deletions docs/maintainers/fixture-corpus.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

## Purpose

This note records the first checked-in fixture corpus used for `FetchKit` conventional-search quality tests.
This note records the checked-in fixture corpus used for `FetchKit` conventional-search quality tests.

The job of this fixture is deliberately narrow: give the default `FetchKitLibrary` and macOS SearchKit tests enough title/body variety to characterize ranking, snippet, and result-evidence behavior without making local or hosted CI download a dataset.

## Current Fixture Source

The first mini corpus is derived from the [`zkeown/gutenberg-corpus`](https://huggingface.co/datasets/zkeown/gutenberg-corpus) dataset on Hugging Face.
The first source-derived mini corpus is derived from the [`zkeown/gutenberg-corpus`](https://huggingface.co/datasets/zkeown/gutenberg-corpus) dataset on Hugging Face.

Why this source fits the first pass:

Expand All @@ -20,6 +20,11 @@ Why this source fits the first pass:

The fixture records live in `Tests/FetchKitTests/Fixtures/GutenbergMiniCorpus.swift`. Each source-derived record carries dataset, config, split, row, and Gutenberg ID metadata so the sample remains attributable and replaceable. The fixture also includes small synthetic near-miss and longer-body records derived from the same topic shape. Those synthetic records exist to stress ranking and snippet selection without expanding the checked-in corpus into a large text dump.

Current synthetic records:

- `fixture-botany-near-miss` scatters the query terms from the seed-storage chapter across an unrelated classroom-supply note. It protects the user-facing expectation that a focused passage about storage of food in seeds ranks ahead of a document that merely mentions storage, food, and seeds separately.
- `fixture-long-frontier-body` places the useful passage in the middle of a longer body. It protects the expectation that snippet selection moves toward the relevant passage and keeps visible truncation markers when the returned snippet is cropped.

## Result Evidence Policy

The first fixture pass settled the title-only snippet policy for the current public surface:
Expand All @@ -31,6 +36,8 @@ The first fixture pass settled the title-only snippet policy for the current pub

In practical terms, simple result lists can keep rendering a snippet for every explained hit, while richer consumers can avoid treating a title snippet as body evidence.

The second fixture pass added a compact-evidence ranking expectation for the default in-memory path. For `allTerms` search, a document that places all terms close together should beat a near-miss that satisfies the same terms only through scattered mentions. That keeps the default backend closer to what an app user means by "this result is about my query" without turning `FetchCore` into a larger ranking DSL.

## Hugging Face Dependency Boundary

Do not add a Hugging Face Swift dependency for the default fixture lane yet. The current checked-in fixture keeps CI deterministic and avoids adding a network, token, cache, or package-resolution requirement to ordinary tests.
Expand All @@ -56,6 +63,6 @@ Hugging Face documents dataset parquet discovery through the Dataset Viewer serv
Use this fixture to keep the settled Milestone 4 result-evidence behavior honest while broader quality work continues:

- whether the current ranking and snippet heuristics are enough for ordinary app callers
- whether a larger fixture corpus exposes ranking or snippet gaps that the mini corpus cannot show
- whether near-miss records and longer-body records keep ranking and snippet behavior aligned between the default in-memory path and the SearchKit-backed path
- whether larger app-like corpora expose ranking or snippet gaps that the mini corpus cannot show
- whether additional checked-in micro-records, generated local fixtures, or an opt-in live dataset lane would add enough value to justify the extra maintenance
- whether future extended snippets should be backed by precomputed summaries for larger documents rather than by foreground search-time work
Loading
Loading