Skip to content

tests: broaden fixture corpus quality coverage#17

Merged
gaelic-ghost merged 1 commit into
mainfrom
tests/broader-corpus-quality
May 2, 2026
Merged

tests: broaden fixture corpus quality coverage#17
gaelic-ghost merged 1 commit into
mainfrom
tests/broader-corpus-quality

Conversation

@gaelic-ghost
Copy link
Copy Markdown
Owner

Summary

  • add synthetic near-miss and longer-body records to the Gutenberg fixture corpus
  • cover focused-vs-scattered all-term ranking and longer-body snippet selection in in-memory FetchKit tests
  • add SearchKit parity coverage for the same fixture behavior
  • reward tighter all-term evidence in the in-memory ranker so focused passages beat scattered matches
  • document why the synthetic fixture records exist

Verification

  • swift test --filter FixtureCorpusQualityTests
  • swift test --filter SearchKitFetchIndexTests/testSearchKitFetchIndexMatchesFixtureCorpusNearMissAndLongBodyBehavior
  • swift test
  • scripts/repo-maintenance/validate-all.sh

@gaelic-ghost gaelic-ghost merged commit 532579a into main May 2, 2026
1 check passed
@gaelic-ghost gaelic-ghost deleted the tests/broader-corpus-quality branch May 2, 2026 21:28
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0305735ba0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


while searchStart < lowercaseText.endIndex,
let range = lowercaseText.range(of: term, range: searchStart..<lowercaseText.endIndex) {
locations.append(lowercaseText.distance(from: lowercaseText.startIndex, to: range.lowerBound))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid quadratic index-distance scans in term location loop

The new compactness scorer makes every .allTerms query walk each matching document body via termLocations, and this loop computes distance(from: startIndex, to:) for every hit. On long texts with frequent terms (for example, a common word appearing thousands of times), those repeated distance calculations accumulate to roughly O(n²) work per term, which can make in-memory search latency spike substantially compared with the previous constant-time scoring path. Converting the search text to a random-access representation once (or tracking offsets incrementally) avoids this regression.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant