perf: improve wildcard query perf with predicate and contains-check pushdown by cheb0 · Pull Request #397 · ozontech/seq-db

cheb0 · 2026-04-03T06:42:33Z

Description

Currently we spend only a fraction of time calling bytes.Index. This PR partially addresses that.

This PR pushes pattern.Searcher to Block level, so that Block is able to stream tokens through searcher. For ordinary wildcards like *error* there is direct FindContains method which is even faster.

For example, query message:*foobarf*:
main: 86 ms
using FindToken: 50 ms
using FindContains: 37 ms

So, FindContains just throws out costly abstractions to get additional performance. We could also provide a dedicated func like FindSuffix, for example. This is a typical example when performance requires additional code.

Query	Type	Ids	cold, ms	hot, ms	cold (branch), ms	hot (branch), ms	cold diff	hot diff
`trace_id:*foobar`	reg	0	18.76	4.37	16.14	1.84	-14%	-57.9%
`k8s_pod:*6`	reg	100	13.3	0.67	13.03	0.47	-2%	-29.9%
`message:err`	reg	100	138.72	26.97	120.27	12.36	-13.3%	-54.2%
`message:foo`	reg	100	77.69	27.08	60.54	11.84	-22.1%	-56.3%
`message:request`	reg	100	124.95	25.45	104.13	10.37	-16.7%	-59.3%
`message:foobarfoobar*`	reg	0	187.54	64.25	147.31	30.5	-21.5%	-52.5%
`message:foobarfoobar`	reg	0	184.93	63.87	121.51	20.39	-34.3%	-68.1%
`message:very_very_message_aggregator_events`	reg	0	173.45	51.62	116.9	12.81	-32.6%	-75.2%

Next steps:

try calling bytes.Index over Block payload - already shows good results
build Offsets lazy - if previous is done
modernize token Block, boost Unpack speed

I have read and followed all requirements in CONTRIBUTING.md;
I used LLM/AI assistance to make this pull request;

codecov-commenter · 2026-04-03T06:46:05Z

Codecov Report

❌ Patch coverage is 88.46154% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.51%. Comparing base (5115f7b) to head (cee0a60).

Files with missing lines	Patch %	Lines
frac/active_token_list.go	78.94%	2 Missing and 2 partials ⚠️
frac/sealed/token/provider.go	91.48%	2 Missing and 2 partials ⚠️
frac/sealed/token/block_loader.go	86.66%	1 Missing and 1 partial ⚠️
pattern/pattern.go	91.30%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #397      +/-   ##
==========================================
+ Coverage   71.28%   71.51%   +0.22%     
==========================================
  Files         210      210              
  Lines       15579    15662      +83     
==========================================
+ Hits        11105    11200      +95     
+ Misses       3673     3663      -10     
+ Partials      801      799       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dkharms · 2026-04-03T10:07:17Z

 	return b.Payload[offset : offset+l]
 }

+func (b *Block) FindContains(from, to int, needle []byte) ([]int, error) {


We've discussed that you can perform bytes.Contains on the block payload before checking each token individually. Have you measured performance of such optimization?

We've discussed that you can perform bytes.Contains on the block payload before checking each token individually.

Yes, I tried calling bytes.Index on entire payload. It boosts even further comparing to this PR:
message:foobar
35 ms => 9 ms

However, this means that when bytes.Index returns and if we have some proper index returned, then we need to do a bin search on Offsets to find an index and then check for false positive. It also comes with neat property that we can avoid call Unpack (build offsets) lazily which boosts cold query performance (somewhat around extra 20%).

I put a task to the backlog, decided that it's too much for a single PR.

dkharms · 2026-04-03T10:31:41Z

 }

+func (b *Block) FindContains(from, to int, needle []byte) ([]int, error) {
+	indices := make([]int, 0)


I guess you could pass here slice of needles as well to handle queries like message:*foo*bar* with multiple needles. Or there is something that blocks such improvement?

No, I think it's doable. Maybe will do

upd: will do in a separate PR

dkharms · 2026-05-19T18:13:54Z

 	return b.Payload[offset : offset+l]
 }

+func (b *Block) FindContains(from, to int, needle []byte) ([]int, error) {


I guess it's better to rename Block.FindContains and Block.FindToken for several reasons:

The name is conflicting with what we have in tokenProvider interface however they serve different purpose;

These methods are should be private, in my opinion;

Maybe something like will be better?

func (b *Block) contains(from, to int, needle []byte) ([]int, error) { ... } func (b *Block) find(from, to int, searcher pattern.Searcher) ([]int, error) { ... }

dkharms · 2026-05-19T18:20:23Z


 type tokenProvider interface {
 	GetToken(uint32) []byte
+	FindContains(firstTID uint32, lastTID uint32, needle []byte) ([]uint32, error)


Why did you decide to make firstTID and lastTID a part of an API?

Seems like for this specific case (e.g. query foo:'*bar*') we cannot narrow the TID search boundaries.

And now we always pass the first and last TID in this method:
https://github.com/ozontech/seq-db/blob/0-wildcard-predicate-pushdown/pattern/pattern.go#L411

dkharms · 2026-05-20T07:46:44Z

+func (tp *Provider) narrowEntries(firstTID, lastTID uint32) []*TableEntry {
+	firstIdx := sort.Search(len(tp.entries), func(i int) bool {
+		return tp.entries[i].getLastTID() >= firstTID
+	})
+	if firstIdx >= len(tp.entries) {
+		return nil
+	}
+	lastIdx := sort.Search(len(tp.entries), func(i int) bool {
+		return tp.entries[i].StartTID > lastTID
+	})
+	lastIdx--
+	if lastIdx < firstIdx {
+		return nil
+	}
+	entries := tp.entries[firstIdx : lastIdx+1]
+	return entries
+}


It is totally safe to rewrite this method in this way and it raises less questions on why we perform decrement and increment in lastIdx:

func (tp *Provider) narrowEntries(firstTID, lastTID uint32) []*TableEntry { firstIdx := sort.Search(len(tp.entries), func(i int) bool { return tp.entries[i].getLastTID() >= firstTID }) if firstIdx >= len(tp.entries) { return nil } lastIdx := sort.Search(len(tp.entries), func(i int) bool { return tp.entries[i].StartTID > lastTID }) // INVARIANT: Following condition always holds: // lastIdx <= len(tp.entries) && firstIdx <= lastIdx return tp.entries[firstIdx:lastIdx] }

dkharms · 2026-05-20T08:01:40Z

+
+	for _, entry := range entries {
+		block := tp.findBlock(entry.BlockIndex)
+		firstIndex, lastIndex := tp.narrowTIDs(entry, firstTID, lastTID)


Seems like it is beneficial to narrow tids only for the first and last entries -- for everything in-between it is just an additional overhead on method call.

And I guess this is name is ambiguous as well -- what we really do here is deriving local index of token inside specific block from its universal tid.

dkharms · 2026-05-20T08:03:14Z

+	return tids, nil
+}
+
+func (tp *Provider) narrowTIDs(entry *TableEntry, firstTID, fromTID uint32) (int, int) {


Incorrect argument name fromTID -- I guess you've meant lastTID here.
And I suggest to use builtin functions for getting min/max:

func (tp *Provider) narrowTIDs(entry *TableEntry, firstTID, lastTID uint32) (int, int) { tidStart := max(firstTID, entry.StartTID) tidEnd := min(lastTID, entry.getLastTID()) firstIndex := entry.GetIndexInTokensBlock(tidStart) lastIndex := entry.GetIndexInTokensBlock(tidEnd) return firstIndex, lastIndex }

pushdown predicate for wildcards

0fa1d0e

lint fixes

cee0a60

dkharms reviewed Apr 3, 2026

View reviewed changes

eguguchkin self-requested a review April 6, 2026 10:20

eguguchkin modified the milestones: v0.72.0, v0.73.0 Apr 13, 2026

cheb0 added the performance Features or improvements that positively affect seq-db performance label May 12, 2026

eguguchkin modified the milestones: v0.73.0, v0.72.0 May 18, 2026

dkharms reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve wildcard query perf with predicate and contains-check pushdown #397

perf: improve wildcard query perf with predicate and contains-check pushdown #397
cheb0 wants to merge 2 commits into
mainfrom
0-wildcard-predicate-pushdown

cheb0 commented Apr 3, 2026

Uh oh!

codecov-commenter commented Apr 3, 2026 •

edited

Loading

Uh oh!

dkharms Apr 3, 2026

Uh oh!

cheb0 Apr 3, 2026

Uh oh!

dkharms Apr 3, 2026 •

edited

Loading

Uh oh!

cheb0 Apr 3, 2026

Uh oh!

cheb0 May 12, 2026

Uh oh!

dkharms May 19, 2026 •

edited

Loading

Uh oh!

dkharms May 19, 2026

Uh oh!

dkharms May 20, 2026

Uh oh!

dkharms May 20, 2026

Uh oh!

dkharms May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cheb0 commented Apr 3, 2026

Description

Uh oh!

codecov-commenter commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkharms Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkharms May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Apr 3, 2026 •

edited

Loading

dkharms Apr 3, 2026 •

edited

Loading

dkharms May 19, 2026 •

edited

Loading