perf: improve wildcard query perf with predicate and contains-check pushdown #397
perf: improve wildcard query perf with predicate and contains-check pushdown #397cheb0 wants to merge 2 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #397 +/- ##
==========================================
+ Coverage 71.28% 71.51% +0.22%
==========================================
Files 210 210
Lines 15579 15662 +83
==========================================
+ Hits 11105 11200 +95
+ Misses 3673 3663 -10
+ Partials 801 799 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| return b.Payload[offset : offset+l] | ||
| } | ||
|
|
||
| func (b *Block) FindContains(from, to int, needle []byte) ([]int, error) { |
There was a problem hiding this comment.
We've discussed that you can perform bytes.Contains on the block payload before checking each token individually. Have you measured performance of such optimization?
There was a problem hiding this comment.
We've discussed that you can perform bytes.Contains on the block payload before checking each token individually.
Yes, I tried calling bytes.Index on entire payload. It boosts even further comparing to this PR:
message:foobar
35 ms => 9 ms
However, this means that when bytes.Index returns and if we have some proper index returned, then we need to do a bin search on Offsets to find an index and then check for false positive. It also comes with neat property that we can avoid call Unpack (build offsets) lazily which boosts cold query performance (somewhat around extra 20%).
I put a task to the backlog, decided that it's too much for a single PR.
| } | ||
|
|
||
| func (b *Block) FindContains(from, to int, needle []byte) ([]int, error) { | ||
| indices := make([]int, 0) |
There was a problem hiding this comment.
I guess you could pass here slice of needles as well to handle queries like message:*foo*bar* with multiple needles. Or there is something that blocks such improvement?
There was a problem hiding this comment.
No, I think it's doable. Maybe will do
There was a problem hiding this comment.
upd: will do in a separate PR
| return b.Payload[offset : offset+l] | ||
| } | ||
|
|
||
| func (b *Block) FindContains(from, to int, needle []byte) ([]int, error) { |
There was a problem hiding this comment.
I guess it's better to rename Block.FindContains and Block.FindToken for several reasons:
- The name is conflicting with what we have in
tokenProviderinterface however they serve different purpose; - These methods are should be private, in my opinion;
Maybe something like will be better?
func (b *Block) contains(from, to int, needle []byte) ([]int, error) { ... }
func (b *Block) find(from, to int, searcher pattern.Searcher) ([]int, error) { ... }|
|
||
| type tokenProvider interface { | ||
| GetToken(uint32) []byte | ||
| FindContains(firstTID uint32, lastTID uint32, needle []byte) ([]uint32, error) |
There was a problem hiding this comment.
Why did you decide to make firstTID and lastTID a part of an API?
Seems like for this specific case (e.g. query foo:'*bar*') we cannot narrow the TID search boundaries.
And now we always pass the first and last TID in this method:
https://github.com/ozontech/seq-db/blob/0-wildcard-predicate-pushdown/pattern/pattern.go#L411
| func (tp *Provider) narrowEntries(firstTID, lastTID uint32) []*TableEntry { | ||
| firstIdx := sort.Search(len(tp.entries), func(i int) bool { | ||
| return tp.entries[i].getLastTID() >= firstTID | ||
| }) | ||
| if firstIdx >= len(tp.entries) { | ||
| return nil | ||
| } | ||
| lastIdx := sort.Search(len(tp.entries), func(i int) bool { | ||
| return tp.entries[i].StartTID > lastTID | ||
| }) | ||
| lastIdx-- | ||
| if lastIdx < firstIdx { | ||
| return nil | ||
| } | ||
| entries := tp.entries[firstIdx : lastIdx+1] | ||
| return entries | ||
| } |
There was a problem hiding this comment.
It is totally safe to rewrite this method in this way and it raises less questions on why we perform decrement and increment in lastIdx:
func (tp *Provider) narrowEntries(firstTID, lastTID uint32) []*TableEntry {
firstIdx := sort.Search(len(tp.entries), func(i int) bool {
return tp.entries[i].getLastTID() >= firstTID
})
if firstIdx >= len(tp.entries) {
return nil
}
lastIdx := sort.Search(len(tp.entries), func(i int) bool {
return tp.entries[i].StartTID > lastTID
})
// INVARIANT: Following condition always holds:
// lastIdx <= len(tp.entries) && firstIdx <= lastIdx
return tp.entries[firstIdx:lastIdx]
}|
|
||
| for _, entry := range entries { | ||
| block := tp.findBlock(entry.BlockIndex) | ||
| firstIndex, lastIndex := tp.narrowTIDs(entry, firstTID, lastTID) |
There was a problem hiding this comment.
Seems like it is beneficial to narrow tids only for the first and last entries -- for everything in-between it is just an additional overhead on method call.
And I guess this is name is ambiguous as well -- what we really do here is deriving local index of token inside specific block from its universal tid.
| return tids, nil | ||
| } | ||
|
|
||
| func (tp *Provider) narrowTIDs(entry *TableEntry, firstTID, fromTID uint32) (int, int) { |
There was a problem hiding this comment.
Incorrect argument name fromTID -- I guess you've meant lastTID here.
And I suggest to use builtin functions for getting min/max:
func (tp *Provider) narrowTIDs(entry *TableEntry, firstTID, lastTID uint32) (int, int) {
tidStart := max(firstTID, entry.StartTID)
tidEnd := min(lastTID, entry.getLastTID())
firstIndex := entry.GetIndexInTokensBlock(tidStart)
lastIndex := entry.GetIndexInTokensBlock(tidEnd)
return firstIndex, lastIndex
}
Description
Currently we spend only a fraction of time calling
bytes.Index. This PR partially addresses that.This PR pushes
pattern.SearchertoBlocklevel, so thatBlockis able to stream tokens through searcher. For ordinary wildcards like*error*there is directFindContainsmethod which is even faster.For example, query
message:*foobarf*:main: 86 ms
using
FindToken: 50 msusing
FindContains: 37 msSo, FindContains just throws out costly abstractions to get additional performance. We could also provide a dedicated func like FindSuffix, for example. This is a typical example when performance requires additional code.
trace_id:*foobark8s_pod:*6message:*err*message:*foo*message:*request*message:*foobar*foobar*message:*foobarfoobar*message:*very_very_message_aggregator_events*Next steps:
bytes.IndexoverBlockpayload - already shows good results