Skip to content

Commit 55d5510

Browse files
committed
This makes XorBinaryFuse16 and XorBinaryFuse8 more robust
1 parent 5d2b41d commit 55d5510

File tree

6 files changed

+138
-104
lines changed

6 files changed

+138
-104
lines changed

README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ The following additional types are implemented, but less tested:
2222

2323
## Reference
2424

25-
* Thomas Mueller Graf, Daniel Lemire, [Binary Fuse Filters: Fast and Smaller Than Xor Filters](http://arxiv.org/abs/2201.01174), Journal of Experimental Algorithmics 27, 2022. DOI: 10.1145/3510449
25+
* Thomas Mueller Graf, Daniel Lemire, [Binary Fuse Filters: Fast and Smaller Than Xor Filters](http://arxiv.org/abs/2201.01174), Journal of Experimental Algorithmics 27, 2022. DOI: 10.1145/3510449
2626
* Thomas Mueller Graf, Daniel Lemire, [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/abs/1912.08258), Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122
2727

2828
## Usage
@@ -31,17 +31,13 @@ The following additional types are implemented, but less tested:
3131
To use the XOR and Binary Fuse filters, first prepare an array of keys, then construct the filter:
3232

3333
```java
34-
import org.fastfilter.xor.Xor8;
35-
import org.fastfilter.xor.Xor16;
3634
import org.fastfilter.xor.XorBinaryFuse8;
3735
import org.fastfilter.xor.XorBinaryFuse16;
3836

3937
// Example keys
4038
long[] keys = {1, 2, 3, 4, 5};
4139

42-
// Construct XOR filters
43-
Xor8 xor8 = Xor8.construct(keys);
44-
Xor16 xor16 = Xor16.construct(keys);
40+
// Construct binary fuse filters=
4541
XorBinaryFuse8 xorBinaryFuse8 = XorBinaryFuse8.construct(keys);
4642
XorBinaryFuse16 xorBinaryFuse16 = XorBinaryFuse16.construct(keys);
4743

@@ -52,6 +48,11 @@ boolean mightContain2 = xor8.mayContain(6L); // false (with high probability)
5248

5349
All filters implement the `Filter` interface and support the `mayContain(long key)` method to check if a key might be in the set. Note that false positives are possible, but false negatives are not.
5450

51+
### Generating the Hash Values
52+
53+
The library is written to process `long` values that are meant to be hash values. Though you do not need to use
54+
cryptographically strong hashing, you should make sure that your hash functions are reasonable: they should
55+
not generate too many collisions (two objects mapping to the same `long` value).
5556

5657
### Serialization and Deserialization
5758

@@ -60,25 +61,24 @@ Filters can be serialized to and deserialized from a `ByteBuffer` for persistenc
6061
```java
6162
import java.nio.ByteBuffer;
6263

63-
// Assuming you have a constructed filter, e.g., Xor8 xor8 = Xor8.construct(keys);
64+
// Assuming you have a constructed filter
6465

6566
// Get the serialized size
66-
int size = xor8.getSerializedSize();
67+
int size = XorBinaryFuse8.getSerializedSize();
6768

6869
// Allocate a ByteBuffer
6970
ByteBuffer buffer = ByteBuffer.allocate(size);
7071

7172
// Serialize the filter
72-
xor8.serialize(buffer);
73+
XorBinaryFuse8.serialize(buffer);
7374

7475
// Prepare buffer for reading (flip)
7576
buffer.flip();
7677

7778
// Deserialize the filter
78-
Xor8 deserializedXor8 = Xor8.deserialize(buffer);
79+
XorBinaryFuse8 deserializedXorBinaryFuse8 = Xor8.deserialize(buffer);
7980

8081
// The deserialized filter behaves identically to the original
81-
boolean result = deserializedXor8.mayContain(1L); // true
8282
```
8383

8484
This allows saving filters to files, databases, or sending them over networks.

fastfilter/src/main/java/org/fastfilter/xor/Xor16.java

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,10 @@
66
import org.fastfilter.utils.Hash;
77

88
/**
9+
* The Xor16 filter implementation is experimental. We recommend using XorBinaryFuse16 instead. Use at your own risks.
10+
*
911
* The xor filter, a new algorithm that can replace a Bloom filter.
12+
* Thomas Mueller Graf, Daniel Lemire, [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/abs/1912.08258), Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122
1013
*
1114
* It needs 1.23 log(1/fpp) bits per key. It is related to the BDZ algorithm [1]
1215
* (a minimal perfect hash function algorithm).

fastfilter/src/main/java/org/fastfilter/xor/Xor8.java

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,12 @@
66
import org.fastfilter.Filter;
77
import org.fastfilter.utils.Hash;
88

9+
910
/**
11+
* The Xor8 filter implementation is experimental. We recommend using XorBinaryFuse8 instead. Use at your own risks.
12+
*
1013
* The xor filter, a new algorithm that can replace a Bloom filter.
14+
* Thomas Mueller Graf, Daniel Lemire, [Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters](https://arxiv.org/abs/1912.08258), Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122
1115
*
1216
* It needs 1.23 log(1/fpp) bits per key. It is related to the BDZ algorithm [1]
1317
* (a minimal perfect hash function algorithm).

fastfilter/src/main/java/org/fastfilter/xor/XorBinaryFuse16.java

Lines changed: 59 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77

88
/**
99
* The xor binary fuse filter, a new algorithm that can replace a Bloom filter.
10+
* Thomas Mueller Graf, Daniel Lemire, [Binary Fuse Filters: Fast and Smaller Than Xor Filters](http://arxiv.org/abs/2201.01174), Journal of Experimental Algorithmics 27, 2022. DOI: 10.1145/3510449
1011
*/
1112
public class XorBinaryFuse16 implements Filter {
1213

@@ -78,6 +79,15 @@ private static int mod3(int x) {
7879
return x;
7980
}
8081

82+
/**
83+
* Constructs a new XorBinaryFuse16 filter from the given array of keys.
84+
* The filter is designed to have a low false positive rate while being space-efficient.
85+
* The keys array should contain unique values. The array may be mutated during construction
86+
* (e.g., sorted and deduplicated) if the algorithm detects that there are likely too many duplicates.
87+
*
88+
* @param keys the array of long keys to add to the filter
89+
* @return a new XorBinaryFuse16 filter containing all the keys
90+
*/
8191
public static XorBinaryFuse16 construct(long[] keys) {
8292
int size = keys.length;
8393
int segmentLength = calculateSegmentLength(ARITY, size);
@@ -102,6 +112,7 @@ private void addAll(long[] keys) {
102112
long[] reverseOrder = new long[size + 1];
103113
byte[] reverseH = new byte[size];
104114
int reverseOrderPos = 0;
115+
boolean duplicated = false;
105116

106117
// the lowest 2 bits are the h index (0, 1, or 2)
107118
// so we only have 6 bits for counting;
@@ -117,7 +128,6 @@ private void addAll(long[] keys) {
117128
blockBits++;
118129
}
119130
int block = 1 << blockBits;
120-
mainloop:
121131
while (true) {
122132
reverseOrder[size] = 1;
123133
int[] startPos = new int[block];
@@ -126,7 +136,8 @@ private void addAll(long[] keys) {
126136
}
127137
// counting sort
128138

129-
for (long key : keys) {
139+
for(int i = 0; i < size; i++) {
140+
long key = keys[i];
130141
long hash = Hash.hash64(key, seed);
131142
int segmentIndex = (int) (hash >>> (64 - blockBits));
132143
// We only overwrite when the hash was zero. Zero hash values
@@ -150,60 +161,62 @@ private void addAll(long[] keys) {
150161
}
151162
}
152163
startPos = null;
153-
if (countMask < 0) {
154-
// we have a possible counter overflow
155-
continue mainloop;
156-
}
157-
158-
reverseOrderPos = 0;
159-
int alonePos = 0;
160-
for (int i = 0; i < arrayLength; i++) {
161-
alone[alonePos] = i;
162-
int inc = (t2count[i] >> 2) == 1 ? 1 : 0;
163-
alonePos += inc;
164-
}
164+
if (countMask >= 0) {
165+
reverseOrderPos = 0;
166+
int alonePos = 0;
167+
for (int i = 0; i < arrayLength; i++) {
168+
alone[alonePos] = i;
169+
int inc = (t2count[i] >> 2) == 1 ? 1 : 0;
170+
alonePos += inc;
171+
}
165172

166-
while (alonePos > 0) {
167-
alonePos--;
168-
int index = alone[alonePos];
169-
if ((t2count[index] >> 2) == 1) {
170-
// It is still there!
171-
long hash = t2hash[index];
172-
byte found = (byte) (t2count[index] & 3);
173-
174-
reverseH[reverseOrderPos] = found;
175-
reverseOrder[reverseOrderPos] = hash;
176-
177-
h012[0] = getHashFromHash(hash, 0);
178-
h012[1] = getHashFromHash(hash, 1);
179-
h012[2] = getHashFromHash(hash, 2);
180-
181-
int index3 = h012[mod3(found + 1)];
182-
alone[alonePos] = index3;
183-
alonePos += ((t2count[index3] >> 2) == 2 ? 1 : 0);
184-
t2count[index3] -= 4;
185-
t2count[index3] ^= mod3(found + 1);
186-
t2hash[index3] ^= hash;
187-
188-
index3 = h012[mod3(found + 2)];
189-
alone[alonePos] = index3;
190-
alonePos += ((t2count[index3] >> 2) == 2 ? 1 : 0);
191-
t2count[index3] -= 4;
192-
t2count[index3] ^= mod3(found + 2);
193-
t2hash[index3] ^= hash;
194-
195-
reverseOrderPos++;
173+
while (alonePos > 0) {
174+
alonePos--;
175+
int index = alone[alonePos];
176+
if ((t2count[index] >> 2) == 1) {
177+
// It is still there!
178+
long hash = t2hash[index];
179+
byte found = (byte) (t2count[index] & 3);
180+
181+
reverseH[reverseOrderPos] = found;
182+
reverseOrder[reverseOrderPos] = hash;
183+
184+
h012[0] = getHashFromHash(hash, 0);
185+
h012[1] = getHashFromHash(hash, 1);
186+
h012[2] = getHashFromHash(hash, 2);
187+
188+
int index3 = h012[mod3(found + 1)];
189+
alone[alonePos] = index3;
190+
alonePos += ((t2count[index3] >> 2) == 2 ? 1 : 0);
191+
t2count[index3] -= 4;
192+
t2count[index3] ^= mod3(found + 1);
193+
t2hash[index3] ^= hash;
194+
195+
index3 = h012[mod3(found + 2)];
196+
alone[alonePos] = index3;
197+
alonePos += ((t2count[index3] >> 2) == 2 ? 1 : 0);
198+
t2count[index3] -= 4;
199+
t2count[index3] ^= mod3(found + 2);
200+
t2hash[index3] ^= hash;
201+
202+
reverseOrderPos++;
203+
}
196204
}
197205
}
198-
199206
if (reverseOrderPos == size) {
200207
break;
201208
}
202209
hashIndex++;
203210
Arrays.fill(t2count, (byte) 0);
204211
Arrays.fill(t2hash, 0);
205212
Arrays.fill(reverseOrder, 0);
206-
213+
// If we reach 10 passes, we assume that there are too many duplicates
214+
// in the input key set. We then sort and remove duplicates in place.
215+
// This should almost never happen.
216+
if (countMask < 0 && !duplicated) {
217+
size = Deduplicator.sortAndRemoveDup(keys, size);
218+
duplicated = true;
219+
}
207220
if (hashIndex > 100) {
208221
// if construction doesn't succeed eventually,
209222
// then there is likely a problem with the hash function.

fastfilter/src/main/java/org/fastfilter/xor/XorBinaryFuse32.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@
77
import org.fastfilter.utils.Hash;
88

99
/**
10-
* The xor binary fuse filter, a new algorithm that can replace a Bloom filter.
10+
* The XorBinaryFuse32 filter is experimental. We recommend using XorBinaryFuse8 or XorBinaryFuse16 instead.
11+
* Use at your own risks.
1112
*/
1213
public class XorBinaryFuse32 implements Filter {
1314

0 commit comments

Comments
 (0)