Skip to content

Conversation

@yuzawa-san
Copy link
Contributor

i have tried the 4.X branch out in a real production environment under real load. this allowed me to profile and kill off more hotspots. here are the main changes:

  • optimize hot loops in AbstractLazilyEncodableSection. here we can skip allocating/calling iterators and just use simple for loops
  • remove ManagedIntegerSet and ManagedFixedList. both of these wrapped the values returned by the field getValue() methods. this was in order for us to determine if the value returned was mutated. in that case the parent field would be marked as dirty. however we found we called a lot of the getValues on the same decoded field and it would return a new wrapper on each call. i was never fully satisfied with my original implementation here, so i reworked this into a different architecture. i introduced a Dirtyable interface which means that the values themselves can track mutations. i made the fields that return a collection return a Dirtyable implementation: IntegerSet, FixedIntegerList, FixedList.
  • this allowed me to clean up the class hierarchy for IntegerSet. also this is a concrete class which means it have virtual dispatch instead of the more expensive interface dispatch.
  • i also converted the method signatures to return FixedIntegerList instead of List<Integer>. that FixedIntegerList is much more optimized: store in a byte array, add methods for unboxed access.
  • added tests

Copy link

@Kevin-P-Kerr Kevin-P-Kerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is that FixedIntegerList can only represent (I believe) 8 bit values. That works for this use case?

@yuzawa-san
Copy link
Contributor Author

@Kevin-P-Kerr yes, the gpp state specs only represent values 0,1,2 as the values (e.g. SensitiveDataProcessing , KnownChildSensitiveDataConsents). theoretically it could have been packed even smaller, but i found byte the smallest primitive which has built-in casts to/from int.

public int setInt(int index, int value) {
// NOTE: int 128 is prevented since it would get turned into byte -128
if(value < 0 || value >= 128) {
throw new IllegalArgumentException("FixedIntegerList only supports positive integers less than 128.");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be fine as the most typical values are 0,1,2 I believe

@yuzawa-san
Copy link
Contributor Author

yuzawa-san commented Jan 28, 2026

i was able to do further optimization:

  • i was able to remove the substring operations fully at great savings. this means that we no longer need to slice BitSets. instead i introduced a BitStringReader which progressed forward thru the BitString. it provides the primitive readInt readLong readFibonacci methods. i was able to remove the need search for the end of fibonacci string. we can accumulate the value while we search for the end.
  • i implemented BitSet in a simpler manner than the JDK version. it uses bytes instead of longs. i settled on a block based base64 decoding algorithm inspired by the JDK's version. this method exploits the fact that 4 base64 characters fit exactly into 3 bytes. this allows for some good loop unwinding and fewer bit shift operations on the decode which was taking up a good bit of cpu relative to other things.
  • i converted the field keys to enums. this allows for a large number of optimizations since we can use offsets and lists to store things
  • i made the class hierarchy more DRY. this reduces the amount of boilerplate required to add new sections.
  • refactored the lazy encode decode and dirty logic into a parent abstract class for reuse. the GppModel, the sections, and the segments all now have consistent and reuseable logic.
  • i cleaned up GppModel to store the sorted segment ids in the header. this allows the dirty logic to be unified. i changed the section map to be keyed on Integer since that is cheaper and its instances (via valueOf) are cached. i kept support for doing stuff via section names. i modified the setters and getters by section to use FieldKey instead of raw strings, since we get a lot of performance from not having to keep a string to field map.
  • replace interfaces with abstract types since they have faster method dispatch
  • upgrade SlicedCharSequence.split() to use String.indexOf() which is significantly faster than using charAt to find the split locations

i have a benchmark:

@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@State(org.openjdk.jmh.annotations.Scope.Thread)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(
        value = 3)
public class Microbenchmark {
  	private static final String in = "DBABMA~CQCDewAQCDewAPoABABGA9EMAP-AAB4AAIAAKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-ClbVfxv_215b_l-9-n7ZHmNX_f4e-7EMQYXybPuBcy71vycF99hMzRN-ramCpkSALt2wSEDbRyY1E1QoWqIFa8w7GpNnKE7SifgZIhzEXtnWAheb5sLY_kCmeb6__d3edkf7f3a_t3c8s-VZ792vf_m9ViYnSufrR_7_20Tm_vyPvf-fv-L_Pzf6xNv3k9bf7Xr7e9_fvLb__f___f___-______9__gAAAAA.QKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-.IKVtV_G__bXlv-X736ftkeY1f9_h77sQxBhfJs-4FzLvW_JwX32EzNE36tqYKmRIAu3bBIQNtHJjUTVChaogVrzDsak2coTtKJ-BkiHMRe2dYCF5vmwtj-QKZ5vr_93d52R_t_dr-3dzyz5Vnv3a9_-b1WJidK5-tH_v_bROb-_I-9_5-_4v8_N_rE2_eT1t_tevt739-8tv_9___9____7______3_-";
    

    @Benchmark
    @Threads(Threads.MAX)
    public void run(Blackhole bh) throws Exception {
        TcfEuV2 nu = new GppModel(in).getTcfEuV2Section();
        bh.consume(nu.getPublisherConsents());
        bh.consume(nu.getPurposeConsents());
        bh.consume(nu.getVendorConsents());
        bh.consume(nu.getPurposeLegitimateInterests());
        bh.consume(nu.getVendorLegitimateInterests());
        bh.consume(nu.getSpecialFeatureOptins());
        bh.consume(nu.getCmpId());
        bh.consume(nu.getPublisherRestrictions());
    }
}

here are the results:

6ac876f6 (4.X):
Benchmark                              Mode  Cnt      Score      Error   Units
Microbenchmark.run                     avgt   15  22775.215 ± 1694.869   ns/op
Microbenchmark.run:gc.alloc.rate       avgt   15   3219.333 ±  226.438  MB/sec
Microbenchmark.run:gc.alloc.rate.norm  avgt   15   6376.014 ±   25.040    B/op
Microbenchmark.run:gc.count            avgt   15    292.000             counts
Microbenchmark.run:gc.time             avgt   15    107.000                 ms

5c1d473 (4.X-perf-optimizations):
Benchmark                              Mode  Cnt      Score     Error   Units
Microbenchmark.run                     avgt   15   1994.077 ±  94.978   ns/op
Microbenchmark.run:gc.alloc.rate       avgt   15  21475.516 ± 991.532  MB/sec
Microbenchmark.run:gc.alloc.rate.norm  avgt   15   3736.001 ±  12.520    B/op
Microbenchmark.run:gc.count            avgt   15    920.000            counts
Microbenchmark.run:gc.time             avgt   15    348.000                ms

that is roughly an 11x improvement in speed and almost 2x decrease in memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants