Open
Conversation
comments comments clang format more clang format
9b91ec6 to
5fd315c
Compare
Contributor
|
Are there any benchmarks to validate those claims? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Performance fixes from Claude!
iso_alloc_zone_t field reordering — include/iso_alloc_ds.h
The biggest win. is_full was previously at offset ~2,119 bytes (cache line 33), buried after the 2,040-byte free_bit_slots[255] array. Every call to is_zone_usable() — the first check in the hot allocation path — would miss the cache loading that field.
The new layout puts all hot fields (user_pages_start, bitmap_start, next_free_bit_slot, canary_secret, pointer_mask, max_bitmap_idx, chunk_size, free_bit_slots_usable, free_bit_slots_index, is_full, internal) in the first 64 bytes (one cache line). The large free_bit_slots[255] array, which is only accessed during free-list refills, moves to the end.
__builtin_ctzll in iso_scan_zone_free_slot_slow — src/iso_alloc.c
Replaced all inner for(j = 0; j < 64; j += 2) loops with:
uint64_t free_mask = ~(uint64_t)bts & USED_BIT_VECTOR;
if (free_mask) return (offset + __builtin_ctzll(free_mask));
USED_BIT_VECTOR = 0x5555... selects even-position bits (one per chunk). Inverting + ANDing gives a mask of free slots. CTZ finds the first in one instruction instead of 32 iterations. Applied to all three paths: NEON, __int128 (split into two 64-bit halves), and standard.
__builtin_ctzll in fill_free_bit_slots — src/iso_alloc.c
Same technique for populating the free-list cache, replacing the 32-iteration inner loop in the partial-word case with a free_mask &= free_mask - 1 iteration (classic "iterate over set bits" idiom).
Zone cache scan direction — src/iso_alloc.c
Changed the thread zone-cache scan from oldest-first (0 → count-1) to newest-first (count-1 → 0). The cache is populated LIFO, so the most recently used zone — most likely to still have free slots — is found on the first iteration instead of the last.