mm/zsmalloc: reduce zs_free() latency on swap release path#757
mm/zsmalloc: reduce zs_free() latency on swap release path#757blktests-ci[bot] wants to merge 4 commits intolinus-master_basefrom
Conversation
|
Upstream branch: b4e0758 |
ceec5ed to
3b54e52
Compare
|
Upstream branch: 6596a02 |
390415c to
856edea
Compare
3b54e52 to
6a0b974
Compare
|
Upstream branch: 507bd4b |
856edea to
9acd7af
Compare
6a0b974 to
59ca59b
Compare
|
Upstream branch: dd6c438 |
9acd7af to
0218f85
Compare
59ca59b to
94f0438
Compare
|
Upstream branch: dd6c438 |
0218f85 to
e65a831
Compare
94f0438 to
857ada9
Compare
|
Upstream branch: dd6c438 |
e65a831 to
6957651
Compare
857ada9 to
482ce5b
Compare
|
Upstream branch: dca922e |
6957651 to
71b45a6
Compare
482ce5b to
5a9f7c7
Compare
|
Upstream branch: e75a43c |
71b45a6 to
a519d87
Compare
5a9f7c7 to
25a041f
Compare
|
Upstream branch: 66edb90 |
a519d87 to
22dcae6
Compare
25a041f to
6f75bd1
Compare
|
Upstream branch: 6d35786 |
22dcae6 to
8b25baa
Compare
6f75bd1 to
1f0d33a
Compare
|
Upstream branch: 6d35786 |
8b25baa to
686cf70
Compare
1f0d33a to
b1870f6
Compare
Currently in zs_free(), the class->lock is held until the zspage is completely freed and the counters are updated. However, freeing pages back to the buddy allocator requires acquiring the zone lock. Under heavy memory pressure, zone lock contention can be severe. When this happens, the CPU holding the class->lock will stall waiting for the zone lock, thereby blocking all other CPUs attempting to acquire the same class->lock. This patch shrinks the critical section of the class->lock to reduce lock contention. By moving the actual page freeing process outside the class->lock, we can improve the concurrency performance of zs_free(). Testing on the RADXA O6 platform shows that with 12 CPUs concurrently performing zs_free() operations, the execution time is reduced by 20%. Signed-off-by: Xueyuan Chen <xueyuan.chen21@gmail.com> Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
zs_free() is expensive due to internal locking (pool->lock, class->lock) and potential zspage freeing. On the process exit path, the slow zs_free() blocks memory reclamation, delaying overall memory release. This has been reported to significantly impact Android low-memory killing where slot_free() accounts for over 80% of the total swap entry freeing cost. Introduce zs_free_deferred() which queues handles into a fixed-size per-pool array for later processing by a workqueue. This allows callers to defer the expensive zs_free() and return quickly, so the process exit path can release memory faster. The array capacity is derived from a 128MB uncompressed data budget (128MB >> PAGE_SHIFT entries), which scales naturally with PAGE_SIZE. When the array reaches half capacity, the workqueue is scheduled to drain pending handles. zs_free_deferred() uses spin_trylock() to access the deferred queue. If the lock is contended (e.g. drain in progress) or the queue is full, it falls back to synchronous zs_free() to guarantee correctness. Also introduce zs_free_deferred_flush() for use during pool teardown to ensure all pending handles are freed. Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
zram_slot_free_notify() is called on the process exit path when unmapping swap entries. The slot_free() it calls internally invokes zs_free(), which accounts for ~87% of slot_free() cost due to zsmalloc internal locking (pool->lock, class->lock) and potential zspage freeing. This blocks the process exit path, delaying overall memory release during Android low-memory killing. Split slot_free() into slot_free_extract() and the actual zs_free() call. slot_free_extract() handles all slot metadata cleanup (clearing flags, updating stats, zeroing handle/size) and returns the zsmalloc handle that needs freeing. This separation has two benefits: 1. It makes the two responsibilities of slot_free() explicit: slot metadata management (must be done under slot lock) vs zsmalloc memory release (can be deferred). 2. It allows zram_slot_free_notify() to use zs_free_deferred() for the handle, deferring the expensive zs_free() to a workqueue so the exit path can release memory faster. While at it, merge three separate clear_slot_flag() calls for ZRAM_IDLE, ZRAM_INCOMPRESSIBLE, and ZRAM_PP_SLOT into a single bitmask operation via clear_slot_flags_on_free(), reducing redundant read-modify-write cycles on the same flags word. All other slot_free() callers (write, discard, meta_free) continue to use synchronous zs_free() through the unchanged slot_free() wrapper. Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org> Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
zswap_invalidate() is called on the same process exit path as zram_slot_free_notify(). The zswap_entry_free() it calls internally performs zs_free() which is expensive due to zsmalloc internal locking. Unlike zram which has a trylock fallback, zswap_invalidate() executes unconditionally, making the latency impact potentially worse. Like zram, the expensive zs_free() here blocks the process exit path, delaying overall memory release. Additionally, zswap_entry_free() performs extra work beyond zs_free(): list_lru_del() (takes its own spinlock), obj_cgroup accounting, and kmem_cache_free for the entry itself. Use zs_free_deferred() in zswap_invalidate() path to defer the expensive zsmalloc handle freeing to a workqueue, allowing the exit path to release memory faster. All other callers (zswap_load, zswap_writeback_entry, zswap_store error paths) run in process context and continue to use synchronous zs_free(). Signed-off-by: Wenchao Hao <haowenchao@xiaomi.com>
|
Upstream branch: aa54b1d |
686cf70 to
1f2c45d
Compare
Pull request for series with
subject: mm/zsmalloc: reduce zs_free() latency on swap release path
version: 2
url: https://patchwork.kernel.org/project/linux-block/list/?series=1083830