Skip to content

Commit 850ed2e

Browse files
committed
perf(hybrid): prevent expensive array slicing when cache is disabled
Added a `max_checkpoints > 0` check to the `finally` block of the generation loop. Previously, even though the underlying C++ state extraction was bypassed, the Python layer was still executing `self._input_ids[:self.n_tokens].tolist()`. For long contexts, slicing and converting this massive array to a Python list caused unnecessary CPU overhead and garbage collection (GC) pressure. This intercept acts as a double-layer isolation, ensuring absolute zero memory allocation and zero overhead for hybrid models running in single-turn mode.
1 parent 191e334 commit 850ed2e

1 file changed

Lines changed: 5 additions & 1 deletion

File tree

llama_cpp/llama.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1396,7 +1396,11 @@ def adapter(token_data_array: llama_cpp.llama_token_data_array):
13961396
]
13971397
)
13981398
finally:
1399-
if self.is_hybrid and self._hybrid_cache_mgr is not None:
1399+
if (
1400+
self.is_hybrid
1401+
and self._hybrid_cache_mgr is not None
1402+
and self._hybrid_cache_mgr.max_checkpoints > 0
1403+
):
14001404
current_history = self._input_ids[:self.n_tokens].tolist()
14011405

14021406
self._hybrid_cache_mgr.save_checkpoint(

0 commit comments

Comments
 (0)