perf(hybrid): prevent expensive array slicing when cache is disabled

JamePeng · JamePeng · commit 850ed2e1ed50 · 2026-03-07T20:48:16.000+08:00
Added a `max_checkpoints &gt; 0` check to the `finally` block of the generation loop.

Previously, even though the underlying C++ state extraction was bypassed, the Python layer was still executing `self._input_ids[:self.n_tokens].tolist()`. For long contexts, slicing and converting this massive array to a Python list caused unnecessary CPU overhead and garbage collection (GC) pressure. This intercept acts as a double-layer isolation, ensuring absolute zero memory allocation and zero overhead for hybrid models running in single-turn mode.
diff --git a/llama_cpp/llama.py b/llama_cpp/llama.py
@@ -1396,7 +1396,11 @@ def adapter(token_data_array: llama_cpp.llama_token_data_array):
                             ]
                         )
         finally:
-            if self.is_hybrid and self._hybrid_cache_mgr is not None:
+            if (
+                self.is_hybrid
+                and self._hybrid_cache_mgr is not None
+                and self._hybrid_cache_mgr.max_checkpoints > 0
+            ):
                 current_history = self._input_ids[:self.n_tokens].tolist()
 
                 self._hybrid_cache_mgr.save_checkpoint(

Original file line number	Diff line number	Diff line change
`@@ -1396,7 +1396,11 @@ def adapter(token_data_array: llama_cpp.llama_token_data_array):`
`1396`	`1396`	`]`
`1397`	`1397`	`)`
`1398`	`1398`	`finally:`
`1399`		`- if self.is_hybrid and self._hybrid_cache_mgr is not None:`
	`1399`	`+ if (`
	`1400`	`+ self.is_hybrid`
	`1401`	`+ and self._hybrid_cache_mgr is not None`
	`1402`	`+ and self._hybrid_cache_mgr.max_checkpoints > 0`
	`1403`	`+ ):`
`1400`	`1404`	`current_history = self._input_ids[:self.n_tokens].tolist()`
`1401`	`1405`
`1402`	`1406`	`self._hybrid_cache_mgr.save_checkpoint(`