Skip to content

Commit 41959f5

Browse files
committed
Bump version to milestone version 0.3.30.
1 parent 3f8f0f8 commit 41959f5

File tree

2 files changed

+128
-1
lines changed

2 files changed

+128
-1
lines changed

CHANGELOG.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,133 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.3.30] Milestone Release
11+
12+
I will update the release notes for version 0.3.30 in the [discussion](https://github.com/JamePeng/llama-cpp-python/discussions).
13+
14+
- refactor(mtmd): redesign multimodal pipeline for concurrent I/O and hybrid state management
15+
This commit fundamentally restructures the `MTMDChatHandler` pipeline, decoupling the prefill and evaluation stages to resolve previous I/O bottlenecks and state-sync issues. The new architecture fully supports hybrid/recurrent multimodal models (e.g., Qwen3.5s, LFM2-VL) with robust context management.
16+
17+
Key structural advantages and changes:
18+
- Concurrent Media Decoding: Implemented `ThreadPoolExecutor` in `_process_mtmd_prompt` with pre-allocated arrays, allowing thread-safe parallel image/audio decoding while strictly preserving the chronological order of user inputs, and can be used in the future to process large numbers of video frames.
19+
- O(1) Prefix Matching ("Negative Reverse Vocabulary"): Replaced slow dictionary lookups with a deterministic hash-to-negative-integer mapping for media IDs. This isolates media tokens from the LLM's positive vocabulary space, enabling native, ultra-fast `longest_token_prefix` array comparisons in Python.
20+
- Hybrid Model State Management: Replaced aggressive mid-turn saving with highly efficient "End-of-Turn" checkpointing. This ensures multi-image prompts consume only a single LRU slot while allowing precise rollback to the nearest valid state upon cache misses.
21+
- Robust Context Shift (OOM Defense): The `__call__` loop now preemptively calculates token boundaries for upcoming multimodal chunks, safely discarding the oldest unpinned tokens from both the physical KV cache and the Python virtual ledger to prevent backend crashes.
22+
- Qwen3.5 Support CONFRIMED, waiting Qwen35ChatHandler PR merge
23+
24+
- merge: Implemented Qwen35ChatHandler for Qwen3.5(by **@alcoftTAO**)
25+
26+
- fix: Correct the mtmd vision check condition bug
27+
28+
- refactor(chat_handler): extract MTMDChatHandler base class and Simplify the complexity of subsequent multimodal adaptation
29+
- Extracted the core multimodal processing pipeline from `Llava15ChatHandler` into a generic `MTMDChatHandler` base class, separating pipeline logic from model-specific prompt formats.
30+
- Updated all multimodal subclass handlers (e.g., Gemma3, Granite-Docling, PaddleOCR, Qwen2.5vl, Qwen3-vl, MiniCPM, GLM4.xV, LFM2-VL) to inherit from the new base class `MTMDChatHandler`.
31+
- Implemented strict `**kwargs` validation in the baseconstructor to gracefully intercept and report unsupported parameters, significantly improving Developer Experience (DX).
32+
- Introduced dynamic `self.log_prefix` (`self.__class__.__name__`) for accurate and consistent logging across all subclasses.
33+
- Cleaned up redundant state-clearing, image-count logic and hardcoded print statements across subclass `__call__` implementations.
34+
- To avoid exceptions occurring when the close method is called due to initialization failure and the call to exit_stack.
35+
36+
- feat: Update llama.cpp to [ggml-org/llama.cpp/commit/2afcdb9777b1bac79fa4bfe284b9bf23085b0469](https://github.com/ggml-org/llama.cpp/commit/2afcdb9777b1bac79fa4bfe284b9bf23085b0469)
37+
38+
- feat: Sync llama.cpp llama/mtmd API Binding 20260301
39+
40+
Many thanks to **@yamikumo-DSD** and **@roj234** for providing detailed testing and valuable suggestions.
41+
42+
More information see: https://github.com/JamePeng/llama-cpp-python/compare/e4861df5fd44bb83ec2b9063ca3375759416aead...3f8f0f89a2b72ec2f9494fa5f14206591a5cde49
43+
44+
## [0.3.29]
45+
46+
- perf(eval): implement adaptive checkpoint intervals for hybrid models
47+
- Dynamically scale checkpoint frequency during large prompt pre-filling (max 3 triggers per eval) to minimize I/O bottlenecks and stuttering.
48+
- Add success validation to `save_checkpoint`, ensuring the `last_ckpt_pos` tracker is only updated when the state is successfully saved to disk/memory.
49+
- Enhance verbose logging to track dynamic interval calculations and save failures.
50+
51+
- fix(eval): make context shift mathematically robust and architecture-safe
52+
- Added a `memory_can_shift()` pre-flight check to proactively intercept and abort gracefully on architectures that physically forbid shifting (e.g., multimodal mmproj where `n_pos_per_embd > 1` or incompatible M-RoPE), preventing fatal `GGML_ASSERT` C++ crashes.
53+
- Implemented dynamic mathematical bounds for `n_keep` and `n_discard` to guarantee that enough space is always freed, completely eliminating the edge-case where `n_discard` evaluates to 0 (causing a dead-loop when `n_ctx` is extremely small).
54+
- Wrapped underlying C++ memory shift operations in a try-except block for defense-in-depth against unexpected backend failures.
55+
- Expanded in-code documentation to clarify the arithmetic constraints and architectural limitations of the KV shift mechanism.
56+
57+
- Add the memory_can_shift API to class LlamaContext
58+
59+
- feat(eval): enable native context shift for hybrid/recurrent models
60+
- Removed the `RuntimeError` that previously blocked context shifting for hybrid and SWA architectures.
61+
- Delegated the shift logic to the underlying C++ backend, which automatically handles Attention KV removal and RNN `pos` shifting.
62+
- Added dynamic verbose logging to clearly identify the model type (Transformer vs. Hybrid/Recurrent/SWA) during a context shift event.
63+
64+
- fix(eval): prevent batch size from halving below 1 during KV slot exhaustion
65+
- Added an explicit guard to break the dynamic batch downgrade loop when `current_batch_size` is exactly 1 and a Code 1 (No KV slot) is returned.
66+
- Prevents the engine from executing an invalid `1 // 2` operation and generating the confusing "Halving batch size from 1 to 0" verbose log.
67+
- Ensures the evaluation process fails fast and aborts gracefully when physical VRAM is completely depleted and no further fallback is mathematically possible.
68+
69+
- feat(hybrid): add periodic checkpointing and adaptive batch handling
70+
- Increase default `ctx_checkpoints` from 16 to 32
71+
- Add new parameter `checkpoint_interval` (default: 4096) for hybrid model state snapshots
72+
- Implement robust dynamic batch downgrade on KV cache exhaustion (status=1)
73+
- Introduce periodic checkpoint saves during eval in hybrid mode
74+
- Improve error handling and logging around context shifts and decoding failures
75+
76+
- Optimization (decode): treat KV slot exhaustion (code 1) as a recoverable return value
77+
- Updated the `decode` wrapper to explicitly return `1` instead of raising a `RuntimeError` when `llama_decode` indicates no KV slots are available.
78+
- Aligned Python API behavior with the underlying C++ contract, treating code 1 as a recoverable signal rather than a fatal crash.
79+
- Enabled upper-level caller loops (like `eval`) to gracefully handle VRAM fragmentation via dynamic batch halving without relying on clumsy try-except block string parsing.
80+
- Retained strict `RuntimeError` exceptions for truly fatal backend failures (e.g., codes -1, -2, -3).
81+
- Added comprehensive docstrings detailing return codes and exception scenarios.
82+
83+
- feat(core): overhaul generate and eval for hybrid model support(Qwen3-next、Qwen3.5 etc.)
84+
- Integrated `HybridCheckpointCache` into the generation loop to support state rollback for recurrent/hybrid architectures.
85+
- Implemented Context Shift (sliding window) in `eval` to gracefully prevent OOM when exceeding `n_ctx`.
86+
- Adapted `eval` to use the newly vectorized `LlamaBatch.add_sequence` API with dynamic `logits_array` configuration.
87+
- Fixed the full prefix match bug by forcing a 1-token re-evaluation to refresh logits.
88+
- Disabled speculative decoding for hybrid models to prevent irreversible state pollution.
89+
- Wrapped the generation loop in a `try...finally` block to guarantee safe checkpoint saving.
90+
91+
- refactor(LlamaBatch): replace set_batch with granular add_token + vectorized add_sequence
92+
- Introduce high-performance add_token() for single-token append in generation loop
93+
- Add flexible add_sequence() with per-token pos/seq_ids/logits arrays
94+
- Remove old set_batch() that assumed single-seq + forced last logit
95+
- Better support for multi-sequence and precise logit control
96+
97+
## [0.3.28]
98+
99+
- fix(HybridCheckpointCache): ValueError: bytes must be in range(0, 256)
100+
101+
- feat: add HybridCheckpointCache detect support for recurrent/hybrid/SWA models
102+
- Introduce ctx_checkpoints parameter (default 16)
103+
- Detect recurrent / hybrid / n_swa > 0 models in __init__
104+
- Automatically use HybridCheckpointCache when hybrid architecture is detected
105+
- Properly close and clear HybridCheckpointCache in __del__
106+
107+
- fix(cache): add safety guards to checkpoint restore and optimize API calls
108+
- Replaced direct `llama_cpp` API calls with cached function pointers (`self._get_size_ext`, etc.) for better performance and consistency.
109+
- Added sequence ID validation with verbose error logging to prevent cross-sequence contamination.
110+
- Added strict state size validation before restoration to prevent buffer overflows and backend segmentation faults.
111+
112+
- Remove redundant seq_id and add resource cleanup
113+
- Removed `seq_id` from `HybridCheckpointCache` initialization to make it a stateless, global multi-sequence manager.
114+
- Added `close()` and `__del__()` methods to safely release C++ context references and prevent memory leaks.
115+
116+
- feat(cache): implement HybridCheckpointCache for hybrid/recurrent models
117+
Introduces a dedicated caching mechanism to support state rollback for
118+
models that cannot physically truncate their KV cache (e.g., Qwen3-Next, Qwen3.5,
119+
etc.).
120+
121+
Key additions and changes:
122+
- Add `HybridCheckpoint` dataclass to store RNN state snapshots along with their binary data and metadata.
123+
- Implement `HybridCheckpointCache` to manage sequence-specific states using the `llama_state_seq_*_ext` C++ APIs.
124+
- Introduce `_hash_prefix` using SHA-256 to guarantee cryptographic certainty when matching prompt histories, preventing state corruption.
125+
- Add `save_checkpoint` with a FIFO eviction policy to strictly bound memory usage based on `max_checkpoints`.
126+
- Add `restore_checkpoint` to securely inject valid RNN states back into the C++ backend.
127+
- Explicitly disable incompatible dictionary interfaces (`__getitem__`, `__setitem__`, `__contains__`) inherited from `BaseLlamaCache`.
128+
- Refactor module imports (alphabetical sorting) and relocate `LlamaDiskCache` for better structural consistency.
129+
130+
- Remove the hack code in llama_chat_format.py
131+
132+
- LLama: Optimize KV cache management for multi-round conversations
133+
- Implements prefix-matching logic to truncate stale "ghost" tokens in C++ KV cache
134+
- Prevents attention misalignment and context poisoning during multi-turn interactions
135+
- Reduces memory overhead by reusing matched prefixes efficiently
136+
10137
## [0.3.27]
11138

12139
- feat: add `PaddleOCR-VL-1.5` multimodal chat handler `PaddleOCRChatHandler`

llama_cpp/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
from .llama_cpp import *
22
from .llama import *
33

4-
__version__ = "0.3.27"
4+
__version__ = "0.3.30"

0 commit comments

Comments
 (0)