You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+127Lines changed: 127 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,133 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
8
8
## [Unreleased]
9
9
10
+
## [0.3.30] Milestone Release
11
+
12
+
I will update the release notes for version 0.3.30 in the [discussion](https://github.com/JamePeng/llama-cpp-python/discussions).
13
+
14
+
- refactor(mtmd): redesign multimodal pipeline for concurrent I/O and hybrid state management
15
+
This commit fundamentally restructures the `MTMDChatHandler` pipeline, decoupling the prefill and evaluation stages to resolve previous I/O bottlenecks and state-sync issues. The new architecture fully supports hybrid/recurrent multimodal models (e.g., Qwen3.5s, LFM2-VL) with robust context management.
16
+
17
+
Key structural advantages and changes:
18
+
- Concurrent Media Decoding: Implemented `ThreadPoolExecutor` in `_process_mtmd_prompt` with pre-allocated arrays, allowing thread-safe parallel image/audio decoding while strictly preserving the chronological order of user inputs, and can be used in the future to process large numbers of video frames.
19
+
- O(1) Prefix Matching ("Negative Reverse Vocabulary"): Replaced slow dictionary lookups with a deterministic hash-to-negative-integer mapping for media IDs. This isolates media tokens from the LLM's positive vocabulary space, enabling native, ultra-fast `longest_token_prefix` array comparisons in Python.
20
+
- Hybrid Model State Management: Replaced aggressive mid-turn saving with highly efficient "End-of-Turn" checkpointing. This ensures multi-image prompts consume only a single LRU slot while allowing precise rollback to the nearest valid state upon cache misses.
21
+
- Robust Context Shift (OOM Defense): The `__call__` loop now preemptively calculates token boundaries for upcoming multimodal chunks, safely discarding the oldest unpinned tokens from both the physical KV cache and the Python virtual ledger to prevent backend crashes.
22
+
- Qwen3.5 Support CONFRIMED, waiting Qwen35ChatHandler PR merge
23
+
24
+
- merge: Implemented Qwen35ChatHandler for Qwen3.5(by **@alcoftTAO**)
25
+
26
+
- fix: Correct the mtmd vision check condition bug
27
+
28
+
- refactor(chat_handler): extract MTMDChatHandler base class and Simplify the complexity of subsequent multimodal adaptation
29
+
- Extracted the core multimodal processing pipeline from `Llava15ChatHandler` into a generic `MTMDChatHandler` base class, separating pipeline logic from model-specific prompt formats.
30
+
- Updated all multimodal subclass handlers (e.g., Gemma3, Granite-Docling, PaddleOCR, Qwen2.5vl, Qwen3-vl, MiniCPM, GLM4.xV, LFM2-VL) to inherit from the new base class `MTMDChatHandler`.
31
+
- Implemented strict `**kwargs` validation in the baseconstructor to gracefully intercept and report unsupported parameters, significantly improving Developer Experience (DX).
32
+
- Introduced dynamic `self.log_prefix` (`self.__class__.__name__`) for accurate and consistent logging across all subclasses.
33
+
- Cleaned up redundant state-clearing, image-count logic and hardcoded print statements across subclass `__call__` implementations.
34
+
- To avoid exceptions occurring when the close method is called due to initialization failure and the call to exit_stack.
35
+
36
+
- feat: Update llama.cpp to [ggml-org/llama.cpp/commit/2afcdb9777b1bac79fa4bfe284b9bf23085b0469](https://github.com/ggml-org/llama.cpp/commit/2afcdb9777b1bac79fa4bfe284b9bf23085b0469)
37
+
38
+
- feat: Sync llama.cpp llama/mtmd API Binding 20260301
39
+
40
+
Many thanks to **@yamikumo-DSD** and **@roj234** for providing detailed testing and valuable suggestions.
41
+
42
+
More information see: https://github.com/JamePeng/llama-cpp-python/compare/e4861df5fd44bb83ec2b9063ca3375759416aead...3f8f0f89a2b72ec2f9494fa5f14206591a5cde49
43
+
44
+
## [0.3.29]
45
+
46
+
- perf(eval): implement adaptive checkpoint intervals for hybrid models
47
+
- Dynamically scale checkpoint frequency during large prompt pre-filling (max 3 triggers per eval) to minimize I/O bottlenecks and stuttering.
48
+
- Add success validation to `save_checkpoint`, ensuring the `last_ckpt_pos` tracker is only updated when the state is successfully saved to disk/memory.
49
+
- Enhance verbose logging to track dynamic interval calculations and save failures.
50
+
51
+
- fix(eval): make context shift mathematically robust and architecture-safe
52
+
- Added a `memory_can_shift()` pre-flight check to proactively intercept and abort gracefully on architectures that physically forbid shifting (e.g., multimodal mmproj where `n_pos_per_embd > 1` or incompatible M-RoPE), preventing fatal `GGML_ASSERT` C++ crashes.
53
+
- Implemented dynamic mathematical bounds for `n_keep` and `n_discard` to guarantee that enough space is always freed, completely eliminating the edge-case where `n_discard` evaluates to 0 (causing a dead-loop when `n_ctx` is extremely small).
54
+
- Wrapped underlying C++ memory shift operations in a try-except block for defense-in-depth against unexpected backend failures.
55
+
- Expanded in-code documentation to clarify the arithmetic constraints and architectural limitations of the KV shift mechanism.
56
+
57
+
- Add the memory_can_shift API to class LlamaContext
58
+
59
+
- feat(eval): enable native context shift for hybrid/recurrent models
60
+
- Removed the `RuntimeError` that previously blocked context shifting for hybrid and SWA architectures.
61
+
- Delegated the shift logic to the underlying C++ backend, which automatically handles Attention KV removal and RNN `pos` shifting.
62
+
- Added dynamic verbose logging to clearly identify the model type (Transformer vs. Hybrid/Recurrent/SWA) during a context shift event.
63
+
64
+
- fix(eval): prevent batch size from halving below 1 during KV slot exhaustion
65
+
- Added an explicit guard to break the dynamic batch downgrade loop when `current_batch_size` is exactly 1 and a Code 1 (No KV slot) is returned.
66
+
- Prevents the engine from executing an invalid `1 // 2` operation and generating the confusing "Halving batch size from 1 to 0" verbose log.
67
+
- Ensures the evaluation process fails fast and aborts gracefully when physical VRAM is completely depleted and no further fallback is mathematically possible.
68
+
69
+
- feat(hybrid): add periodic checkpointing and adaptive batch handling
70
+
- Increase default `ctx_checkpoints` from 16 to 32
71
+
- Add new parameter `checkpoint_interval` (default: 4096) for hybrid model state snapshots
0 commit comments