Restore BeeLlama local changes on cleaned history by Anbeeld · Pull Request #1 · Anbeeld/beellama.cpp

Anbeeld · 2026-05-09T15:00:03Z

DFlash speculative decoding: --spec-type dflash drives a DFlash draft GGUF alongside the target model. The target captures hidden states into a per-layer 4096-slot ring buffer, the drafter cross-attends to the most recent --spec-dflash-cross-ctx hidden-state tokens and proposes drafts for target verification.
TurboQuant / TCQ KV-cache compression: Five cache types (turbo2, turbo3, turbo4, turbo2_tcq, turbo3_tcq) spanning from 4x to 7.5x compression, with higher-bit options being practically lossless in many cases. Set independently with --cache-type-k and --cache-type-v.
Adaptive draft-max control: The server adjusts the active draft horizon at runtime instead of using a fixed --spec-draft-n-max. The default profit controller compares speculative throughput against a no-spec baseline; the fringe alternative maps acceptance-rate bands to draft depth. Use --no-spec-dm-adaptive for a static horizon.
Full multimodal support: When --mmproj is active, the server keeps flat DFlash available for text generation. The model can be fully offloaded to CPU with no problems to reduce VRAM pressure.
Reasoning-loop protection: The server detects repeated hidden reasoning output and intervenes. Default mode is force-close with --reasoning-loop-window and --reasoning-loop-max-period tuning available.
Sampled DFlash verification: --spec-draft-temp enables rejection-sampling drafter behavior. Activates when both draft and target temperature exceed zero. Draft log probabilities must be available for rejection sampling to produce correct output.
DDTree branch verification: optional --spec-branch-budget adds branch nodes beyond the main draft path with GPU parent_ids, tree masks, and recurrent tree kernels. Disabled automatically when the target model spans more than one GPU. This one is very much work in progress!
Request-level speculative overrides: Draft-max and branch budget can be overridden per-request through JSON fields without restarting the server.
CopySpec model-free speculation: --spec-type copyspec provides rolling-hash suffix matching over previous tokens without a draft model. Results must be benchmarked per workload.

github-actions Bot added documentation Improvements or additions to documentation Nvidia GPU testing examples devops python script server ggml model labels May 9, 2026

Restore BeeLlama local changes on cleaned history

5ccca01

Anbeeld force-pushed the restore-local-state branch from 02cd0c6 to 5ccca01 Compare May 9, 2026 15:07

Anbeeld marked this pull request as ready for review May 9, 2026 15:12

Anbeeld merged commit 686e63a into main May 9, 2026
8 of 77 checks passed

Anbeeld deleted the restore-local-state branch May 9, 2026 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore BeeLlama local changes on cleaned history#1

Restore BeeLlama local changes on cleaned history#1
Anbeeld merged 1 commit into
mainfrom
restore-local-state

Anbeeld commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Anbeeld commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant