Restore BeeLlama local changes on cleaned history#1
Merged
Conversation
02cd0c6 to
5ccca01
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
--spec-type dflashdrives a DFlash draft GGUF alongside the target model. The target captures hidden states into a per-layer 4096-slot ring buffer, the drafter cross-attends to the most recent--spec-dflash-cross-ctxhidden-state tokens and proposes drafts for target verification.turbo2,turbo3,turbo4,turbo2_tcq,turbo3_tcq) spanning from 4x to 7.5x compression, with higher-bit options being practically lossless in many cases. Set independently with--cache-type-kand--cache-type-v.--spec-draft-n-max. The defaultprofitcontroller compares speculative throughput against a no-spec baseline; thefringealternative maps acceptance-rate bands to draft depth. Use--no-spec-dm-adaptivefor a static horizon.--mmprojis active, the server keeps flat DFlash available for text generation. The model can be fully offloaded to CPU with no problems to reduce VRAM pressure.force-closewith--reasoning-loop-windowand--reasoning-loop-max-periodtuning available.--spec-draft-tempenables rejection-sampling drafter behavior. Activates when both draft and target temperature exceed zero. Draft log probabilities must be available for rejection sampling to produce correct output.--spec-branch-budgetadds branch nodes beyond the main draft path with GPUparent_ids, tree masks, and recurrent tree kernels. Disabled automatically when the target model spans more than one GPU. This one is very much work in progress!--spec-type copyspecprovides rolling-hash suffix matching over previous tokens without a draft model. Results must be benchmarked per workload.