Skip to content

llama : extend embeddings API#22728

Draft
ggerganov wants to merge 1 commit intomasterfrom
gg/llama-extract-embeddings
Draft

llama : extend embeddings API#22728
ggerganov wants to merge 1 commit intomasterfrom
gg/llama-extract-embeddings

Conversation

@ggerganov
Copy link
Copy Markdown
Member

Overview

Preparing some base functionality needed for extracting embeddings from different stages of the inference. This is needed to support speculative decoding methods such as Eagle3, MTP, etc.

  • Layer input embeddings extraction
  • Token embedding tensor replacement
  • TBD

Additional information

TBD - still figuring out what's needed

Requirements

@github-actions github-actions Bot added the model Model specific label May 5, 2026
@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 6, 2026

Let me add my comment here since that PR is getting a bit crowded.

IMO the only blocker to not run this inside the same llama graph was that kv-cache becomes weird to handle. Perhaps we can restructure the kv-cache such that there is an auxiliary cache for these "speculators", then everything becomes kind of already working from a sync perspective. Or I am maybe missing something

@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 6, 2026

It looks like for Gemma4 MTP, the MTP head actually attends to the target's KV cache, something to keep in mind

@ggerganov
Copy link
Copy Markdown
Member Author

run this inside the same llama graph

I don't think it is feasible to have the main and MTP graphs combined. Having the MTP graph (and any other speculative decoding graph) in a separate context has many advantages:

  • It has a separate backend scheduler
  • We have finer control on which devices to place the drafter
  • We can work with draft models that are separate from the main model
  • Multi-sequence drafting is much easier to conceptualize
  • etc.

There are many different variants of speculative decoding and more will appear in the future. We cannot stuff all this logic inside the llama_context. I think the current common/speculative foundation is overall good. We have to extend it to manage the prompt embeddings. And for that we first need a mechanism to extract them.

For the Gemma4 MTP, I would try to create a memory-less llama_context and assign the llama_memory from the target context to it so it can use it directly. After drafting, we would need to wipe the drafted tokens from the memory with seq_rm to restore the memory state for the target model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants