llama : extend embeddings API by ggerganov · Pull Request #22728 · ggml-org/llama.cpp

ggerganov · 2026-05-05T17:55:47Z

Overview

Preparing some base functionality needed for extracting embeddings from different stages of the inference. This is needed to support speculative decoding methods such as Eagle3, MTP, etc.

Layer input embeddings extraction
Token embedding tensor replacement
TBD

Additional information

TBD - still figuring out what's needed

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

am17an · 2026-05-06T02:26:33Z

Let me add my comment here since that PR is getting a bit crowded.

IMO the only blocker to not run this inside the same llama graph was that kv-cache becomes weird to handle. Perhaps we can restructure the kv-cache such that there is an auxiliary cache for these "speculators", then everything becomes kind of already working from a sync perspective. Or I am maybe missing something

am17an · 2026-05-06T02:44:16Z

It looks like for Gemma4 MTP, the MTP head actually attends to the target's KV cache, something to keep in mind

ggerganov · 2026-05-06T04:21:32Z

run this inside the same llama graph

I don't think it is feasible to have the main and MTP graphs combined. Having the MTP graph (and any other speculative decoding graph) in a separate context has many advantages:

It has a separate backend scheduler
We have finer control on which devices to place the drafter
We can work with draft models that are separate from the main model
Multi-sequence drafting is much easier to conceptualize
etc.

There are many different variants of speculative decoding and more will appear in the future. We cannot stuff all this logic inside the llama_context. I think the current common/speculative foundation is overall good. We have to extend it to manage the prompt embeddings. And for that we first need a mechanism to extract them.

For the Gemma4 MTP, I would try to create a memory-less llama_context and assign the llama_memory from the target context to it so it can use it directly. After drafting, we would need to wipe the drafted tokens from the memory with seq_rm to restore the memory state for the target model.

llama : enable layer input extraction

0445829

ggerganov mentioned this pull request May 5, 2026

llama + spec: MTP Support #22673

Draft

github-actions Bot added the model Model specific label May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : extend embeddings API#22728

llama : extend embeddings API#22728
ggerganov wants to merge 1 commit intomasterfrom
gg/llama-extract-embeddings

ggerganov commented May 5, 2026

Uh oh!

am17an commented May 6, 2026

Uh oh!

am17an commented May 6, 2026

Uh oh!

ggerganov commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ggerganov commented May 5, 2026

Overview

Additional information

Requirements

Uh oh!

am17an commented May 6, 2026

Uh oh!

am17an commented May 6, 2026

Uh oh!

ggerganov commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants