Conversation
|
Let me add my comment here since that PR is getting a bit crowded.
|
|
It looks like for Gemma4 MTP, the MTP head actually attends to the target's KV cache, something to keep in mind |
I don't think it is feasible to have the main and MTP graphs combined. Having the MTP graph (and any other speculative decoding graph) in a separate context has many advantages:
There are many different variants of speculative decoding and more will appear in the future. We cannot stuff all this logic inside the For the Gemma4 MTP, I would try to create a memory-less |
Overview
Preparing some base functionality needed for extracting embeddings from different stages of the inference. This is needed to support speculative decoding methods such as Eagle3, MTP, etc.
Additional information
TBD - still figuring out what's needed
Requirements