Skip to content

Enable efficient KV-cache based generation for Coconut latent reasoning#10

Open
somepatt wants to merge 1 commit intoInternLM:mainfrom
somepatt:optim-inference
Open

Enable efficient KV-cache based generation for Coconut latent reasoning#10
somepatt wants to merge 1 commit intoInternLM:mainfrom
somepatt:optim-inference

Conversation

@somepatt
Copy link

This PR refactors the generate() logic in Coconut to correctly and efficiently support KV-cache–based autoregressive decoding after latent token replacement.

Previously, generation recomputed the entire prefix on every decoding step by passing the full inputs_embeds sequence, which:

  • disabled effective KV-cache usage,

  • led to unnecessary quadratic computation,

  • and did not reflect how causal LMs are intended to be decoded.

This change aligns Coconut generation with standard causal LM decoding while preserving Coconut’s latent-replacement semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments