Enable efficient KV-cache based generation for Coconut latent reasoning by somepatt · Pull Request #10 · InternLM/SIM-CoT

somepatt · 2026-02-10T07:19:08Z

This PR refactors the generate() logic in Coconut to correctly and efficiently support KV-cache–based autoregressive decoding after latent token replacement.

Previously, generation recomputed the entire prefix on every decoding step by passing the full inputs_embeds sequence, which:

disabled effective KV-cache usage,
led to unnecessary quadratic computation,
and did not reflect how causal LMs are intended to be decoded.

This change aligns Coconut generation with standard causal LM decoding while preserving Coconut’s latent-replacement semantics.

add kv cache on inference

840afb5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable efficient KV-cache based generation for Coconut latent reasoning#10

Enable efficient KV-cache based generation for Coconut latent reasoning#10
somepatt wants to merge 1 commit intoInternLM:mainfrom
somepatt:optim-inference

somepatt commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

somepatt commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments