server, webui: support continue generation on reasoning models by ServeurpersoCom · Pull Request #22727 · ggml-org/llama.cpp

ServeurpersoCom · 2026-05-05T17:15:31Z

Overview

Reasoning models can now use the Continue button. Stopping mid thought saves the partial chain of thought, F5 keeps it, and clicking Continue resumes inside the thinking block instead of restarting from scratch. Same behavior for stops after the thinking ends. Plain content prefill is unchanged.

21754.reasoning-continue-prefill.mp4

Additional information

Backend resolves the old TODO in oaicompat_chat_params_parse: removes the throw blocking assistant prefill on reasoning models and the forced reasoning_format = NONE workaround, then orchestrates thinking_start_tag, thinking_end_tag and generation_prompt around the prefilled message so the prompt is rebuilt correctly and the parser introduced in PR #20424 routes the next stream chunks to reasoning_content or content depending on whether the prefill is plain content, mid reasoning, or post reasoning. Bridges the API field from #21036, the parser routing from #20424 and the webui storage from #21249.

Frontend drops the reasoning_content guard on the Continue button, sends reasoning_content with the prefilled assistant message in continueAssistantMessage, persists partial reasoningContent on stop so the CoT survives F5 and Continue, and marks the streaming state on reasoning chunks so savePartialResponseIfNeeded does not early return when stop happens before any content token. Setting hint updated.

First step toward #21754: covers voluntary stop and reload, full network resilience (SSE resume) is left for a follow up.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Opus 4.7 + local MCP server with rootless pod

ServeurpersoCom · 2026-05-05T17:23:57Z

I add a video

ServeurpersoCom · 2026-05-05T18:02:50Z

It's possible that a space or line break might be missing during retrieval. This could be due to an idempotence issue with the tokenizer ? It's not critical, but it slightly skews the distribution. (like a micro-error in the KV Cache, which the model could not have generated on its own)

ServeurpersoCom · 2026-05-05T18:04:48Z

It makes me want to write a Python hammer script that does lots of pauses/resumes to see what happens and isolate a front / back bug

aldehir · 2026-05-05T18:09:46Z

+        const bool thinking_active = chat_params.supports_thinking && !chat_params.thinking_end_tag.empty();
+        const bool has_reasoning   = !last_message.reasoning_content.empty();
+        const bool has_content     = !last_message.content.empty() || !last_message.content_parts.empty();
+        const bool mid_reasoning   = has_reasoning && !has_content;
+
+        // some templates inject thinking_start in generation_prompt, others let the model emit it
+        const bool gp_has_think = thinking_active
+            && chat_params.generation_prompt.find(chat_params.thinking_start_tag) != std::string::npos;
+
+        // open the thinking block when reasoning is present and the template did not inject it
+        if (has_reasoning) {
+            if (thinking_active && !gp_has_think) {
+                chat_params.prompt += chat_params.thinking_start_tag;
+            }
+            chat_params.prompt += last_message.reasoning_content;
+        }
+
+        if (thinking_active) {
+            if (mid_reasoning) {
+                // model continues inside the thinking block, keep generation_prompt open on think
+                if (!gp_has_think) {
+                    chat_params.generation_prompt += chat_params.thinking_start_tag;
+                }
+            } else {
+                // close thinking block when reasoning is followed by content, or when the template forced it open
+                if (has_reasoning || gp_has_think) {
+                    chat_params.prompt += chat_params.thinking_end_tag;
+                }
+                // strip thinking_start from generation_prompt so the parser routes model output as content
+                auto pos = chat_params.generation_prompt.rfind(chat_params.thinking_start_tag);
+                if (pos != std::string::npos) {
+                    chat_params.generation_prompt = chat_params.generation_prompt.substr(0, pos);
+                }
+            }
+        }
+


This is not universal. Some models, like gpt-oss, wrap the assistant content as well. This logic must be delegated to the chat handler for the given template. That is where the knowledge required to properly rebuild the assistant message exists.

I am working on the plumbing in common required to support this.

Great! I'm iterating in the pod with Opus: a Python client that stress tests all this on different models; GPT is going to fail. I will rewrite the code later based on your new API.

allozaur · 2026-05-06T07:24:34Z

Gr8 stuff, @ServeurpersoCom @aldehir let's push further to have it working in production asap :)

ServeurpersoCom · 2026-05-06T08:29:53Z

We will have the missing layer of abstraction! the frontend will pose no problem (at most, a small commit to add a state backup in the callback for network outages), and the C++ will have a clean abstracted OAI chunks redirector!

aldehir · 2026-05-06T21:08:56Z

@allozaur Understood. I'll focus on the common API and open a PR soon.

server, webui: support continue generation on reasoning models

9958111

ServeurpersoCom requested review from a team as code owners May 5, 2026 17:15

aldehir reviewed May 5, 2026

View reviewed changes

github-actions Bot added server/webui examples server labels May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server, webui: support continue generation on reasoning models#22727

server, webui: support continue generation on reasoning models#22727
ServeurpersoCom wants to merge 1 commit intoggml-org:masterfrom
ServeurpersoCom:reasoning-continue-prefill

ServeurpersoCom commented May 5, 2026 •

edited

Loading

Uh oh!

ServeurpersoCom commented May 5, 2026

Uh oh!

ServeurpersoCom commented May 5, 2026

Uh oh!

ServeurpersoCom commented May 5, 2026

Uh oh!

aldehir May 5, 2026

Uh oh!

ServeurpersoCom May 5, 2026

Uh oh!

allozaur commented May 6, 2026

Uh oh!

ServeurpersoCom commented May 6, 2026

Uh oh!

aldehir commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ServeurpersoCom commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

First step toward #21754: covers voluntary stop and reload, full network resilience (SSE resume) is left for a follow up.

Requirements

Uh oh!

ServeurpersoCom commented May 5, 2026

Uh oh!

ServeurpersoCom commented May 5, 2026

Uh oh!

ServeurpersoCom commented May 5, 2026

Uh oh!

aldehir May 5, 2026

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom May 5, 2026

Choose a reason for hiding this comment

Uh oh!

allozaur commented May 6, 2026

Uh oh!

ServeurpersoCom commented May 6, 2026

Uh oh!

aldehir commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ServeurpersoCom commented May 5, 2026 •

edited

Loading