Skip to content

server, webui: support continue generation on reasoning models#22727

Open
ServeurpersoCom wants to merge 1 commit intoggml-org:masterfrom
ServeurpersoCom:reasoning-continue-prefill
Open

server, webui: support continue generation on reasoning models#22727
ServeurpersoCom wants to merge 1 commit intoggml-org:masterfrom
ServeurpersoCom:reasoning-continue-prefill

Conversation

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

@ServeurpersoCom ServeurpersoCom commented May 5, 2026

Overview

Reasoning models can now use the Continue button. Stopping mid thought saves the partial chain of thought, F5 keeps it, and clicking Continue resumes inside the thinking block instead of restarting from scratch. Same behavior for stops after the thinking ends. Plain content prefill is unchanged.

21754.reasoning-continue-prefill.mp4

Additional information

Backend resolves the old TODO in oaicompat_chat_params_parse: removes the throw blocking assistant prefill on reasoning models and the forced reasoning_format = NONE workaround, then orchestrates thinking_start_tag, thinking_end_tag and generation_prompt around the prefilled message so the prompt is rebuilt correctly and the parser introduced in PR #20424 routes the next stream chunks to reasoning_content or content depending on whether the prefill is plain content, mid reasoning, or post reasoning. Bridges the API field from #21036, the parser routing from #20424 and the webui storage from #21249.

Frontend drops the reasoning_content guard on the Continue button, sends reasoning_content with the prefilled assistant message in continueAssistantMessage, persists partial reasoningContent on stop so the CoT survives F5 and Continue, and marks the streaming state on reasoning chunks so savePartialResponseIfNeeded does not early return when stop happens before any content token. Setting hint updated.

First step toward #21754: covers voluntary stop and reload, full network resilience (SSE resume) is left for a follow up.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Opus 4.7 + local MCP server with rootless pod

@ServeurpersoCom ServeurpersoCom requested review from a team as code owners May 5, 2026 17:15
@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

I add a video

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

It's possible that a space or line break might be missing during retrieval. This could be due to an idempotence issue with the tokenizer ? It's not critical, but it slightly skews the distribution. (like a micro-error in the KV Cache, which the model could not have generated on its own)

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

It makes me want to write a Python hammer script that does lots of pauses/resumes to see what happens and isolate a front / back bug

Comment on lines +1094 to +1129
const bool thinking_active = chat_params.supports_thinking && !chat_params.thinking_end_tag.empty();
const bool has_reasoning = !last_message.reasoning_content.empty();
const bool has_content = !last_message.content.empty() || !last_message.content_parts.empty();
const bool mid_reasoning = has_reasoning && !has_content;

// some templates inject thinking_start in generation_prompt, others let the model emit it
const bool gp_has_think = thinking_active
&& chat_params.generation_prompt.find(chat_params.thinking_start_tag) != std::string::npos;

// open the thinking block when reasoning is present and the template did not inject it
if (has_reasoning) {
if (thinking_active && !gp_has_think) {
chat_params.prompt += chat_params.thinking_start_tag;
}
chat_params.prompt += last_message.reasoning_content;
}

if (thinking_active) {
if (mid_reasoning) {
// model continues inside the thinking block, keep generation_prompt open on think
if (!gp_has_think) {
chat_params.generation_prompt += chat_params.thinking_start_tag;
}
} else {
// close thinking block when reasoning is followed by content, or when the template forced it open
if (has_reasoning || gp_has_think) {
chat_params.prompt += chat_params.thinking_end_tag;
}
// strip thinking_start from generation_prompt so the parser routes model output as content
auto pos = chat_params.generation_prompt.rfind(chat_params.thinking_start_tag);
if (pos != std::string::npos) {
chat_params.generation_prompt = chat_params.generation_prompt.substr(0, pos);
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not universal. Some models, like gpt-oss, wrap the assistant content as well. This logic must be delegated to the chat handler for the given template. That is where the knowledge required to properly rebuild the assistant message exists.

I am working on the plumbing in common required to support this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I'm iterating in the pod with Opus: a Python client that stress tests all this on different models; GPT is going to fail. I will rewrite the code later based on your new API.

@allozaur
Copy link
Copy Markdown
Contributor

allozaur commented May 6, 2026

Gr8 stuff, @ServeurpersoCom @aldehir let's push further to have it working in production asap :)

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

We will have the missing layer of abstraction! the frontend will pose no problem (at most, a small commit to add a state backup in the callback for network outages), and the C++ will have a clean abstracted OAI chunks redirector!

@aldehir
Copy link
Copy Markdown
Contributor

aldehir commented May 6, 2026

@allozaur Understood. I'll focus on the common API and open a PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants