server, webui: support continue generation on reasoning models#22727
server, webui: support continue generation on reasoning models#22727ServeurpersoCom wants to merge 1 commit intoggml-org:masterfrom
Conversation
|
I add a video |
|
It's possible that a space or line break might be missing during retrieval. This could be due to an idempotence issue with the tokenizer ? It's not critical, but it slightly skews the distribution. (like a micro-error in the KV Cache, which the model could not have generated on its own) |
|
It makes me want to write a Python hammer script that does lots of pauses/resumes to see what happens and isolate a front / back bug |
| const bool thinking_active = chat_params.supports_thinking && !chat_params.thinking_end_tag.empty(); | ||
| const bool has_reasoning = !last_message.reasoning_content.empty(); | ||
| const bool has_content = !last_message.content.empty() || !last_message.content_parts.empty(); | ||
| const bool mid_reasoning = has_reasoning && !has_content; | ||
|
|
||
| // some templates inject thinking_start in generation_prompt, others let the model emit it | ||
| const bool gp_has_think = thinking_active | ||
| && chat_params.generation_prompt.find(chat_params.thinking_start_tag) != std::string::npos; | ||
|
|
||
| // open the thinking block when reasoning is present and the template did not inject it | ||
| if (has_reasoning) { | ||
| if (thinking_active && !gp_has_think) { | ||
| chat_params.prompt += chat_params.thinking_start_tag; | ||
| } | ||
| chat_params.prompt += last_message.reasoning_content; | ||
| } | ||
|
|
||
| if (thinking_active) { | ||
| if (mid_reasoning) { | ||
| // model continues inside the thinking block, keep generation_prompt open on think | ||
| if (!gp_has_think) { | ||
| chat_params.generation_prompt += chat_params.thinking_start_tag; | ||
| } | ||
| } else { | ||
| // close thinking block when reasoning is followed by content, or when the template forced it open | ||
| if (has_reasoning || gp_has_think) { | ||
| chat_params.prompt += chat_params.thinking_end_tag; | ||
| } | ||
| // strip thinking_start from generation_prompt so the parser routes model output as content | ||
| auto pos = chat_params.generation_prompt.rfind(chat_params.thinking_start_tag); | ||
| if (pos != std::string::npos) { | ||
| chat_params.generation_prompt = chat_params.generation_prompt.substr(0, pos); | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
This is not universal. Some models, like gpt-oss, wrap the assistant content as well. This logic must be delegated to the chat handler for the given template. That is where the knowledge required to properly rebuild the assistant message exists.
I am working on the plumbing in common required to support this.
There was a problem hiding this comment.
Great! I'm iterating in the pod with Opus: a Python client that stress tests all this on different models; GPT is going to fail. I will rewrite the code later based on your new API.
|
Gr8 stuff, @ServeurpersoCom @aldehir let's push further to have it working in production asap :) |
|
We will have the missing layer of abstraction! the frontend will pose no problem (at most, a small commit to add a state backup in the callback for network outages), and the C++ will have a clean abstracted OAI chunks redirector! |
|
@allozaur Understood. I'll focus on the common API and open a PR soon. |
Overview
Reasoning models can now use the Continue button. Stopping mid thought saves the partial chain of thought, F5 keeps it, and clicking Continue resumes inside the thinking block instead of restarting from scratch. Same behavior for stops after the thinking ends. Plain content prefill is unchanged.
21754.reasoning-continue-prefill.mp4
Additional information
Backend resolves the old TODO in oaicompat_chat_params_parse: removes the throw blocking assistant prefill on reasoning models and the forced reasoning_format = NONE workaround, then orchestrates thinking_start_tag, thinking_end_tag and generation_prompt around the prefilled message so the prompt is rebuilt correctly and the parser introduced in PR #20424 routes the next stream chunks to reasoning_content or content depending on whether the prefill is plain content, mid reasoning, or post reasoning. Bridges the API field from #21036, the parser routing from #20424 and the webui storage from #21249.
Frontend drops the reasoning_content guard on the Continue button, sends reasoning_content with the prefilled assistant message in continueAssistantMessage, persists partial reasoningContent on stop so the CoT survives F5 and Continue, and marks the streaming state on reasoning chunks so savePartialResponseIfNeeded does not early return when stop happens before any content token. Setting hint updated.
First step toward #21754: covers voluntary stop and reload, full network resilience (SSE resume) is left for a follow up.
Requirements