Skip to content

Conversation

@juanmichelini
Copy link
Collaborator

@juanmichelini juanmichelini commented Dec 3, 2025

Remove early return that was causing conversations to terminate prematurely when the LLM produced content, even when tool calls were still being processed. This ensures the conversation continues properly through the full execution flow.

This change removes the problematic code block that was checking has_content and immediately finishing the conversation, which prevented proper processing of tool calls and other agent actions.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:6eba439-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-6eba439-python \
  ghcr.io/openhands/agent-server:6eba439-python

All tags pushed for this build

ghcr.io/openhands/agent-server:6eba439-golang-amd64
ghcr.io/openhands/agent-server:6eba439-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:6eba439-golang-arm64
ghcr.io/openhands/agent-server:6eba439-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:6eba439-java-amd64
ghcr.io/openhands/agent-server:6eba439-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:6eba439-java-arm64
ghcr.io/openhands/agent-server:6eba439-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:6eba439-python-amd64
ghcr.io/openhands/agent-server:6eba439-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:6eba439-python-arm64
ghcr.io/openhands/agent-server:6eba439-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:6eba439-golang
ghcr.io/openhands/agent-server:6eba439-java
ghcr.io/openhands/agent-server:6eba439-python

About Multi-Architecture Support

  • Each variant tag (e.g., 6eba439-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 6eba439-python-amd64) are also available if needed

Remove early return that was causing conversations to terminate prematurely
when the LLM produced content, even when tool calls were still being processed.
This ensures the conversation continues properly through the full execution flow.

Co-authored-by: openhands <openhands@all-hands.dev>
@enyst enyst added the integration-test Runs the integration tests and comments the results label Dec 3, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2025

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2025

🧪 Integration Tests Results

Overall Success Rate: 89.5%
Total Cost: $1.34
Models Tested: 5
Timestamp: 2025-12-03 07:43:16 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Success Rate Tests Passed Skipped Total Tests Cost
litellm_proxy_claude_sonnet_4_5_20250929 62.5% 5/8 0 8 $0.44
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 7/7 1 8 $0.49
litellm_proxy_vertex_ai_gemini_3_pro_preview 100.0% 8/8 0 8 $0.31
litellm_proxy_deepseek_deepseek_chat 100.0% 7/7 1 8 $0.05
litellm_proxy_gpt_5_mini_2025_08_07 87.5% 7/8 0 8 $0.04

📋 Detailed Results

litellm_proxy_claude_sonnet_4_5_20250929

  • Success Rate: 62.5% (5/8)
  • Total Cost: $0.44
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_0c6f36a_sonnet_run_N8_20251203_073610

Failed Tests:

  • t05_simple_browsing: Agent did not find the answer. Response: ... (Cost: $0.04)
  • t06_github_pr_browsing: No final answer found from agent. Events: 9, LLM messages: 4 (Cost: $0.10)
  • t08_image_file_viewing: Agent did not identify yellow color in the logo. Response: (Cost: $0.04)

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.49
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_0c6f36a_kimi_k2_run_N8_20251203_073615
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.31
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_0c6f36a_gemini_3_pro_run_N8_20251203_073615

litellm_proxy_deepseek_deepseek_chat

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.05
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_0c6f36a_deepseek_run_N8_20251203_073615
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gpt_5_mini_2025_08_07

  • Success Rate: 87.5% (7/8)
  • Total Cost: $0.04
  • Run Suffix: litellm_proxy_gpt_5_mini_2025_08_07_0c6f36a_gpt5_mini_run_N8_20251203_073615

Failed Tests:

  • t05_simple_browsing: Agent did not find the answer. Response: Fetched http://localhost:8000 and reported the revealed answer. Next steps: I can simulate clicking the button, save the page, or inspect other endpoints on the server—what would you like me to do?... (Cost: $0.0033)

@enyst
Copy link
Collaborator

enyst commented Dec 3, 2025

@juanmichelini It might be worth looking into these logs, not sure what happens here with Sonnet getting 5/8. 5/8 is really surprising

Or gpt-5-mini:

t05_simple_browsing: Agent did not find the answer. Response: Fetched http://localhost:8000/ and reported the revealed answer. Next steps: I can simulate clicking the button, save the page, or inspect other endpoints on the server—what would you like me to do?...

Looks like it just talks to the user, so a content response.

Maybe we need the fake user message...?

@juanmichelini
Copy link
Collaborator Author

@juanmichelini It might be worth looking into these logs, not sure what happens here with Sonnet getting 5/8. 5/8 is really surprising

Or gpt-5-mini:

t05_simple_browsing: Agent did not find the answer. Response: Fetched http://localhost:8000/ and reported the revealed answer. Next steps: I can simulate clicking the button, save the page, or inspect other endpoints on the server—what would you like me to do?...

Looks like it just talks to the user, so a content response.

Maybe we need the fake user message...?

Hey Engel! I've uploaded yesterday logs here https://drive.google.com/drive/folders/1KMAq14ztG8-ug6aLVWDoGR6zp6ifVlHF I'm doing small runs (10~20 issues) but the amount of empty patches got is pretty consistent with and without fix.

@juanmichelini juanmichelini changed the title Fix premature conversation termination when LLM produces content Fix premature conversation termination when LLM produces content (GPT-5 Codex and GLM 4.6) Dec 5, 2025
@neubig
Copy link
Contributor

neubig commented Dec 5, 2025

Hey @juanmichelini , within the logs, are there any particular traces that are indicative of the problem in the original code? I'd like to take a closer look to better understand the problem.

@enyst enyst added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Dec 5, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 5, 2025

🧪 Integration Tests Results

Overall Success Rate: 86.8%
Total Cost: $1.11
Models Tested: 5
Timestamp: 2025-12-05 17:01:16 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Success Rate Tests Passed Skipped Total Tests Cost
litellm_proxy_deepseek_deepseek_chat 100.0% 7/7 1 8 $0.05
litellm_proxy_gpt_5_mini_2025_08_07 87.5% 7/8 0 8 $0.07
litellm_proxy_moonshot_kimi_k2_thinking 85.7% 6/7 1 8 $0.10
litellm_proxy_vertex_ai_gemini_3_pro_preview 100.0% 8/8 0 8 $0.39
litellm_proxy_claude_sonnet_4_5_20250929 62.5% 5/8 0 8 $0.49

📋 Detailed Results

litellm_proxy_deepseek_deepseek_chat

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.05
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_b6e3767_deepseek_run_N8_20251205_165711
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gpt_5_mini_2025_08_07

  • Success Rate: 87.5% (7/8)
  • Total Cost: $0.07
  • Run Suffix: litellm_proxy_gpt_5_mini_2025_08_07_b6e3767_gpt5_mini_run_N8_20251205_165713

Failed Tests:

  • t05_simple_browsing: Agent did not find the answer. Response: Fetched http://localhost:8000 and extracted the displayed answer. Next steps: ask the user if they want me to simulate clicking the button in a headless browser, fetch other endpoints, or make any mod... (Cost: $0.0047)

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 85.7% (6/7)
  • Total Cost: $0.10
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_b6e3767_kimi_k2_run_N8_20251205_165711
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

  • t02_add_bash_hello: Shell script is not executable (Cost: $0.02)

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.39
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_b6e3767_gemini_3_pro_run_N8_20251205_165714

litellm_proxy_claude_sonnet_4_5_20250929

  • Success Rate: 62.5% (5/8)
  • Total Cost: $0.49
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_b6e3767_sonnet_run_N8_20251205_165716

Failed Tests:

  • t05_simple_browsing: Agent did not find the answer. Response: ... (Cost: $0.04)
  • t06_github_pr_browsing: No final answer found from agent. Events: 9, LLM messages: 4 (Cost: $0.09)
  • t08_image_file_viewing: Agent did not identify yellow color in the logo. Response: (Cost: $0.04)

@enyst
Copy link
Collaborator

enyst commented Dec 5, 2025

In some conversations uploaded by Juan, I see:
gpt-5-codex MessageEvent

The conversation ended, with a MessageEvent. This MessageEvent has content, so the LLM is speaking to the user and waiting for the user. That's the kind of case, it seems to me, for which we had created and sent a fake user message in V0, to tell it to continue working.


Related, but maybe unnecessary, in the codex-cli, they have this text for GPT-5:

You are a coding agent. Please keep going until the query is completely resolved, before ending your turn and yielding back to the user. Only terminate your turn when you are sure that the problem is solved. Autonomously resolve the query to the best of your ability, using the tools available to you, before coming back to the user. Do NOT guess or make up an answer.

However, this text doesn't exist in the system prompts for codex variants. I assume it means that it might not be needed for gpt-5-codex (although...it seems to be what we see), or maybe something else may have taken its place in their SWE-bench instruction, as opposed to system message.

@blacksmith-sh blacksmith-sh bot requested a review from raymyers December 6, 2025 12:57
@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Dec 6, 2025

[Automatic Post]: I have assigned @raymyers as a reviewer based on git blame information. Thanks in advance for the help!

@neubig neubig requested review from neubig and removed request for raymyers December 6, 2025 16:12
@openhands-ai
Copy link

openhands-ai bot commented Dec 6, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1304 at branch `fix-premature-conversation-termination-clean`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@neubig
Copy link
Contributor

neubig commented Dec 6, 2025

Thanks @enyst . @xingyaoww: based on your previous experience, I wonder what you think we should do here. Should we re-implement the fake user message, modify the swe-bench prompt, something else?

@enyst
Copy link
Collaborator

enyst commented Dec 8, 2025

Related:
I think Simon is seeing the same thing I mentioned above, MessageEvent where we mark the conversation FINISHED

We could try this PR, if you'd like, as it is, but I think if we send back the last message as 'assistant' role, we get a 400, so we need a 'user' role message, which is what 'fake user message' was doing.

@xingyaoww
Copy link
Collaborator

How about we solve this issue #1351 engel mentioned, and force agent to emit FinishAction when actually done, and send back "fake user message" when it sends MessageEvent?

@juanmichelini
Copy link
Collaborator Author

This issue happens only when benchmarking or testing models GPT 5 Codex and GLM 4.6.
That is when there is no user to ask the agent to continue.

Those two models behave differently than Claude family and the current fix fails with the tests for Claude Sonnet 4.
As Engel mentions there is a related issue that fails with Claude. #1351

(Side note: Something else that might be related: unlike GPT 5 Codex, GPT 5 gives patches in most cases but only in 20% of the cases they show a FinishAction. Which makes GPT 5 more costly to run when doing multiple iterations. See the benchmark sheet and compare both GPT 5.
)

Changing the conditions for FinishAction might impact all model evaluations, so I do not think we should merge the fix as is, but we could:

  • Add some LLM specific logic either in the agent or the system prompts for GPT 5 Codex and GLM 4.6.
    The tests would work, other models would not be affected when benchmarking.

  • Keep the fix but test differently. See if the tests can be modified to work with claude, also run swebench with claude on this fix to make sure there is no regression.

  • Fix [BUG] Agent produces MessageEvent without ActionEvent, causing premature execution termination #1351 a try, and also check that it fixes the empty patches in GPT 5 Codex and GLM 4.6.

  • In the meantime, we could evaluate GPT 5 Codex and GLM 4.6 with this fix without merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration-test Runs the integration tests and comments the results

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants