Skip to content

feat: add LlamaCppInterpreter for toolshim, fix streaming chunk handling#8281

Closed
eugenio wants to merge 3 commits intoblock:mainfrom
eugenio:feat/llamacpp-toolshim-interpreter
Closed

feat: add LlamaCppInterpreter for toolshim, fix streaming chunk handling#8281
eugenio wants to merge 3 commits intoblock:mainfrom
eugenio:feat/llamacpp-toolshim-interpreter

Conversation

@eugenio
Copy link
Copy Markdown

@eugenio eugenio commented Apr 3, 2026

Summary

  • LlamaCppInterpreter: New ToolInterpreter implementation that uses the built-in llama.cpp backend instead of requiring an external Ollama server. Configured via GOOSE_TOOLSHIM_MODEL env var (default: bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_K_M). Falls back to OllamaInterpreter when the local-inference feature is disabled or the model is unavailable.
  • Streaming fix: toolshim_postprocess now runs only on the final streaming message (when usage is present) instead of on every intermediate chunk. This prevents the interpreter model from seeing incomplete text fragments and returning noop for all of them. Ref: bug: Toolshim OllamaInterpreter returns noop for all streaming tool call fragments #8275

Motivation

The toolshim currently hardcodes an OllamaInterpreter that requires a running Ollama server on localhost:11434. This is a friction point for users who:

  1. Don't want to install/manage a separate Ollama service
  2. Already have the local-inference feature compiled in
  3. Use Windows where Ollama runs as a service that consumes resources even when idle

The LlamaCppInterpreter reuses the existing InferenceRuntime and model cache, loading a small (~800MB) model on first use.

Visibility changes

  • inference_engine module: pub(super)pub(crate) for LoadedModel, TokenAction, validate_and_compute_context, create_and_prefill_context, generation_loop, build_context_params, estimate_max_context_for_memory, effective_context_size
  • InferenceRuntime::get_or_create_model_slot: fnpub fn
  • LocalInferenceProvider::load_model_sync_public: new pub(crate) wrapper

Test plan

  • Compiles with --features local-inference (full default features)
  • Compiles without local-inference (graceful fallback to Ollama)
  • All existing unit tests pass (1014/1022 — 8 pre-existing env-specific failures)
  • Manual: verify goose session with GOOSE_TOOLSHIM=true uses llama.cpp (check logs for LlamaCpp tool interpreter response)
  • Manual: verify streaming tool calls are interpreted correctly with small models

🤖 Generated with Claude Code

eugenio and others added 2 commits April 3, 2026 12:19
- Add `repair_truncated_json()` fallback to `safely_parse_json()` that
  closes unclosed strings and brackets in truncated tool call arguments.
  Fixes large-payload EOF errors (col 27409+) from streaming models. (block#8272)

- Downgrade summon frontmatter parse warning from WARN to DEBUG and
  include the source file path for easier debugging. Files without a
  `name` field are expected (non-extension markdown) and should not
  pollute logs. (block#8273)

- Add "Merged text content", "Removed trailing assistant message", and
  "Trimmed trailing whitespace from assistant message" to the MOIM
  expected-issues list so they no longer trigger WARN-level log entries
  on every LLM call. (block#8274)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a built-in llama.cpp-based ToolInterpreter that eliminates the hard
dependency on an external Ollama server for toolshim functionality.

LlamaCppInterpreter:
- Implements the ToolInterpreter trait using the existing InferenceRuntime
- Uses a small local GGUF model (default: Llama-3.2-1B-Instruct Q4_K_M)
- Configurable via GOOSE_TOOLSHIM_MODEL env var
- Falls back to OllamaInterpreter if local-inference feature is disabled
  or model is unavailable

Streaming fix:
- Toolshim postprocessing now runs only on the final streaming message
  (when usage data is present) instead of on every intermediate chunk
- Prevents small interpreter models from seeing partial text fragments
  that they cannot interpret, fixing the "noop for all chunks" bug (block#8275)

Visibility changes:
- Widen inference_engine types to pub(crate) for cross-module access
- Add load_model_sync_public() wrapper on LocalInferenceProvider

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a9552930d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +133 to +136
return augment_message_with_tool_calls(&interpreter, response, toolshim_tools)
.await
.map_err(|e| {
ProviderError::ExecutionError(format!("Failed to augment message: {}", e))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fall back to Ollama when llama.cpp interpretation fails

This branch returns immediately after LlamaCppInterpreter::new() succeeds, so any later runtime error from llama.cpp (for example, generate_sync failing because GOOSE_TOOLSHIM_MODEL is not downloaded) aborts toolshim instead of taking the intended Ollama fallback path. In builds with local-inference enabled but no local interpreter model present, toolshim now hard-fails rather than degrading gracefully.

Useful? React with 👍 / 👎.

Comment on lines +356 to +357
if config.toolshim && usage.is_some() {
if let Some(msg) = last_message.take() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Decouple final toolshim pass from usage availability

The final post-processing is now gated on usage.is_some(), but streaming providers can legally complete without usage metadata (e.g., OpenAI-compatible parsing only forwards usage when output token counts are present). In those cases this condition is never met, so toolshim is skipped entirely and tool-call JSON stays as plain assistant text.

Useful? React with 👍 / 👎.

Comment on lines +348 to +349
if message.is_some() {
last_message = message.clone();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Aggregate streamed text before one-shot tool interpretation

This stores only the latest chunk in last_message, then interprets that single message at stream end. For delta-style streams, each message contains only a fragment, so the interpreter often receives just the trailing token(s) instead of the full assistant output and returns noop. The final toolshim pass needs accumulated content across chunks, not just the last chunk object.

Useful? React with 👍 / 👎.

- P1: LlamaCpp augmentation errors now fall back to Ollama instead of
  hard-failing. Previously only creation failure triggered fallback.

- P1: Toolshim runs after stream ends even when usage metadata is absent
  (some OpenAI-compatible providers omit it). Added post-loop fallback.

- P1: Store accumulated message across chunks instead of only the last
  chunk. The final toolshim pass now sees the complete assistant output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93b016a627

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +356 to +357
if let Some(msg) = &message {
accumulated_message = Some(msg.clone());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Accumulate full streamed text before running toolshim

accumulated_message is overwritten with each incoming chunk, so the final toolshim pass only sees the last fragment instead of the full assistant response. This breaks tool extraction for delta-style streams (for example, OpenAI-format streams emit per-chunk deltas), where tool-call JSON is split across many chunks; the interpreter then receives an incomplete suffix and typically returns no tool calls.

Useful? React with 👍 / 👎.

Comment on lines +385 to +388
if config.toolshim && !stream_done {
if let Some(msg) = accumulated_message.take() {
let augmented = toolshim_postprocess(msg, &toolshim_tools).await?;
yield (Some(augmented), None);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent duplicate output when usage metadata is absent

In toolshim mode, chunks are already yielded inside the loop, but the post-loop fallback emits accumulated_message again when no usage metadata was seen. For providers that omit usage, this produces duplicated assistant content (at least the final delta chunk, or the full message for snapshot streams), which can corrupt the final transcript and downstream parsing.

Useful? React with 👍 / 👎.

@eugenio
Copy link
Copy Markdown
Author

eugenio commented Apr 4, 2026

Superseded by #8310 which includes all changes from this PR (LlamaCppInterpreter, streaming chunk accumulation, direct parsers) plus additional marker sanitization fixes.

@eugenio eugenio closed this Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant