feat: add LlamaCppInterpreter for toolshim, fix streaming chunk handling#8281
feat: add LlamaCppInterpreter for toolshim, fix streaming chunk handling#8281eugenio wants to merge 3 commits intoblock:mainfrom
Conversation
- Add `repair_truncated_json()` fallback to `safely_parse_json()` that closes unclosed strings and brackets in truncated tool call arguments. Fixes large-payload EOF errors (col 27409+) from streaming models. (block#8272) - Downgrade summon frontmatter parse warning from WARN to DEBUG and include the source file path for easier debugging. Files without a `name` field are expected (non-extension markdown) and should not pollute logs. (block#8273) - Add "Merged text content", "Removed trailing assistant message", and "Trimmed trailing whitespace from assistant message" to the MOIM expected-issues list so they no longer trigger WARN-level log entries on every LLM call. (block#8274) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a built-in llama.cpp-based ToolInterpreter that eliminates the hard dependency on an external Ollama server for toolshim functionality. LlamaCppInterpreter: - Implements the ToolInterpreter trait using the existing InferenceRuntime - Uses a small local GGUF model (default: Llama-3.2-1B-Instruct Q4_K_M) - Configurable via GOOSE_TOOLSHIM_MODEL env var - Falls back to OllamaInterpreter if local-inference feature is disabled or model is unavailable Streaming fix: - Toolshim postprocessing now runs only on the final streaming message (when usage data is present) instead of on every intermediate chunk - Prevents small interpreter models from seeing partial text fragments that they cannot interpret, fixing the "noop for all chunks" bug (block#8275) Visibility changes: - Widen inference_engine types to pub(crate) for cross-module access - Add load_model_sync_public() wrapper on LocalInferenceProvider Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7a9552930d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| return augment_message_with_tool_calls(&interpreter, response, toolshim_tools) | ||
| .await | ||
| .map_err(|e| { | ||
| ProviderError::ExecutionError(format!("Failed to augment message: {}", e)) |
There was a problem hiding this comment.
Fall back to Ollama when llama.cpp interpretation fails
This branch returns immediately after LlamaCppInterpreter::new() succeeds, so any later runtime error from llama.cpp (for example, generate_sync failing because GOOSE_TOOLSHIM_MODEL is not downloaded) aborts toolshim instead of taking the intended Ollama fallback path. In builds with local-inference enabled but no local interpreter model present, toolshim now hard-fails rather than degrading gracefully.
Useful? React with 👍 / 👎.
| if config.toolshim && usage.is_some() { | ||
| if let Some(msg) = last_message.take() { |
There was a problem hiding this comment.
Decouple final toolshim pass from usage availability
The final post-processing is now gated on usage.is_some(), but streaming providers can legally complete without usage metadata (e.g., OpenAI-compatible parsing only forwards usage when output token counts are present). In those cases this condition is never met, so toolshim is skipped entirely and tool-call JSON stays as plain assistant text.
Useful? React with 👍 / 👎.
| if message.is_some() { | ||
| last_message = message.clone(); |
There was a problem hiding this comment.
Aggregate streamed text before one-shot tool interpretation
This stores only the latest chunk in last_message, then interprets that single message at stream end. For delta-style streams, each message contains only a fragment, so the interpreter often receives just the trailing token(s) instead of the full assistant output and returns noop. The final toolshim pass needs accumulated content across chunks, not just the last chunk object.
Useful? React with 👍 / 👎.
- P1: LlamaCpp augmentation errors now fall back to Ollama instead of hard-failing. Previously only creation failure triggered fallback. - P1: Toolshim runs after stream ends even when usage metadata is absent (some OpenAI-compatible providers omit it). Added post-loop fallback. - P1: Store accumulated message across chunks instead of only the last chunk. The final toolshim pass now sees the complete assistant output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 93b016a627
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if let Some(msg) = &message { | ||
| accumulated_message = Some(msg.clone()); |
There was a problem hiding this comment.
Accumulate full streamed text before running toolshim
accumulated_message is overwritten with each incoming chunk, so the final toolshim pass only sees the last fragment instead of the full assistant response. This breaks tool extraction for delta-style streams (for example, OpenAI-format streams emit per-chunk deltas), where tool-call JSON is split across many chunks; the interpreter then receives an incomplete suffix and typically returns no tool calls.
Useful? React with 👍 / 👎.
| if config.toolshim && !stream_done { | ||
| if let Some(msg) = accumulated_message.take() { | ||
| let augmented = toolshim_postprocess(msg, &toolshim_tools).await?; | ||
| yield (Some(augmented), None); |
There was a problem hiding this comment.
Prevent duplicate output when usage metadata is absent
In toolshim mode, chunks are already yielded inside the loop, but the post-loop fallback emits accumulated_message again when no usage metadata was seen. For providers that omit usage, this produces duplicated assistant content (at least the final delta chunk, or the full message for snapshot streams), which can corrupt the final transcript and downstream parsing.
Useful? React with 👍 / 👎.
|
Superseded by #8310 which includes all changes from this PR (LlamaCppInterpreter, streaming chunk accumulation, direct parsers) plus additional marker sanitization fixes. |
Summary
ToolInterpreterimplementation that uses the built-in llama.cpp backend instead of requiring an external Ollama server. Configured viaGOOSE_TOOLSHIM_MODELenv var (default:bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_K_M). Falls back toOllamaInterpreterwhen thelocal-inferencefeature is disabled or the model is unavailable.toolshim_postprocessnow runs only on the final streaming message (whenusageis present) instead of on every intermediate chunk. This prevents the interpreter model from seeing incomplete text fragments and returningnoopfor all of them. Ref: bug: Toolshim OllamaInterpreter returns noop for all streaming tool call fragments #8275Motivation
The toolshim currently hardcodes an
OllamaInterpreterthat requires a running Ollama server onlocalhost:11434. This is a friction point for users who:local-inferencefeature compiled inThe
LlamaCppInterpreterreuses the existingInferenceRuntimeand model cache, loading a small (~800MB) model on first use.Visibility changes
inference_enginemodule:pub(super)→pub(crate)forLoadedModel,TokenAction,validate_and_compute_context,create_and_prefill_context,generation_loop,build_context_params,estimate_max_context_for_memory,effective_context_sizeInferenceRuntime::get_or_create_model_slot:fn→pub fnLocalInferenceProvider::load_model_sync_public: newpub(crate)wrapperTest plan
--features local-inference(full default features)local-inference(graceful fallback to Ollama)goose sessionwithGOOSE_TOOLSHIM=trueuses llama.cpp (check logs forLlamaCpp tool interpreter response)🤖 Generated with Claude Code