feat: add LlamaCppInterpreter for toolshim, fix streaming chunk handling by eugenio · Pull Request #8281 · block/goose

eugenio · 2026-04-03T10:22:37Z

Summary

LlamaCppInterpreter: New ToolInterpreter implementation that uses the built-in llama.cpp backend instead of requiring an external Ollama server. Configured via GOOSE_TOOLSHIM_MODEL env var (default: bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_K_M). Falls back to OllamaInterpreter when the local-inference feature is disabled or the model is unavailable.
Streaming fix: toolshim_postprocess now runs only on the final streaming message (when usage is present) instead of on every intermediate chunk. This prevents the interpreter model from seeing incomplete text fragments and returning noop for all of them. Ref: bug: Toolshim OllamaInterpreter returns noop for all streaming tool call fragments #8275

Motivation

The toolshim currently hardcodes an OllamaInterpreter that requires a running Ollama server on localhost:11434. This is a friction point for users who:

Don't want to install/manage a separate Ollama service
Already have the local-inference feature compiled in
Use Windows where Ollama runs as a service that consumes resources even when idle

The LlamaCppInterpreter reuses the existing InferenceRuntime and model cache, loading a small (~800MB) model on first use.

Visibility changes

inference_engine module: pub(super) → pub(crate) for LoadedModel, TokenAction, validate_and_compute_context, create_and_prefill_context, generation_loop, build_context_params, estimate_max_context_for_memory, effective_context_size
InferenceRuntime::get_or_create_model_slot: fn → pub fn
LocalInferenceProvider::load_model_sync_public: new pub(crate) wrapper

Test plan

Compiles with --features local-inference (full default features)
Compiles without local-inference (graceful fallback to Ollama)
All existing unit tests pass (1014/1022 — 8 pre-existing env-specific failures)
Manual: verify goose session with GOOSE_TOOLSHIM=true uses llama.cpp (check logs for LlamaCpp tool interpreter response)
Manual: verify streaming tool calls are interpreted correctly with small models

🤖 Generated with Claude Code

- Add `repair_truncated_json()` fallback to `safely_parse_json()` that closes unclosed strings and brackets in truncated tool call arguments. Fixes large-payload EOF errors (col 27409+) from streaming models. (block#8272) - Downgrade summon frontmatter parse warning from WARN to DEBUG and include the source file path for easier debugging. Files without a `name` field are expected (non-extension markdown) and should not pollute logs. (block#8273) - Add "Merged text content", "Removed trailing assistant message", and "Trimmed trailing whitespace from assistant message" to the MOIM expected-issues list so they no longer trigger WARN-level log entries on every LLM call. (block#8274) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add a built-in llama.cpp-based ToolInterpreter that eliminates the hard dependency on an external Ollama server for toolshim functionality. LlamaCppInterpreter: - Implements the ToolInterpreter trait using the existing InferenceRuntime - Uses a small local GGUF model (default: Llama-3.2-1B-Instruct Q4_K_M) - Configurable via GOOSE_TOOLSHIM_MODEL env var - Falls back to OllamaInterpreter if local-inference feature is disabled or model is unavailable Streaming fix: - Toolshim postprocessing now runs only on the final streaming message (when usage data is present) instead of on every intermediate chunk - Prevents small interpreter models from seeing partial text fragments that they cannot interpret, fixing the "noop for all chunks" bug (block#8275) Visibility changes: - Widen inference_engine types to pub(crate) for cross-module access - Add load_model_sync_public() wrapper on LocalInferenceProvider Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a9552930d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T10:26:58Z

crates/goose/src/agents/reply_parts.rs

+                return augment_message_with_tool_calls(&interpreter, response, toolshim_tools)
+                    .await
+                    .map_err(|e| {
+                        ProviderError::ExecutionError(format!("Failed to augment message: {}", e))


Fall back to Ollama when llama.cpp interpretation fails

This branch returns immediately after LlamaCppInterpreter::new() succeeds, so any later runtime error from llama.cpp (for example, generate_sync failing because GOOSE_TOOLSHIM_MODEL is not downloaded) aborts toolshim instead of taking the intended Ollama fallback path. In builds with local-inference enabled but no local interpreter model present, toolshim now hard-fails rather than degrading gracefully.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-03T10:26:58Z

crates/goose/src/agents/reply_parts.rs

+                if config.toolshim && usage.is_some() {
+                    if let Some(msg) = last_message.take() {


Decouple final toolshim pass from usage availability

The final post-processing is now gated on usage.is_some(), but streaming providers can legally complete without usage metadata (e.g., OpenAI-compatible parsing only forwards usage when output token counts are present). In those cases this condition is never met, so toolshim is skipped entirely and tool-call JSON stays as plain assistant text.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-03T10:26:58Z

crates/goose/src/agents/reply_parts.rs

+                if message.is_some() {
+                    last_message = message.clone();


Aggregate streamed text before one-shot tool interpretation

This stores only the latest chunk in last_message, then interprets that single message at stream end. For delta-style streams, each message contains only a fragment, so the interpreter often receives just the trailing token(s) instead of the full assistant output and returns noop. The final toolshim pass needs accumulated content across chunks, not just the last chunk object.

Useful? React with 👍 / 👎.

- P1: LlamaCpp augmentation errors now fall back to Ollama instead of hard-failing. Previously only creation failure triggered fallback. - P1: Toolshim runs after stream ends even when usage metadata is absent (some OpenAI-compatible providers omit it). Added post-loop fallback. - P1: Store accumulated message across chunks instead of only the last chunk. The final toolshim pass now sees the complete assistant output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93b016a627

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T10:40:40Z

crates/goose/src/agents/reply_parts.rs

+                if let Some(msg) = &message {
+                    accumulated_message = Some(msg.clone());


Accumulate full streamed text before running toolshim

accumulated_message is overwritten with each incoming chunk, so the final toolshim pass only sees the last fragment instead of the full assistant response. This breaks tool extraction for delta-style streams (for example, OpenAI-format streams emit per-chunk deltas), where tool-call JSON is split across many chunks; the interpreter then receives an incomplete suffix and typically returns no tool calls.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-03T10:40:40Z

crates/goose/src/agents/reply_parts.rs

+            if config.toolshim && !stream_done {
+                if let Some(msg) = accumulated_message.take() {
+                    let augmented = toolshim_postprocess(msg, &toolshim_tools).await?;
+                    yield (Some(augmented), None);


Prevent duplicate output when usage metadata is absent

In toolshim mode, chunks are already yielded inside the loop, but the post-loop fallback emits accumulated_message again when no usage metadata was seen. For providers that omit usage, this produces duplicated assistant content (at least the final delta chunk, or the full message for snapshot streams), which can corrupt the final transcript and downstream parsing.

Useful? React with 👍 / 👎.

eugenio · 2026-04-04T07:55:05Z

Superseded by #8310 which includes all changes from this PR (LlamaCppInterpreter, streaming chunk accumulation, direct parsers) plus additional marker sanitization fixes.

eugenio and others added 2 commits April 3, 2026 12:19

chatgpt-codex-connector bot reviewed Apr 3, 2026

View reviewed changes

eugenio closed this Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add LlamaCppInterpreter for toolshim, fix streaming chunk handling#8281

feat: add LlamaCppInterpreter for toolshim, fix streaming chunk handling#8281
eugenio wants to merge 3 commits intoblock:mainfrom
eugenio:feat/llamacpp-toolshim-interpreter

eugenio commented Apr 3, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Uh oh!

eugenio commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if config.toolshim && usage.is_some() {
		if let Some(msg) = last_message.take() {

		if let Some(msg) = &message {
		accumulated_message = Some(msg.clone());

Conversation

eugenio commented Apr 3, 2026

Summary

Motivation

Visibility changes

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

eugenio commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant