-
-
Notifications
You must be signed in to change notification settings - Fork 147
Description
Problem
Tool results are always converted to strings via JSON.stringify(), which prevents sending multimodal content (e.g. images) in tool responses when using the OpenAI Responses API.
The OpenAI Responses API supports multimodal tool outputs via function_call_output, where output can be either a string or an array of content parts:
{
"type": "function_call_output",
"call_id": "call_xyz",
"output": [
{ "type": "input_image", "image_url": "https://example.com/screenshot.png" },
{ "type": "input_text", "text": "Screenshot of the current state" }
]
}This is documented in the migration guide and the function calling guide ("For functions that return images or files, you can pass an array of image or file objects instead of a string").
However, TanStack AI forces all tool results to strings at two points, making it impossible to use this capability:
1. Server-side tool execution — tool-calls.ts
ai/packages/typescript/ai/src/activities/chat/tools/tool-calls.ts
Lines 181 to 182 in fda4b06
| toolResultContent = | |
| typeof result === 'string' ? result : JSON.stringify(result) |
2. Client-side tool result processing — processor.ts
ai/packages/typescript/ai/src/activities/chat/stream/processor.ts
Lines 316 to 317 in fda4b06
| // Step 2: Create a tool-result part (for LLM conversation history) | |
| const content = typeof output === 'string' ? output : JSON.stringify(output) |
3. OpenAI adapter message conversion — text.ts
ai/packages/typescript/ai-openai/src/adapters/text.ts
Lines 709 to 713 in fda4b06
| output: | |
| typeof message.content === 'string' | |
| ? message.content | |
| : JSON.stringify(message.content), | |
| }) |
Use Case
I am building a visual editor where the AI designs on a canvas. After each tool call (e.g. add_text, edit_image), I need the AI to see a screenshot of the canvas to verify its changes and iterate. The canvas screenshot is available via a URL endpoint.
With the current implementation, returning { type: "input_image", image_url: "https://..." } from a tool gets stringified to "{\"type\":\"input_image\",\"image_url\":\"https://...\"}" — the model receives it as plain text and cannot actually see the image.
I verified manually that sending the function_call_output with output as an array (with input_image parts) to the OpenAI Responses API works correctly — the model receives and processes the image.
Proposed Solution
The ModelMessage type for role: "tool" should support content as either string or Array<ContentPart>. The stringify logic in the three locations above should preserve arrays/objects when they match the expected multimodal format, and only stringify plain objects that are not content part arrays.
The OpenAI adapter should pass the array through to function_call_output.output instead of stringifying it.
I am aware this will bring another discussion: since we can return anything from tools currently. Forcing it to output string seems was the immediate fix to maintain consistency communication across all models (as I am not sure which models accept content parts as tool results). But, this is a crucial requirement to support.
Environment
@tanstack/ai: latest@tanstack/ai-openai: latest@cloudflare/tanstack-ai: latest- OpenAI adapter uses Responses API (
client.responses.create())