Skip to content

Tool results are always stringified, preventing multimodal (image) tool responses with OpenAI Responses API #363

@tecoad

Description

@tecoad

Problem

Tool results are always converted to strings via JSON.stringify(), which prevents sending multimodal content (e.g. images) in tool responses when using the OpenAI Responses API.

The OpenAI Responses API supports multimodal tool outputs via function_call_output, where output can be either a string or an array of content parts:

{
  "type": "function_call_output",
  "call_id": "call_xyz",
  "output": [
    { "type": "input_image", "image_url": "https://example.com/screenshot.png" },
    { "type": "input_text", "text": "Screenshot of the current state" }
  ]
}

This is documented in the migration guide and the function calling guide ("For functions that return images or files, you can pass an array of image or file objects instead of a string").

However, TanStack AI forces all tool results to strings at two points, making it impossible to use this capability:

1. Server-side tool execution — tool-calls.ts

toolResultContent =
typeof result === 'string' ? result : JSON.stringify(result)

2. Client-side tool result processing — processor.ts

// Step 2: Create a tool-result part (for LLM conversation history)
const content = typeof output === 'string' ? output : JSON.stringify(output)

3. OpenAI adapter message conversion — text.ts

output:
typeof message.content === 'string'
? message.content
: JSON.stringify(message.content),
})

Use Case

I am building a visual editor where the AI designs on a canvas. After each tool call (e.g. add_text, edit_image), I need the AI to see a screenshot of the canvas to verify its changes and iterate. The canvas screenshot is available via a URL endpoint.

With the current implementation, returning { type: "input_image", image_url: "https://..." } from a tool gets stringified to "{\"type\":\"input_image\",\"image_url\":\"https://...\"}" — the model receives it as plain text and cannot actually see the image.

I verified manually that sending the function_call_output with output as an array (with input_image parts) to the OpenAI Responses API works correctly — the model receives and processes the image.

Proposed Solution

The ModelMessage type for role: "tool" should support content as either string or Array<ContentPart>. The stringify logic in the three locations above should preserve arrays/objects when they match the expected multimodal format, and only stringify plain objects that are not content part arrays.

The OpenAI adapter should pass the array through to function_call_output.output instead of stringifying it.

I am aware this will bring another discussion: since we can return anything from tools currently. Forcing it to output string seems was the immediate fix to maintain consistency communication across all models (as I am not sure which models accept content parts as tool results). But, this is a crucial requirement to support.

Environment

  • @tanstack/ai: latest
  • @tanstack/ai-openai: latest
  • @cloudflare/tanstack-ai: latest
  • OpenAI adapter uses Responses API (client.responses.create())

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions