Add multimodal input support to generateImage() and generateVideo() (image-to-image, image-to-video) #481

yiss · 2026-04-20T16:45:36Z

yiss
Apr 20, 2026

Problem

generateImage() and generateVideo() are currently centered around text-prompt inputs, but several providers and models support image-conditioned generation workflows.

Examples include:

image-to-image generation
prompt + reference image generation
multi-reference image generation
image-to-video generation
video generation from a starting frame
model-specific image editing / transformation workflows

Today there is no obvious provider-agnostic way to pass image inputs into generateImage() and generateVideo().

TanStack AI already has a clean multimodal abstraction for content parts (ImagePart with source.type: 'data' | 'url'). It would be great if media generation APIs reused that same shape instead of introducing provider-specific one-offs for image-conditioned generation.

Why this matters

Modern image and video models are increasingly multimodal. Generation is no longer only text-to-image or text-to-video.

A unified way to pass image inputs would make it much easier for adapters to support workflows like:

image editing
reference-guided generation
image-to-video
multi-image composition

Proposal

Add an optional inputs field to both generateImage() and generateVideo() that accepts reusable multimodal content parts, ideally existing ImagePart values.

This would provide a consistent, provider-agnostic way to pass image-conditioned inputs into media generation APIs.

Example API

`generateImage()`

import { generateImage, type ImagePart } from '@tanstack/ai'

const reference: ImagePart = {
  type: 'image',
  source: {
    type: 'url',
    value: 'https://example.com/reference.png',
  },
}

await generateImage({
  adapter: openaiImage('gpt-image-1.5'),
  prompt: 'Turn this into a cinematic product photo',
  inputs: [reference],
})

`generateVideo()`

import { generateVideo, type ImagePart } from '@tanstack/ai'

const startingFrame: ImagePart = {
  type: 'image',
  source: {
    type: 'data',
    value: base64Image,
    mimeType: 'image/png',
  },
}

await generateVideo({
  adapter: googleVideo('veo-3.1'),
  prompt: 'Animate this still into a slow cinematic push-in with subtle motion',
  inputs: [startingFrame],
})

Multiple reference images

import { generateImage, type ImagePart } from '@tanstack/ai'

const product: ImagePart = {
  type: 'image',
  source: {
    type: 'url',
    value: 'https://example.com/product.png',
  },
}

const style: ImagePart = {
  type: 'image',
  source: {
    type: 'url',
    value: 'https://example.com/style.png',
  },
}

await generateImage({
  adapter: geminiImage('nano-banana'),
  prompt: 'Generate a new image of the product using the style of the second reference',
  inputs: [product, style],
})

Expected behavior

generateImage() and generateVideo() should both accept image-conditioned inputs through the same field name.
The input format should ideally reuse existing TanStack AI multimodal primitives such as ImagePart.
Adapters should map those inputs into the provider-native request shape.
Unsupported combinations can be rejected by adapters at runtime or by adapter-specific validation.
Providers that only support text prompts should continue to work unchanged.

Open design questions

Should the field be named inputs, references, or something else?
Should it accept only ImagePart[], or broader content parts for future extensibility?
Should generateVideo() support multiple input images as well, or only one initially?

Summary

Request: add a unified, provider-agnostic way to pass image-conditioned inputs into both generateImage() and generateVideo(), ideally by reusing existing multimodal content-part types such as ImagePart.

tombeckenham · 2026-05-05T22:07:06Z

tombeckenham
May 5, 2026
Maintainer

It is possible to provide images but only via model options which is provider specific. I like this idea although it may be complicated to translate those inputs into every format that's expected. There are now multiple types of images that can be provided - references, elements, start frame, end frame. This is worth exploring though.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add multimodal input support to generateImage() and generateVideo() (image-to-image, image-to-video) #481

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Add multimodal input support to generateImage() and generateVideo() (image-to-image, image-to-video) #481

Uh oh!

yiss Apr 20, 2026

Problem

Why this matters

Proposal

Example API

generateImage()

generateVideo()

Multiple reference images

Expected behavior

Open design questions

Summary

Replies: 1 comment

Uh oh!

tombeckenham May 5, 2026 Maintainer

yiss
Apr 20, 2026

`generateImage()`

`generateVideo()`

tombeckenham
May 5, 2026
Maintainer