feat: add multi-modal attachment propagation to all worker agents thr… by bittoby · Pull Request #1196 · eigent-ai/eigent

bittoby · 2026-02-09T05:50:25Z

Enable Image Analysis for All Worker Agents

Problem

Only Multi-Modal Agent could analyze images. Worker agents (Developer, Browser, Document) couldn't see or process image attachments, causing tasks to fail.

Solution

Added ImageAnalysisToolkit to Developer, Browser, and Document agents
Registered toolkits with toolkits_to_register_agent parameter
Updated system prompts with explicit image analysis instructions

Changes

Agent Factories:

backend/app/agent/factory/developer.py
backend/app/agent/factory/browser.py
backend/app/agent/factory/document.py

System Prompts:

backend/app/agent/prompt.py

Impact

All worker agents can now:

Analyze screenshots and images
Extract text from images
Answer questions about visual content
Process images alongside text in complex tasks

Testing

✅ Developer Agent: Extracts code from screenshots
✅ Browser Agent: Analyzes webpage screenshots
✅ Document Agent: Ready for document image analysis

Type

Feature
Bug Fix

Closes #956

bittoby · 2026-02-09T05:51:53Z

@Wendong-Fan Could you please review this PR? thanks

Wendong-Fan · 2026-02-09T18:47:54Z

@Wendong-Fan Could you please review this PR? thanks

thanks @bittoby 's contribution! could @nitpicker55555 and @Zephyroam help checking this?

bittoby · 2026-02-10T12:03:41Z

@Zephyroam @nitpicker55555 I would appreciate your feedback. please review the PR! thanks

nitpicker55555

Obviously, you didn’t test your code. You didn’t register the ImageAnalysisToolkit for the agent—so how is the agent supposed to gain image-reading capability?

Additionally, this design is overly complicated. Why not simply modify the prompt of the decompose agent so that, when breaking down the task, it passes the image location to the corresponding agent? That way, we would only need to adjust the prompt and register the ImageAnalysisToolkit for the agent.

bittoby · 2026-02-11T11:47:13Z

@nitpicker55555 Obviously, after I tested it correctly I pushed the PR. When I attach a test image and input “As a developer, analyze this screenshot and write code” as prompt, then the developer agent generates HTML that matches the screenshot.

bittoby · 2026-02-11T11:56:55Z

Obviously, you didn’t test your code. You didn’t register the ImageAnalysisToolkit for the agent—so how is the agent supposed to gain image-reading capability?

Additionally, this design is overly complicated. Why not simply modify the prompt of the decompose agent so that, when breaking down the task, it passes the image location to the corresponding agent? That way, we would only need to adjust the prompt and register the ImageAnalysisToolkit for the agent.

you're right - I haven't added ImageAnalysisToolkit to the worker agents yet. The current implementation works because it relies on the LLM's native vision capability (like gpt-5), but I agree we should add the toolkit for explicit tool calls and non-vision model support. Would you prefer I complete the current approach by adding the toolkit, or switch to your suggestion of modifying the decompose prompt to selectively pass images?

nitpicker55555 · 2026-02-11T12:21:48Z

Obviously, you didn’t test your code. You didn’t register the ImageAnalysisToolkit for the agent—so how is the agent supposed to gain image-reading capability?
Additionally, this design is overly complicated. Why not simply modify the prompt of the decompose agent so that, when breaking down the task, it passes the image location to the corresponding agent? That way, we would only need to adjust the prompt and register the ImageAnalysisToolkit for the agent.

you're right - I haven't added ImageAnalysisToolkit to the worker agents yet. The current implementation works because it relies on the LLM's native vision capability (like gpt-5), but I agree we should add the toolkit for explicit tool calls and non-vision model support. Would you prefer I complete the current approach by adding the toolkit, or switch to your suggestion of modifying the decompose prompt to selectively pass images?

@bittoby Thank you for the explanation, but I still don’t understand: without providing a dedicated tool, how can the model read an image from a passed-in image URL instead of a base64-encoded input? Could you point out which part of your PR implements this functionality?

bittoby · 2026-02-11T12:28:30Z

Right now it only passes the image file paths along as additional_info. The current setup depends on the model’s built-in vision (which needs base64 data or a URL in the actual message). My PR doesn’t do that conversion or attach the images to the message - it just sends the paths as metadata. I will update PR. I misunderstood. I am clear now.

For this to really work, we need to either:

add ImageAnalysisToolkit to the worker agents, or
change how we build the LLM request so it includes the images as base64.

Which option do you recommend?

nitpicker55555 · 2026-02-11T12:37:43Z

My idea is to tweak ImageAnalysisToolkit so it supports taking an image path as input and returning the actual image data back to the agent (right now ImageAnalysisToolkit only returns an image description).

The benefits are: the change is relatively small, we’d only need to update the decompose agent’s prompt to pass along the image path. It would also allow the agent to read other image files, not just images uploaded by the user in the prompt, and it could support users providing an image path instead of the image itself.

I also looked into how Claude Code does image reading, and it works in a similar way, via a Read tool.

What do you think @Wendong-Fan @bittoby

bittoby · 2026-02-11T12:59:23Z

@nitpicker55555 Thanks for your explanation!
Your idea (extending ImageAnalysisToolkit) is a solid quick win, but I see some limitations:

Concerns:

Mixed responsibility: one toolkit ends up doing both “read” and “analyze”
Less reusable: e.g., Document Agent might only need to read images for embeddings, but it would still have to pull in the whole analysis toolkit
Heavier deps everywhere: any agent that needs to read an image would also load the analysis dependencies

My approach (like Claude Code): make a small, separate tool:

read_image(path) -> base64
Keep ImageAnalysisToolkit focused on analysis only
More flexible: agents can read images for embeddings, uploads, etc.

But it requires more code, a bit more wiring

I agree with your approach for now since it's faster to ship. We can refactor to a separate ReadImageTool later if we see reusability issues.

@Wendong-Fan what's your preference?

nitpicker55555 · 2026-02-11T13:04:49Z

@nitpicker55555 Thanks for your explanation!

Your idea (extending ImageAnalysisToolkit) is a solid quick win, but I see some limitations:

Concerns:

Mixed responsibility: one toolkit ends up doing both “read” and “analyze”

Less reusable: e.g., Document Agent might only need to read images for embeddings, but it would still have to pull in the whole analysis toolkit

Heavier deps everywhere: any agent that needs to read an image would also load the analysis dependencies

My approach (like Claude Code): make a small, separate tool:

read_image(path) -> base64

Keep ImageAnalysisToolkit focused on analysis only

More flexible: agents can read images for embeddings, uploads, etc.

But it requires more code, a bit more wiring

I agree with your approach for now since it's faster to ship. We can refactor to a separate ReadImageTool later if we see reusability issues.

@Wendong-Fan what's your preference?

@bittoby what do you mean document agent read image for embedding?

bittoby · 2026-02-11T13:17:04Z

I mean, here "embedding" = putting the image into a document not analyzing it.
For example:
User uploads screenshot.png and asks: "Create a PDF report with this screenshot",
Document Agent needs to read the png from disk , convert it to base64 data then insert it to pdf. It doesn't need analyze the png, describe or extract text etc

nitpicker55555 · 2026-02-11T15:13:01Z

@bittoby It seems you haven’t reviewed this tool at all. I strongly recommend checking the codebase before continuing the discussion.
ImageAnalysisToolkit has only one dependency—Pillow—which is also required for simple image reading. The difference between this tool and the “read-only image” functionality we need is merely whether we leverage the built-in agent to generate a description. From the agent’s perspective, the only difference lies in the arguments passed in.
We could, of course, implement a separate tool as you suggested, but the supposed benefit of being “more flexible: agents can read images for embeddings, uploads, etc.” does not really exist.

bittoby · 2026-02-11T15:38:31Z

Okay. @nitpicker55555 I’ll update the PR to follow your approach.

…ocument agents by integrating ImageAnalysisToolkit with proper agent registration and explicit priority instructions in system prompts

…lti-modal-worker-agents

bittoby · 2026-02-11T19:13:32Z

@nitpicker55555 I updated pr to follow your feedback. please review again

bittoby · 2026-02-12T23:26:36Z

@Wendong-Fan @nitpicker55555 I would appreciate your feedback.

…lti-modal-worker-agents

bittoby · 2026-02-13T14:23:54Z

test.webm

@nitpicker55555 pls check this test video. My changes works well

…ker agents ScreenshotToolkit already provides read_image capability via the agent's own vision model, making ImageAnalysisToolkit redundant. Add ScreenshotToolkit to browser and document agents for image reading support, and revert all ImageAnalysisToolkit additions from worker agents.

nitpicker55555 · 2026-02-13T15:27:18Z

I can see the problem now, develop agent alreday has Screenshot toolkit for image reading, we can use this tool directly, check this bittoby#1, and test browser agent/document agent to see if they can read image. You can achieve this by modifing the generated task plan to: "use browser agent/document agent to check the xxx image file content"

chore eigent-ai#1196

…lti-modal-worker-agents

bittoby · 2026-02-13T16:13:05Z

browser.webm

The browser agent works, too.

…lti-modal-worker-agents

bittoby · 2026-02-13T18:03:44Z

@nitpicker55555 I merged your changes. Now all agents work well for reading function. i think it's okay to merge this pr. @Wendong-Fan

nitpicker55555 · 2026-02-13T19:37:31Z

 from app.utils.listen.toolkit_listen import auto_listen_toolkit




You do not need to modify this file anymore since it is not used.

Others LGTM

…lti-modal-worker-agents

bittoby · 2026-02-13T20:14:13Z

@nitpicker55555 I reverted image_analysis_toolkit since it is not used. it's ready to merge.
Thank you

nitpicker55555

Thanks @bittoby ! @Wendong-Fan can you take a look for this?

bittoby · 2026-02-14T08:46:40Z

@Wendong-Fan would you review this pr and merge that if it has no problem?
Thank you

Zephyroam · 2026-02-17T10:34:13Z

Do we need to update the agent skills here?

Zephyroam · 2026-02-17T10:58:06Z

I tested several times. All of them failed. Could you upload a successful demo?

Logs here
eigent-0.0.83-darwin-arm64-1771326253031.log

bittoby · 2026-02-17T12:12:06Z

browser.mp4

developer.mp4

@Zephyroam It works well without any errors. please check this demo.

bittoby · 2026-02-18T18:09:05Z

@Wendong-Fan please review this PR and would be appreciate to merge if it has no problem

bittoby · 2026-02-19T12:31:02Z

@Wendong-Fan @bytecii would appreciate feedback.
thank you

Wendong-Fan

thanks @bittoby 's PR and sorry for the late review, one issue to address is that ScreenshotToolkit is now used in multiple agents, this can misattribute toolkit activation/deactivation events in workflow logs/UI, so we need to add agent_name when init ScreenshotToolkit for those agent, i will fix this in another commit

cc @nitpicker55555 @Zephyroam

Wendong-Fan requested review from Zephyroam and nitpicker55555 February 9, 2026 18:47

Wendong-Fan added this to the Sprint 14 milestone Feb 9, 2026

nitpicker55555 reviewed Feb 11, 2026

View reviewed changes

bittoby closed this Feb 11, 2026

bittoby force-pushed the feat/multi-modal-worker-agents branch from 1e21ba1 to 805ed97 Compare February 11, 2026 16:59

bittoby reopened this Feb 11, 2026

bittoby closed this Feb 11, 2026

bittoby force-pushed the feat/multi-modal-worker-agents branch from 1e21ba1 to 53d8830 Compare February 11, 2026 17:11

bittoby reopened this Feb 11, 2026

feat: enable multi-modal image analysis for Developer, Browser, and D…

dba3f58

…ocument agents by integrating ImageAnalysisToolkit with proper agent registration and explicit priority instructions in system prompts

bittoby force-pushed the feat/multi-modal-worker-agents branch from 698b49b to dba3f58 Compare February 11, 2026 19:03

bittoby added 2 commits February 11, 2026 19:09

feat: add mock toolkits

3392d95

Merge branch 'main' of https://github.com/bittoby/eigent into feat/mu…

236d2ae

…lti-modal-worker-agents

bittoby requested a review from nitpicker55555 February 11, 2026 19:54

bittoby added 2 commits February 13, 2026 13:28

Merge branch 'main' of https://github.com/bittoby/eigent into feat/mu…

4103846

…lti-modal-worker-agents

fix: refactor format

3577876

bittoby added 2 commits February 13, 2026 12:52

Merge pull request #1 from eigent-ai/feat/multi-modal-worker-agents

d66a36a

chore eigent-ai#1196

Merge branch 'main' of https://github.com/bittoby/eigent into feat/mu…

a8dd047

…lti-modal-worker-agents

Merge branch 'main' of https://github.com/bittoby/eigent into feat/mu…

dfb7021

…lti-modal-worker-agents

nitpicker55555 reviewed Feb 13, 2026

View reviewed changes

bittoby added 2 commits February 13, 2026 20:11

Merge branch 'main' of https://github.com/bittoby/eigent into feat/mu…

3ef4dbb

…lti-modal-worker-agents

fix: revert image_analysis_toolkit

ebf7d31

nitpicker55555 approved these changes Feb 13, 2026

View reviewed changes

fix: solve merge conflict

93dba85

bittoby force-pushed the feat/multi-modal-worker-agents branch from ed49def to 93dba85 Compare February 19, 2026 12:12

Wendong-Fan assigned bittoby Feb 21, 2026

Wendong-Fan reviewed Feb 21, 2026

View reviewed changes

Wendong-Fan and others added 2 commits February 22, 2026 01:08

fix(agent): require explicit screenshot toolkit agent name

637246a

Merge branch 'main' into feat/multi-modal-worker-agents

8a49417

Wendong-Fan merged commit 1831d2a into eigent-ai:main Feb 21, 2026
6 checks passed

bittoby deleted the feat/multi-modal-worker-agents branch February 22, 2026 01:24

Wendong-Fan mentioned this pull request Mar 6, 2026

fix: replace deprecated ImageAnalysisToolkit with ScreenshotToolkit i… #1464

Merged

		from app.utils.listen.toolkit_listen import auto_listen_toolkit

Conversation

bittoby commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enable Image Analysis for All Worker Agents

Problem

Solution

Changes

Impact

Testing

Type

Uh oh!

bittoby commented Feb 9, 2026

Uh oh!

Wendong-Fan commented Feb 9, 2026

Uh oh!

bittoby commented Feb 10, 2026

Uh oh!

nitpicker55555 left a comment

Choose a reason for hiding this comment

Uh oh!

bittoby commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bittoby commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nitpicker55555 commented Feb 11, 2026

Uh oh!

bittoby commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nitpicker55555 commented Feb 11, 2026

Uh oh!

bittoby commented Feb 11, 2026

Uh oh!

nitpicker55555 commented Feb 11, 2026

Uh oh!

bittoby commented Feb 11, 2026

Uh oh!

nitpicker55555 commented Feb 11, 2026

Uh oh!

bittoby commented Feb 11, 2026

Uh oh!

bittoby commented Feb 11, 2026

Uh oh!

bittoby commented Feb 12, 2026

Uh oh!

bittoby commented Feb 13, 2026

Uh oh!

nitpicker55555 commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bittoby commented Feb 13, 2026

Uh oh!

bittoby commented Feb 13, 2026

Uh oh!

nitpicker55555 Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

bittoby commented Feb 13, 2026

Uh oh!

nitpicker55555 left a comment

Choose a reason for hiding this comment

Uh oh!

bittoby commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zephyroam commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zephyroam commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bittoby commented Feb 17, 2026

Uh oh!

bittoby commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bittoby commented Feb 19, 2026

Uh oh!

bittoby commented Feb 9, 2026 •

edited

Loading

bittoby commented Feb 11, 2026 •

edited

Loading

bittoby commented Feb 11, 2026 •

edited

Loading

bittoby commented Feb 11, 2026 •

edited

Loading

nitpicker55555 commented Feb 13, 2026 •

edited

Loading

bittoby commented Feb 14, 2026 •

edited

Loading

Zephyroam commented Feb 17, 2026 •

edited

Loading

Zephyroam commented Feb 17, 2026 •

edited

Loading

bittoby commented Feb 18, 2026 •

edited

Loading