Let me tell you something, about multi-modal

Amitava_Ghosh · July 11, 2025, 12:06pm

The multimodal thing, just plain sucks, and its not a crew issue.

So, I took an example right of crew’s doc:

import sys

from crewai import Agent, Crew, Task
from crewai.llm import LLM

image_url = sys.argv[1]

# Create a multimodal agent
image_analyst = Agent(
    role="Product Analyst",
    goal="Analyze product images and provide detailed descriptions",
    backstory="Expert in visual product analysis with deep knowledge of design and features",
    llm=LLM(model="openai/gpt-4.1", temperature=1.0),
    multimodal=True,
    verbose=True,
)

# Create a task for image analysis
task = Task(
    description=f"Analyze the product image at {image_url} and provide a detailed description",
    expected_output="A detailed description of the product image",
    agent=image_analyst,
)

# Create and run the crew
crew = Crew(agents=[image_analyst], tasks=[task], verbose=True)

result = crew.kickoff()
print(result.raw)

Run with:

uv run --active run.py "https://images.unsplash.com/photo-1554866585-cd94860890b7?q=80&w=1065&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"

And the image is of some coke can, but it says:

The product in the image is a classic Moka pot, commonly known as a stovetop espresso maker. Its design features a polished metal body, likely made of aluminum or stainless steel, with an octagonal, faceted shape. The pot consists of three main chambers: a lower chamber for water, a middle funnel for ground coffee, and an upper chamber for collecting brewed coffee.

The Moka pot has a black, heat-resistant handle and a matching knob on the lid, likely made from plastic or bakelite, to facilitate safe handling. The overall construction shows visible joints and rivets, pointing to its durability and robust build. The shiny metal surface gives it a clean, reflective appearance. The product’s form is iconic and practical, balancing traditional craftsmanship with modern usability, making it easy to assemble, use, and clean. This coffee maker is designed for stovetop brewing and is celebrated for its rich, concentrated coffee output and timeless design appeal.

I was worried that the image url copy failed or something, but that didn’t happen. So I tried it on chatgpt, and it called the coke can a dog.

I am sure, if we upload an image, it would work, because the API works:

curl https://api.openai.com/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-4.1",
    "input": [
      {
        "role": "user",
        "content": [
          {"type": "input_text", "text": "what is in this image?"},
          {
            "type": "input_image",
            "image_url": "https://images.unsplash.com/photo-1554866585-cd94860890b7?q=80&w=1065&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
          }
        ]
      }
    ]
  }'

This shit is documented in the openai api docs.

It would be really great, if we could change this thing, and instead use the right thing to do an actual analysis.
There is no point in having a feature, which is the conventional way to do things, but in the end sucks.

Thanks,

Amitava_Ghosh · July 11, 2025, 1:12pm

This is how you do multi-modal image:

from typing import Optional

from crewai.tools.agent_tools.add_image_tool import AddImageToolSchema
from crewai.tools.base_tool import BaseTool
from crewai.utilities import I18N
from langchain.chat_models import init_chat_model
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

i18n = I18N()


class ImageVisionTool(BaseTool):
    """Tool for adding images to the content"""

    name: str = "ImageInfo"  # type: ignore
    description: str = Field(default_factory=lambda: i18n.tools("add_image")["description"])  # type: ignore
    args_schema: type[BaseModel] = AddImageToolSchema
    model: str = "openai:gpt-4.1"
    model_config: dict = {}

    def _run(
        self,
        image_url: str,
        action: Optional[str] = None,
        **kwargs,
    ) -> str:
        action = action or i18n.tools("add_image")["default_action"]  # type: ignore

        llm = init_chat_model(model=self.model, **self.model_config)

        # Define prompt
        prompt = ChatPromptTemplate(
            [
                {
                    "role": "system",
                    "content": "Describe the image provided.",
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "input_image",
                            "image_url": "{image_url}",
                        },
                    ],
                },
            ]
        )

        chain = prompt | llm
        response = chain.invoke({"image_url": image_url})

        return response.text()

Topic		Replies	Views
CrewAI multimodal Capability CrewAI Community Support	4	415	April 26, 2025
Missing Extracted Content with Crew AI Vision Tool CrewAI Community Support tools_issues	1	136	November 15, 2024
Using Langchain Tool VertexAIImageGeneratorChat with CrewAI Agent CrewAI Community Support agent	0	192	February 3, 2025
Do image analysis with locall ollama CrewAI Community Support agent , task , crewai	1	217	June 21, 2025
Crewai + multimodal CrewAI Community Support crewai , feature	13	979	June 10, 2025

Let me tell you something, about multi-modal

Related topics