The multimodal thing, just plain sucks, and its not a crew issue.
So, I took an example right of crew’s doc:
import sys
from crewai import Agent, Crew, Task
from crewai.llm import LLM
image_url = sys.argv[1]
# Create a multimodal agent
image_analyst = Agent(
role="Product Analyst",
goal="Analyze product images and provide detailed descriptions",
backstory="Expert in visual product analysis with deep knowledge of design and features",
llm=LLM(model="openai/gpt-4.1", temperature=1.0),
multimodal=True,
verbose=True,
)
# Create a task for image analysis
task = Task(
description=f"Analyze the product image at {image_url} and provide a detailed description",
expected_output="A detailed description of the product image",
agent=image_analyst,
)
# Create and run the crew
crew = Crew(agents=[image_analyst], tasks=[task], verbose=True)
result = crew.kickoff()
print(result.raw)
Run with:
uv run --active run.py "https://images.unsplash.com/photo-1554866585-cd94860890b7?q=80&w=1065&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
And the image is of some coke can, but it says:
The product in the image is a classic Moka pot, commonly known as a stovetop espresso maker. Its design features a polished metal body, likely made of aluminum or stainless steel, with an octagonal, faceted shape. The pot consists of three main chambers: a lower chamber for water, a middle funnel for ground coffee, and an upper chamber for collecting brewed coffee.
The Moka pot has a black, heat-resistant handle and a matching knob on the lid, likely made from plastic or bakelite, to facilitate safe handling. The overall construction shows visible joints and rivets, pointing to its durability and robust build. The shiny metal surface gives it a clean, reflective appearance. The product’s form is iconic and practical, balancing traditional craftsmanship with modern usability, making it easy to assemble, use, and clean. This coffee maker is designed for stovetop brewing and is celebrated for its rich, concentrated coffee output and timeless design appeal.
I was worried that the image url copy failed or something, but that didn’t happen. So I tried it on chatgpt, and it called the coke can a dog.
I am sure, if we upload an image, it would work, because the API works:
curl https://api.openai.com/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4.1",
"input": [
{
"role": "user",
"content": [
{"type": "input_text", "text": "what is in this image?"},
{
"type": "input_image",
"image_url": "https://images.unsplash.com/photo-1554866585-cd94860890b7?q=80&w=1065&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
}
]
}
]
}'
This shit is documented in the openai api docs.
It would be really great, if we could change this thing, and instead use the right thing to do an actual analysis.
There is no point in having a feature, which is the conventional way to do things, but in the end sucks.
Thanks,