CrewAI multimodal Capability

Aashay_Kulkarni · April 24, 2025, 10:53am

Hello Everyone!

I wanted to analyze screenshots for its visual cues and elements, can the vision tool help in this or is it just a kind of OCR that can extract only text from images. Also, does the multimodal mode work now with image links? Did someone try it? If yes, with which llm did it work? I saw some people commenting that it doesn’t work. I want to replicate how when we put an image in Chatgpt and ask it to analyze image, It can do it to a detail of which buttons are there what color is used.

Please let me know.

Thanks!

tonykipkemboi · April 24, 2025, 3:46pm

Hi @Aashay_Kulkarni,

Thank you for joining the CrewAI community and for asking the question.

You can use the multimodal=True parameter in your agent class definition like this example:

from crewai import Agent, Task, Crew, Process, LLM
from crewai.tasks.task_output import TaskOutput
import os, json
import datetime

llm = LLM(model="gemini/gemini-1.5-flash", temperature=0)


# Define a multimodal agent with vision capabilities
researcher = Agent(
    role="Product Quality Inspector",
    goal="Analyze product {image} and report on quality attributes",
    backstory="Senior visual inspector with extensive industry knowledge in product quality inspection",
    llm=llm,
    multimodal=True,
    verbose=True
)

# Create a task that involves image analysis
task = Task(
    description="Analyze the product image at {image} and provide a detailed report",
    expected_output="A detailed report on quality assessment with today's {date}.",
    agent=researcher
)

# Assemble the crew and execute tasks sequentially
crew = Crew(
    agents=[researcher],
    tasks=[task],
    process=Process.sequential,
    verbose=True
)

inputs = {
    'image': 'https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhL0ZIm6wh06croVC6i2mo_vWm6MDYZPgd5h0jXQGB13-4Z1ugRQD5OuF9HnZEOfe6mC_62S_bPiOIocO1ljytrOfwxNsOynJO8TYzJw31NkfG4cVwOJ-kKnrTBtZ_wC2A3YuAdaVOO7YA/s1600/broken+cup+2.jpg',
    'date': datetime.datetime.now().strftime("%Y-%m-%d")
}

result = crew.kickoff(inputs=inputs)
print(result)

You can try it here in colab: Google Colab

Aashay_Kulkarni · April 25, 2025, 7:17am

Hi @tonykipkemboi ! Does it work with every multimodal llm or just with gemini?

system · April 26, 2025, 7:18am

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.

tonykipkemboi · April 26, 2025, 10:39pm

it should work with any multimodal llm. haven’t tried all though.

Topic		Replies	Views
Crewai + multimodal CrewAI Community Support crewai , feature	13	692	June 10, 2025
Multimodal agents dont work with Gemini due to LiteLLM errors CrewAI Community Support	2	277	February 6, 2025
Do image analysis with locall ollama CrewAI Community Support agent , task , crewai	1	35	June 21, 2025
Missing Extracted Content with Crew AI Vision Tool CrewAI Community Support tools_issues	1	80	November 15, 2024
Can I deploy the multi agents for image classification with our model? CrewAI Community Support agent	1	87	November 23, 2024

CrewAI multimodal Capability

Related topics