CrewAI multimodal Capability

Hello Everyone!

I wanted to analyze screenshots for its visual cues and elements, can the vision tool help in this or is it just a kind of OCR that can extract only text from images. Also, does the multimodal mode work now with image links? Did someone try it? If yes, with which llm did it work? I saw some people commenting that it doesn’t work. I want to replicate how when we put an image in Chatgpt and ask it to analyze image, It can do it to a detail of which buttons are there what color is used.

Please let me know.

Thanks!

1 Like

Hi @Aashay_Kulkarni,

Thank you for joining the CrewAI community and for asking the question.

You can use the multimodal=True parameter in your agent class definition like this example:

from crewai import Agent, Task, Crew, Process, LLM
from crewai.tasks.task_output import TaskOutput
import os, json
import datetime

llm = LLM(model="gemini/gemini-1.5-flash", temperature=0)


# Define a multimodal agent with vision capabilities
researcher = Agent(
    role="Product Quality Inspector",
    goal="Analyze product {image} and report on quality attributes",
    backstory="Senior visual inspector with extensive industry knowledge in product quality inspection",
    llm=llm,
    multimodal=True,
    verbose=True
)

# Create a task that involves image analysis
task = Task(
    description="Analyze the product image at {image} and provide a detailed report",
    expected_output="A detailed report on quality assessment with today's {date}.",
    agent=researcher
)

# Assemble the crew and execute tasks sequentially
crew = Crew(
    agents=[researcher],
    tasks=[task],
    process=Process.sequential,
    verbose=True
)

inputs = {
    'image': 'https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhL0ZIm6wh06croVC6i2mo_vWm6MDYZPgd5h0jXQGB13-4Z1ugRQD5OuF9HnZEOfe6mC_62S_bPiOIocO1ljytrOfwxNsOynJO8TYzJw31NkfG4cVwOJ-kKnrTBtZ_wC2A3YuAdaVOO7YA/s1600/broken+cup+2.jpg',
    'date': datetime.datetime.now().strftime("%Y-%m-%d")
}

result = crew.kickoff(inputs=inputs)
print(result)

You can try it here in colab: Google Colab

1 Like

Hi @tonykipkemboi ! Does it work with every multimodal llm or just with gemini?

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.

it should work with any multimodal llm. haven’t tried all though.