Crewai + multimodal

Paarttipaabhalaji · November 26, 2024, 5:53pm

Hi Team,

I need to use the multimodal LLM in crewai. In case of multimodal, usually image need to convert into base64 image and send that encoded value into LLM param as “Image_url” in message request body. I would like to know whether crewai have an options to send that image encoded value vis crewai’s LLM() package.

Here is my request body:

request_body_sample = {
    "messages": [{"role":"system","content":system_prompt}, {"role":"user","content":[{"type":"text","text":user_text_input},{"type":"image_url","image_url":{"url": f"data:image/jpeg;base64,{imgBase64EncValue}"}}]}],
    "project_id": credentials.get("project_id"),
    "model_id": "watsonx/meta-llama/llama-3-2-90b-vision-instruct",
    "decoding_method": "sample",
    "random_seed": 568743,
    "temperature": 0,
    "top_k": 50,
    "top_p": 1,
    "repetition_penalty": 1,
    "max_tokens": 8000
}    


response = requests.post(
    credentials.get("url"),
    headers=headers,
    json=request_body_sample
    )
if response.status_code != 200:
    raise Exception("Non-200 response: " + str(response.text))
data = response.json()
print(data['choices'][0]['message']['content'])

Or please help me to use multimodal in crewai in better way.

when I ask the question related to “multimodal support” documentation chatbot. Its output the message like

crewai doesn’t support multimodal?

Thanking you.

uma-08 · November 30, 2024, 3:10pm

hey @Paarttipaabhalaji did you find a solution for this?

Paarttipaabhalaji · December 2, 2024, 4:34am

No @uma-08 , please help me on this.

uma-08 · December 2, 2024, 6:52pm

sure would love to discuss workflows to execute on this, I’m also stuck in this. what’s the best way to connect with you, @Paarttipaabhalaji ?

Paarttipaabhalaji · December 3, 2024, 6:37am

kindly connect me in linkedin.

matt · December 3, 2024, 12:02pm

We do sort of support multi-model through one of our tools - Vision Tool - CrewAI but this is only a tool so it may or may not help

Paarttipaabhalaji · December 4, 2024, 6:01am

@matt I need to use multimodal from the provider watsonx.

Colin_Ng · December 5, 2024, 5:21am

I also faced the same issue.

I want to directly input the image path to my local multimodal model, without using additional tool

Paarttipaabhalaji · December 18, 2024, 12:02pm

@matt any guidance or update on this.

MayInRain · January 6, 2025, 12:48pm

Same here! I’m trying to work with two images using a multimodal model, but can only send them as base64 strings.

BradLeon · April 9, 2025, 2:48pm

same issue, looking forward a solution.

Max_Moura · April 10, 2025, 10:44am

It’d be great if you could provide a detailed description of what you’re trying to accomplish, how you’re attempting to do it, and ideally, include a short code snippet. That way, you’ll have a much better shot at getting the help you need.

netcmcc · April 10, 2025, 11:50am

github.com/crewAIInc/crewAI

[BUG]Agent multimodal cannot send images to LLM correctly

opened 11:02AM - 10 Apr 25 UTC

netcmcc

bug

### Description The multimodality feature in CrewAI v0.114.0 does not properly …handle image inputs when sending requests to LLMs. The agent fails to correctly format and send image data to the LLM, resulting in incomplete or failed image analysis tasks. Because the image was not submitted to LLM in the format agreed by OpenAI API, the image type task did not work properly. See the log for details. https://platform.openai.com/docs/guides/images?format=base64-encoded#provide-multiple-image-inputs https://docs.anthropic.com/en/docs/build-with-claude/vision ### Steps to Reproduce 1. Create an agent with `multimodal=True` 2. Set up a task that involves image analysis 3. Run the crew with an image URL ### Expected behavior According to the OpenAI API documentation, the request should be formatted to properly handle multimodal inputs. The image URL should be sent as part of a structured content array with proper type specifications. Expected request format (from OpenAI API documentation): ```json { "model": "gpt-4o", "input": [ { "role": "user", "content": [ {"type": "input_text", "text": "what is in this image?"}, { "type": "input_image", "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } ] } ] } ``` ### Screenshots/Code snippets ## Test Code ```python from crewai import Agent, Task, Crew, LLM from dotenv import load_dotenv import logging # Configure logging logging.basicConfig( level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s' ) logger = logging.getLogger(__name__) # Load environment variables load_dotenv() # Define variables IMAGE_URL = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" # Configure LLM using CrewAI's configuration llm = LLM( model="openrouter/openai/gpt-4o" ) # Create a multimodal agent for image analysis image_analyst = Agent( role="Image Analyst", goal="Analyze and describe image contents", backstory="Expert in image analysis and description", multimodal=True, llm=llm ) # Create a task for image analysis with simple prompt task = Task( description=f"what's in the image? {IMAGE_URL}", expected_output="A description of what's in the image", agent=image_analyst ) # Create and run the crew crew = Crew( agents=[image_analyst], tasks=[task] ) result = crew.kickoff() print(result) ``` ### Operating System macOS Sonoma ### Python Version 3.12 ### crewAI Version 0.114.0 ### crewAI Tools Version 0.40.1 ### Virtual Environment Venv ### Evidence ## Actual Behavior The request sent to the LLM does not properly format the image data. From the logs, we can see that the image URL is being sent as plain text in the message content rather than being properly structured as a multimodal input. Logs excerpt: ``` 2025-04-10 18:38:56,411 - DEBUG - POST Request Sent from LiteLLM: curl -X POST \ https://openrouter.ai/api/v1/chat/completions \ -H 'HTTP-Referer: *****' -H 'X-Title: *****' -H 'Authorization: Bearer sk-or-v1-aaa075b9de083684530d********************************************' \ -d '{'model': 'openai/gpt-4o', 'messages': [{'role': 'system', 'content': 'You are Image Analyst. Expert in image analysis and description\nYour personal goal is: Analyze and describe image contents\nYou ONLY have access to the following tools, and should NEVER make up tools that are not listed here:\n\nTool Name: Add image to content\nTool Arguments: {\'image_url\': {\'description\': \'The URL or path of the image to add\', \'type\': \'str\'}, \'action\': {\'description\': \'Optional context or question about the image\', \'type\': \'Union[str, NoneType]\'}}\nTool Description: See image to understand its content, you can optionally ask a question about the image\n\nIMPORTANT: Use the following format in your response:\n\n```\nThought: you should always think about what to do\nAction: the action to take, only one name of [Add image to content], just the name, exactly as it\'s written.\nAction Input: the input to the action, just a simple JSON object, enclosed in curly braces, using " to wrap keys and values.\nObservation: the result of the action\n```\n\nOnce all necessary information is gathered, return the following format:\n\n```\nThought: I now know the final answer\nFinal Answer: the final answer to the original input question\n```'}, {'role': 'user', 'content': "\nCurrent Task: what's in the image? https://cdn-picture.jingdaka.com/backend_pic/dst/poster/6ob/2025/03/24/3d6fdf18-aa8d-b7a9-a121-83691464e955.jpg\n\nThis is the expected criteria for your final answer: A description of what's in the image\nyou MUST return the actual complete content as the final answer, not a summary.\n\nBegin! This is VERY important to you, use the tools available and give your best Final Answer, your job depends on it!\n\nThought:"}, {'role': 'assistant', 'content': "{'role': 'user', 'content': [{'type': 'text', 'text': 'Describe the contents of the image'}, {'type': 'image_url', 'image_url': {'url': 'https://cdn-picture.jingdaka.com/backend_pic/dst/poster/6ob/2025/03/24/3d6fdf18-aa8d-b7a9-a121-83691464e955.jpg'}}]}"}, {'role': 'assistant', 'content': 'To analyze and describe the content of the image provided, I will use the image analysis tool.\n\nAction: Add image to content\nAction Input: {"image_url": "https://cdn-picture.jingdaka.com/backend_pic/dst/poster/6ob/2025/03/24/3d6fdf18-aa8d-b7a9-a121-83691464e955.jpg", "action": "Describe the contents of the image"}'}], 'stop': ['\nObservation:'], 'stream': False}' ``` ### Possible Solution none ### Additional context - The issue appears to be in how CrewAI formats the request to the LLM - The current implementation treats the image URL as plain text rather than properly structuring it as a multimodal input - This prevents the LLM from properly processing and analyzing the image content ## Suggested Fix The CrewAI library should be updated to properly format multimodal requests according to the OpenAI API specification. The request formatting should be modified to: 1. Structure the content as an array of different content types 2. Properly specify the image URL as an `input_image` type 3. Include the appropriate content type headers and request structure Would appreciate any guidance or updates on when this functionality might be properly implemented.

like · June 10, 2025, 1:58pm

Hello, have you solved it successfully? I am currently working on a project on video content understanding and need to convert multiple video frames into base64 format for sending!

Topic		Replies	Views
CrewAI multimodal Capability CrewAI Community Support	4	130	April 26, 2025
How to use the qwen2.5-vl-3b-instruct model with the CrewAi? LLMs llama-31-8b	3	367	April 6, 2025
Llm connection to local server CrewAI Community Support	2	194	March 17, 2025
Help ::: How to use a custom (local) LLM with vLLM LLMs llama-31-8b	2	197	June 10, 2025
What are the LLMs that crewai supports, getting error LLM value is an unknown object General	10	371	March 25, 2025

Crewai + multimodal

Related topics