Multimodal agents dont work with Gemini due to LiteLLM errors

This code works:

import os
from dotenv import load_dotenv
from crewai import Agent, Task, Crew, LLM
import google.generativeai as genai

# Load environment variables
load_dotenv()

# Initialize Gemini
GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')
if not GOOGLE_API_KEY:
    raise ValueError("Please set GOOGLE_API_KEY environment variable")

genai.configure(api_key=GOOGLE_API_KEY)

# Configure CrewAI to use Gemini
llm = LLM(
    model="gemini/gemini-1.5-pro-latest",
    temperature=0.7,
    api_key=GOOGLE_API_KEY
)

def analyze_image():
    print("\nStarting image analysis...\n")
    
    # Validate image path
    image_path = "data/test2.png"
    abs_image_path = os.path.abspath(image_path)
    
    if not os.path.exists(abs_image_path):
        raise FileNotFoundError(f"Image not found at: {abs_image_path}")
    
    print(f"Found image at: {abs_image_path}")
    
    # Create a multimodal image analyst agent
    image_analyst = Agent(
        role='Image Analyst',
        goal='Analyze images and provide detailed, accurate descriptions with meaningful insights',
        backstory='''You are an expert image analyst with years of experience in visual content 
        interpretation. You have a keen eye for detail and can identify subtle elements that others 
        might miss. Your expertise spans across various types of images, from photographs to 
        technical diagrams, and you excel at providing clear, structured analysis.''',
        llm=llm,
        multimodal=True,  # Enable multimodal capabilities
        verbose=True,  # Enable detailed execution logs for debugging
        max_iter=1
    )

    # Create an image analysis task
    image_task = Task(
        description=f'''Analyze the image at {abs_image_path} and generate a comprehensive description.
        Focus on:
        1. Main subjects and objects in the image
        2. Visual composition, colors, and lighting
        3. Any text or notable details
        4. Context and setting
        5. Any notable patterns or relationships between elements''',
        expected_output='''A detailed, structured analysis of the image covering all the requested focus areas.
        The output should be clear, concise, and organized into sections for easy reading.''',
        agent=image_analyst
    )

    # Create and run the crew
    crew = Crew(
        agents=[image_analyst],
        tasks=[image_task],
        verbose=True  # Enable detailed execution logs
    )

    result = crew.kickoff()
    print("\nImage Analysis Result:")
    print(result)

if __name__ == "__main__":
    analyze_image()

However the console output shows either a 429, 500 or 503 error (regardless of how many requests are being sent):

I have tried to use max_iter=1, max_execution_time=5, max_rpm=5 and max_retry_limit - no matter what I try, the result is always the same: a literal spam of 429, 500 or 503 errors.

What confuses me, is when I make a similar request using the gemini SDK it works fine, I never get 429, 500 or 503 errors. Also using Crew AI with Gemini just for text also works fine - no errors.

Has anybody gotten multimodal agents working with Gemini? Do we know if it works?

Let me know if I can provide any additional information.

2 Likes

Same here! I’ve got EXACTLY the same error (but for text only)… LiteLLM really starts to annoy me; it seems impossible to use gemini reliably with Crewai due to these kinds of errors.

Even when I use “gemini/gemini-1.5-pro”

    gemini_creative_1_5_pro = LLM(
        model="gemini/gemini-1.5-pro",
        api_key=os.environ["GOOGLE_API_KEY"],
        temperature=1,
        max_completion_tokens=8192
    )

Anyone know how to solve this?

Is it because we us the LLM object instead of the Agent(llm=“gemini/gemini-1.5-pro”)?

2 Likes

Follow-up; this works now since the 0.100 crewai version.