Hi,
We have been using crewAI to prototype a feature and it’s running under Bedrock Agentcore. We had issues when invoking azure/o3
and azure/o3-mini
as our llm models. They are supposed to make use of crewAI’s FileReadTool
and it seems able to do so, but it’s very intermittent and common behaviour is not making use of the tool. For instance, it may call the tool in 1 out of 10 times.
We have also tested the models below and all of them performed the tool calling accurately in every run, without exception:
- bedrock/us.anthropic.claude-sonnet-4-20250514-v1:0
- gemini/gemini-2.5-pro
- azure/gpt-4.1
OS and package versions:
- macOS: 15.5 (24F74)
- crewai>=0.150.0
- crewai_tools==0.58.0
We have also tried function_calling_llm: azure/gpt-4.1
and it did not work for both o3 and o3-mini.
That’s the agent’s yaml, the task’s yaml and python code. I can’t share everything so I tried to keep it reproducible.
analyst:
role: >
Strategic Analyst
goal: >
Analyze multiple markdown files ({markdown_files}).
backstory: >
You are a top-tier analyst.
allow_delegation: false
llm: azure/o3
temperature: 0.1
markdown_generation_task:
description: >
Read all files ({markdown_files}) using the FileReadTool.
expected_output: >
A single, comprehensive Markdown document.
agent: analyst
from crewai_tools import FileReadTool
file_tool = FileReadTool()
@agent
def analyst(self) -> Agent:
return Agent(
config=self.agents_config["analyst"],
tools=[file_tool],
verbose=True,
)
@task
def markdown_generation_task(self) -> Task:
return Task(
config=self.tasks_config["markdown_generation_task"],
output_file="file.md",
tools=[file_tool],
)
Well, there’s a pretty long discussion on the CrewAI GitHub about failures in function calling (tool usage). Here I shared my two cents, and here I proposed a proof-of-concept for how we might be able to improve this aspect of the framework.
Right off the bat, I’d suggest paying closer attention to your prompt engineering. Try to be more explicit about how your agent is supposed to use its tools. I think that could significantly improve the chances of them being used correctly.
I’d also suggest sharing the logs/output from your execution. That way, other users can get a better handle on the issues you’re running into and can actually contribute to the discussion.
Hey Max,
Thank you for taking the time to answer and for the heads-up and insights.
I have been following the GitHub issues, but I’m not sure if they are related to my case. I have run the same prompt with Claude 4 over 30 times, and a smaller number of times with Gemini-2.5-Pro and GPT-4.1. With these models, the process works as intended (i.e., by calling the tool) 100% of the time.
This leads me to believe the issue lies with the combination of the o3 family of models and CrewAI. When I run the same query using only LiteLLM, the models seem to work correctly, although I don’t have reliable data on that yet. This points toward the issue mentioned in your second link; it’s likely that the way CrewAI prompts the o3 models is not optimal, and I will investigate this further.
Regarding the incorrect output, the failures manifested in a few ways: either a blank output or a message saying, “I do not have access to the files.” The latter would lead to the LLM generating the markdown with the given instructions but filling it with random content.
We’d like to use the o3 models because of their cost-effectiveness and their capabilities with “agency” and function calling. It seems we might have to move to a “Flow” paradigm and call the model directly within a function.
edit: Just as a final thought, the fact that function_calling_llm: azure/gpt-4.1
(or any other aforementioned working model) won’t work intrigues me.
1 Like
Hey Jon, thanks for the feedback.
Looks like you’ve run into a situation where the framework’s default behavior isn’t working as expected, and at this point, the best solution is to debug your agentic system. To do that, you’ll need to isolate the problem and get an x-ray of what’s happening under the hood:
-
Build a simple, specific Crew that deals directly with the LLM and the tool you’re having trouble with. This way, you avoid sharing sensitive customer (or personal) data while still having a minimal reproducible example of the issue.
-
Then I recommend adding a crucial tool for this debugging/optimization phase of your agentic system: a monitoring and observability platform. I suggest AgentOps or Phoenix, as they both offer smooth integration with CrewAI and a decent enough free tier.
Keep the community posted on your progress so everyone can benefit from your findings.
2 Likes