This is additional work but I think it would be informative to ask about llm’s for what purpose. I use gpt4omini
for general inference a lot but have to use gpt4o for a planning llm because I can never get a decent plan out of gpt4omini. For vision rag, etc I always use gemini 1.5 flash because it has a 1M token context and is of course multimodal and not too expensive. Even there though I use gemini 1.5 pro to do planning llm work as it does a better job.
When considering such as hierachichal/planned processes I have found that Phi3-medium & to some extent Phi3-mini have far greater reasoning skills than other LLMs.
While still relatively new to CrewAI I find that choosing the right LLM for the Agent within a multi agent system as apposed to defaulting to one provides far better results. E.G: manger_llm == Phi3, general agent default to GTP4omini, etc.
For paid ones, task planning, claude sonnet and deepseek both work. And for tool calling and structured output, qwen2.5-coder 7b has been working out. We are trying out, using passing how an user prompt would look like, which can act as trigger to identify an action on our platform.
Even in that, qwen2.5-coder acting both as planner and executor llm has worked out for a limited test case. Pretty sure, for such a case, identifying an user intent to an org specific action would be better done by openai, and from there on, the classification can be with claude or qwen2.5-coder.
We are setting up a test framework, so will be able to experiment better.