Ok this one is going to be a doozy. I am working on creating a complex simulation benchmarking system for local LLM’s using crewAI. I have been an early adopter of the tech but had to spend some time away and boy it has changed. (Super glad to see CrewAI take off Joao is awesome)
So break down of the problem. Local LLM’s are having issues with tool calls and memory. ( i havent had this issue with local LLMs in the past )
Agents interact in a virtual world that is basically controlled by 2 JSON files. One is the world state the other is an internal “email” communication system between the agents.
NOTE: THIS ALL NEEDS TO RUN IN AN AIR GAPPED ENVIRONMENT
I am dynamically generating the agents and crews from the yaml config files. The crew runs BUT they either refuse to use tools or use placeholder text instead of the proper commands. So the logs are about a few million lines so breaking them all down is pretty hard. I tried to use gemini but it choked so I had to use some better custom longer context models to break them down for me.
IV. Summary of Errors
-
CRITICAL Configuration Error: Both Crew ‘{redacted} A’ and Crew ‘{redacted} B’ are missing their dedicated function_calling_llm instance. This is the primary and most severe error.
-
Repeated Memory Failure: As a direct result of the missing crew LLM, every single task execution attempt for both crews fails to add information to long-term memory (Failed to add to long term memory: … ‘NoneType’ object has no attribute ‘function_calling_llm’). This prevents agents from building context and remembering information between steps.
Error:
Failed to add to long term memory: Failed to convert text into a Pydantic model due to error: ‘NoneType’ object has no attribute ‘function_calling_llm’
-
Incorrect Agent Output Format: Many agents (especially in Crew B) failed to adhere to the critical instruction of responding only with a JSON tool call. They outputted plain text, explanatory text, or error messages instead.
-
Tool Call Failures:
-
Agents frequently attempted tool calls using placeholder data (“…”) instead of context gathered from (failed) memory.
-
Some agents called the wrong tool for their assigned task (e.g., calling reader instead of sender/analyzer/{world state).
-
One agent ({redacted} ) attempted to call a tool (Draft {redacted} Message) that doesn’t seem to be in the assigned list or registry.
-
Context Contamination / Hallucination: Agents produced outputs based on incorrect or generic information, likely due to memory failure (e.g., {redacted} _a’s {redacted} report, {redacted} _a’s inputs, {redacted} _b repeating {redacted} _b’s errors).
-
Output Corruption: In later stages of Crew B’s execution, {redacted} _b started producing garbage/corrupted token sequences instead of coherent text or JSON.
-
Inconsistent Formatting: Minor issue where Crew B agents sometimes prefixed their output with ToolCall:.
The weird thing is it seems only 1 agent is able to use tools. All the others just refuse or just send malformed data. I have tried about a dozen models so far. No luck.
Any thoughts of how to troubleshoot this?