I cannot make structured outputs to work consistently, and I’m thinking this is an issue with crewai, since I didn’t run into this problem at all when using Agents SDK.
This is my code:
async def get_founders_names(self):
query = (
f"Search the web for the names of the founders of company {self.state.company_name}. "
f"Do NOT include any explanation, thought, or commentary. Only output JSON."
)
result = await researcher_agent.kickoff_async(query, response_format=FounderNames)
if result.pydantic:
self.state.founder_names = result.pydantic
return self.state.founder_names
else:
raise ValueError(f"Failed to get structured output from researcher agent. Raw result: {result}")
Where:
class FounderNames(BaseModel):
names: Annotated[List[str], Field(description=“List of the full names of the company founders.”)] = None
I get wrongly formatted outputs about 90% of the time, with something like:
'‘‘‘ json {“names”: [“<founder 1>”, “<founder 2>”] }’’’
Rather than something clean like:
{“names”: [“<founder 1>”, "<founder 2>”] }
Is there a recommendation for how to address this problem?
Which LLM are you using? It’s working for me on every run (I’ve tested it with both Gemini and OpenAI):
from crewai import Agent, LLM
from pydantic import BaseModel, Field
from typing import Optional, List, Annotated
import asyncio
import os
os.environ["OPENAI_API_KEY"] = "<your-key>"
async def get_founders_names():
class FounderNames(BaseModel):
names: Optional[Annotated[
List[str],
Field(description="List of the full names of the company founders")
]] = None
llm = LLM(
model="gpt-4o",
temperature=0.5
)
researcher_agent = Agent(
role="Data Extraction Research Specialist",
goal=(
"Extract relevant information from sources and structure it into "
"clean, organized data formats"
),
backstory=(
"You are a detail-oriented researcher who specializes in finding "
"key information and organizing it into clear, structured outputs."
),
llm=llm,
allow_delegation=False,
verbose=True
)
query = (
"Extract the names of the founders of Dunder Mifflin company from the "
"text below:\n\n"
"Willy Wonka founded the Wonka Chocolate Factory. Dunder Mifflin (The "
"Office) was founded in 1949 by Robert Dunder and Robert Mifflin. Elon "
"Musk founded SpaceX in 2002."
)
result = await researcher_agent.kickoff_async(
query, response_format=FounderNames
)
print("\nresult:\n")
print(f"{result.pydantic=}")
print(f"{type(result.pydantic)=}\n")
async def main():
await get_founders_names()
if __name__ == "__main__":
asyncio.run(main())
model = “gemini/gemini-2.0-flash”
I tried other models, including gpt-4o-mini and still get unreliable results.
I’m using crewai version = “0.159.0”
Let me know what else I should be looking at. I’m impressed you’re able to get good results with such a high temperature.
BTW, I literally ran you example with gemini-2.0-flash and got this:
Raw result: ```json
{
“names”: [
“Robert Dunder”,
“Robert Mifflin”
]
}
Pydantic result: None
I was able to get the right output with gemini-2.5-flash, though.
Based on your problem description, I decided to check if there’s a significant difference in structured output generation when you call an Agent directly versus when you use a full-fledged Crew. After reviewing the code, it turns out there is, and this difference explains the issue you’re facing.
As I mentioned, there are basically two ways to get structured output in CrewAI:
- The full-fledged version, with at least one
Agent, a Task (with its output_pydantic parameter set), and a Crew. You then call Crew.kickoff() (or its variants) to start the agentic loop, which will likely result in structured output.
- Or the streamlined approach, where you just define an
Agent and then kick off the process with the response_format parameter set to start the agentic loop.
When it’s time to generate the structured output, the simplified Agent version (which just becomes a LiteAgent when you call kickoff()) attempts to validate the LLM’s output in a very basic and direct way. If the LLM adds any extra text (like “Here is your result:”), the validation fails. So, this streamlined version is very streamlined, even in its attempt to generate structured output.
The full-fledged version, on the other hand, is robust. It uses a sequence of up to three different methods (with increasing levels of robustness) to try and generate the structured output. The third and most robust method involves an LLM call using the Instructor library.
So, this difference in approach is the reason for the unreliability you’re experiencing with structured output generation.
My immediate recommendation is to refactor your code to use the full-fledged setup with an Agent, a Task (with its output_pydantic parameter set), and a Crew. If you think the CrewAI devs should improve the robustness of this step for the LiteAgent, feel free to open an issue on GitHub for them to consider. Just keep in mind there’s a reason it’s called Lite. It’s natural to trade some robustness for the sake of simplicity.
1 Like
Thanks @maxmoura , I appreciate you going deeper into this. Your explanation makes perfect sense.
I think the challenge is that I’m using Flows, not Crews, and refactoring it could mean I lose the benefits of the flow. I could also create a new crew for every step of the flow, but that doesn’t sound practical.
There is perhaps a possibility of creating a crew for some of the critical steps of my flow where output reliability matters most, so I’ll explore that. It’s ironic that flows are supposed to create more reliability but I have to go back to crews in this case 
Meanwhile, I’ll open an issue and see if the crewai team can provide support.
Thank you!
Not at all! Think of the Flow as just a step orchestrator. Each step of your Flow (each function that represents a step) can contain any code, whether it’s a simple Python function that handles data, an Agent that’s triggered directly with Agent.kickoff(), or even a full-fledged Crew.
In every function (step) where you’re currently doing Agent ➜ Agent.kickoff(), you just switch to the Agent + Task + Crew ➜ Crew.kickoff() setup.
Makes sense. I can create a wrapper around the crew and just call that at each step of the flow. I thought calling a full crew added a lot of overhead, but sounds like I was wrong.
1 Like