ArkSim: a Testing framework for CrewAI

Yao · March 17, 2026, 7:14am

We’ve been working on ArkSim a testing framework for CrewAI Agents.

ArkSim simulates multi-turn conversations with diverse synthetic users. It is meant to detect and capture issues early on before they hit production. There’s currently integration examples for CrewAI

repo: arksim/examples/integrations/crewai at main · arklexai/arksim · GitHub
docs: https://docs.arklex.ai/

Happy to answer any questions and would love feedback from people currently working on agents!

khan5v · March 22, 2026, 7:52am

Nice work on the multi-turn simulation. Curious how you handle scoring consistency between runs? I’ve been running the same crew 30+ times and found the variance in scores higher than expected, even with temperature near zero. Hard to tell if a change actually improved things or if it’s just noise.

I’ve been exploring statistical comparison on trace-level metrics (duration, tokens, cost per task) as a complement to quality scoring to separate real changes from run-to-run variance.

Yao · March 23, 2026, 6:33pm

OpenAI has deprecated temperature controls, which introduces some inherent variance in outputs. Anthropic models tend to be more deterministic by comparison (due to their training methods). A few additional considerations that may help:

Increasing num_conversations_per_scenario will provide a more statistically robust picture of scores at the scenario level.
Beyond raw scores, tracking agent behavior failures and unique error distributions can offer meaningful signal on model improvement over time.

Your point about trace-level metrics as a complement is interesting. Deterministic signals like token count, tool call sequences, and latency can catch regressions that quality scores might miss or be noisy about. We’re working on OpenTelemetry integration that would make this kind of analysis easier.
Would love to hear more about what statistical methods you’re using on the trace side.

khan5v · March 23, 2026, 7:26pm

Yep, can confirm, matches what I’ve seen - on Anthropic being more deterministic.

For the trace side, I’ve been treating before/after runs as statistical populations - bootstrap CIs on operational metrics, per-task breakdowns to catch regressions hiding in aggregates. Packaged it into an open-source tool called Kalibra if you want to take a look.

Your OTel integration sounds like a natural complement - ArkSim generates the test scenarios, traces capture the operational data, statistical comparison catches the regressions. Different layers of the same problem.

Topic		Replies	Views
How to monitor CrewAI workflow, including costs? General crewai , feature	2	216	December 9, 2024
Trying to understand CrewAI- is this really about agents, or just managing LLM calls? General	5	113	February 14, 2026
Crew - How to keep the consistency of "Score" CrewAI Community Support	1	76	March 17, 2025
The problem with testing the crew CrewAI Community Support crewai	4	137	March 19, 2025
Issues with complex crew architexture, tool calling / memory. Running in air gapped environments CrewAI Community Support tools_issues , agent , crewai , memory	3	188	May 9, 2025

ArkSim: a Testing framework for CrewAI

Related topics