ArkSim: a Testing framework for CrewAI

We’ve been working on ArkSim a testing framework for CrewAI Agents.

ArkSim simulates multi-turn conversations with diverse synthetic users. It is meant to detect and capture issues early on before they hit production. There’s currently integration examples for CrewAI

repo: arksim/examples/integrations/crewai at main · arklexai/arksim · GitHub
docs: https://docs.arklex.ai/

Happy to answer any questions and would love feedback from people currently working on agents!

2 Likes

Nice work on the multi-turn simulation. Curious how you handle scoring consistency between runs? I’ve been running the same crew 30+ times and found the variance in scores higher than expected, even with temperature near zero. Hard to tell if a change actually improved things or if it’s just noise.

I’ve been exploring statistical comparison on trace-level metrics (duration, tokens, cost per task) as a complement to quality scoring to separate real changes from run-to-run variance.

1 Like

OpenAI has deprecated temperature controls, which introduces some inherent variance in outputs. Anthropic models tend to be more deterministic by comparison (due to their training methods). A few additional considerations that may help:

  • Increasing num_conversations_per_scenario will provide a more statistically robust picture of scores at the scenario level.

  • Beyond raw scores, tracking agent behavior failures and unique error distributions can offer meaningful signal on model improvement over time.

Your point about trace-level metrics as a complement is interesting. Deterministic signals like token count, tool call sequences, and latency can catch regressions that quality scores might miss or be noisy about. We’re working on OpenTelemetry integration that would make this kind of analysis easier.
Would love to hear more about what statistical methods you’re using on the trace side.

1 Like

Yep, can confirm, matches what I’ve seen - on Anthropic being more deterministic.

For the trace side, I’ve been treating before/after runs as statistical populations - bootstrap CIs on operational metrics, per-task breakdowns to catch regressions hiding in aggregates. Packaged it into an open-source tool called Kalibra if you want to take a look.

Your OTel integration sounds like a natural complement - ArkSim generates the test scenarios, traces capture the operational data, statistical comparison catches the regressions. Different layers of the same problem.