Agent eval tools?

Hiya all,

I’m curious how people here are evaluating their CrewAI agents once things start getting a bit more complex?

I mean stuff like:

  • output quality
  • if tool usage is actually improving
  • regressions after changes
  • comparing runs over time
  • sharing results that is easy to understand to other peeps

I tried test.qlankr.com, but I’m wondering what other tools or approaches people here are using?

Are you mostly doing this manually, building your own eval setup or using some other tools?

Would love to hear what people do