How do you test if an agent really completed a task?

Hey everyone, beginner question.

I’m still learning how people test CrewAI / multi-agent workflows properly, so sorry if this is basic.

How do you usually check whether an agent actually completed a task, rather than just looking like it did?

The case I’m confused about is when an agent skips part of the evidence chain.

For example:

  • it makes up an ID
  • skips a schema
  • builds an output from incomplete context
  • says the task is complete without clear evidence
  • claims something was submitted without a receipt

How do people usually catch this in practice?

Do you rely on logs, evaluator agents, human review, structured outputs, or some kind of run ledger?

I’m testing a tiny task-flow myself, but I don’t want to approach it in a naive way. I’d like to understand the proper lightweight pattern before trusting agents with more serious workflows.