How do you track LLM cost per customer for CrewAI workflows?

Has anyone here figured out how to accurately track what each customer is actually costing you in LLM and agent execution spend?

I’ve been talking to a handful of founders building agent-powered SaaS on CrewAI and similar frameworks, and this keeps coming up. At small scale it’s manageable, but somewhere around 20-30 customers things seem to get messy fast.

The challenge I keep hearing about is that most observability tools give you visibility at the LLM call level, but not at the “which customer triggered this entire agent run” level. So when you have multi-step crews with tool calls, retries, and sub-agents, stitching that back to a specific customer for billing or margin analysis gets painful.

Curious what people here are actually doing in practice:

  • Are you attributing per-customer AI costs at all, or is it just baked into a flat margin assumption?

  • If you are tracking it, what does your setup look like? Something stitched together, a specific tool, custom instrumentation?

  • Has inaccurate cost attribution ever actually burned you, like a customer segment turning out to be way more expensive than you priced for?

Not looking to sell anything here. I’m genuinely trying to understand whether this is a real operational headache for people shipping agent products, or whether most teams have figured out a reasonable approach I haven’t seen yet.

Would appreciate any honest takes, even if your answer is “we just don’t bother and it’s fine.”

The pattern that’s held up best for me is: **don’t treat cost as an LLM metric; treat it as a run-level accounting problem.**

I’d give every customer-triggered execution a `run_id` + `customer_id` at the edge, then propagate that through every Crew/agent/tool/sub-agent call.

Minimal setup that works:

1. **Create a run ledger row before kickoff**

  • `run_id`

  • `customer_id`

  • workflow / crew name

  • model(s)

  • started_at / finished_at

2. **Emit an event for every cost-bearing step**

  • prompt tokens / completion tokens

  • model unit price at time of call

  • tool/runtime costs if you care about margins

  • retries as separate child events

3. **Aggregate back to the run**

  • total AI cost

  • total latency

  • retry count

  • optional per-step breakdown for debugging

4. **Roll that up to customer/day/month tables**

  • this is what finance / pricing decisions actually need

A few practical lessons:

- If you only log per-call token usage, you’ll lose the business context fast.

- If you only log run totals, debugging margin explosions becomes painful.

- So I’d keep both: **event-level detail, run-level aggregation**.

For messy multi-agent systems, I’d also separate:

- **customer-visible run** (what the customer thinks happened)

- **internal child runs** (planner, retries, sub-agents, browser steps)

That parent/child run tree makes weird spend spikes much easier to explain.

Also: if your workflows are semi-structured, a portable workflow artifact helps here too. A lot of teams already have an implicit workflow doc; making it explicit (steps, branches, expected inputs/outputs) makes both replay and cost attribution cleaner because you know what unit of work you’re measuring.

At small scale many teams absolutely *don’t* bother. But once you have enough customers that a few pathological workflows can distort margins, I think you need run-level accounting, not just observability dashboards.

The setup that’s worked for me: Phoenix (arize-phoenix) with the CrewAI + Anthropic OpenInference instrumentors. The CrewAI one captures orchestration spans, the Anthropic one captures per-call token counts (llm.token_count.prompt / llm.token_count.completion). Both activate
automatically with auto_instrument=True.

For customer attribution, I believe OpenInference has a context manager that tags all child spans:

from openinference.instrumentation import using_attributes

with using_attributes(metadata={“customer_id”: “cust-123”}):
  crew.kickoff()  # all spans inherit customer_id

Then px.Client().get_spans_dataframe() gives you everything in a DataFrame - group by customer, multiply tokens by rate ($1/$5 per MTok for Haiku 4.5), done.

To @skillforge_team’s point on the run ledger - agreed for billing. The trace approach adds the diagnostic layer: when one customer’s runs cost 3x more, you can see it was the research agent re-running a tool 6 times, not just the total.