How do you track LLM cost per customer for CrewAI workflows?

saud_harbi · March 21, 2026, 1:13am

Has anyone here figured out how to accurately track what each customer is actually costing you in LLM and agent execution spend?

I’ve been talking to a handful of founders building agent-powered SaaS on CrewAI and similar frameworks, and this keeps coming up. At small scale it’s manageable, but somewhere around 20-30 customers things seem to get messy fast.

The challenge I keep hearing about is that most observability tools give you visibility at the LLM call level, but not at the “which customer triggered this entire agent run” level. So when you have multi-step crews with tool calls, retries, and sub-agents, stitching that back to a specific customer for billing or margin analysis gets painful.

Curious what people here are actually doing in practice:

Are you attributing per-customer AI costs at all, or is it just baked into a flat margin assumption?
If you are tracking it, what does your setup look like? Something stitched together, a specific tool, custom instrumentation?
Has inaccurate cost attribution ever actually burned you, like a customer segment turning out to be way more expensive than you priced for?

Not looking to sell anything here. I’m genuinely trying to understand whether this is a real operational headache for people shipping agent products, or whether most teams have figured out a reasonable approach I haven’t seen yet.

Would appreciate any honest takes, even if your answer is “we just don’t bother and it’s fine.”

skillforge_team · March 21, 2026, 7:26am

The pattern that’s held up best for me is: **don’t treat cost as an LLM metric; treat it as a run-level accounting problem.**

I’d give every customer-triggered execution a `run_id` + `customer_id` at the edge, then propagate that through every Crew/agent/tool/sub-agent call.

Minimal setup that works:

1. **Create a run ledger row before kickoff**

`run_id`
`customer_id`
workflow / crew name
model(s)
started_at / finished_at

2. **Emit an event for every cost-bearing step**

prompt tokens / completion tokens
model unit price at time of call
tool/runtime costs if you care about margins
retries as separate child events

3. **Aggregate back to the run**

total AI cost
total latency
retry count
optional per-step breakdown for debugging

4. **Roll that up to customer/day/month tables**

this is what finance / pricing decisions actually need

A few practical lessons:

- If you only log per-call token usage, you’ll lose the business context fast.

- If you only log run totals, debugging margin explosions becomes painful.

- So I’d keep both: **event-level detail, run-level aggregation**.

For messy multi-agent systems, I’d also separate:

- **customer-visible run** (what the customer thinks happened)

- **internal child runs** (planner, retries, sub-agents, browser steps)

That parent/child run tree makes weird spend spikes much easier to explain.

Also: if your workflows are semi-structured, a portable workflow artifact helps here too. A lot of teams already have an implicit workflow doc; making it explicit (steps, branches, expected inputs/outputs) makes both replay and cost attribution cleaner because you know what unit of work you’re measuring.

At small scale many teams absolutely *don’t* bother. But once you have enough customers that a few pathological workflows can distort margins, I think you need run-level accounting, not just observability dashboards.

khan5v · March 22, 2026, 8:03pm

The setup that’s worked for me: Phoenix (arize-phoenix) with the CrewAI + Anthropic OpenInference instrumentors. The CrewAI one captures orchestration spans, the Anthropic one captures per-call token counts (llm.token_count.prompt / llm.token_count.completion). Both activate
automatically with auto_instrument=True.

For customer attribution, I believe OpenInference has a context manager that tags all child spans:

from openinference.instrumentation import using_attributes

with using_attributes(metadata={“customer_id”: “cust-123”}):
  crew.kickoff()  # all spans inherit customer_id

Then px.Client().get_spans_dataframe() gives you everything in a DataFrame - group by customer, multiply tokens by rate ($1/$5 per MTok for Haiku 4.5), done.

To @skillforge_team’s point on the run ledger - agreed for billing. The trace approach adds the diagnostic layer: when one customer’s runs cost 3x more, you can see it was the research agent re-running a tool 6 times, not just the total.

Topic		Replies	Views
How to monitor CrewAI workflow, including costs? General crewai , feature	1	236	December 9, 2024
Crewai result.token_usage not matching with LLMs token Usage count CrewAI Community Support	0	484	February 3, 2025
Handling LLM Errors in Hierarchical CrewAI Process with Callbacks LLMs llama-31-8b	8	674	April 10, 2025
Trying to understand CrewAI- is this really about agents, or just managing LLM calls? General	5	156	February 14, 2026
Unable to trace llm calls (langsmith or other) General	0	211	February 23, 2025

How do you track LLM cost per customer for CrewAI workflows?

Related topics