We currently have a crew capable of answering data analytics questions using datasets stored in Google BigQuery. This crew generates and executes BigQuery statements based on user queries, consolidates the query results, and then provides responses directly to users.
Could you please clarify the following:
Is the concept of groundedness applicable in this scenario? (My understanding is that groundedness typically pertains to Retrieval-Augmented Generation (RAG) applications.)
If groundedness does indeed apply, given our current implementation, how can we ensure our responses remain grounded in the dataset?
Yep, groundedness definitely applies to your BigQuery-based analytics system. Even though it’s different from traditional RAG architectures—which you probably already get, hence the question—at its heart, groundedness is all about making sure AI responses are actually based on verifiable data sources, not just stuff it might have cooked up.
“To ensure” is a pretty strong phrase, and I’d advise you not to take it as a sure thing. Whenever you’re designing agentic systems—whether they’re workflows or actual agents—never assume you can ensure anything. Just because an LLM gets a grounded context (whether from traditional RAG or your BigQuery setup) doesn’t guarantee it won’t hallucinate. So, your best bet is to use solid techniques to try and mitigate hallucinations:
Keep temperature settings low (like 0.1-0.3).
Log the generated SQL for auditing and verification.
Format your query results in a consistent, machine-readable structure.
Watch those result set size limits to avoid overwhelming the context window.
Prompt Engineering: This is probably the best piece of advice I can give you, because this often gets underestimated when we’re using frameworks. Frameworks are great for handling boilerplate, hiding the tedious stuff, and, like CrewAI, they aim to be elegant. But please, don’t fall into the trap of just writing a one-paragraph Task.description and expecting consistent output. Seriously, don’t do it. For every critical step in the solution you’re building, I really encourage you to craft a good old-fashioned prompt with some test data and feed it to your LLM’s chatbot – yeah, just the basic chatbot, no fancy features. Keep refining your prompt, keep applying prompt engineering techniques until your chatbot starts giving you more “predictable” responses. Use Chain-of-Thought prompting to make its reasoning explicit. Include system messages like: “You must only reference information from the query results. If the data is insufficient to answer the question, say so directly.” Now that your prompt is mature enough, transfer those instructions over to your Agent and Task definitions, and I guarantee you’ll start feeling much more confident about the results.
For critical applications, the next step is to implement a review process for the responses (likely involving both another agent and a human). To give you an example, for a pilot project I did for a small law firm, after querying legal precedents (our grounded data), I implemented a second fact-checking stage where another LLM would simply re-run the queries. This was to ensure the first LLM was exclusively relying on actual case law citations. Then, in the third and final stage, the final report is still reviewed by a real lawyer. This is a real-world example of “Trust, but Verify,” which Mike Conover talked about in this AI Engineer video.
I hope these simple, general tips help you make your project a success story. Good luck!