Hi,
I am building a crew AI agent for converting Natural language question to SQL (BigQuery). I need the agent to understand my BigQuery Dataset and Business Logic so that it can cater specific business jargons. I have provided two documents i.e. document_dictionary and business logic for this purpose. I am using 2 agents, a data explorer that list_tables and get_schema with the mentioned documents as additional reference. The second agent is Query Agent that creates SQL query based on schema and business logic. It then verifies SQL using sql_verifier tool, finally it executes SQL using execute_sql tool. Then I use llm separately to do analysis on results and send it back to the frontend. I am using GPT-4.1 and GPT-4.1 mini.
The approach is good in terms of accurate responses. But my main issue is speed of responses. My average response time is around 30 seconds. The best is 15 sec and worst 60 sec.
I have two questions that I want help with
- How can I improve my agents response time?
- How can I add ability to serve multiple users at the same time?
Right now if one userâs query is being processed and second one asks a question it takes a lot of time as first one needs to be completed first. (I built fast api backend and streamlit frontend for demo)
I have also provided some code.
Data Dictionary and Logic
BUSINESS_RULES = """
1. Completed Revenue = SUM(revenue) filtered by completed_on_date for date range where job_status = 'Completed'.
2. Total Revenue (or simply Revenue) = SUM(revenue) filtered by created_on_date. Do not filter by job_status.
3. Created Jobs or Total Jobs = COUNT(DISTINCT job_id) using created_on_date. Include all job_status ['Completed', 'Canceled', 'In Progress', 'Scheduled', 'Hold'].
4. Completed Jobs = COUNT(DISTINCT job_id) using completed_on_date where job_status = 'Completed'.
5. Cancelled Jobs = COUNT(DISTINCT job_id) using created_on_date where job_status = 'Canceled'.
6. In Progress Jobs = COUNT(DISTINCT job_id) using created_on_date where job_status = 'In Progress'.
7. Scheduled Jobs = COUNT(DISTINCT job_id) using created_on_date where job_status = 'Scheduled'.
8. Hold Jobs = COUNT(DISTINCT job_id) using created_on_date where job_status = 'Hold'.
9. New Customers = COUNT(DISTINCT customer_id) where customer_created_on_date_job_created >= created_on_date.
10. Active Customers = COUNT(DISTINCT customer_id) where:
- job_status = 'Completed'
- completed_on_date is between the most recent date (end of time period) and 36 months prior.
- Use: DATE_SUB(DATE(start_date), INTERVAL 36 MONTH) as the lower bound.
"""
DATA_DICTIONARY =
"""Use only the following tables and columns to generate SQL queries. Ignore all other sources.
---
Table Usage Rules:
1. **demo_jobs (Primary Table)**
- Use for all general job and customer queries.
- Allowed date columns: created_on_date, completed_on_date, scheduled_on_date, customer_created_on_date_job_created
2. **demo_memberships**
- Use only if the query explicitly involves "member" or "membership".
- Always use date columns, brands, regions, and other columns from demo_jobs for filtering.
---
Column Reference:
đš demo_jobs:
- job_id: Unique ID for each job.
- region: Region of the job.
- customer_id: Unique ID for each customer; may appear multiple times.
- csr_name: CSR who booked and assigned the job.
- agent_name: Technician responsible for completing the job.
- business_unit_name: Business unit names.
- trades_group: Trade group names.
- trades_type: Trade types.
- revenue: Revenue in USD generated by the job.
- job_status: ['Completed', 'Canceled', 'In Progress', 'Scheduled', 'Hold']
- campaign_name: Source marketing campaign of the job.
- created_on_date: Date when the job was created.
- completed_on_date: Date when the job was completed.
- scheduled_on_date: Date when the job was scheduled.
- brand: Brand name extracted from business_unit_name.
- customer_created_on_date_job_created: The date the customer was created (same as their first job creation date).
đš demo_memberships:
- membership_id: Unique ID for each membership.
- customer_id: Foreign key to demo_jobs.
- customer_name: Name of the customer.
- customer_type: 'Residential' or 'Commercial'
- city: Customer's city.
- membership_type: Describes duration and type of membership (e.g., '1 Yr', 'Monthly', '2 Year').
---
Guidelines:
- Always prefer demo_jobs unless "membership" related question is asked.
- All date-based filters must use dates from demo_jobs, even for membership queries.
- Do not reference columns not listed above.
- Keep all queries focused and aligned with these rules to ensure relevance and accuracy."""
Prompts
DATA_EXPLORER_BACKSTORY = """
You are an Expert Data Explorer.
You are in charge of exploring the data in the database. Your goal is to list the tables using "list_tables" and carefully select relevant tables according to
the user question. Then extract their schema using "tables_schema" tool.
---
System Configuration:
- Project ID: {project}
- Dataset ID: {dataset}
---
Guidelines:
- Use `list_tables` to see all tables and `tables_schema` tools.
- Carefully analyze the question and think which are the relevant tables using data dictionary.
- Then use "tables_schema" tool to inspect only the relevant tables and column structures.
- Always use data dictionary to understand the table relationships and relevant columns.
---
References:
**Data Dictionary**:
{data_dictionary}
"""
SQL_GENERATOR_BACKSTORY = """
You are an Expert Data Engineer.
Your goal is to:
1. Generate a valid production-grade BigQuery SQL query based on the user question using the provided Data Dictionary, Business Rules, and Date Filtering Rules.
2. Verify the generated SQL query using the tool 'check_sql'.
3. Execute the verified SQL query using the tool 'execute_sql'.
---
đ Data Handling Rules:
- Only use columns listed in the provided Data Dictionary.
- Use **fully-qualified table names** in the format `{project}.{dataset}.<table_name>`.
- Always **filter out `NULL` and `'None'` values** from any date fields or categorical fields before processing.
- Ensure all filters are valid for the intended data type (e.g., wrap `DATETIME` or `TIMESTAMP` columns with `DATE()` for comparison).
- Avoid string casting of dates unless absolutely necessary.
---
đ
Date Filtering Rules:
- Always use the `BETWEEN` operator when filtering over a date range.
- Use `DATE(column_name)` to normalize any date field before filtering.
1. **Last Week (Calendar Week)**
- Definition: Previous full week, **Monday through Sunday**.
- Start: `DATE_SUB(DATE_TRUNC(CURRENT_DATE(), WEEK(MONDAY)), INTERVAL 7 DAY)`
- End: `DATE_SUB(DATE_TRUNC(CURRENT_DATE(), WEEK(MONDAY)), INTERVAL 1 DAY)`
2. **This Week (Current Week-to-Date)**
- Start: `DATE_TRUNC(CURRENT_DATE(), WEEK(MONDAY))`
- End: `CURRENT_DATE()`
3. **Past 7 Days (Rolling)**
- Start: `DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)`
- End: `DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)`
4. **Previous Month (Calendar Month)**
- Start: `DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH), MONTH)`
- End: `LAST_DAY(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH))`
5. **This Month (Month-to-Date)**
- Start: `DATE_TRUNC(CURRENT_DATE(), MONTH)`
- End: `CURRENT_DATE()`
6. **Year-to-Date (YTD)**
- Start: `DATE_TRUNC(CURRENT_DATE(), YEAR)`
- End: `CURRENT_DATE()`
7. **Last 30 Days (Rolling)**
- Start: `DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)`
- End: `CURRENT_DATE()`
8. **Previous Quarter (Calendar Quarter)**
- Previous full quarter (Jan-Mar, Apr-Jun, Jul-Sep, Oct-Dec), not including current quarter.
- Start: `DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 QUARTER), QUARTER)`
- End: `LAST_DAY(DATE_SUB(DATE_TRUNC(CURRENT_DATE(), QUARTER), INTERVAL 1 DAY))`
9. **This Quarter (Calendar Quarter)**
- From the start of the current quarter through today.
- Start: `DATE_TRUNC(CURRENT_DATE(), QUARTER)`
- End: `CURRENT_DATE()`
10. **Same Quarter Last Year**
- The same quarter last year (for year-over-year comparison).
- Start: `DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 YEAR), QUARTER)`
- End: `LAST_DAY(DATE_ADD(DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 YEAR), QUARTER), INTERVAL 2 MONTH))`
11. **Past 90 Days (Rolling)**
- Last 90 days, not including today.
- Start: `DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)`
- End: `DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)`
12. **First Quarter (Q1 of Current Year)**
- The first calendar quarter, January 1 through March 31 of the current year.
- Start: `DATE_TRUNC(CURRENT_DATE(), YEAR)`
- End: `LAST_DAY(DATE_ADD(DATE_TRUNC(CURRENT_DATE(), YEAR), INTERVAL 2 MONTH))`
13. **Second Quarter (Q2 of Current Year)**
- Second quarter of the current year (Apr 1 to Jun 30).
- Start: `DATE_ADD(DATE_TRUNC(CURRENT_DATE(), YEAR), INTERVAL 3 MONTH)`
- End: `LAST_DAY(DATE_ADD(DATE_TRUNC(CURRENT_DATE(), YEAR), INTERVAL 5 MONTH))`
14. **Third Quarter (Q3 of Current Year)**
- Third quarter of the current year (Jul 1 to Sep 30).
- Start: `DATE_ADD(DATE_TRUNC(CURRENT_DATE(), YEAR), INTERVAL 6 MONTH)`
- End: `LAST_DAY(DATE_ADD(DATE_TRUNC(CURRENT_DATE(), YEAR), INTERVAL 8 MONTH))`
15. **Fourth Quarter or last quarter (Q4 of Current Year)**
- Fourth quarter of the current year (Oct 1 to Dec 31).
- Start: `DATE_ADD(DATE_TRUNC(CURRENT_DATE(), YEAR), INTERVAL 9 MONTH)`
- End: `LAST_DAY(DATE_ADD(DATE_TRUNC(CURRENT_DATE(), YEAR), INTERVAL 11 MONTH))`
---
âď¸ SQL Query Guidelines:
- Always use Business Rules and
- Always refer to Data Dictionary for table relations and relevant columns.
- Use **clear, business-relevant aliases** (e.g., `scheduled_job_count`, `active_customers`, `avg_days_to_schedule`).
- Use `LIMIT` for multi-row result sets unless otherwise stated.
- Prefer **CTEs only when necessary** for readability or reuse.
- Avoid SELECT * â explicitly name columns.
- Ensure **logical consistency**, **performance efficiency**, and **syntactic correctness**.
- Format queries cleanly with proper indentation and keyword casing.
---
đ Available References:
**Data Dictionary**:
{data_dictionary}
**Business Rules and Logic**:
{business_rules}
Quickly Respond only with the final SQL query â no explanations unless explicitly asked.
"""
Agents Definitions
data_explorer = Agent(
role="Data Explorer",
goal="Explore the data in the database. List the tables and schemas of the tables.",
backstory=PromptTemplates.get_prompt('DATA_EXPLORER_BACKSTORY', data_dictionary=DATA_DICTIONARY, business_rules=BUSINESS_RULES, dataset=DATASET, project=PROJECT),
tools=[list_tables, tables_schema],
verbose=VERBOSE,
max_iter=10,
max_execution_time=90,
llm=secondary_llm,
)
sql_generator = Agent(
role="SQL Generator",
goal="""Execute verified and Optimized SQL query for BigQuery. The SQL queries should reflect business logic and domain understanding.
Always use clear column aliases and include appropriate filters for data quality.""",
backstory=PromptTemplates.get_prompt('SQL_GENERATOR_BACKSTORY', data_dictionary=DATA_DICTIONARY, business_rules=BUSINESS_RULES, dataset=DATASET, project=PROJECT),
tools=[check_sql, execute_sql],
verbose=VERBOSE,
max_iter=10,
max_execution_time=90,
llm=main_llm,
)
# Task Definitions for Agents
explore_data = Task(
description="Explore the BigQuery Dataset and Tables. Decide which tables are relevant for the query: {query}. Extract schemas of the relevant tables.",
expected_output="The output of 'list_tables' should be a string of table names comma separated. And the output of schema should be a dictionary with table name as key and schema as value.",
agent=data_explorer,
async_execution=True,
)
generate_sql = Task(
description="""Execute the generated Optimized SQL query for BigQuery for question: {query}.
Make sure to use clear column aliases that will be preserved in the output.
Please respond ONLY with a valid JSON object enclosed exactly within the tags, following this precise structure:
{{
"query" (str): < The user question >,
"sql_query" (str): < The optimized SQL query >,
"query_result" (list[dict[str, Any]]): < The result of the query >
}}
Output Guidelines:
- Ensure the JSON is properly formatted and parsable.
- Do NOT output any text outside the JSON and the specified tags.
""",
expected_output="""A JSON object with the following keys: 'query', 'sql_query', 'query_result'""",
agent=sql_generator,
context=[explore_data],
output_pydantic=QueryResult,
)