Hi Support Team,
I would like to understand how the crew/agent behaves after getting the large outputs from the tools that we design which in general exceeds the token limit. How does crew tackles this one. Does this put the whole message in the next prompt ? If so how does the provider responds and how does the agents gracefully tackles this one ?
also interested in the topic. Is there a way to pass large amount of data (that exceeds the token limit) to another agent?
Hey there, here’s a Reddit thread on tool output issues where others faced similar problems.
I’ve also bumped into the same issue when using tools that output a lot. It seems like there’s no built-in way to cap the output from these tools directly. I noticed this especially when working with the FileReadTool
; configuring it to split up the responses into smaller parts didn’t work. Even when I set the max_tokens
and max_completion_tokens
to a small number on the LLM object, it still throws errors:
llm.py-llm:187 - ERROR: LiteLLM call failed: litellm.BadRequestError: Error code: 400 - {'error': {'message': 'litellm.BadRequestError: VertexAIException BadRequestError - {\n "error": {\n "code": 400,\n "message": "The input token count (2036021) exceeds the maximum number of tokens allowed (1000000).",\n "status": "INVALID_ARGUMENT"\n }
any solution other than changing the llm model?
You might need to use RAG, that way, you can control the context size. Other than that, I can’t think of anything else right now.
I thought RAG and [context] between tasks handled this but I may be wrong.
Alright folks, lemme throw in my two cents on this topic. But before I dive in, let me make one thing crystal clear:
There’s no magic bullet for cramming more context into your LLM than its context window can handle. Any attempt to do so will inevitably boil down to some form of content synthesis, and let’s be honest, any synthesis means you’re losing something, okay? The RAG technique is still one of the most solid and reliable approaches because the way it trims things down uses semantic criteria to try and fetch the best chunks. This shrinks the context but actually boosts the quality of the answers. So, yeah, what @zinyando and @jklre suggested is definitely your best bet.
Now, for those situations where you figure RAG isn’t strictly necessary, but you still gotta watch that context window limit (like when an Agent
keeps calling tools over and over for the same task, and the context from those tools just keeps piling up), nothing beats having the tool itself offer some way to truncate the content. To encourage y’all to get into the habit of cooking up custom tools that actually fit what your projects need, I’m sharing some simple alternative versions of FileReadTool
and ScrapeWebsiteTool
. These are two of the usual suspects for retrieving context, and they can easily blow past those limits. The cool thing about these alternative versions is they give you four ways to retrieve content:
- Full content extraction: This makes the tools behave just like their counterparts in
crewai_tools
– fully compatible. - Partial reads based on character limits (head truncation): This is your classic truncation. The tool just grabs the first
max_chars
characters from the content. For some things, that’s all you need. - Partial reads based on chunking (random chunks): With this mode, the content gets sliced into 1000-character chunks.
max_chars
is rounded down to the nearest multiple of 1000, and that many chunks (max_chars / 1000
) are returned. The kicker is, the first and last chunks of the original content are always included, and the ones in between are picked at random. This mode is pretty handy if you need to keep a good mix of the content while making sure you don’t send more thanmax_chars
to yourAgent
’s LLM. - AI-powered summarization: Here, you pass in a
crewai.LLM
, and it takes care of summarizing the content. To stop the summarizer LLM itself from hitting its context limit, it uses method #3 (random chunking) to pull up to 34,000 characters (around 8,500 tokens) and then whips up a summary of up to 6,000 characters (around 1,500 tokens). How many tokens you actually get back depends on how well the summarizer LLM plays ball. Now, this summarization LLM doesn’t need to be (and frankly, I don’t think it should be) super sophisticated. Summarizing is a pretty straightforward job, and any LLM with a 12,000-token context window can handle it, so you can get good summaries without spending a fortune.
Just keep in mind that these three truncation methods come with a trade-off: your Agent
won’t be getting the entire content anymore (though, let’s be real, it might not have been getting it all anyway, which is probably why we’re even talking about this ).
You can find the alternative version of FileReadTool
here, and the alternative version of ScrapeWebsiteTool
here. In the examples below, I used this file (Dracula, a hefty 865,173 characters) to test out the FileReadTool
, and this site (24,319 characters) for the ScrapeWebsiteTool
.
VersatileFileReadTool - Full Content Extraction (Default/Compatibility Mode):
from versatile_file_read_tool import VersatileFileReadTool
file_read_tool = VersatileFileReadTool()
content = file_read_tool.run(
file_path="dracula_by_bram_stoker.txt"
)
print(f"Retrieved {len(content)} chars")
# Using Tool: File Reading Tool
# Retrieved 865173 chars
VersatileFileReadTool - Partial Extraction (Head Truncation Mode):
from versatile_file_read_tool import VersatileFileReadTool
file_read_tool = VersatileFileReadTool(
retrieval_mode="head",
max_chars=13002
)
content = file_read_tool.run(
file_path="dracula_by_bram_stoker.txt"
)
print(f"Retrieved {len(content)} chars")
# Using Tool: File Reading Tool
# Retrieved 13002 chars
VersatileFileReadTool - Partial Extraction (Random Chunking Mode):
from versatile_file_read_tool import VersatileFileReadTool
file_read_tool = VersatileFileReadTool(
retrieval_mode="random_chunks",
max_chars=13321
)
content = file_read_tool.run(
file_path="dracula_by_bram_stoker.txt"
)
print(f"Retrieved {len(content)} chars")
# Using Tool: File Reading Tool
# Retrieved 12212 chars
VersatileFileReadTool - AI-Powered Extraction (Summarization Mode):
from versatile_file_read_tool import VersatileFileReadTool
from crewai import LLM
import os
os.environ["GEMINI_API_KEY"] = "<YOUR_API_KEY>"
gemini_llm = LLM(
model="gemini/gemini-2.5-flash-preview-04-17",
temperature=0.3
)
file_read_tool = VersatileFileReadTool(
retrieval_mode="summarize",
llm=gemini_llm
)
content = file_read_tool.run(
file_path="dracula_by_bram_stoker.txt"
)
print(f"Retrieved {len(content)} chars")
print(content)
# Using Tool: File Reading Tool
# Retrieved 3967 chars
# This text contains excerpts from Bram Stoker's novel *Dracula*, presented...
VersatileScrapeWebsiteTool - AI-Powered Extraction (Summarization Mode):
from versatile_scrape_website_tool import VersatileScrapeWebsiteTool
from crewai import LLM
import os
os.environ["GEMINI_API_KEY"] = "<YOUR_API_KEY>"
gemini_llm = LLM(
model="gemini/gemini-2.5-flash-preview-04-17",
temperature=0.3
)
scrape_website_tool = VersatileScrapeWebsiteTool(
retrieval_mode="summarize",
llm=gemini_llm
)
content = scrape_website_tool.run(
website_url="https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/"
)
print(f"Retrieved {len(content)} chars")
print(content)
# Using Tool: Website Scraping Tool
# Retrieved 2621 chars
# Google Cloud has announced the Agent2Agent Protocol (A2A), a new open protocol...
Thank you for this… This is much needed as we start to increase context windows on projects. It would be great if these ideas were alongside the doco from crewai
Keep them coming