Hierarchical process and context loss (Long context vs RAG) for documents

Is there a pattern which solves the following issue.

I have a large document (5MB), which I want to ensure remains in context across conversations, especially when the manager asks questions or delegates to an agent (or when it goes through the process of re-trying until it gets a final answer), as there are times when the manager summaries or re-interprets the task description so the document context can be lost and it can go badly off track.

I would use the knowledge module however, this uses embeddings and I can’t create a single chunk size for the whole document as I get a maximum size error (google.api_core.exceptions.InvalidArgument: 400 400 Request payload size exceeds the limit: 36000 bytes. in upsert - imposed by google embedding). if I attempt to use smaller chunks I find that it never returns enough of the document or all results which provide a good answer.

So I prefer to use the long context (1 million input tokens is more than fine) rather than a RAG approach. I’m not worried about the number of tokens being used by sending the whole document each time, as the LLM does a much better job at returning quality results when it has the whole document in context for each request (as the results are deterministic in nature) - then it knows to refer to this provided document rather than relying on training data or any grounded results.

As callbacks and guardrails are executed after the task is executed, I can’t see any mechanism to force the insertion before passing to the LLM.

So I’m out of ideas at the moment, other than it would be good in the knowledge module supported full text attachment (rather than chunking/ embedding), so each request generated by the manager would keep this in context.

Is there any approach to solve this, or is this an enhancement?

You really seem set on using the entire document in your context window, don’t you? So, I believe (and this is just a guess) that you could create a custom tool, as explained in the documentation here. Your custom tool would have the sole function of loading and delivering your file. Then you provide this tool to your agent. This way (at least in theory), whenever it calls the tool, it would get the entire content of your file in the context.

A less exotic possibility would be a custom RAG tool, maybe using the new gemini-embedding-exp-03-07 model, and with the parameters: chunk_size = 20000 to generate large chunks of 20k chars, chunk_overlap = 1000 for 1k chars of overlap, summarize = False to receive raw chunks, and number_documents = 5 to receive a large enough quantity of chunks for your context after the search.

Thanks, much appreciated! - I will give the custom tool approach! - Hopefully, the manager always triggers the agent’s tool use even when asking questions, and even with the back and fourth with the llm to get to a final answer and not just on the first call.

Regarding the newer embedder even though it supports bigger chunks, I suspect it still probably won’t return good enough results across different sections but might give it a try.