PDF reader agent returns a summary of the last page

Hello!

I have seen many examples where an agent is responsible for reading the content of a file (in my case, a PDF fro a URL), and returns the file content verbatim.

Ex.: Efficient PDF Summarization with CrewAI and Intel® XPU Optimization - Intel Community

What I observe is the “Agent Final Answer“ for this task is a summary of the last PDF page. For example, in a technical paper where the last page shows the biography of the authors, the “Agent Final Answer“ is a summary of the biographies. What I would like to have is the whole text verbatim.

Code:

import logging
from dotenv import load_dotenv
from crewai import Agent, Task, Crew, Process, LLM
from crewai.tools import BaseTool
from PyPDF2 import PdfReader
import os
from typing import Type, Dict
from pydantic import BaseModel, Field
import urllib.request
import io

logging.basicConfig(level=logging.DEBUG, format='%(asctime)-15s [%(levelname)s] %(message)s')

class PDFReaderTool(BaseTool):
    name: str = "PDF Reader"
    description: str = "Reads the content of a PDF file and returns the text."

    def _run(self, pdf_url: str) -> Dict[str, str]:
        req = urllib.request.Request(pdf_url, headers={'User-Agent': 'Mozilla/5.0'})
        # Open the URL and read the content as bytes
        with urllib.request.urlopen(req) as response:
            remote_file_bytes = response.read()
        # Create an in-memory binary stream
        pdf_file_obj = io.BytesIO(remote_file_bytes)

        reader = PdfReader(pdf_file_obj)
        extracted_text = "".join([page.extract_text() for page in reader.pages])
        return {'extracted_text': extracted_text}

def main():
    logging.info(f"extract_text_verbatim.main()")

    pdf_url = r"https://arxiv.org/pdf/2507.23746"

    load_dotenv()
    # Check for .env variables
    ollama_model_name = os.getenv('OLLAMA_MODEL_NAME')
    if not ollama_model_name:
        raise EnvironmentError(f"OLLAMA_MODEL_NAME not found in the environment variables")
    ollama_api_base = os.getenv('OLLAMA_API_BASE')
    if not ollama_api_base:
        raise EnvironmentError(f"OLLAMA_API_BASE not found in the environment variables")
    crewai_disable_telemetry = os.getenv('CREWAI_DISABLE_TELEMETRY')
    if not crewai_disable_telemetry:
        raise EnvironmentError(f"CREWAI_DISABLE_TELEMETRY not found in the environment variables")

    pdf_reader_tool = PDFReaderTool()

    llm = LLM(
        model=ollama_model_name,
        base_url=ollama_api_base,
        temperature=0.0,
    )
    reader_agent = Agent(
        role="Reader",
        goal="Extract text from a PDF document. Your Final Answer is nothing else that the text verbatim.",
        verbose=True,
        memory=True,
        backstory="You are an expert in extracting text from PDF documents.",
        tools=[pdf_reader_tool],
        allow_delegation=False,
        llm=llm
    )

    read_text_task = Task(
        description="""Extract the text from the PDF document located at {pdf_url}. \
        Do not synthesize. Do not summarize. Return the whole text as-is.""",
        expected_output="""The value for the key 'extracted_text', from the dict object returned by the tool 'PDF Reader'. \
                        For example, if the tool returns {'extracted_text': 'Once upon a time in Wonderland...'}, \
                        the output must be 'Once upon a time in Wonderland...' \
                        If the tool returns {'extracted_text': 'hdhay6 djandf9 29dfhkhfehdf'}, \
                        the output must be 'hdhay6 djandf9 29dfhkhfehdf'. \
                        If the tool returns {'extracted_text': 'Yoobidoobida!'}, \
                        the output must be 'Yoobidoobida!'.""",
        agent=reader_agent,
        name="Read Text Task"
    )

    crew = Crew(
        agents=[reader_agent],
        tasks=[read_text_task],
        process=Process.sequential,
        manager_llm=llm
    )

    result = crew.kickoff(inputs={"pdf_url": pdf_url})
    print(f"result:\n{result}")

if __name__ == '__main__':
    main()

The tool call returns a dictionary with the single key ‘extracted_text‘, as expected:

╭──────────────────────────────────────────────────── Tool Output ─────────────────────────────────────────────────────╮
│ │
│ {‘extracted_text’: '1\nReal-Time Transmission of Uncompressed High-Definition Video Via\nA VCSEL-Based Optical │
│ Wireless Link With Ultra-Low Latency\nHossein Kazemi, Isaac N. O. Osahon, Tiankuo Jiao, David Butler, Nikolay │
│ Ledentsov Jr., Ilya Titkov,\nNikolay Ledentsov, and Harald Haas\nAbstract —Real-time transmission of │
│ high-resolution video sig-\nnals in an uncompressed and unencrypted format requires an\nultra-reliable and │
│ low-latency communications (URLLC) medium\nwith high bandwidth to maintain the quality of experience\n(QoE) for │
│ users. We put forward the design and experimental\ndemonstration of a high-performance laser-based optical │
│ wireless\ncommunication (OWC) system that enables high-definition (HD)\nvideo transmission with submillisecond │
│ latencies. The serial\ndigital interface (SDI) output of a camera is used to transmit\nthe live video stream over │
│ an optical wireless link by directly\nmodulating the SDI signal on the intensity of a 940 nm vertical\ncavity │
│ surface emitting laser (VCSEL). The proposed SDI over\nlight fidelity (LiFi) system corroborates error-free │
│ transmission of\nfull HD (FHD) and 4K ultra-high-definition (UHD) resolutions\nat data rates of 2.97Gb/s and │
│ 5.94Gb/s, respectively, with a\nmeasured end-to-end latency of under 35ns. Since SDI standards\nsupport various …

Unfortunately, the “Agent Final Answer” is not the whole text verbatim, as requested:

╭─────────────────────────────────────────────── ✅ Agent Final Answer ────────────────────────────────────────────────╮
│ │
│ Agent: Reader │
│ │
│ Final Answer: │
│ The provided text appears to be a list of author profiles or bios related to a research paper or project on │
│ optical wireless communication systems. Here’s a brief summary: │
│ │
│ 1. Authors include researchers from various institutions and countries, such as the University of Cambridge, Aston │
│ University, VI Systems GmbH, and others. │
│ │
│ 2. Their areas of expertise include photonics, communication theory, signal processing, semiconductor │
│ nanostructures, and advanced optoelectronic devices. │
│ │
│ 3. Notable researchers include Harald Haas, who is a Fellow of IEEE and the founder of pureLiFi Ltd., and Nikolay │
│ Ledentsov, a Senior Member of IEEE with over 900 publications and numerous awards. │
│ │
│ 4. Their recent work focuses on photonics, communication theory, and signal processing for optical wireless │
│ communication systems, with Harald Haas being recognized as a highly cited researcher by Clarivate/Web of Science │
│ since 2017. │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

I used models ollama/deepseek-r1:14b and ollama/granite3.3:8b with similar results.

Any advice would be welcome!

Since you’re using a model through Ollama, it’s worth checking if you’re hitting a context window limit, especially since you are requesting a full dump of the PDF’s content. By default, Ollama truncates input at 2048 tokens, even if the input is longer and the model itself could handle much larger contexts.

I discussed this over in this other thread, which provides the solution and a test case. It might be helpful for your issue as well.

1 Like

Thank you so much for pointing this out!

Although increasing num_ctx to 16k and 32k resulted in failures (the “Agent Final Answer“ was “<think>“), increasing it to 8k did improve the answer. I received a summary of the whole paper. This seems to confirm that the context length was a bottleneck. I assume the processed message by the ollama LLM truncated the beginning of the text.

I still receive a summary of the text, though. Is it possible that the instructions written in the task are also truncated, so the LLM doesn’t know what to do, apart from generating a summary of what it reads?

Well, Sébastien, let’s talk a bit more about what’s happening under the hood.

The file you’re using as an example, after being converted to text by your tool, has 42,372 characters. The text alone amounts to 11,970 tokens (using the Gemini models’ tokenization method). Add to that the prompt text that the CrewAI framework is generating under the hood. So, this enormous sandwich is ultimately being passed to your LLM. And your instruction asks it to repeat the same thing it received as input, so you’ll also get a very large response. Your LLM would have to be very competent to be able to absorb and respect instructions (signal) amidst so much content (noise).

I’d like to believe that the example you presented is just a minimal reproducible example and not a concrete use case, right? Still, what’s happening is that you’re generating a terrible signal-to-noise ratio (SNR) for your LLM. Are you really sure that your model running via Ollama is robust enough to handle this? (If it is, please let me borrow your computer; I could really use a friend who would let me run a 400B+ parameter LLM on their machine :blush:).

Since I’m just speculating, I believe you are experiencing firsthand what has been termed “Context Rot.” I suggest checking out this excellent video from the folks at ChromaDB on the topic. Context rot quickly degrades a model’s ability to follow even very simple instructions (like “repeat what I said”). One way to test this hypothesis is to run your Crew, simply swapping the LLM for a larger commercial model, and see if the result improves.

Everything I’ve mentioned so far is still at the level of the LLM’s own limitations. I’m not even considering whether CrewAI attempts to do any kind of automatic context window management. I recall seeing that something in this regard was recently modified, but I can’t say for sure if the action is reactive (when there’s an actual context window overflow) or if it’s preventive (if automatic summarizations are being executed).

Finally, as I said above, I want to believe that the example you provided doesn’t correspond to a real-world use case. A real-world agentic application would definitely handle similar situations with a map-reduce pattern, that is, splitting the text into chunks, processing each chunk (even in parallel), and finally, having a last step to merge the parts. I myself have proposed a drop-in replacement for two of CrewAI’s base tools (FileReadTool and ScrapeWebsiteTool) that offer the ability to perform some form of content truncation/summarization to better manage the LLM’s context window. You can find these versions in this repository.

Hello Max,

You’re right: the code I pasted was only the first part of an agentic flow. I was trying to adhere to the pattern of having an agent dedicated to extracting the text verbatim. Then, another agent would do something else (like summarizing, or extracting relevant information). I observed that the Reader agent wouldn’t return the text verbatim, so I am trying to pinpoint the root cause.

The size of the text is typical for the application I have in mind, so the context window is a problem. As I understand from your comments, the ratio between the instructions and the payload is an additional problem, and that could be solved by chunking the text is smaller pieces.

I will parse the references you provided and see where I can get!

Thanks again!

P.S.: Sadly, no, I can’t run a 400B+ parameters LLM on any machine I can log on! :wink:

1 Like

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.