Hello!
I have seen many examples where an agent is responsible for reading the content of a file (in my case, a PDF fro a URL), and returns the file content verbatim.
Ex.: Efficient PDF Summarization with CrewAI and Intel® XPU Optimization - Intel Community
What I observe is the “Agent Final Answer“ for this task is a summary of the last PDF page. For example, in a technical paper where the last page shows the biography of the authors, the “Agent Final Answer“ is a summary of the biographies. What I would like to have is the whole text verbatim.
Code:
import logging
from dotenv import load_dotenv
from crewai import Agent, Task, Crew, Process, LLM
from crewai.tools import BaseTool
from PyPDF2 import PdfReader
import os
from typing import Type, Dict
from pydantic import BaseModel, Field
import urllib.request
import io
logging.basicConfig(level=logging.DEBUG, format='%(asctime)-15s [%(levelname)s] %(message)s')
class PDFReaderTool(BaseTool):
name: str = "PDF Reader"
description: str = "Reads the content of a PDF file and returns the text."
def _run(self, pdf_url: str) -> Dict[str, str]:
req = urllib.request.Request(pdf_url, headers={'User-Agent': 'Mozilla/5.0'})
# Open the URL and read the content as bytes
with urllib.request.urlopen(req) as response:
remote_file_bytes = response.read()
# Create an in-memory binary stream
pdf_file_obj = io.BytesIO(remote_file_bytes)
reader = PdfReader(pdf_file_obj)
extracted_text = "".join([page.extract_text() for page in reader.pages])
return {'extracted_text': extracted_text}
def main():
logging.info(f"extract_text_verbatim.main()")
pdf_url = r"https://arxiv.org/pdf/2507.23746"
load_dotenv()
# Check for .env variables
ollama_model_name = os.getenv('OLLAMA_MODEL_NAME')
if not ollama_model_name:
raise EnvironmentError(f"OLLAMA_MODEL_NAME not found in the environment variables")
ollama_api_base = os.getenv('OLLAMA_API_BASE')
if not ollama_api_base:
raise EnvironmentError(f"OLLAMA_API_BASE not found in the environment variables")
crewai_disable_telemetry = os.getenv('CREWAI_DISABLE_TELEMETRY')
if not crewai_disable_telemetry:
raise EnvironmentError(f"CREWAI_DISABLE_TELEMETRY not found in the environment variables")
pdf_reader_tool = PDFReaderTool()
llm = LLM(
model=ollama_model_name,
base_url=ollama_api_base,
temperature=0.0,
)
reader_agent = Agent(
role="Reader",
goal="Extract text from a PDF document. Your Final Answer is nothing else that the text verbatim.",
verbose=True,
memory=True,
backstory="You are an expert in extracting text from PDF documents.",
tools=[pdf_reader_tool],
allow_delegation=False,
llm=llm
)
read_text_task = Task(
description="""Extract the text from the PDF document located at {pdf_url}. \
Do not synthesize. Do not summarize. Return the whole text as-is.""",
expected_output="""The value for the key 'extracted_text', from the dict object returned by the tool 'PDF Reader'. \
For example, if the tool returns {'extracted_text': 'Once upon a time in Wonderland...'}, \
the output must be 'Once upon a time in Wonderland...' \
If the tool returns {'extracted_text': 'hdhay6 djandf9 29dfhkhfehdf'}, \
the output must be 'hdhay6 djandf9 29dfhkhfehdf'. \
If the tool returns {'extracted_text': 'Yoobidoobida!'}, \
the output must be 'Yoobidoobida!'.""",
agent=reader_agent,
name="Read Text Task"
)
crew = Crew(
agents=[reader_agent],
tasks=[read_text_task],
process=Process.sequential,
manager_llm=llm
)
result = crew.kickoff(inputs={"pdf_url": pdf_url})
print(f"result:\n{result}")
if __name__ == '__main__':
main()
The tool call returns a dictionary with the single key ‘extracted_text‘, as expected:
╭──────────────────────────────────────────────────── Tool Output ─────────────────────────────────────────────────────╮
│ │
│ {‘extracted_text’: '1\nReal-Time Transmission of Uncompressed High-Definition Video Via\nA VCSEL-Based Optical │
│ Wireless Link With Ultra-Low Latency\nHossein Kazemi, Isaac N. O. Osahon, Tiankuo Jiao, David Butler, Nikolay │
│ Ledentsov Jr., Ilya Titkov,\nNikolay Ledentsov, and Harald Haas\nAbstract —Real-time transmission of │
│ high-resolution video sig-\nnals in an uncompressed and unencrypted format requires an\nultra-reliable and │
│ low-latency communications (URLLC) medium\nwith high bandwidth to maintain the quality of experience\n(QoE) for │
│ users. We put forward the design and experimental\ndemonstration of a high-performance laser-based optical │
│ wireless\ncommunication (OWC) system that enables high-definition (HD)\nvideo transmission with submillisecond │
│ latencies. The serial\ndigital interface (SDI) output of a camera is used to transmit\nthe live video stream over │
│ an optical wireless link by directly\nmodulating the SDI signal on the intensity of a 940 nm vertical\ncavity │
│ surface emitting laser (VCSEL). The proposed SDI over\nlight fidelity (LiFi) system corroborates error-free │
│ transmission of\nfull HD (FHD) and 4K ultra-high-definition (UHD) resolutions\nat data rates of 2.97Gb/s and │
│ 5.94Gb/s, respectively, with a\nmeasured end-to-end latency of under 35ns. Since SDI standards\nsupport various …
Unfortunately, the “Agent Final Answer” is not the whole text verbatim, as requested:
╭─────────────────────────────────────────────── ✅ Agent Final Answer ────────────────────────────────────────────────╮
│ │
│ Agent: Reader │
│ │
│ Final Answer: │
│ The provided text appears to be a list of author profiles or bios related to a research paper or project on │
│ optical wireless communication systems. Here’s a brief summary: │
│ │
│ 1. Authors include researchers from various institutions and countries, such as the University of Cambridge, Aston │
│ University, VI Systems GmbH, and others. │
│ │
│ 2. Their areas of expertise include photonics, communication theory, signal processing, semiconductor │
│ nanostructures, and advanced optoelectronic devices. │
│ │
│ 3. Notable researchers include Harald Haas, who is a Fellow of IEEE and the founder of pureLiFi Ltd., and Nikolay │
│ Ledentsov, a Senior Member of IEEE with over 900 publications and numerous awards. │
│ │
│ 4. Their recent work focuses on photonics, communication theory, and signal processing for optical wireless │
│ communication systems, with Harald Haas being recognized as a highly cited researcher by Clarivate/Web of Science │
│ since 2017. │
│ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
I used models ollama/deepseek-r1:14b and ollama/granite3.3:8b with similar results.
Any advice would be welcome!