How to implement PDF as a knowledge source?

Hello,
I see knowledge source for string on crewAI documentation page but I am looking for help on PDF as knowledge source. Also, is it possible to to keep PDF document at some cloud storage location and crew can access it as knowledge source?

Thanks

@Shwet_Ketu Use the PDFKnowledgeSource class as follows:

from crewai.knowledge.source.pdf_knowledge_source import PDFKnowledgeSource

my_pdf_source = PDFKnowledgeSource(
    file_path=“your/path/to/filename.pdf”
)

my_crew = Crew(
    ...,
    knowledge_sources=[my_pdf_source],
)
1 Like

It seems the contents of the pdf are not in the my_pdf_source object. What other steps need to be carried out ? Do we have to do something with the memory ?

There is a message [2024-12-17 13:02:36][ERROR]: Failed to upsert documents: Expected metadata to be a non-empty dict, got 0 metadata attributes in upsert.

[2024-12-17 13:02:36][WARNING]: Failed to init knowledge: Expected metadata to be a non-empty dict, got 0 metadata attributes in upsert.

Hi @Sudesh_K

I was having the same issue. I believe there is a bug see:

In this BUG post, @opahopa prepared their own temporary fix. By creating their own LocalTxTFileKnowledgeSource Class.

My solution was to convert my .pdf file to a .txt file first. Then use this LocalTxTFileKnowledgeSource class instead.

Hi,

Can you share your code. I want to implement the same logic.

Regards,

1 Like

I used the following to convert pdf to txt

import fitz
def pdf_to_text(pdf_path, txt_path):
 # Open the PDF
 pdf_document = fitz.open(pdf_path)
 
 # Create a text file to store the extracted text
 with open(txt_path, "w", encoding="utf-8") as text_file:
    for page_number in range(len(pdf_document)):
        page = pdf_document.load_page(page_number)
        text = page.get_text()
        text_file.write(text)
 
    # Close the PDF
    pdf_document.close()

# Example usage
pdf_path = "/home/user/crewai/demo/knowledge/filename.pdf"
txt_path = "/home/user/crewai/demo/knowledge/filename.txt"
pdf_to_text(pdf_path, txt_path)

And here is my crew.py

from crewai import Agent, Crew, Process, Task, LLM
from crewai.project import CrewBase, agent, crew, task, before_kickoff, after_kickoff

# Knowledge temporary fix
from crewai.knowledge.source.base_knowledge_source import BaseKnowledgeSource
from pydantic import Field
from typing import Dict
import uuid


class LocalTxTFileKnowledgeSource(BaseKnowledgeSource):
    file_path: str = Field(description="Path to the local .txt file")
    def load_content(self) -> Dict[str, str]:
        try:
            with open(self.file_path, "r", encoding="utf-8") as file:
                content = file.read()
            return {self.file_path: content}
        except Exception as e:
            raise ValueError(f"Failed to read the file {self.file_path}: {str(e)}")

    def add(self) -> None:
        """Process and store the file content."""
        content = self.load_content()

        for _, text in content.items():
            chunks = self._chunk_text(text)
            self.chunks.extend(chunks)

            chunks_metadata = [
                {
                    "chunk_id": str(uuid.uuid4()),
                    "source": self.file_path,
                    "description": f"Chunk {i + 1} from file {self.file_path}"
                }
                for i in range(len(chunks))
            ]

        self.save_documents(metadata=chunks_metadata)

@CrewBase
class Demo2():
	"""Demo2 crew"""

	agents_config = 'config/agents.yaml'
	tasks_config = 'config/tasks.yaml'

	@agent
	def reviewer(self) -> Agent:
		return Agent(
			config=self.agents_config['reviewer'],
			memory=True,
			verbose=True,
			max_rpm=10,  # Limit API calls
		)

	@task
	def documentation_review_task(self) -> Task:
		return Task(
			config=self.tasks_config['documentation_review_task'],
			output_file='outputs/1_documentation_review_task.md'
		)

	@crew
	def crew(self) -> Crew:
		"""Creates the Demo2 crew"""
		local_txt_source = LocalTxTFileKnowledgeSource(file_path="knowledge/filename.txt", metadata={"version": "15.1"})
		return Crew(
			agents=self.agents, # Automatically created by the @agent decorator
			tasks=self.tasks, # Automatically created by the @task decorator
			process=Process.sequential,
			verbose=True,
			knowledge_sources=[local_txt_source],
			full_output=True,
			output_log_file='outputs/0_crew_output_log_file.md'
		)

I hope this helps, let me know how you get on.

2 Likes

Hi, as a follow-up on this question, is there a way to set the knowledge search as a directory of PDFs for RAG, instead of just a single PDF as in this example?

Hi, I don’t see the PDFKnowledgeSource in the online docs. Is it a newer, not yet documented class?

And why/or (should) we be using that instead of the PDFSearchTool?

Thx.

The knowledge_sources allows you to add more than one source to the list. But it’s not possible to provide a directory to it, as far as I know.

It’s not newer. It was released at the same time as the StringKnowledgeSource class. It’s just that there’s no example in the docs.

As stated in the docs:

Knowledge in CrewAI is a powerful system that allows AI agents to access and utilize external information sources during their tasks. Think of it as giving your agents a reference library they can consult while working.

Key benefits of using Knowledge:

  • Enhance agents with domain-specific information
  • Support decisions with real-world data
  • Maintain context across conversations
  • Ground responses in factual information
1 Like

Thanks for sharing. It will really help a lot.

@kapenge You’re welcome! :slight_smile:

If i use the “LocalTxTFileKnowledgeSource” or “PDFKnowledgeSource” I got this error.

[2024-12-20 16:05:31][ERROR]: Failed to upsert documents: timed out in upsert.
[2024-12-20 16:05:31][WARNING]: Failed to init knowledge: timed out in upsert.

Any hint?

Here’s how to implement PDF as a knowledge source.

Make sure you have a folder named knowledge in the root of your directory where you should save your PDF(s).

# Imports
from crewai import Agent, Task, Crew, Process, LLM
from crewai.knowledge.source.pdf_knowledge_source import PDFKnowledgeSource

# Pass the PDF to the knowledge class
# IMPORTANT: the file path should be the name of the pdf only and not like this `knowledge/pdf_name.pdf
pdf_source = PDFKnowledgeSource(file_path="pdf_name.pdf")

...


my_crew = Crew(
    ...,
    knowledge_sources=[my_pdf_source],
)

If you get the metadata error (this will be resolved when we cut v0.86.1), add a dummy input to the metadata like so:

...
pdf_source = PDFKnowledgeSource(
                    file_path="pdf_name.pdf",
                    metadata={"title": "some-title"}
)
...
2 Likes

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.