How to implement PDF as a knowledge source?

Shwet_Ketu · December 16, 2024, 6:25am

Hello,
I see knowledge source for string on crewAI documentation page but I am looking for help on PDF as knowledge source. Also, is it possible to to keep PDF document at some cloud storage location and crew can access it as knowledge source?

Thanks

rokbenko · December 16, 2024, 7:40am

@Shwet_Ketu Use the PDFKnowledgeSource class as follows:

from crewai.knowledge.source.pdf_knowledge_source import PDFKnowledgeSource

my_pdf_source = PDFKnowledgeSource(
    file_path=“your/path/to/filename.pdf”
)

my_crew = Crew(
    ...,
    knowledge_sources=[my_pdf_source],
)

Sudesh_K · December 17, 2024, 5:07am

It seems the contents of the pdf are not in the my_pdf_source object. What other steps need to be carried out ? Do we have to do something with the memory ?

There is a message [2024-12-17 13:02:36][ERROR]: Failed to upsert documents: Expected metadata to be a non-empty dict, got 0 metadata attributes in upsert.

[2024-12-17 13:02:36][WARNING]: Failed to init knowledge: Expected metadata to be a non-empty dict, got 0 metadata attributes in upsert.

Daniel_Ryan · December 17, 2024, 10:57am

Hi @Sudesh_K

I was having the same issue. I believe there is a bug see:

In this BUG post, @opahopa prepared their own temporary fix. By creating their own LocalTxTFileKnowledgeSource Class.

My solution was to convert my .pdf file to a .txt file first. Then use this LocalTxTFileKnowledgeSource class instead.

kapenge · December 17, 2024, 1:05pm

Hi,

Can you share your code. I want to implement the same logic.

Regards,

Daniel_Ryan · December 17, 2024, 7:50pm

I used the following to convert pdf to txt

import fitz
def pdf_to_text(pdf_path, txt_path):
 # Open the PDF
 pdf_document = fitz.open(pdf_path)
 
 # Create a text file to store the extracted text
 with open(txt_path, "w", encoding="utf-8") as text_file:
    for page_number in range(len(pdf_document)):
        page = pdf_document.load_page(page_number)
        text = page.get_text()
        text_file.write(text)
 
    # Close the PDF
    pdf_document.close()

# Example usage
pdf_path = "/home/user/crewai/demo/knowledge/filename.pdf"
txt_path = "/home/user/crewai/demo/knowledge/filename.txt"
pdf_to_text(pdf_path, txt_path)

And here is my crew.py

from crewai import Agent, Crew, Process, Task, LLM
from crewai.project import CrewBase, agent, crew, task, before_kickoff, after_kickoff

# Knowledge temporary fix
from crewai.knowledge.source.base_knowledge_source import BaseKnowledgeSource
from pydantic import Field
from typing import Dict
import uuid


class LocalTxTFileKnowledgeSource(BaseKnowledgeSource):
    file_path: str = Field(description="Path to the local .txt file")
    def load_content(self) -> Dict[str, str]:
        try:
            with open(self.file_path, "r", encoding="utf-8") as file:
                content = file.read()
            return {self.file_path: content}
        except Exception as e:
            raise ValueError(f"Failed to read the file {self.file_path}: {str(e)}")

    def add(self) -> None:
        """Process and store the file content."""
        content = self.load_content()

        for _, text in content.items():
            chunks = self._chunk_text(text)
            self.chunks.extend(chunks)

            chunks_metadata = [
                {
                    "chunk_id": str(uuid.uuid4()),
                    "source": self.file_path,
                    "description": f"Chunk {i + 1} from file {self.file_path}"
                }
                for i in range(len(chunks))
            ]

        self.save_documents(metadata=chunks_metadata)

@CrewBase
class Demo2():
	"""Demo2 crew"""

	agents_config = 'config/agents.yaml'
	tasks_config = 'config/tasks.yaml'

	@agent
	def reviewer(self) -> Agent:
		return Agent(
			config=self.agents_config['reviewer'],
			memory=True,
			verbose=True,
			max_rpm=10,  # Limit API calls
		)

	@task
	def documentation_review_task(self) -> Task:
		return Task(
			config=self.tasks_config['documentation_review_task'],
			output_file='outputs/1_documentation_review_task.md'
		)

	@crew
	def crew(self) -> Crew:
		"""Creates the Demo2 crew"""
		local_txt_source = LocalTxTFileKnowledgeSource(file_path="knowledge/filename.txt", metadata={"version": "15.1"})
		return Crew(
			agents=self.agents, # Automatically created by the @agent decorator
			tasks=self.tasks, # Automatically created by the @task decorator
			process=Process.sequential,
			verbose=True,
			knowledge_sources=[local_txt_source],
			full_output=True,
			output_log_file='outputs/0_crew_output_log_file.md'
		)

I hope this helps, let me know how you get on.

Baalakay · December 17, 2024, 10:10pm

Hi, as a follow-up on this question, is there a way to set the knowledge search as a directory of PDFs for RAG, instead of just a single PDF as in this example?

Baalakay · December 17, 2024, 11:16pm

Hi, I don’t see the PDFKnowledgeSource in the online docs. Is it a newer, not yet documented class?

And why/or (should) we be using that instead of the PDFSearchTool?

Thx.

rokbenko · December 18, 2024, 8:34am

The knowledge_sources allows you to add more than one source to the list. But it’s not possible to provide a directory to it, as far as I know.

It’s not newer. It was released at the same time as the StringKnowledgeSource class. It’s just that there’s no example in the docs.

As stated in the docs:

Knowledge in CrewAI is a powerful system that allows AI agents to access and utilize external information sources during their tasks. Think of it as giving your agents a reference library they can consult while working.

Key benefits of using Knowledge:

Enhance agents with domain-specific information

Support decisions with real-world data

Maintain context across conversations

Ground responses in factual information

kapenge · December 18, 2024, 2:22pm

Thanks for sharing. It will really help a lot.

rokbenko · December 18, 2024, 2:34pm

@kapenge You’re welcome!

kapenge · December 20, 2024, 3:10pm

If i use the “LocalTxTFileKnowledgeSource” or “PDFKnowledgeSource” I got this error.

[2024-12-20 16:05:31][ERROR]: Failed to upsert documents: timed out in upsert.
[2024-12-20 16:05:31][WARNING]: Failed to init knowledge: timed out in upsert.

Any hint?

tonykipkemboi · December 20, 2024, 10:25pm

Here’s how to implement PDF as a knowledge source.

Make sure you have a folder named knowledge in the root of your directory where you should save your PDF(s).

# Imports
from crewai import Agent, Task, Crew, Process, LLM
from crewai.knowledge.source.pdf_knowledge_source import PDFKnowledgeSource

# Pass the PDF to the knowledge class
# IMPORTANT: the file path should be the name of the pdf only and not like this `knowledge/pdf_name.pdf
pdf_source = PDFKnowledgeSource(file_path="pdf_name.pdf")

...


my_crew = Crew(
    ...,
    knowledge_sources=[my_pdf_source],
)

If you get the metadata error (this will be resolved when we cut v0.86.1), add a dummy input to the metadata like so:

...
pdf_source = PDFKnowledgeSource(
                    file_path="pdf_name.pdf",
                    metadata={"title": "some-title"}
)
...

system · December 21, 2024, 10:26pm

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Knowledge -- any working examples? General	9	647	December 14, 2024
[BUG] Knowledge Source metadata generation doesn't work (and possibly the knowledge store at all) CrewAI Community Support	15	701	January 28, 2025
I have multiple resume pdf files and want to use as knowledge so that CrewAI Agent help me to find more fits candidate for the specific role General	0	115	February 4, 2025
Unable to accesss Knowledge in Crewai but other modules are working CrewAI Community Support feature	1	54	February 11, 2025
StringKnowledgeSource is erroring out General crewai	2	121	February 19, 2025

How to implement PDF as a knowledge source?

Related topics