Agent setup and configuration

I’m trying to build an agent that can replace our appointment coordinators at the agency.

The database is 100 pages long and the agent is having trouble answering questions from it. It answers directly from the language model, which leads to significant inaccuracy in both questions and answers.

?How can I get it to read from large files

How are you querying the database? Via text-to-sql or it’s in a knowledge base? Are you using a tool?

The more details you share the better people can help you.

The data is in a docx file.

At first I tried to extract it with a function that calls the agent to read from the document before it answers…

Then the file got big and its memory was not enough… and I tried using rag tool but it still doesn’t work for me…

Have you tried using the knowledge feature?

An alternative is to do plain old RAG, you can use a local database like ChromaDB for that and then create a tool that fetches the data

I haven’t tried it, I’ll try it. Thank you very much

Hi I have recently faced this problem.

You can use the Knowledge - CrewAI knowledge feature but this has native support for PDF, CSV, TXT, Excel but lacks the docx support. You can use Docling (CrewDoclingSource) but I have not had a lot of success with this. You can use DOCX RAG Search - CrewAI Docx search and this acts as a form of RAG so useful but not brilliant.

In the end i converted all my DOCX and PDF to txt files to make them work, but you could convert to PDF in a work flow.

So in summary:

Option 1 use the DOCX RAG Search - CrewAI and configure your task yaml to search for files in your /knowledge directory.

Option 2 convert to text and then use TextFileKnowledgeSource (Knowledge - CrewAI)

Play around with both and find a methods that works for your particular problem.

Request for CrewAI : Can we have .docx added natively as I have found Docling to be slow Knowledge - CrewAI please :slight_smile:

1 Like

Well, I’m not CrewAI, but since I had a bit of spare time, I whipped up a poor man’s version of it. Before diving into the code, though, here are a few caveats.

This proposed implementation uses the python-docx library. Install it with pip install python-docx if you’re old-school like me, or go with uv add python-docx if you’re into the latest trends.

If you look closely at the _read_docx_file method, you’ll see there’s not much magic happening — I’m just grabbing all the text I can find in your .docx file (later on, it’s split into chunks, embeddings are computed, and everything’s stored in a vector database). It’s a basic starting point for you — or anyone else — to build a more robust parser that can handle tables, images, and other complex structures down the line.

Speaking of complex document structures, that’s exactly where Docling shines. It’s really good at dealing with intricate documents, but of course, that comes at a processing cost (a.k.a. it’s slower). Keep in mind that while simpler approaches like mine below will improve latency, they might hurt your crew’s ability to find relevant info, alright?

Enough chit-chat. Directory layout:

crewai_docx_knowledge_source/
├── knowledge/
│   └── IndyreadseBookseReaderHelp.docx
├── docx_knowledge_source.py
└── test_docx_knowledge_source.py

File docx_knowledge_source.py:

# -*- coding: utf-8 -*-
"""Knowledge source for loading and querying DOCX file content."""

from pathlib import Path
from typing import Dict, List, Optional, Union
from typing_extensions import Self
from pydantic import Field, field_validator, model_validator

try:
    import docx
    from docx.document import Document

    DOCX_AVAILABLE = True
except ImportError:
    DOCX_AVAILABLE = False

from crewai.knowledge.source.base_knowledge_source import BaseKnowledgeSource
from crewai.utilities.constants import KNOWLEDGE_DIRECTORY
from crewai.utilities.logger import Logger


class DOCXKnowledgeSource(BaseKnowledgeSource):
    """
    Manages loading and querying content from DOCX files using embeddings.

    Attributes:
        file_paths: A list of paths (str or Path) to the DOCX files.
        chunks: A list storing the text chunks generated from the DOCX
                files after processing.
        content: A dictionary caching the text content of each file.
                 Maps Path objects to their string content.
        safe_file_paths: A list of validated Path objects corresponding to
                         the input file_paths.
    """

    file_paths: Optional[Union[Path, List[Path], str, List[str]]] = Field(
        default=None, description="List of DOCX file paths"
    )
    chunks: List[str] = Field(default_factory=list)
    content: Dict[Path, str] = Field(default_factory=dict)
    safe_file_paths: List[Path] = Field(default_factory=list)

    _logger: Logger = Logger(verbose=True, default_color="yellow")

    @field_validator("file_paths", mode="before")
    @classmethod
    def _validate_file_paths_provided(
        cls, value: Optional[Union[Path, List[Path], str, List[str]]]
    ) -> Union[Path, List[Path], str, List[str]]:
        """Validate that file_paths is provided and not empty."""
        if value is None or (isinstance(value, list) and not value):
            raise ValueError(
                "Attribute 'file_paths' must be provided and cannot be empty."
            )
        return value

    @model_validator(mode="after")
    def _post_init_validation(self) -> Self:
        """
        Post-initialization hook: process paths, validate, load content.
        """
        if not DOCX_AVAILABLE:
            raise ImportError(
                "python-docx is not installed. "
                "Please install it with:`uv add python-docx`"
            )

        self.safe_file_paths = self._process_file_paths()
        self._validate_paths_exist()
        self.content = self._load_content()
        return self

    def _process_file_paths(self) -> List[Path]:
        """
        Convert input file paths to a list of validated Path objects.

        Handles single strings, Path objects, or lists thereof. Resolves
        paths relative to KNOWLEDGE_DIRECTORY if applicable.

        Returns:
            A list of validated Path objects.

        Raises:
            ValueError: If file_paths is None, empty, or of an invalid type
                        after initial validation.
        """
        if self.file_paths is None:
            # Should be caught by the validator, but acts as a safeguard.
            raise ValueError("'file_paths' cannot be None at this stage.")

        # Standardize input to a list
        path_list: List[Union[Path, str]]
        if isinstance(self.file_paths, (str, Path)):
            path_list = [self.file_paths]
        elif isinstance(self.file_paths, list):
            # Ensure all elements are either str or Path before proceeding
            if not all(isinstance(p, (str, Path)) for p in self.file_paths):
                raise ValueError(
                    "All items in 'file_paths' list must be str or Path."
                )
            path_list = list(self.file_paths)
        else:
            # Should not happen due to Pydantic validation, but good practice.
            raise ValueError(
                "'file_paths' must be a Path, str, or a list of these types."
            )

        if not path_list:
            # Should be caught by the validator, but acts as a safeguard.
            raise ValueError("'file_paths' list cannot be empty.")

        # Convert all elements to validated Path objects
        processed_paths: List[Path] = []
        for path_input in path_list:
            processed_paths.append(self._convert_to_path(path_input))

        return processed_paths

    def _convert_to_path(self, path_input: Union[str, Path]) -> Path:
        """
        Convert a string or Path input to a resolved Path object.

        Attempts to resolve relative paths against KNOWLEDGE_DIRECTORY.

        Args:
            path_input: The input path (string or Path object).

        Returns:
            A resolved Path object.

        Raises:
            TypeError: If the input is not a str or Path.
        """
        if isinstance(path_input, str):
            # Try resolving relative to the knowledge directory first
            path_in_knowledge_dir = Path(KNOWLEDGE_DIRECTORY) / path_input
            if path_in_knowledge_dir.exists():
                return path_in_knowledge_dir.resolve()

            # Fallback to checking if it's absolute or relative to CWD
            absolute_or_cwd_path = Path(path_input)
            if absolute_or_cwd_path.exists():
                return absolute_or_cwd_path.resolve()

            # If neither exists, return the knowledge dir path attempt.
            # Validation later will raise the FileNotFoundError.
            self._logger.log(
                "warning",
                f"Path '{path_input}' not found directly or in "
                f"'{KNOWLEDGE_DIRECTORY}'. Assuming it should be in "
                f"'{KNOWLEDGE_DIRECTORY}' for subsequent validation.",
                color="yellow",
            )
            return path_in_knowledge_dir.resolve()
        elif isinstance(path_input, Path):
            return path_input.resolve()  # Ensure absolute path
        else:
            raise TypeError(
                f"Invalid path type: {type(path_input)}. Expected str or Path."
            )

    def _validate_paths_exist(self) -> None:
        """Check if all processed file paths exist and are files."""
        for path in self.safe_file_paths:
            if not path.exists():
                self._logger.log(
                    "error",
                    f"File not found: {path}. Ensure it exists. Searched "
                    f"relative to '{KNOWLEDGE_DIRECTORY}' and current directory.",
                    color="red",
                )
                raise FileNotFoundError(f"File not found: {path}")
            if not path.is_file():
                self._logger.log(
                    "error",
                    f"Path is not a file: {path}",
                    color="red",
                )
                raise IsADirectoryError(f"Path is not a file: {path}")

    def _read_docx_file(self, file_path: Path) -> str:
        """
        Reads text content from a single DOCX file.

        Args:
            file_path: The Path object pointing to the DOCX file.

        Returns:
            The extracted text content as a single string, or an empty
            string if reading fails or the file is empty.
        """
        try:
            document: Document = docx.Document(str(file_path))
            paragraphs_text: List[str] = [
                p.text for p in document.paragraphs if p.text and p.text.strip()
            ]
            return "\n".join(paragraphs_text)
        except Exception as e:
            # Catch potential errors during file reading (e.g., corrupted file)
            self._logger.log(
                "error",
                f"Failed to read DOCX file {file_path}: {e}",
                color="red",
            )
            # Return empty string for this file to allow processing others
            return ""

    def _load_content(self) -> Dict[Path, str]:
        """
        Load text content from all validated DOCX files into the cache.

        Returns:
            A dictionary mapping each file Path to its extracted text content.
            Files with read errors or no content are skipped.
        """
        loaded_content: Dict[Path, str] = {}
        for file_path in self.safe_file_paths:
            self._logger.log(
                "info", f"Loading content from: {file_path}"
            )
            file_content = self._read_docx_file(file_path)
            if file_content:
                loaded_content[file_path] = file_content
            else:
                self._logger.log(
                    "warning",
                    f"Skipping file due to read error or empty content: "
                    f"{file_path}",
                    color="yellow",
                )

        if not loaded_content:
            self._logger.log(
                "warning",
                "No content could be loaded from the provided DOCX files.",
                color="yellow",
            )

        return loaded_content

    def validate_content(self):
        """Validate the paths."""
        self._validate_paths_exist()

    def add(self) -> None:
        """
        Processes the loaded DOCX content for querying.

        Combines content from all files, chunks it based on configured
        settings, stores the chunks, and triggers the embedding/saving
        process via the base class's `_save_documents` method.
        """
        if not self.content:
            self._logger.log(
                "warning",
                "No content loaded from DOCX files. Nothing to add.",
                color="yellow",
            )
            return

        # Combine content from all files. Adding file path context before
        # each document's content could be a future enhancement.
        # Using Path objects as keys ensures uniqueness if paths are complex.
        combined_content = "\n\n---\n\n".join(self.content.values())

        if not combined_content.strip():
            self._logger.log(
                "warning",
                "Loaded DOCX content is empty or whitespace only. "
                "Nothing to add.",
                color="yellow",
            )
            return

        new_chunks = self._chunk_text(combined_content)
        self.chunks.extend(new_chunks)
        self._save_documents()  # Save chunks
        self._logger.log(
            "info", f"Added {len(new_chunks)} chunks from DOCX files."
        )

    def _chunk_text(self, text: str) -> List[str]:
        """
        Splits the text into manageable chunks based on configured size
        and overlap.

        Args:
            text: The text content to chunk.

        Returns:
            A list of text chunks. Returns an empty list if input text is empty
            or if chunk_size is invalid.
        """
        if not text:
            return []

        # Ensure chunk_size is positive.
        chunk_size = getattr(self, "chunk_size", 0)
        chunk_overlap = getattr(self, "chunk_overlap", 0)

        if chunk_size <= 0:
            self._logger.log(
                "error",
                f"Invalid or missing chunk_size ({chunk_size}). "
                "Cannot chunk text.",
                color="red",
            )
            return []  # Cannot proceed without a valid chunk size

        safe_chunk_size = chunk_size

        # Ensure chunk_overlap is valid.
        if chunk_overlap < 0:
            self._logger.log(
                "warning",
                f"Negative chunk_overlap ({chunk_overlap}) is invalid. "
                "Setting overlap to 0.",
                color="yellow",
            )
            safe_chunk_overlap = 0
        elif chunk_overlap >= safe_chunk_size:
            self._logger.log(
                "warning",
                f"Chunk overlap ({chunk_overlap}) is greater than or "
                f"equal to chunk size ({safe_chunk_size}). Setting overlap to 0.",
                color="yellow",
            )
            safe_chunk_overlap = 0
        else:
            safe_chunk_overlap = chunk_overlap

        step = safe_chunk_size - safe_chunk_overlap # Step will always be > 0

        chunks = [
            text[i : i + safe_chunk_size] for i in range(0, len(text), step)
        ]

        # Remove potential empty strings resulting from chunking
        return [chunk for chunk in chunks if chunk and chunk.strip()]

Download the sample file from https://www.innerwest.nsw.gov.au/ArticleDocuments/1619/IndyreadseBookseReaderHelp.docx.aspx and save it into the knowledge directory.

File test_docx_knowledge_source.py:

# -*- coding: utf-8 -*-
"""Test for the DOCXKnowledgeSource class."""

# It'd be `from crewai.knowledge.source.docx_knowledge_source import DOCXKnowledgeSource`,
# if I weren't so lazy and actually submitted a PR. But for now, just import it like this:
from docx_knowledge_source import DOCXKnowledgeSource

from crewai import Agent, Task, Crew, Process, LLM
import os

os.environ["GEMINI_API_KEY"] = "YOUR-KEY"

docx_source = DOCXKnowledgeSource(
    file_paths=["IndyreadseBookseReaderHelp.docx"]
)

# See: https://docs.embedchain.ai/components/embedding-models
embedder_config = {
    "provider": "google",
    "config": {
        "model": "models/text-embedding-004",
        "api_key": os.environ["GEMINI_API_KEY"]
    }
}

# See: https://docs.crewai.com/concepts/llms#provider-configuration-examples
llm = LLM(
    model="gemini/gemini-2.0-flash",
    temperature=0.3
)

rag_assistant = Agent(
    role="Knowledge Base Assistant",
    goal=(
        "Answer questions accurately using *only* information found "
        "within the provided knowledge base."
    ),
    backstory=(
        "You are an AI assistant specialized in retrieving information "
        "*exclusively* from the provided knowledge base. Your sole "
        "purpose is to answer user questions based on this document. "
        "If required information isn't present, you must state that "
        "clearly. Decline tasks requiring external knowledge."
    ),
    llm=llm,
    verbose=True,
    allow_delegation=False,
)

rag_task = Task(
    description=(
        "Instructions:\n"
        "1. Analyze the user's question: '{user_question}'.\n"
        "2. Thoroughly search the provided knowledge base for the "
        "answer.\n"
        "3. Formulate a concise answer based **STRICTLY** on the "
        "information found within the knowledge base.\n"
        "4. **DO NOT** use external knowledge or make assumptions.\n"
        "5. If the answer is found, present the concise answer clearly.\n"
        "6. If the answer is **NOT FOUND** in the knowledge base, "
        "explicitly state that. For example: 'Sorry, I could not find "
        "information about that in the provided document.'\n"
        "7. Maintain a helpful and focused tone."
    ),
    expected_output=(
        "A helpful, concise answer (under 300 characters) based "
        "*strictly* on the knowledge base, OR a clear statement that "
        "the information was not found within the provided document."
    ),
    agent=rag_assistant,
)

rag_crew = Crew(
    agents=[rag_assistant],
    tasks=[rag_task],
    process=Process.sequential,
    verbose=True,
    knowledge_sources=[docx_source],
    embedder=embedder_config
)

result = rag_crew.kickoff(
    inputs={
        "user_question": "Can I borrow an eBook from the website?"
    }
)

print(f"\n🤖 Answer:\n\n{result.raw}\n")

And there you have it. Enjoy!

3 Likes

First.. Wow… Thank you for taking the time to make this..
Second.. You are really good at explaining how things work.
Third.. Will get to work testing.. but wanted to thank you first

3 Likes

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.