Hi,
I am working on enabling Docling Knowledge source in crewai.
Crew ai version - 0.108.0
docling==2.29.0
content_source = CrewDoclingSource(
file_paths=[“Browser Related.docx”],
collection_name=“sourcedocling”,
storage=KnowledgeStorage(collection_name=“sourcedocling”)
# storage=“./storage”
)
It is Working fine but , I am not able to correctly use the Storage parameter. I want my Docling vector db to be save in my directory so that my crewai and my agent can access it later on.
But no luck yet have tried different ways I have declare CREWAI_STORAGE_DIR="./storage_DB"
in my .env file also.
In fact, you can pass an instance of KnowledgeStorage
as a custom storage configuration, just like you did.
I took a look at the definition of KnowledgeStorage
(crewai/knowledge/storage/knowledge_storage.py
):
class KnowledgeStorage(BaseKnowledgeStorage):
"""
Extends Storage to handle embeddings for memory entries, improving
search efficiency.
"""
collection: Optional[chromadb.Collection] = None
collection_name: Optional[str] = "knowledge"
app: Optional[ClientAPI] = None
def __init__(
self,
embedder: Optional[Dict[str, Any]] = None,
collection_name: Optional[str] = None,
):
self.collection_name = collection_name
self._set_embedder_config(embedder)
# [...]
def initialize_knowledge_storage(self):
base_path = os.path.join(db_storage_path(), "knowledge")
chroma_client = chromadb.PersistentClient(
path=base_path,
settings=Settings(allow_reset=True),
)
self.app = chroma_client
try:
collection_name = (
f"knowledge_{self.collection_name}"
if self.collection_name
else "knowledge"
)
if self.app:
self.collection = self.app.get_or_create_collection(
name=sanitize_collection_name(collection_name),
embedding_function=self.embedder,
)
else:
raise Exception("Vector Database Client not initialized")
except Exception:
raise Exception("Failed to create or get collection")
Maybe I’m missing something, but it looks like base_path
is hardcoded. You can actually see where your vector database will be stored with:
from pathlib import Path
from crewai.utilities.paths import db_storage_path
base_knowledge_db_path = Path(db_storage_path()) / "knowledge"
print(base_knowledge_db_path)
Anyway, your Crew
and Agent
s will automatically handle queries to your vector database for you.
Yes ,I have also gone through this module but didn’t get any luck on base path also it is noy getting saved anywhere. I am not sure I am intializing the KnowledgeStorage instance correctly. Or their is some issue in module which is not correctly calling Save and initialize_knowledge_storage(self) function.
As CrewDoclingSource also has a save function with save_documents() but nowhere vector db is saved.
Yes Crew and Agent by default handle it correctly but in that case for every call the processing of the document will be done again that will give a latency that I don’t think so is correct.
Additionally do you have idea how collection name works and how I can use it if I have to make 3 different Vector Db to pass it to 3 different agents.
Update:
Yep, turns out I was missing something: db_storage_path()
itself relies on get_project_directory_name()
, which ultimately respects the CREWAI_STORAGE_DIR
environment variable. So you absolutely can control where your vector database gets saved, just as you need. For example:
from pathlib import Path
from crewai.utilities.paths import db_storage_path
import os
os.environ["CREWAI_STORAGE_DIR"] = str(Path.cwd()) # Set to current working directory
base_knowledge_db_path = Path(db_storage_path()) / "knowledge"
print(base_knowledge_db_path)