TextFileKnowledgeSource- ChromaDB " Failed to create or get collection"

Hello,

Basically my project is, I have data coming from a streamlit page, I need the agent to determine this incoming text is valid or not using a set of business rules. I’m trying to run the knowledge base with .txt file. My configuration is according to the docs and I’ve used embedding as well as chunking.
When I try to run the knowledge, I keep getting the same issue no matter what I try,

My error is
raise Exception(f"An error occurred while running the crew: {e}")
Exception: An error occurred while running the crew: Failed to create or get collection

File "D:\anaconda\envs\ocr\Lib\site-packages\chromadb\rate_limit\simple_rate_limit_init .py", line 23, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File “D:\anaconda\envs\ocr\Lib\site-packages\chromadb\api\segment.py”, line 202, in create_collection check_index_name(name) File “D:\anaconda\envs\ocr\Lib\site-packages\chromadb\api\segment.py”, line 90, in check_index_name raise ValueError(msg) ValueError: Expected collection name that (1) contains 3-63 characters, (2) starts and ends with an alphanumeric character, (3) otherwise contains only alphanumeric characters, underscores or hyphens (-), (4) contains no two consecutive periods (..) and (5) is not a valid IPv4 address,

My error is that chromadb is generating the collection name based on my data. It’s characters are more than 68 [which it’s generating on its own]. I tried hardcoding a name that satisfies the rules but still chromadb is generating its own unacceptable name.

My code is:

from crewai.knowledge.source.text_file_knowledge_source import TextFileKnowledgeSource

text_source = TextFileKnowledgeSource(
file_paths=[“Businessrulex.txt”],
chunk_size=200, # Maximum size of each chunk (default: 4000)
chunk_overlap=10, # Overlap between chunks (default: 200)
collection_name=“text_file_rules”
)

llm = LLM(model=“gpt-4o-mini”, temperature=0,)
@CrewBase
class Visionproj():
“”“Visionproj crew”“”

agents_config = 'config/agents.yaml'
tasks_config = 'config/tasks.yaml'

@agent
def claimagent(self) -> Agent:
    return Agent(
        config=self.agents_config['cagent'],
        knowledge_sources=[text_source],
        verbose=True,
        llm=llm,
        embedder={
            "provider": "openai",
        "config": {
            "model": "text-embedding-3-small",
            "api_key": os.environ['OPENAI_API_KEY'],
        }
    }
    )


@task
def claimtask(self) -> Task:
    return Task(
        config=self.tasks_config['ctask'],
        verbose=True,
    )

@crew
def crew(self) -> Crew:
    """Creates the Visionproj crew"""
    
    return Crew(
        agents=self.agents, # Automatically created by the @agent decorator
        tasks=self.tasks, # Automatically created by the @task decorator
        process=Process.sequential,
        verbose=True,
        knowledge_sources=[text_source],
        embedder={
            "provider": "openai",
        "config": {
            "model": "text-embedding-3-small",
            "api_key": os.environ['OPENAI_API_KEY'],
        }
    }
        
    )

I tried hashlib.sha256(text.encode()), removing collection name and letting it create its own name[still created a very big useless name], client.get_or_create_collection(name=“test”), hardcoding collection_name=“” and everything.I’m still stuck on this part of chromadb, I dont know what else issues I will face later. I updated libraries but nothing worked. My agents and tasks yaml files are correctly configured btw.

Any kind of help would be very helpful.
Thank you!

Welcome to our community!

Regarding the issue you’re having with ChromaDB, honestly, it looks to me like the TextFileKnowledgeSource.collection_name just isn’t being taken into account at all. What actually determines the final name of the collection is KnowledgeStorage.collection_name. So, try something like this and see if it resolves that particular issue:

from crewai.knowledge.source.text_file_knowledge_source import TextFileKnowledgeSource
from crewai.knowledge.storage.knowledge_storage import KnowledgeStorage

text_source = TextFileKnowledgeSource(
    file_paths=["BusinessruleX.txt"],
    chunk_size=200,
    chunk_overlap=10,
    storage=KnowledgeStorage(
        collection_name="text_file_rules"
    )
)

Let us know if this helps!

Hello,

Thank you for the suggestion. I tried importing and implementing it in code but I still got the same error.

What I’ve noticed is, it’s completely failing to get collection name. From the others in community, they seem to have run it without any chromadb errors like me. I face this issue with both pdf and text files too.
Another thing is that it;s geenrating its own name from the data but not my business rules which is weird because I want it to chunk, embed and store the business rules not the data. I dont know from where it’s taking this from. I am using the complete crewai framework with yaml files and python instead of standalone python file.
Do I need to get a chromadb API? There is embedding requests to my openai account from because I’m using that API. Is there someting else I can do? Here is my code:

crew.py
from crewai import Agent, Crew, Process, Task, LLM
from crewai.project import CrewBase, agent, crew, task
from crewai.knowledge.source.text_file_knowledge_source import TextFileKnowledgeSource
from crewai.knowledge.storage.knowledge_storage import KnowledgeStorage

text_source = TextFileKnowledgeSource(
file_paths=[“Businessrulex.txt”],
chunk_size=200,
chunk_overlap=10,
storage=KnowledgeStorage(
collection_name=“text_file_rules”)
)
llm = LLM(model=“gpt-4o-mini”, temperature=0)

@CrewBase
class Visionproj():
“”“Visionproj crew”“”

agents_config = 'config/agents.yaml'
tasks_config = 'config/tasks.yaml'

@agent
def claimagent(self) -> Agent:
    return Agent(
        config=self.agents_config['cagent'],
        knowledge_sources=[text_source],
        verbose=True,
        llm=llm,
        embedder={
            "provider": "openai",
        "config": {
            "model": "text-embedding-3-small",
            "api_key": os.environ['OPENAI_API_KEY'],
        }
    }
    )


@task
def claimtask(self) -> Task:
    return Task(
        config=self.tasks_config['ctask'],
        verbose=True,
        output_file="claim.md"
    )

@crew
def crew(self) -> Crew:
    """Creates the Visionproj crew"""
    
    return Crew(
        agents=self.agents, # Automatically created by the @agent decorator
        tasks=self.tasks, # Automatically created by the @task decorator
        process=Process.sequential,
        verbose=True,
        knowledge_sources=[text_source],
        embedder={
            "provider": "openai",
        "config": {
            "model": "text-embedding-3-small",
            "api_key": os.environ['OPENAI_API_KEY'],
        }
    }
    )

main.py
import os

import sys

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(file), ‘../’)))

import json

import warnings

import logging

from proj.crew import Proj
def run():
“”"
Run the crew.
“”"
try:

    inputs = json.loads(sys.stdin.read())
    extracted_text = inputs['extracted_text']
    Visionproj().crew().kickoff(inputs={'extracted_text': extracted_text })
except Exception as e:
    raise Exception(f"An error occurred while running the crew: {e}")

if name == “main”:
run()

Any help would be appreciated, been working on this for months!
Thank you

Hey, @i_lipi. I took a quick look at the CrewAI code, and here’s what I’ve gathered:

You need to provide an embedder for both your custom KnowledgeStorage and your Crew. This is especially important if you’re not using OpenAI, so I wanted to document it here in case anyone else runs into the same situation. It should look something like this:

embedder_config = {
    "provider": "google",
    "config": {
        "model": "models/text-embedding-004",
        "api_key": os.environ["GEMINI_API_KEY"]
    }
}

txt_source = TextFileKnowledgeSource(
    file_paths=["user_info.txt"],
    chunk_size=200,
    chunk_overlap=10,
    storage=KnowledgeStorage(
        collection_name="text_file_rules",
        embedder=embedder_config    # <-- Embedder goes here
    )
)

crew = Crew(
    agents=[agent],
    tasks=[task],
    verbose=True,
    process=Process.sequential,
    knowledge_sources=[txt_source],
    embedder=embedder_config    # <-- And here as well
)

Now, about your collection_name parameter not being respected — that’s actually true. In the example above, KnowledgeStorage is instantiated twice: first in txt_source = TextFileKnowledgeSource(... and then again in crew = Crew(....

Check out this part of the Crew class code in crewai/crew.py:

@model_validator(mode="after")
def create_crew_knowledge(self) -> "Crew":
    """Create the knowledge for the crew."""
    if self.knowledge_sources:
        try:
            if isinstance(self.knowledge_sources, list) and all(
                isinstance(k, BaseKnowledgeSource) for k in self.knowledge_sources
            ):
                self.knowledge = Knowledge(
                    sources=self.knowledge_sources,
                    embedder=self.embedder,
                    collection_name="crew",    # <-- Hardcoded here
                )

        except Exception as e:
            self._logger.log(
                "warning", f"Failed to init knowledge: {e}", color="yellow"
            )
    return self

At this stage, during validation of the Crew class, self.knowledge.collection_name gets set to "crew", and later on it’s changed to "knowledge_crew", which is the collection_name you’ll actually see in your ChromaDB.

That’s what I’ve found so far. Hope these pointers help you get started on a real fix for your issue!

Hello,
I just removed the chunk parameters from my code, so it took default size according to my text I believe.

DEBUG - Knowledge sources: {‘chunk_size’: 4000, ‘chunk_overlap’: 200, ‘chunks’: , ‘chunk_embeddings’: , ‘storage’: <crewai.knowledge.storage.knowledge_storage.KnowledgeStorage object at 0x000001C9FA3A0B60>, ‘metadata’: {}, ‘collection_name’: None, ‘file_path’: None, ‘file_paths’: [‘Businessrulex.txt’], ‘content’: {WindowsPath(‘knowledge/Businessrulex.txt’): "###

Since this is for short term memory, I think this data will be stored in a temp file in my Appdata until session.
This worked and my agent is answering accordingly.
Thank you very much for your guidance. Really appreciate it.

Please mark this thread as resolved, thank you!!! (^ω^ʃƪ)