String Knowledge sources not working with Gemini

Same issue encountered when trying to use Azure/any LLM that isn’t “OpenAI”.

Update:
@tonykipkemboi @rokbenko

I’ve traced the problem down to a few areas inside of KnowledgeStorage:

 def _set_embedder_config(
    self, embedder_config: Optional[Dict[str, Any]] = None
) -> None:
    """Set the embedding configuration for the knowledge storage.

    Args:
        embedder_config (Optional[Dict[str, Any]]): Configuration dictionary for the embedder.
            If None or empty, defaults to the default embedding function.
    """
    self.embedder_config = (
        EmbeddingConfigurator().configure_embedder(embedder_config)
        if embedder_config
        else self._create_default_embedding_function()
    )

What seems to be happening is that the embedder_config data does not seem to be reaching set_embedder_config() (i.e. shows up as ‘None’) when attempting to use a different provider/model/llm.

Here’s the logs for the config i’m passing directly into KnowledgeSource. I am printing these results from the inside of the configure_embedder() method:

Configuring embedding function with: {'provider': 'azure', 'config': {'model': 'text-embedding-ada-002'}}

Returning self.embedding_functions[provider]: <function EmbeddingConfigurator._configure_azure at 0x10a8d2de0>

The error seems to be that the config isn’t being recognized (shows up as ‘None’ in configure_embedder()) and so it defaults to prompting for an OpenAI key when it falls back to _create_default_embedding_function().

Let me know if there’s any more info I can provide or if you want to try to pass in something different.

We are looking into this internally. More details to follow.

1 Like

Hi, I’m think I’m having the same issue, when trying to use watsonx as the embedder. I’ve raised an issue here: [BUG] Watsonx as embedder is not working - script errors and stops · Issue #1790 · crewAIInc/crewAI · GitHub

@tonykipkemboi
@subbu
@rokbenko

Was able to find a workaround/solution to enable Knowledge to work with non-OpenAI models and is confirmed as working with the knowledge_sources attribute placed on either the Crew instance as well as the Agent. In this example I am using a modified version of the SpaceKnowledgeSource example from Knowledge - CrewAI
Going to do a bit of research on integrating it into StringKnowledgeSource next. The following elements were utilized:

Custom storage: It appears that embedding configuration errors can be pinpointed to a couple different areas, BaseKnowledgeSource being one of them. To prevent the embedding_config from being treated with the default mechanism (geared towards OpenAI embeddings), I created a custom storage element and initialized it.

# Create and configure knowledge storage
storage: KnowledgeStorage = KnowledgeStorage(
    embedder_config={
        "provider": "azure",
        "config": {
            "model": "text-embedding-3-small",
            "api_key": os.environ.get("AZURE_OPENAI_API_KEY")
        }
})

storage.initialize_knowledge_storage()

In the actual SpaceNewsKnowledgeSource class I had to use an altered add() method based on the BaseKnowledgeSource class that it extends. One of the issues was that the metadata was not being correctly associated with the formatted data chunks. Therefore, unique ids needed to be created and associated with each chunk:

def generate_unique_id(self, content: str) -> str:
    """Generate a unique ID using a hash of the content."""
    return hashlib.sha256(content.encode('utf-8')).hexdigest()

def add(self) -> None:
    """Process and store the articles."""
    content = self.load_content()
    all_chunks = []
    # Chunk the text content
    for _, text in content.items():
        chunks = self._chunk_text(text)
        all_chunks.extend(chunks)

    # Deduplicate chunks
    unique_chunks = list(set(all_chunks))

    # Assign deduplicated chunks to self.chunks
    self.chunks = unique_chunks  # Ensure self.chunks is fully aligned

    # Create metadata for each chunk
    self.metadata = [{"id": self.generate_unique_id(chunk)}for chunk in self.chunks]

    # Validate metadata and chunk alignment
    if len(self.chunks) != len(self.metadata):
        raise ValueError(f"Mismatch in chunks and metadata lengths: {len(self.chunks)} vs {len(self.metadata)}")

    # Save documents with the associated metadata
    self.save_documents(metadata=self.metadata)

Later on I would then associate that storage configuration that I initially created with the custom knowledge source:

recent_news = SpaceNewsKnowledgeSource(
    api_endpoint="https://api.spaceflightnewsapi.net/v4/articles",
    limit=10,
)

recent_news.storage = storage
# Add content to the knowledge source
recent_news.add()

When it comes to the crew configuration, knowledge_sources and embedder both need to be set up with the following structure (alter the model name/params as needed for your application):

    knowledge_sources=[recent_news],
    embedder={
        "provider": "azure",
        "config": {
            "model": "text-embedding-3-small",
            "api_key": os.environ.get("AZURE_OPENAI_API_KEY")
        }
    },

This is currently configured with Crew serving as the central location of the Knowledge source but it can work with Agents and associated Knowledge sources as well. Don’t forget to also manually set your LLMs on your agents. E.G.:

space_analyst = Agent(
    role="Analyst",
    goal="Answer questions the knowledge sources you are provided with accurately and comprehensively.",
    backstory="""You are an expert analyst. You excel at answering questions
    about the information provided in knowledge_sources (Knowledge), accurate information""",
    # knowledge_sources=[recent_news],
    # embedder_config={
    #     "provider": "azure",
    #     "config": {
    #         "model": "text-embedding-3-small",
    #         "api_key": os.environ.get("AZURE_OPENAI_API_KEY")
    #     }},
    llm=LLM(model="azure/gpt-4-32k", temperature=0.0)
)

The commented out lines are for if you’d like to use the Agent-Knowledge association instead.

I also have the following .env variables in place. Currently stress-testing everything and finding out what isn’t explicitly necessary for this to work:

  • AZURE_OPENAI_API_KEY
  • AZURE_API_BASE
  • AZURE_OPENAI_ENDPOINT (same as “AZURE_API_BASE”)
  • AZURE_OPENAI_DEPLOYMENT_NAME (make this one the model name of your embedder)
  • AZURE_API_KEY
  • AZURE_API_VERSION
  • OPENAI_API_VERSION (embedding model version)
  • OPENAI_API_KEY (same as Azure API key)

Those are my findings so far. Planning to iterate on this for StringKnowledgeSource next. Hope this helps!

Here’s a custom class/implementation that successfully utilizes a string Knowledge source (such as Azure or Gemini) along with a quick breakdown/demo:

The setup:
*Note: Make sure to also follow the other configuration instructions I mentioned in the previous post.

Custom Knowledge Source - String

# Create and configure knowledge storage
storage: KnowledgeStorage = KnowledgeStorage(
    embedder_config={
        "provider": "azure",
        "config": {
            "model": "text-embedding-3-small",
            "api_key": os.environ.get("AZURE_OPENAI_API_KEY")
        }
})
storage.initialize_knowledge_storage()

class AzureStringKnowledgeSource(BaseKnowledgeSource):
    """Knowledge source that fetches data from Space News API."""
    text_content: str = Field(description="The string text content")
    def load_content(self) -> Dict[Any, str]:
        try:
            return {"data": self.text_content}
        except Exception as e:
            raise ValueError(f"Failed to acquire string: {str(e)}")

    def generate_unique_id(self, content: str) -> str:
        """Generate a unique ID using a hash of the content."""
        return hashlib.sha256(content.encode('utf-8')).hexdigest()

    def add(self) -> None:
        """Process and store the articles."""
        content = self.load_content()
        all_chunks = []
        # Chunk the text content
        for _, text in content.items():
            chunks = self._chunk_text(text)
            all_chunks.extend(chunks)
        # Deduplicate chunks
        unique_chunks = list(set(all_chunks))
        # Assign deduplicated chunks to self.chunks
        self.chunks = unique_chunks  # Ensure self.chunks is fully aligned
        # Create metadata for each chunk
        self.metadata = [{"id": self.generate_unique_id(chunk)}for chunk in self.chunks]
        # Validate metadata and chunk alignment
        if len(self.chunks) != len(self.metadata):
            raise ValueError(f"Mismatch in chunks and metadata lengths: {len(self.chunks)} vs {len(self.metadata)}")
        # Save documents with the associated metadata
        self.save_documents(metadata=self.metadata)

text_data = AzureStringKnowledgeSource(text_content="The secret number is 42.")
text_data.storage = storage
# Add content to the knowledge source
text_data.add()

Screenshot/Breakdown:

Hi @ Evan_Scallan
I’m working on a Chatbot using CrewAI, and I’m setting up a GoogleGenerativeAiEmbeddingFunction with KnowledgeStorage. I wanted to confirm if my current approach is correct for initializing the embedder and linking it with the knowledge base:

google_ai = embedding_functions.GoogleGenerativeAiEmbeddingFunction(
    api_key=google_api_key
)

# Configuring Knowledge Storage with Google Embedder
storage = KnowledgeStorage(
    embedder_config={
        "provider": "google",
        "embedding_function": google_ai  # Passing the instance directly
    }
)
storage.initialize_knowledge_storage()

# Setting up Excel data source and linking to storage
excel_source = ExcelKnowledgeSource(file_paths=["knowledge/cit_alumni.xls"])
knowledge = Knowledge(collection_name="excel_knowledge", sources=[excel_source])

# Attaching storage and adding the knowledge
knowledge.storage = storage
knowledge.add()

My Questions:

  1. :white_check_mark: Is it correct to pass an initialized GoogleGenerativeAiEmbeddingFunction directly as shown above?
  2. :white_check_mark: Should I use "embedding_function": google_ai directly, or does it need a different structure?
  3. :white_check_mark: Does ExcelKnowledgeSource support .xls files, or should I switch to .xlsx for compatibility?

I appreciate any guidance from the community. Thanks in advance for your help! :blush:

We just updated the Knowledge docs with a Gemini example.

1 Like

@Satish_V

I’m not entirely sure. I’ve tried a similar approach using AzureOpenAIEmbeddings directly on the Crew Instance but you may be able to have some luck.
I noticed a similar approach on the documentation for memory. Memory - CrewAI
That being said, I do know that the following structure for embedder_config has given me some success:

embedder_config={
    "provider": "<your provider>",
    "config": {
        "model": "<your embedding model>",
        "api_key": "<your api key>"
    }

@Evan_Scallan please check our docs above. We updated so you don’t need to use “embedder_config” but rather “embedder”

1 Like

@ tonykipkemboi As you mentioned, I tried with updated Knowledge docs with a Gemini example, but the code still prompts for an OpenAI API key.

We’re cutting a new version today that will resolve the issue. For now you can try by installing the main github branch of crewai repo to test the new update before it gets officially cut today.

Hi @tonykipkemboi sorry for maybe mixing the topics as you have replied in another topic since it was closed, I’m taking advantage of this here to raise my question.

I tried:

from crewai.knowledge.source.base_knowledge_source import BaseKnowledgeSource
from crewai.knowledge.source.pdf_knowledge_source import PDFKnowledgeSource
from crewai.knowledge.source.string_knowledge_source import StringKnowledgeSource

Not matter what I use I always have the same Warning/Error:

[2025-01-03 21:26:46][ERROR]: Failed to upsert documents: Unequal lengths for fields: ids: 2, metadatas: 1, documents: 2 in upsert.

[2025-01-03 21:26:46][WARNING]: Failed to init knowledge: Unequal lengths for fields: ids: 2, metadatas: 1, documents: 2 in upsert.

I’m using ollama and IBM embedder

 crew = Crew(
            agents=[
                proposal_writer_agent, 
                #sales_manager_agent, 
                account_executive_agent
            ],
            tasks=[
                proposal_creation_task, 
                proposal_review_task,
                proposal_revision, 
                proposal_email_task
                ],
            knowledge_sources=[string_knowledge_source_brandIdentity],
            verbose=True,
            embedder={
                "provider": "ollama",
                "config": {
                    "model": "granite-embedding:278m"
                    }
                },
            process=Process.hierarchical,
            manager_llm='ollama/gemma2-27b-32k',
            #manager_agent=sales_manager_agent,
            #planning=True,
            #planning_llm=llm_manager,
            full_output=True,
            output_log_file=f"sbd_logs_{self.timestamp}.txt"
        )

To get around the unequal lengths error, I had to implement my own custom class and method that handled the metadata differently–at least in the current state. You’ll see that in my larger post from a day ago.

Just so you know, that behaviour of failing or defaulting to openai also happens with Ollama. and the nomic-embed-text

Using this →

from crewai.knowledge.source.string_knowledge_source import StringKnowledgeSource
crew = Crew(
    agents=[agent],
    tasks=[task],
    verbose=True,
    process=Process.sequential,
    knowledge_sources=[string_source],
    embedder={
        "provider": "ollama",
        "config": {
            "model": "nomic-embed-text",
        }
    }
)

Traceback

  File "/path/to/project/.venv/bin/run_crew", line 5, in <module>
    from tax_crew.main import run
  File "/path/to/project/src/tax_crew/main.py", line 5, in <module>
    from tax_crew.crew import TaxCrew
  File "/path/to/project/src/tax_crew/crew.py", line 6, in <module>
    string_source = StringKnowledgeSource(
                    ^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/venv/lib/python3.11/site-packages/pydantic/main.py", line 214, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/venv/lib/python3.11/site-packages/crewai/knowledge/storage/knowledge_storage.py", line 51, in __init__
    self._set_embedder_config(embedder_config)
  File "/path/to/venv/lib/python3.11/site-packages/crewai/knowledge/storage/knowledge_storage.py", line 174, in _set_embedder_config
    else self._create_default_embedding_function()
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/venv/lib/python3.11/site-packages/crewai/knowledge/storage/knowledge_storage.py", line 158, in _create_default_embedding_function
    return OpenAIEmbeddingFunction(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/venv/lib/python3.11/site-packages/chromadb/utils/embedding_functions/openai_embedding_function.py", line 56, in __init__
    raise ValueError(
ValueError: Please provide an OpenAI API key. You can get one at https://platform.openai.com/account/api-keys
An error occurred while running the crew: Command '['uv', 'run', 'run_crew']' returned non-zero exit status 1.```

@tonykipkemboi I’m using 0.95 i have tested String Knowlege code as per doc still it prompts for openai api key

Try adding api_key in the config and put a random string or empty quotes.

crew = Crew(
    agents=[agent],
    tasks=[task],
    verbose=True,
    process=Process.sequential,
    knowledge_sources=[string_source],
    embedder={
        "provider": "ollama",
        "config": {
            "model": "nomic-embed-text",
            "api_key": ""
        }
    }
)

In my case when I run only

csv_source = PDFKnowledgeSource(file_paths=["data.csv"])

It still gives me error to provide an OpenAI API key.
So the error might be caused from ‘knowledge source’ code and not from ‘crew initialization’ code.

This might be helpful to reproduce the error.

This is still giving the same error.

1 Like

Same here ! the suggested code doesn’t work.