Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion api/config/embedder.json
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
{
"embedder": {
"client_class": "OpenAIClient",
"initialize_kwargs": {
"api_key": "${OPENAI_API_KEY}",
"base_url": "${OPENAI_BASE_URL}"
},
Comment on lines +4 to +7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change introduces a potential issue when environment variables are not set. The replace_env_placeholders function will pass the literal placeholder string (e.g., "${OPENAI_BASE_URL}") to the OpenAIClient constructor if the corresponding environment variable is missing.

This causes two problems:

  1. For base_url, it bypasses the client's logic to use the default OpenAI URL (https://api.openai.com/v1) when OPENAI_BASE_URL is not set. The client will instead try to connect to the invalid URL "${OPENAI_BASE_URL}".
  2. For api_key, it bypasses the client's validation that raises a ValueError if the API key is missing. Instead, it will attempt to authenticate with the invalid key "${OPENAI_API_KEY}".

The OpenAIClient is already designed to read api_key and base_url from environment variables if they are not passed to the constructor. Relying on that existing mechanism is more robust.

While the removal of the dimensions field is a good improvement, I recommend removing the initialize_kwargs section entirely. The client will correctly handle credentials on its own. If the original issue persists, it might be due to the environment variables not being correctly propagated to the application, which should be investigated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenAIClient is already designed to read api_key and base_url from environment variables if they are not passed to the constructor. Relying on that existing mechanism is more robust.

I am curious what issues are you seeing if it is not defined directly in the kwargs?

"batch_size": 500,
"model_kwargs": {
"model": "text-embedding-3-small",
"dimensions": 256,
"encoding_format": "float"
}
Comment on lines 9 to 12

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Removal of dimensions parameter causes embedding dimension mismatch with cached databases

Removing the dimensions: 256 parameter from the OpenAI embedder configuration causes a breaking change for users with existing cached databases.

Click to expand

Background

The text-embedding-3-small model defaults to 1536 dimensions when no dimensions parameter is specified. The previous configuration explicitly set dimensions: 256 to reduce embedding size.

How the bug is triggered

  1. User has an existing cached database (.pkl file) at ~/.adalflow/databases/{repo_name}.pkl with 256-dimensional embeddings (created before this change)
  2. User updates to the new configuration (without dimensions parameter)
  3. System loads the cached 256-dimensional embeddings from api/data_pipeline.py:869-892
  4. When querying, the system generates a 1536-dimensional query embedding using the new configuration
  5. FAISS retriever fails because query embedding dimension (1536) doesn't match document embedding dimension (256)

Code flow

  • api/data_pipeline.py:869-892 loads existing databases without checking embedding dimension compatibility:
if self.repo_paths and os.path.exists(self.repo_paths["save_db_file"]):
    logger.info("Loading existing database...")
    self.db = LocalDB.load_state(self.repo_paths["save_db_file"])
    documents = self.db.get_transformed_data(key="split_and_embed")
    if documents:
        # ... logs dimensions but doesn't validate against current config
        return documents  # Returns old embeddings
  • api/rag.py:385-390 creates FAISS retriever with mismatched dimensions:
self.retriever = FAISSRetriever(
    **configs["retriever"],
    embedder=retrieve_embedder,  # Uses new 1536-dim embedder
    documents=self.transformed_docs,  # Contains old 256-dim embeddings
    document_map_func=lambda doc: doc.vector,
)

Impact

  • Runtime errors when querying repositories that have cached databases
  • Error message: "All embeddings should be of the same size" or similar FAISS dimension mismatch error
  • Users must manually delete cached databases to recover

Recommendation: Either: (1) Keep the dimensions: 256 parameter to maintain backward compatibility, or (2) Add dimension validation in api/data_pipeline.py to detect and rebuild databases when embedding dimensions don't match the current configuration.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

},
Expand All @@ -16,6 +19,9 @@
},
"embedder_google": {
"client_class": "GoogleEmbedderClient",
"initialize_kwargs": {
"api_key": "${GOOGLE_API_KEY}"
},
Comment on lines +22 to +24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the OpenAI configuration, this change can cause issues if the GOOGLE_API_KEY environment variable is not set. The GoogleEmbedderClient will be initialized with api_key as the literal string "${GOOGLE_API_KEY}".

This bypasses the client's logic to raise a ValueError for a missing key, and will instead cause a failure later when genai.configure() is called with an invalid key. The PR description also mentions this key is optional, which makes this behavior particularly problematic.

The GoogleEmbedderClient already handles reading the API key from the environment. It's better to rely on the client's implementation.

I recommend removing this initialize_kwargs section to allow the client to manage its own credential loading.

"batch_size": 100,
"model_kwargs": {
"model": "text-embedding-004",
Expand Down