Skip to content

feat: add Voyage AI as embedding provider#474

Open
mcinteerj wants to merge 2 commits intoAsyncFuncAI:mainfrom
mcinteerj:feat/voyage-ai-embedder
Open

feat: add Voyage AI as embedding provider#474
mcinteerj wants to merge 2 commits intoAsyncFuncAI:mainfrom
mcinteerj:feat/voyage-ai-embedder

Conversation

@mcinteerj
Copy link

Summary

Adds Voyage AI as an embedding provider via a new VoyageEmbedderClient, alongside the existing OpenAI, Google, Ollama, and Bedrock providers. Voyage AI's code-optimised models (voyage-code-3) provide strong retrieval quality for code repositories.

Usage

DEEPWIKI_EMBEDDER_TYPE=voyage
VOYAGE_AI_API_KEY=your_key_here

Changes

  • api/voyage_client.py (new): VoyageEmbedderClient wrapping the voyageai SDK. Handles serialisation carefully — voyageai.Client stores tenacity retry objects internally that cannot be pickled; solved via lazy init, __getstate__/__setstate__, and a to_dict() override covering adalflow's own DB serialisation path. Retry via backoff.on_exception with a giveup guard for programmer errors. acall implemented via voyageai.AsyncClient.
  • api/config.py: register VoyageEmbedderClient, read VOYAGE_AI_API_KEY (aliased to VOYAGE_API_KEY for the SDK), route embedder_voyage config key, add is_voyage_embedder() consistent with other provider detection patterns.
  • api/config/embedder.json: add embedder_voyage block (voyage-code-3, batch_size: 100, input_type: document).
  • api/tools/embedder.py: add input_type param to get_embedder() for query vs document distinction.
  • api/rag.py: Voyage query_embedder initialised once at startup with input_type=query (critical — Voyage models distinguish query vs document embeddings for retrieval quality). FAISS retriever uses query_embedder for both Ollama and Voyage.
  • api/main.py: remove OPENAI_API_KEY from required env vars (was incorrectly flagged as required for all providers).
  • api/config/generator.json: add gemini-2.0-flash-exp model params.
  • docker-compose.yml: use ${HOME} for .adalflow volume path (more portable).
  • README.md: document voyage embedder type, VOYAGE_AI_API_KEY, and switching instructions.

Notes

  • voyageai added to pyproject.toml / poetry.lock
  • Default model is voyage-code-3 (latest Voyage code model, 1024 dimensions)

Adds VoyageEmbedderClient to support Voyage AI embeddings
(voyage-code-3) alongside existing providers (OpenAI, Google,
Ollama, Bedrock). Voyage AI's code-optimised models provide
strong retrieval quality for code repositories.

Changes:
- api/voyage_client.py: new ModelClient wrapping voyageai SDK.
  Lazy client init via property + __getstate__/__setstate__ +
  to_dict(exclude) to handle voyageai's non-serialisable tenacity
  retry internals across pickle and adalflow's own DB serialisation.
  backoff.on_exception with giveup guard for programmer errors.
  acall implemented via voyageai.AsyncClient.
- api/config.py: register VoyageEmbedderClient, read
  VOYAGE_AI_API_KEY env var (aliased to VOYAGE_API_KEY for SDK),
  add embedder_voyage config routing, is_voyage_embedder() detection
  consistent with other provider patterns.
- api/config/embedder.json: add embedder_voyage block (voyage-code-3,
  batch_size 100, input_type document).
- api/tools/embedder.py: add input_type param to get_embedder() for
  query vs document distinction.
- api/rag.py: query_embedder for Voyage initialised once at startup
  with input_type=query (not per-call). FAISS retriever uses
  query_embedder for both ollama and voyage.
- api/main.py: remove OPENAI_API_KEY from required env vars (it was
  never required for non-OpenAI providers).
- api/config/generator.json: add gemini-2.0-flash-exp model params.
- docker-compose.yml: use ${HOME} for .adalflow volume path.
- README.md: document voyage embedder type, VOYAGE_AI_API_KEY env
  var, and switching instructions.

Set DEEPWIKI_EMBEDDER_TYPE=voyage and VOYAGE_AI_API_KEY to use.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @mcinteerj, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the application's embedding capabilities by integrating Voyage AI as a new provider. This addition offers specialized models, particularly for code-related tasks, and includes robust client-side handling for proper serialization and distinct query embedding. The changes also streamline environment variable requirements and update documentation to reflect the new options, providing users with more choice and flexibility in their embedding strategies.

Highlights

  • New Embedding Provider: Voyage AI: Integrated Voyage AI as a new embedding provider, enabling the use of models like voyage-code-3 which are optimized for code retrieval. This adds another powerful option alongside existing OpenAI, Google, Ollama, and Bedrock providers.
  • Robust Client Implementation for Voyage AI: Implemented VoyageEmbedderClient with careful handling of serialization for voyageai SDK objects, including lazy initialization, __getstate__/__setstate__ methods, and a to_dict() override to manage unpicklable internal retry objects.
  • Voyage AI Specific Query Embedding: Introduced a dedicated query_embedder for Voyage AI, initialized with input_type=query, which is critical for maintaining retrieval quality as Voyage models distinguish between query and document embeddings.
  • Configuration and Documentation Updates: Updated configuration files (api/config.py, api/config/embedder.json) to register the new provider and its default model (voyage-code-3), and extensively documented its usage and API key requirements in README.md.
  • Refined Environment Variable Handling: Removed OPENAI_API_KEY from the list of universally required environment variables, making the application more flexible for users not relying on OpenAI.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .gitignore
    • Added docker-compose.override.yml to the ignore list.
  • README.md
    • Added documentation for the voyage embedder type, including its API key (VOYAGE_AI_API_KEY) and model (voyage-code-3).
    • Included instructions for switching to Voyage AI embeddings.
    • Updated the list of required environment variables to include VOYAGE_AI_API_KEY when DEEPWIKI_EMBEDDER_TYPE=voyage.
  • api/config.py
    • Imported VoyageEmbedderClient.
    • Added VOYAGE_AI_API_KEY to environment variable retrieval and set VOYAGE_API_KEY for the SDK.
    • Registered VoyageEmbedderClient in CLIENT_CLASSES.
    • Updated load_embedder_config to process embedder_voyage configuration.
    • Modified get_embedder_config to return embedder_voyage settings when embedder_type is 'voyage'.
    • Added is_voyage_embedder function to check if Voyage AI embeddings are in use.
    • Updated get_embedder_type to include 'voyage' as a possible return value.
    • Included embedder_voyage in the list of keys to update from embedder_config.
  • api/config/embedder.json
    • Added a new configuration block for embedder_voyage, specifying VoyageEmbedderClient, a batch_size of 100, and model_kwargs for voyage-code-3 with input_type: document.
  • api/config/generator.json
    • Added configuration parameters for the gemini-2.0-flash-exp model.
  • api/main.py
    • Removed OPENAI_API_KEY from the list of universally required environment variables.
  • api/poetry.lock
    • Added aiolimiter package with version 1.2.1.
    • Added tenacity package with version 9.1.4.
    • Added voyageai package with version 0.2.4 and its dependencies.
  • api/pyproject.toml
    • Added voyageai = ">=0.2.0" as a new dependency.
  • api/rag.py
    • Modified the __init__ method to conditionally initialize a dedicated query_embedder for Voyage AI with input_type='query'.
    • Adjusted the prepare_retriever method to use the query_embedder for both Ollama and Voyage AI embedder types.
  • api/tools/embedder.py
    • Added an input_type parameter to the get_embedder function signature and docstring.
    • Updated the logic within get_embedder to retrieve embedder_voyage configuration when specified.
    • Implemented logic to override the input_type in model_kwargs if provided, specifically for Voyage AI retrieval vs. indexing.
  • api/voyage_client.py
    • Added a new file defining the VoyageEmbedderClient class, which wraps the voyageai SDK.
    • Implemented __init__, _initialize_client, client property, __getstate__, __setstate__, and to_dict for proper client management and serialization.
    • Included parse_embedding_response to convert Voyage AI responses to EmbedderOutput format.
    • Defined convert_inputs_to_api_kwargs to prepare inputs for the Voyage AI API, including default model and input_type handling.
    • Added call method with backoff.on_exception for synchronous API calls.
    • Implemented voyage_async_client property and acall method for asynchronous API calls.
  • docker-compose.yml
    • Updated the volume path for .adalflow to use ${HOME}/.adalflow for improved portability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

- tests/unit/test_all_embedders.py: add embedder_voyage config check,
  VoyageEmbedderClient CLIENT_CLASSES check, is_voyage_embedder()
  detection test and mutual exclusivity assertion, matching the
  pattern from the Bedrock embedder PR (AsyncFuncAI#419)
- api/data_pipeline.py: add voyage branch to count_tokens() and
  update docstring to include voyage type
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds Voyage AI as a new embedding provider, which is a great addition, especially with its code-optimized models. The implementation is thorough, with careful handling of serialization for the Voyage client and integration into the existing configuration and RAG pipeline. The changes are consistent with the existing provider patterns.

I've added a couple of suggestions to improve maintainability by refactoring some repetitive logic. These are minor points in an otherwise excellent contribution.

Comment on lines +183 to +184
elif embedder_type == 'voyage' and 'embedder_voyage' in configs:
return configs.get("embedder_voyage", {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While this addition is correct and consistent with the existing structure, the if/elif chain in get_embedder_config is becoming long and less maintainable. Each new provider requires adding another block.

Consider refactoring the entire function to be more scalable, for example:

def get_embedder_config():
    """
    Get the current embedder configuration based on DEEPWIKI_EMBEDDER_TYPE.

    Returns:
        dict: The embedder configuration with model_client resolved
    """
    embedder_type = EMBEDDER_TYPE
    config_key = f"embedder_{embedder_type}"

    if embedder_type != 'openai' and config_key in configs:
        return configs.get(config_key, {})
    
    # Default to 'openai' config which has the key 'embedder'
    return configs.get("embedder", {})

This approach avoids modifying the function for every new provider, as long as the configuration key in embedder.json follows the embedder_<type> pattern.

Comment on lines +66 to +70
api_key = self._api_key or os.getenv(self._env_api_key_name)
if not api_key:
raise ValueError(
f"Environment variable {self._env_api_key_name} must be set"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for retrieving the API key is duplicated here and in the voyage_async_client property (lines 211-213). To avoid repetition and improve maintainability, you could extract this into a private helper method, like _get_api_key(), and call it from both places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant