Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -67,3 +67,4 @@ next-env.d.ts

# ignore adding self-signed certs
certs/
docker-compose.override.yml
8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -342,6 +342,7 @@ docker-compose up
| `openai` | OpenAI embeddings (default) | `OPENAI_API_KEY` | Uses `text-embedding-3-small` model |
| `google` | Google AI embeddings | `GOOGLE_API_KEY` | Uses `text-embedding-004` model |
| `ollama` | Local Ollama embeddings | None | Requires local Ollama installation |
| `voyage` | Voyage AI embeddings | `VOYAGE_AI_API_KEY` | Uses `voyage-code-3` model; optimized for code retrieval |

### Why Use Google AI Embeddings?

Expand All @@ -363,6 +364,9 @@ export DEEPWIKI_EMBEDDER_TYPE=google

# Use local Ollama embeddings
export DEEPWIKI_EMBEDDER_TYPE=ollama

# Use Voyage AI embeddings (optimized for code retrieval)
export DEEPWIKI_EMBEDDER_TYPE=voyage
```

**Note**: When switching embedders, you may need to regenerate your repository embeddings as different models produce different vector spaces.
Expand Down Expand Up @@ -421,7 +425,8 @@ docker-compose up
| `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint | No | Required only if you want to use Azure OpenAI models |
| `AZURE_OPENAI_VERSION` | Azure OpenAI version | No | Required only if you want to use Azure OpenAI models |
| `OLLAMA_HOST` | Ollama Host (default: http://localhost:11434) | No | Required only if you want to use external Ollama server |
| `DEEPWIKI_EMBEDDER_TYPE` | Embedder type: `openai`, `google`, `ollama`, or `bedrock` (default: `openai`) | No | Controls which embedding provider to use |
| `VOYAGE_AI_API_KEY` | Voyage AI API key | No | Required only if `DEEPWIKI_EMBEDDER_TYPE=voyage` |
| `DEEPWIKI_EMBEDDER_TYPE` | Embedder type: `openai`, `google`, `ollama`, `bedrock`, or `voyage` (default: `openai`) | No | Controls which embedding provider to use |
| `PORT` | Port for the API server (default: 8001) | No | If you host API and frontend on the same machine, make sure change port of `SERVER_BASE_URL` accordingly |
| `SERVER_BASE_URL` | Base URL for the API server (default: http://localhost:8001) | No |
| `DEEPWIKI_AUTH_MODE` | Set to `true` or `1` to enable authorization mode. | No | Defaults to `false`. If enabled, `DEEPWIKI_AUTH_CODE` is required. |
Expand All @@ -432,6 +437,7 @@ docker-compose up
- If using `DEEPWIKI_EMBEDDER_TYPE=google`: `GOOGLE_API_KEY` is required
- If using `DEEPWIKI_EMBEDDER_TYPE=ollama`: No API key required (local processing)
- If using `DEEPWIKI_EMBEDDER_TYPE=bedrock`: AWS credentials (or role-based credentials) are required
- If using `DEEPWIKI_EMBEDDER_TYPE=voyage`: `VOYAGE_AI_API_KEY` is required

Other API keys are only required when configuring and using models from the corresponding providers.

Expand Down
33 changes: 29 additions & 4 deletions api/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,14 @@
from api.google_embedder_client import GoogleEmbedderClient
from api.azureai_client import AzureAIClient
from api.dashscope_client import DashscopeClient
from api.voyage_client import VoyageEmbedderClient
from adalflow import GoogleGenAIClient, OllamaClient

# Get API keys from environment variables
OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')
GOOGLE_API_KEY = os.environ.get('GOOGLE_API_KEY')
OPENROUTER_API_KEY = os.environ.get('OPENROUTER_API_KEY')
VOYAGE_AI_API_KEY = os.environ.get('VOYAGE_AI_API_KEY')
AWS_ACCESS_KEY_ID = os.environ.get('AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = os.environ.get('AWS_SECRET_ACCESS_KEY')
AWS_SESSION_TOKEN = os.environ.get('AWS_SESSION_TOKEN')
Expand All @@ -32,6 +34,9 @@
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
if OPENROUTER_API_KEY:
os.environ["OPENROUTER_API_KEY"] = OPENROUTER_API_KEY
if VOYAGE_AI_API_KEY:
os.environ["VOYAGE_AI_API_KEY"] = VOYAGE_AI_API_KEY
os.environ["VOYAGE_API_KEY"] = VOYAGE_AI_API_KEY
if AWS_ACCESS_KEY_ID:
os.environ["AWS_ACCESS_KEY_ID"] = AWS_ACCESS_KEY_ID
if AWS_SECRET_ACCESS_KEY:
Expand Down Expand Up @@ -63,7 +68,8 @@
"OllamaClient": OllamaClient,
"BedrockClient": BedrockClient,
"AzureAIClient": AzureAIClient,
"DashscopeClient": DashscopeClient
"DashscopeClient": DashscopeClient,
"VoyageEmbedderClient": VoyageEmbedderClient
}

def replace_env_placeholders(config: Union[Dict[str, Any], List[Any], str, Any]) -> Union[Dict[str, Any], List[Any], str, Any]:
Expand Down Expand Up @@ -152,7 +158,7 @@ def load_embedder_config():
embedder_config = load_json_config("embedder.json")

# Process client classes
for key in ["embedder", "embedder_ollama", "embedder_google", "embedder_bedrock"]:
for key in ["embedder", "embedder_ollama", "embedder_google", "embedder_bedrock", "embedder_voyage"]:
if key in embedder_config and "client_class" in embedder_config[key]:
class_name = embedder_config[key]["client_class"]
if class_name in CLIENT_CLASSES:
Expand All @@ -174,6 +180,8 @@ def get_embedder_config():
return configs.get("embedder_google", {})
elif embedder_type == 'ollama' and 'embedder_ollama' in configs:
return configs.get("embedder_ollama", {})
elif embedder_type == 'voyage' and 'embedder_voyage' in configs:
return configs.get("embedder_voyage", {})
Comment on lines +183 to +184
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While this addition is correct and consistent with the existing structure, the if/elif chain in get_embedder_config is becoming long and less maintainable. Each new provider requires adding another block.

Consider refactoring the entire function to be more scalable, for example:

def get_embedder_config():
    """
    Get the current embedder configuration based on DEEPWIKI_EMBEDDER_TYPE.

    Returns:
        dict: The embedder configuration with model_client resolved
    """
    embedder_type = EMBEDDER_TYPE
    config_key = f"embedder_{embedder_type}"

    if embedder_type != 'openai' and config_key in configs:
        return configs.get(config_key, {})
    
    # Default to 'openai' config which has the key 'embedder'
    return configs.get("embedder", {})

This approach avoids modifying the function for every new provider, as long as the configuration key in embedder.json follows the embedder_<type> pattern.

else:
return configs.get("embedder", {})

Expand Down Expand Up @@ -235,19 +243,36 @@ def is_bedrock_embedder():
client_class = embedder_config.get("client_class", "")
return client_class == "BedrockClient"

def is_voyage_embedder():
"""
Check if the current embedder configuration uses VoyageEmbedderClient.

Returns:
bool: True if using Voyage AI embeddings, False otherwise
"""
embedder_config = get_embedder_config()
if not embedder_config:
return False
model_client = embedder_config.get("model_client")
if model_client:
return model_client.__name__ == "VoyageEmbedderClient"
return embedder_config.get("client_class", "") == "VoyageEmbedderClient"

def get_embedder_type():
"""
Get the current embedder type based on configuration.

Returns:
str: 'bedrock', 'ollama', 'google', or 'openai' (default)
str: 'bedrock', 'ollama', 'google', 'voyage', or 'openai' (default)
"""
if is_bedrock_embedder():
return 'bedrock'
elif is_ollama_embedder():
return 'ollama'
elif is_google_embedder():
return 'google'
elif is_voyage_embedder():
return 'voyage'
else:
return 'openai'

Expand Down Expand Up @@ -341,7 +366,7 @@ def load_lang_config():

# Update embedder configuration
if embedder_config:
for key in ["embedder", "embedder_ollama", "embedder_google", "embedder_bedrock", "retriever", "text_splitter"]:
for key in ["embedder", "embedder_ollama", "embedder_google", "embedder_bedrock", "embedder_voyage", "retriever", "text_splitter"]:
if key in embedder_config:
configs[key] = embedder_config[key]

Expand Down
8 changes: 8 additions & 0 deletions api/config/embedder.json
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,14 @@
"dimensions": 256
}
},
"embedder_voyage": {
"client_class": "VoyageEmbedderClient",
"batch_size": 100,
"model_kwargs": {
"model": "voyage-code-3",
"input_type": "document"
}
},
"retriever": {
"top_k": 20
},
Expand Down
5 changes: 5 additions & 0 deletions api/config/generator.json
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@
"top_p": 0.8,
"top_k": 20
},
"gemini-2.0-flash-exp": {
"temperature": 1.0,
"top_p": 0.8,
"top_k": 20
},
"gemini-2.5-flash-lite": {
"temperature": 1.0,
"top_p": 0.8,
Expand Down
5 changes: 4 additions & 1 deletion api/data_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ def count_tokens(text: str, embedder_type: str = None, is_ollama_embedder: bool

Args:
text (str): The text to count tokens for.
embedder_type (str, optional): The embedder type ('openai', 'google', 'ollama', 'bedrock').
embedder_type (str, optional): The embedder type ('openai', 'google', 'ollama', 'bedrock', 'voyage').
If None, will be determined from configuration.
is_ollama_embedder (bool, optional): DEPRECATED. Use embedder_type instead.
If None, will be determined from configuration.
Expand Down Expand Up @@ -58,6 +58,9 @@ def count_tokens(text: str, embedder_type: str = None, is_ollama_embedder: bool
elif embedder_type == 'bedrock':
# Bedrock embedding models vary; use a common GPT-like encoding for rough estimation
encoding = tiktoken.get_encoding("cl100k_base")
elif embedder_type == 'voyage':
# Voyage AI uses similar tokenization to GPT models for rough estimation
encoding = tiktoken.get_encoding("cl100k_base")
else: # OpenAI or default
# Use OpenAI embedding model encoding
encoding = tiktoken.encoding_for_model("text-embedding-3-small")
Expand Down
2 changes: 1 addition & 1 deletion api/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ def patched_watch(*args, **kwargs):
import uvicorn

# Check for required environment variables
required_env_vars = ['GOOGLE_API_KEY', 'OPENAI_API_KEY']
required_env_vars = ['GOOGLE_API_KEY']
missing_vars = [var for var in required_env_vars if not os.environ.get(var)]
if missing_vars:
logger.warning(f"Missing environment variables: {', '.join(missing_vars)}")
Expand Down
49 changes: 48 additions & 1 deletion api/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions api/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ boto3 = ">=1.34.0"
websockets = ">=11.0.3"
azure-identity = ">=1.12.0"
azure-core = ">=1.24.0"
voyageai = ">=0.2.0"


[build-system]
Expand Down
34 changes: 19 additions & 15 deletions api/rag.py
Original file line number Diff line number Diff line change
Expand Up @@ -190,20 +190,24 @@ def __init__(self, provider="google", model=None, use_s3: bool = False): # noqa
self.memory = Memory()
self.embedder = get_embedder(embedder_type=self.embedder_type)

self_weakref = weakref.ref(self)
# Patch: ensure query embedding is always single string for Ollama
def single_string_embedder(query):
# Accepts either a string or a list, always returns embedding for a single string
if isinstance(query, list):
if len(query) != 1:
raise ValueError("Ollama embedder only supports a single string")
query = query[0]
instance = self_weakref()
assert instance is not None, "RAG instance is no longer available, but the query embedder was called."
return instance.embedder(input=query)

# Use single string embedder for Ollama, regular embedder for others
self.query_embedder = single_string_embedder if self.is_ollama_embedder else self.embedder
# Voyage requires input_type="query" for retrieval — initialize a dedicated query
# embedder once at startup rather than allocating a new client per call.
if self.embedder_type == 'voyage':
self.query_embedder = get_embedder(embedder_type='voyage', input_type='query')
elif self.is_ollama_embedder:
# Ollama needs single-string coercion (doesn't accept list inputs)
self_weakref = weakref.ref(self)
def single_string_embedder(query):
if isinstance(query, list):
if len(query) != 1:
raise ValueError("Ollama embedder only supports a single string")
query = query[0]
instance = self_weakref()
assert instance is not None, "RAG instance is no longer available"
return instance.embedder(input=query)
self.query_embedder = single_string_embedder
else:
self.query_embedder = self.embedder

self.initialize_db_manager()

Expand Down Expand Up @@ -381,7 +385,7 @@ def prepare_retriever(self, repo_url_or_path: str, type: str = "github", access_

try:
# Use the appropriate embedder for retrieval
retrieve_embedder = self.query_embedder if self.is_ollama_embedder else self.embedder
retrieve_embedder = self.query_embedder if self.embedder_type in ('ollama', 'voyage') else self.embedder
self.retriever = FAISSRetriever(
**configs["retriever"],
embedder=retrieve_embedder,
Expand Down
15 changes: 13 additions & 2 deletions api/tools/embedder.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@
from api.config import configs, get_embedder_type


def get_embedder(is_local_ollama: bool = False, use_google_embedder: bool = False, embedder_type: str = None) -> adal.Embedder:
def get_embedder(is_local_ollama: bool = False, use_google_embedder: bool = False, embedder_type: str = None, input_type: str = None) -> adal.Embedder:
"""Get embedder based on configuration or parameters.

Args:
is_local_ollama: Legacy parameter for Ollama embedder
use_google_embedder: Legacy parameter for Google embedder
embedder_type: Direct specification of embedder type ('ollama', 'google', 'bedrock', 'openai')
embedder_type: Direct specification of embedder type ('ollama', 'google', 'bedrock', 'openai', 'voyage')
input_type: Optional input_type for Voyage/other embedders ('document' or 'query')

Returns:
adal.Embedder: Configured embedder instance
Expand All @@ -22,6 +23,8 @@ def get_embedder(is_local_ollama: bool = False, use_google_embedder: bool = Fals
embedder_config = configs["embedder_google"]
elif embedder_type == 'bedrock':
embedder_config = configs["embedder_bedrock"]
elif embedder_type == 'voyage':
embedder_config = configs["embedder_voyage"]
else: # default to openai
embedder_config = configs["embedder"]
elif is_local_ollama:
Expand All @@ -37,6 +40,8 @@ def get_embedder(is_local_ollama: bool = False, use_google_embedder: bool = Fals
embedder_config = configs["embedder_ollama"]
elif current_type == 'google':
embedder_config = configs["embedder_google"]
elif current_type == 'voyage':
embedder_config = configs["embedder_voyage"]
else:
embedder_config = configs["embedder"]

Expand All @@ -50,6 +55,12 @@ def get_embedder(is_local_ollama: bool = False, use_google_embedder: bool = Fals
# Create embedder with basic parameters
embedder_kwargs = {"model_client": model_client, "model_kwargs": embedder_config["model_kwargs"]}

# Override input_type if provided (critical for Voyage AI retrieval vs indexing)
if input_type and "model_kwargs" in embedder_kwargs:
# Create a copy to avoid modifying the global config
embedder_kwargs["model_kwargs"] = embedder_kwargs["model_kwargs"].copy()
embedder_kwargs["model_kwargs"]["input_type"] = input_type

embedder = adal.Embedder(**embedder_kwargs)

# Set batch_size as an attribute if available (not a constructor parameter)
Expand Down
Loading