Skip to content

Commit 19c5058

Browse files
ochatclaude
andcommitted
feat: Complete RAG API optimization suite with intelligent backup embeddings
This comprehensive update adds intelligent backup embedding providers, performance optimizations, and comprehensive error handling: ## 🚀 New Features ### Intelligent Backup Embedding System - **Ultra-fast failover**: Socket check detects dead ports in 0.5 seconds - **Immediate failover**: Primary failure triggers instant backup attempt (no retries) - **Smart cooldown**: 1-minute cooldown after primary provider failure - **Seamless switching**: LibreChat receives 200 status when backup succeeds - **Fast recovery**: Optimized retry logic prevents cascading failures - **Clear logging**: Prominent failure messages and accurate provider tracking ### Custom NVIDIA Embeddings Provider - **Full NVIDIA API compatibility** for LLaMA embedding models - **Fast port detection**: Socket check fails immediately if nothing listening - **Optimized timeouts**: 0.5s socket check, 2s connection, 3s read timeout - **Configurable parameters**: batch size, retries, timeout, input types - **Fast failover mode**: Reduced retries when backup provider configured - **Proper error handling** for NVIDIA-specific API responses ### Enhanced AWS Bedrock Support - **Titan V2 embeddings** with configurable dimensions (256/512/1024) - **Reactive rate limiting** - only activates when AWS throttles requests - **Graceful error handling** with user-friendly configuration messages - **Backward compatibility** with Titan V1 models ### Database & Performance Optimizations - **Graceful PostgreSQL error handling** - 503 responses for connection issues - **Optimized chunking strategy** - adaptive batch sizes based on chunk size - **Request throttling middleware** - prevents LibreChat overload (configurable) - **Improved UTF-8 file processing** with proper cleanup and null checks - **Enhanced connection pooling** with optimized timeout settings ## 📋 Configuration ### Backup Provider Setup ```env # Primary Provider EMBEDDINGS_PROVIDER=nvidia EMBEDDINGS_MODEL=nvidia/llama-3.2-nemoretriever-300m-embed-v1 NVIDIA_TIMEOUT=3 # Fast failover - 3 second read timeout # Backup Provider EMBEDDINGS_PROVIDER_BACKUP=bedrock EMBEDDINGS_MODEL_BACKUP=amazon.titan-embed-text-v2:0 PRIMARY_FAILOVER_COOLDOWN_MINUTES=1 # Performance Tuning EMBED_CONCURRENCY_LIMIT=3 ``` ### Bedrock Titan V2 Configuration ```env BEDROCK_EMBEDDING_DIMENSIONS=512 # 256, 512, or 1024 BEDROCK_EMBEDDING_NORMALIZE=true BEDROCK_MAX_BATCH_SIZE=15 ``` ## 🧪 Testing - **31 comprehensive tests** covering V1/V2 compatibility - **Error simulation and recovery** testing - **Integration tests** for backup failover scenarios ## 🛠️ Technical Improvements - **Ultra-fast port detection** - Socket check with 0.5s timeout before connection - **Immediate failover logic** - no retry delays when backup is available - **Triple-layer timeout strategy** - socket (0.5s), connection (2s), read (3s) - **Conditional AWS credential loading** - only when Bedrock is configured - **Thread-safe state management** with proper locking - **Pydantic v2 compatibility** with proper field declarations - **Comprehensive error categorization** and user-friendly messages ## 📚 Documentation - **Complete environment variable documentation** in README.md - **High availability configuration examples** with NVIDIA + Bedrock setup - **Detailed provider configuration guides** for all supported embedding services This update ensures robust, production-ready embedding operations with lightning-fast failover (0.5-3 seconds), optimal performance, and excellent user experience. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 2f08fc3 commit 19c5058

17 files changed

+2104
-42
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
.idea
22
.venv
33
.env
4+
.env.beta
45
__pycache__
56
uploads/
67
myenv/
78
venv/
89
*.pyc
910
dev.yml
1011
SHOPIFY.md
12+
CLAUDE.md

README.md

Lines changed: 71 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ The following environment variables are required to run the application:
6464
- `DEBUG_RAG_API`: (Optional) Set to "True" to show more verbose logging output in the server console, and to enable postgresql database routes
6565
- `DEBUG_PGVECTOR_QUERIES`: (Optional) Set to "True" to enable detailed PostgreSQL query logging for pgvector operations. Useful for debugging performance issues with vector database queries.
6666
- `CONSOLE_JSON`: (Optional) Set to "True" to log as json for Cloud Logging aggregations
67-
- `EMBEDDINGS_PROVIDER`: (Optional) either "openai", "bedrock", "azure", "huggingface", "huggingfacetei", "vertexai", or "ollama", where "huggingface" uses sentence_transformers; defaults to "openai"
67+
- `EMBEDDINGS_PROVIDER`: (Optional) either "openai", "bedrock", "azure", "huggingface", "huggingfacetei", "vertexai", "ollama", or "nvidia", where "huggingface" uses sentence_transformers; defaults to "openai"
6868
- `EMBEDDINGS_MODEL`: (Optional) Set a valid embeddings model to use from the configured provider.
6969
- **Defaults**
7070
- openai: "text-embedding-3-small"
@@ -74,6 +74,37 @@ The following environment variables are required to run the application:
7474
- vertexai: "text-embedding-004"
7575
- ollama: "nomic-embed-text"
7676
- bedrock: "amazon.titan-embed-text-v1"
77+
- nvidia: "nvidia/llama-3.2-nemoretriever-300m-embed-v1"
78+
- `EMBEDDINGS_PROVIDER_BACKUP`: (Optional) Backup provider for automatic failover ("openai", "bedrock", "azure", "huggingface", "huggingfacetei", "vertexai", "ollama", "nvidia")
79+
- `EMBEDDINGS_MODEL_BACKUP`: (Optional) Backup model to use when primary provider fails
80+
- `PRIMARY_FAILOVER_COOLDOWN_MINUTES`: (Optional) Minutes to wait before retrying failed primary provider (default: 1)
81+
- `EMBED_CONCURRENCY_LIMIT`: (Optional) Maximum concurrent embedding requests to prevent overload (default: 3)
82+
83+
#### Backup Embedding Provider Configuration
84+
The RAG API supports intelligent backup embedding providers for high availability:
85+
- **Automatic failover**: When primary provider fails, requests automatically switch to backup
86+
- **Smart cooldown**: Failed primary providers are avoided for configurable time period
87+
- **Transparent operation**: LibreChat receives success responses when backup succeeds
88+
- **Automatic recovery**: Primary provider is retried when cooldown expires
89+
90+
#### NVIDIA Embedding Provider Configuration
91+
- `NVIDIA_BASE_URL`: (Optional) NVIDIA API endpoint URL (default: "http://localhost:8003/v1")
92+
- `NVIDIA_API_KEY`: (Optional) API key for NVIDIA embedding service
93+
- `NVIDIA_MODEL`: (Optional) NVIDIA model to use (default: "nvidia/llama-3.2-nemoretriever-300m-embed-v1")
94+
- `NVIDIA_INPUT_TYPE`: (Optional) Input type for embeddings ("query", "passage", default: "passage")
95+
- `NVIDIA_ENCODING_FORMAT`: (Optional) Encoding format ("float", "base64", default: "float")
96+
- `NVIDIA_TRUNCATE`: (Optional) Truncate input if too long ("NONE", "START", "END", default: "NONE")
97+
- `NVIDIA_MAX_RETRIES`: (Optional) Maximum retry attempts (default: 3)
98+
- `NVIDIA_TIMEOUT`: (Optional) Read timeout in seconds (default: 3, connection timeout: 2s)
99+
- `NVIDIA_MAX_BATCH_SIZE`: (Optional) Maximum texts per batch (default: 32)
100+
101+
#### AWS Bedrock Enhanced Configuration
102+
- `BEDROCK_EMBEDDING_DIMENSIONS`: (Optional) For Titan V2 models - embedding dimensions (256, 512, or 1024, default: 1024)
103+
- `BEDROCK_EMBEDDING_NORMALIZE`: (Optional) For Titan V2 models - normalize embeddings ("true"/"false", default: "true")
104+
- `BEDROCK_MAX_BATCH_SIZE`: (Optional) Maximum texts per Bedrock batch (default: 15)
105+
- `BEDROCK_INITIAL_RETRY_DELAY`: (Optional) Initial retry delay in seconds for rate limiting (default: 1.0)
106+
- `BEDROCK_MAX_RETRY_DELAY`: (Optional) Maximum retry delay in seconds (default: 60.0)
107+
- `BEDROCK_BACKOFF_FACTOR`: (Optional) Exponential backoff multiplier (default: 2.0)
77108
- `RAG_AZURE_OPENAI_API_VERSION`: (Optional) Default is `2023-05-15`. The version of the Azure OpenAI API.
78109
- `RAG_AZURE_OPENAI_API_KEY`: (Optional) The API key for Azure OpenAI service.
79110
- Note: `AZURE_OPENAI_API_KEY` will work but `RAG_AZURE_OPENAI_API_KEY` will override it in order to not conflict with LibreChat setting.
@@ -125,6 +156,45 @@ The `ATLAS_MONGO_DB_URI` could be the same or different from what is used by Lib
125156

126157
Follow one of the [four documented methods](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-index/#procedure) to create the vector index.
127158

159+
### High Availability Configuration Example
160+
161+
For production environments requiring maximum uptime, you can configure redundant embedding providers with automatic failover. This example uses NVIDIA as the primary provider with AWS Bedrock as backup:
162+
163+
```env
164+
# Primary Provider - NVIDIA Embeddings (Local/On-Premises)
165+
EMBEDDINGS_PROVIDER=nvidia
166+
EMBEDDINGS_MODEL=nvidia/llama-3.2-nemoretriever-300m-embed-v1
167+
NVIDIA_BASE_URL=http://your-nvidia-server:8003/v1
168+
NVIDIA_API_KEY=your-nvidia-api-key
169+
NVIDIA_MAX_BATCH_SIZE=32
170+
NVIDIA_TIMEOUT=3
171+
172+
# Backup Provider - AWS Bedrock Titan V2 Embeddings
173+
EMBEDDINGS_PROVIDER_BACKUP=bedrock
174+
EMBEDDINGS_MODEL_BACKUP=amazon.titan-embed-text-v2:0
175+
AWS_ACCESS_KEY_ID=your-aws-access-key
176+
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
177+
AWS_DEFAULT_REGION=us-west-2
178+
BEDROCK_EMBEDDING_DIMENSIONS=512
179+
BEDROCK_EMBEDDING_NORMALIZE=true
180+
181+
# Failover Configuration
182+
PRIMARY_FAILOVER_COOLDOWN_MINUTES=2
183+
EMBED_CONCURRENCY_LIMIT=3
184+
185+
# Performance Tuning
186+
CHUNK_SIZE=1500
187+
CHUNK_OVERLAP=100
188+
```
189+
190+
**How this works:**
191+
- **Primary**: NVIDIA embeddings serve all requests when available
192+
- **Failover**: If NVIDIA fails, requests automatically switch to Bedrock
193+
- **Cooldown**: After failure, NVIDIA is not retried for 2 minutes (prevents cascading failures)
194+
- **Recovery**: NVIDIA is automatically retried when cooldown expires
195+
- **Transparency**: LibreChat receives successful responses when backup succeeds
196+
197+
This configuration ensures high availability with seamless failover while maintaining optimal performance and cost efficiency.
128198

129199
### Proxy Configuration
130200

app/config.py

Lines changed: 176 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ class EmbeddingsProvider(Enum):
2727
OLLAMA = "ollama"
2828
BEDROCK = "bedrock"
2929
GOOGLE_VERTEXAI = "vertexai"
30+
NVIDIA = "nvidia"
3031

3132

3233
def get_env_variable(
@@ -37,6 +38,9 @@ def get_env_variable(
3738
if default_value is None and required:
3839
raise ValueError(f"Environment variable '{var_name}' not found.")
3940
return default_value
41+
# Strip comments and whitespace from environment variables
42+
if isinstance(value, str) and '#' in value:
43+
value = value.split('#')[0].strip()
4044
return value
4145

4246

@@ -236,7 +240,7 @@ def init_embeddings(provider, model):
236240

237241
return VertexAIEmbeddings(model=model)
238242
elif provider == EmbeddingsProvider.BEDROCK:
239-
from langchain_aws import BedrockEmbeddings
243+
from app.services.embeddings.bedrock_rate_limited import RateLimitedBedrockEmbeddings
240244

241245
session_kwargs = {
242246
"aws_access_key_id": AWS_ACCESS_KEY_ID,
@@ -248,10 +252,53 @@ def init_embeddings(provider, model):
248252
session_kwargs["aws_session_token"] = AWS_SESSION_TOKEN
249253

250254
session = boto3.Session(**session_kwargs)
251-
return BedrockEmbeddings(
252-
client=session.client("bedrock-runtime"),
255+
256+
# Get reactive rate limiting configuration from environment
257+
max_batch = int(get_env_variable("BEDROCK_MAX_BATCH_SIZE", "15"))
258+
max_retries = int(get_env_variable("BEDROCK_MAX_RETRIES", "5"))
259+
initial_delay = float(get_env_variable("BEDROCK_INITIAL_RETRY_DELAY", "0.1"))
260+
max_delay = float(get_env_variable("BEDROCK_MAX_RETRY_DELAY", "30.0"))
261+
backoff_factor = float(get_env_variable("BEDROCK_BACKOFF_FACTOR", "2.0"))
262+
recovery_factor = float(get_env_variable("BEDROCK_RECOVERY_FACTOR", "0.9"))
263+
264+
# Get Titan V2 specific parameters
265+
dimensions = get_env_variable("BEDROCK_EMBEDDING_DIMENSIONS", None)
266+
if dimensions is not None:
267+
dimensions = int(dimensions)
268+
normalize = get_env_variable("BEDROCK_EMBEDDING_NORMALIZE", "true").lower() == "true"
269+
270+
# Create client with connection pooling for maximum performance
271+
config = boto3.session.Config(
272+
max_pool_connections=50, # Increased for better concurrency
273+
retries={'max_attempts': 0} # We handle retries in our wrapper
274+
)
275+
276+
return RateLimitedBedrockEmbeddings(
277+
client=session.client("bedrock-runtime", config=config),
253278
model_id=model,
254279
region_name=AWS_DEFAULT_REGION,
280+
max_batch_size=max_batch,
281+
max_retries=max_retries,
282+
initial_retry_delay=initial_delay,
283+
max_retry_delay=max_delay,
284+
backoff_factor=backoff_factor,
285+
recovery_factor=recovery_factor,
286+
dimensions=dimensions,
287+
normalize=normalize,
288+
)
289+
elif provider == EmbeddingsProvider.NVIDIA:
290+
from app.services.embeddings.nvidia_embeddings import NVIDIAEmbeddings
291+
292+
return NVIDIAEmbeddings(
293+
base_url=RAG_OPENAI_BASEURL,
294+
model=model,
295+
api_key=RAG_OPENAI_API_KEY,
296+
max_batch_size=int(get_env_variable("NVIDIA_MAX_BATCH_SIZE", "20")),
297+
max_retries=int(get_env_variable("NVIDIA_MAX_RETRIES", "3")),
298+
timeout=float(get_env_variable("NVIDIA_TIMEOUT", "3.0")), # Fast failover - 3 seconds
299+
input_type=get_env_variable("NVIDIA_INPUT_TYPE", "query"),
300+
encoding_format=get_env_variable("NVIDIA_ENCODING_FORMAT", "float"),
301+
truncate=get_env_variable("NVIDIA_TRUNCATE", "NONE"),
255302
)
256303
else:
257304
raise ValueError(f"Unsupported embeddings provider: {provider}")
@@ -285,13 +332,136 @@ def init_embeddings(provider, model):
285332
EMBEDDINGS_MODEL = get_env_variable(
286333
"EMBEDDINGS_MODEL", "amazon.titan-embed-text-v1"
287334
)
288-
AWS_DEFAULT_REGION = get_env_variable("AWS_DEFAULT_REGION", "us-east-1")
335+
elif EMBEDDINGS_PROVIDER == EmbeddingsProvider.NVIDIA:
336+
EMBEDDINGS_MODEL = get_env_variable(
337+
"EMBEDDINGS_MODEL", "nvidia/llama-3.2-nemoretriever-300m-embed-v1"
338+
)
289339
else:
290340
raise ValueError(f"Unsupported embeddings provider: {EMBEDDINGS_PROVIDER}")
291341

292-
embeddings = init_embeddings(EMBEDDINGS_PROVIDER, EMBEDDINGS_MODEL)
342+
# Load AWS credentials ONLY if Bedrock is used as primary or backup
343+
backup_provider_str = get_env_variable("EMBEDDINGS_PROVIDER_BACKUP", None)
344+
bedrock_needed = (
345+
EMBEDDINGS_PROVIDER == EmbeddingsProvider.BEDROCK or
346+
(backup_provider_str and backup_provider_str.lower() == "bedrock")
347+
)
293348

294-
logger.info(f"Initialized embeddings of type: {type(embeddings)}")
349+
if bedrock_needed:
350+
AWS_DEFAULT_REGION = get_env_variable("AWS_DEFAULT_REGION", "us-east-1")
351+
AWS_ACCESS_KEY_ID = get_env_variable("AWS_ACCESS_KEY_ID", None)
352+
AWS_SECRET_ACCESS_KEY = get_env_variable("AWS_SECRET_ACCESS_KEY", None)
353+
AWS_SESSION_TOKEN = get_env_variable("AWS_SESSION_TOKEN", None)
354+
logger.debug("AWS credentials loaded for Bedrock provider")
355+
else:
356+
# Set to None when not needed
357+
AWS_DEFAULT_REGION = None
358+
AWS_ACCESS_KEY_ID = None
359+
AWS_SECRET_ACCESS_KEY = None
360+
AWS_SESSION_TOKEN = None
361+
logger.debug("AWS credentials not required - no Bedrock provider configured")
362+
363+
# Initialize embeddings with backup support
364+
def init_embeddings_with_backup():
365+
"""Initialize embeddings with automatic backup failover."""
366+
# Use already loaded backup provider string
367+
backup_model = get_env_variable("EMBEDDINGS_MODEL_BACKUP", None)
368+
369+
if backup_provider_str and backup_model:
370+
# Backup is configured, create backup embeddings with failover
371+
backup_provider = EmbeddingsProvider(backup_provider_str.lower())
372+
373+
logger.info(f"Backup provider configured: {backup_provider.value} / {backup_model}")
374+
375+
try:
376+
# Initialize primary provider
377+
primary_embeddings = init_embeddings(EMBEDDINGS_PROVIDER, EMBEDDINGS_MODEL)
378+
logger.info(f"✅ Primary provider initialized: {EMBEDDINGS_PROVIDER.value}")
379+
380+
try:
381+
# Initialize backup provider
382+
backup_embeddings = init_embeddings(backup_provider, backup_model)
383+
logger.info(f"✅ Backup provider initialized: {backup_provider.value}")
384+
385+
# Create backup wrapper
386+
from app.services.embeddings.backup_embeddings import BackupEmbeddingsProvider
387+
388+
# Get cooldown configuration
389+
primary_cooldown_minutes = int(get_env_variable("PRIMARY_FAILOVER_COOLDOWN_MINUTES", "1"))
390+
391+
# For fast failover, reduce retries on primary provider if it's NVIDIA
392+
if EMBEDDINGS_PROVIDER == EmbeddingsProvider.NVIDIA and hasattr(primary_embeddings, 'max_retries'):
393+
logger.info(f"Reducing NVIDIA max_retries from {primary_embeddings.max_retries} to 1 for faster backup failover")
394+
primary_embeddings.max_retries = 1
395+
396+
return BackupEmbeddingsProvider(
397+
primary_provider=primary_embeddings,
398+
backup_provider=backup_embeddings,
399+
primary_name=f"{EMBEDDINGS_PROVIDER.value}:{EMBEDDINGS_MODEL}",
400+
backup_name=f"{backup_provider.value}:{backup_model}",
401+
primary_cooldown_minutes=primary_cooldown_minutes
402+
)
403+
404+
except Exception as backup_error:
405+
logger.warning(f"⚠️ Backup provider failed to initialize: {str(backup_error)}")
406+
logger.info(f"Continuing with primary provider only: {EMBEDDINGS_PROVIDER.value}")
407+
return primary_embeddings
408+
409+
except Exception as primary_error:
410+
logger.error(f"❌ Primary provider failed to initialize: {str(primary_error)}")
411+
412+
# Try to initialize backup as primary
413+
try:
414+
backup_embeddings = init_embeddings(backup_provider, backup_model)
415+
logger.warning(f"🔄 Using backup provider as primary: {backup_provider.value}")
416+
return backup_embeddings
417+
except Exception as backup_error:
418+
logger.error(f"❌ Both providers failed to initialize!")
419+
raise RuntimeError(
420+
f"Failed to initialize any embedding provider. "
421+
f"Primary ({EMBEDDINGS_PROVIDER.value}): {str(primary_error)}, "
422+
f"Backup ({backup_provider.value}): {str(backup_error)}"
423+
) from primary_error
424+
else:
425+
# No backup configured, use single provider
426+
return init_embeddings(EMBEDDINGS_PROVIDER, EMBEDDINGS_MODEL)
427+
428+
try:
429+
embeddings = init_embeddings_with_backup()
430+
logger.info(f"Initialized embeddings of type: {type(embeddings)}")
431+
except Exception as e:
432+
error_message = str(e)
433+
434+
# Provide helpful configuration error messages
435+
if EMBEDDINGS_PROVIDER == EmbeddingsProvider.BEDROCK:
436+
if "model identifier is invalid" in error_message:
437+
logger.error(
438+
f"❌ BEDROCK CONFIGURATION ERROR ❌\n\n"
439+
f"The Bedrock model '{EMBEDDINGS_MODEL}' is not available in region '{AWS_DEFAULT_REGION}'.\n\n"
440+
f"💡 Quick Fix:\n"
441+
f" Set EMBEDDINGS_MODEL=amazon.titan-embed-text-v1 in your .env file\n\n"
442+
f"🔍 Available models in most regions:\n"
443+
f" • amazon.titan-embed-text-v1\n"
444+
f" • cohere.embed-english-v3\n"
445+
f" • cohere.embed-multilingual-v3\n\n"
446+
f"🌍 To check available models in {AWS_DEFAULT_REGION}:\n"
447+
f" AWS Console → Bedrock → Foundation models → Embedding"
448+
)
449+
elif "AccessDeniedException" in error_message:
450+
logger.error(
451+
f"❌ BEDROCK ACCESS ERROR ❌\n\n"
452+
f"Your AWS account doesn't have access to Bedrock in '{AWS_DEFAULT_REGION}'.\n\n"
453+
f"💡 Solutions:\n"
454+
f" 1. AWS Console → Bedrock → Model access → Request model access\n"
455+
f" 2. Enable foundation models you want to use\n"
456+
f" 3. Verify IAM permissions include 'bedrock:InvokeModel'\n\n"
457+
f"⚠️ Note: Bedrock may not be available in all regions"
458+
)
459+
else:
460+
logger.error(f"❌ BEDROCK ERROR: {error_message}")
461+
else:
462+
logger.error(f"❌ EMBEDDINGS ERROR ({EMBEDDINGS_PROVIDER}): {error_message}")
463+
464+
raise RuntimeError(f"Failed to initialize embeddings: {error_message}") from e
295465

296466
# Vector store
297467
if VECTOR_DB_TYPE == VectorDBType.PGVECTOR:

0 commit comments

Comments
 (0)