Comprehensive guide to configuring the Multi-Modal Academic Research System for optimal performance.
- Environment Variables
- OpenSearch Configuration
- API Configuration
- Logging Configuration
- Application Settings
- Advanced Configuration
- Performance Tuning
- Security Considerations
The system uses a .env file in the project root to manage configuration. All environment variables are optional except GEMINI_API_KEY.
# Copy the example file
cp .env.example .env
# Edit with your preferred editor
nano .env
# or
code .env
# or
vim .envGEMINI_API_KEY (Required)
- Description: Google Gemini API key for AI-powered analysis and generation
- Default: None (must be provided)
- How to get: https://makersuite.google.com/app/apikey
- Example:
GEMINI_API_KEY=AIzaSyAbc123def456ghi789jkl012mno345pqr - Notes:
- Free tier available (no credit card required)
- Used for PDF diagram analysis and research query responses
- Do NOT use quotes around the key
OPENSEARCH_HOST
- Description: Hostname or IP address of the OpenSearch server
- Default:
localhost - Example:
OPENSEARCH_HOST=localhost - Valid values:
localhost(local Docker container)127.0.0.1(local IP)- Any valid hostname or IP address
OPENSEARCH_PORT
- Description: Port number for OpenSearch connection
- Default:
9200 - Example:
OPENSEARCH_PORT=9200 - Valid values: Any valid port number (1-65535)
- Notes: Default OpenSearch port is 9200
OPENSEARCH_USERNAME (Not in .env.example, but supported)
- Description: Username for OpenSearch authentication
- Default:
admin(hardcoded in opensearch_manager.py) - Example:
OPENSEARCH_USERNAME=admin - Notes: Currently not exposed in .env but can be added
OPENSEARCH_PASSWORD (Not in .env.example, but supported)
- Description: Password for OpenSearch authentication
- Default:
MyStrongPassword@2024!(hardcoded in opensearch_manager.py) - Example:
OPENSEARCH_PASSWORD=MyStrongPassword@2024! - Security: Should match the password set when starting OpenSearch container
Minimal configuration:
GEMINI_API_KEY=AIzaSyAbc123def456ghi789jkl012mno345pqr
OPENSEARCH_HOST=localhost
OPENSEARCH_PORT=9200Custom OpenSearch server:
GEMINI_API_KEY=AIzaSyAbc123def456ghi789jkl012mno345pqr
OPENSEARCH_HOST=192.168.1.100
OPENSEARCH_PORT=9201Development environment:
# Gemini API
GEMINI_API_KEY=AIzaSyAbc123def456ghi789jkl012mno345pqr
# OpenSearch (local Docker)
OPENSEARCH_HOST=localhost
OPENSEARCH_PORT=9200
# Optional: Add comments for clarity
# Get API key from: https://makersuite.google.com/app/apikey-
Never commit .env to version control
- Already in
.gitignore - Keep sensitive keys private
- Already in
-
Use separate .env files for different environments
.env.development.env.production.env.testing
-
No quotes around values
# Correct GEMINI_API_KEY=AIzaSyAbc123 # Incorrect (will include quotes in the value) GEMINI_API_KEY="AIzaSyAbc123"
-
No spaces around equals sign
# Correct OPENSEARCH_HOST=localhost # Incorrect OPENSEARCH_HOST = localhost
OpenSearch is the vector database and search engine powering the research system.
Standard configuration:
docker run -d \
--name opensearch-research \
-p 9200:9200 \
-p 9600:9600 \
-e "discovery.type=single-node" \
-e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=MyStrongPassword@2024!" \
opensearchproject/opensearch:latestWith persistent storage:
docker run -d \
--name opensearch-research \
-p 9200:9200 \
-p 9600:9600 \
-e "discovery.type=single-node" \
-e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=MyStrongPassword@2024!" \
-v opensearch-data:/usr/share/opensearch/data \
opensearchproject/opensearch:latestWith custom memory limits:
docker run -d \
--name opensearch-research \
-p 9200:9200 \
-p 9600:9600 \
-e "discovery.type=single-node" \
-e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=MyStrongPassword@2024!" \
-e "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" \
opensearchproject/opensearch:latestThe system creates a research_assistant index with the following settings:
Default Index Settings (defined in opensearch_manager.py):
{
'settings': {
'index': {
'number_of_shards': 2,
'number_of_replicas': 1,
'knn': True # Enable k-nearest neighbors for vector search
}
}
}Index Mappings:
content_type: Keyword (paper, video, podcast)title: Text with keyword sub-fieldabstract: Textcontent: Text (main searchable content)authors: Keyword arraypublication_date: Dateurl: Keywordtranscript: Text (for videos/podcasts)diagram_descriptions: Text (from Gemini Vision)key_concepts: Keyword arraycitations: Nested objectsembedding: kNN vector (384 dimensions, all-MiniLM-L6-v2)metadata: Object (flexible additional data)
Check index status (via Settings tab in UI):
- View connection status
- See index statistics
- Verify document count
Delete and recreate index:
# Via Python console
from multi_modal_rag.indexing.opensearch_manager import OpenSearchManager
manager = OpenSearchManager()
manager.client.indices.delete(index='research_assistant')
manager.create_index('research_assistant')Backup index data:
# Using OpenSearch API
curl -X POST "https://localhost:9200/_snapshot/my_backup/snapshot_1?wait_for_completion=true" \
-ku admin:MyStrongPassword@2024!SSL Configuration (in opensearch_manager.py):
OpenSearch(
hosts=[{'host': host, 'port': port}],
http_auth=(username, password),
http_compress=True,
use_ssl=True, # Enable SSL
verify_certs=False, # Disable cert verification (local dev)
ssl_assert_hostname=False,
ssl_show_warn=False,
timeout=5 # 5 second timeout
)Modify connection timeout:
Edit multi_modal_rag/indexing/opensearch_manager.py:
timeout=30 # Increase for slow networksAPI Key Setup:
- Visit: https://makersuite.google.com/app/apikey
- Sign in with Google account
- Create new API key
- Copy to
.envfile
Rate Limits (Free Tier):
- 60 requests per minute
- 1,500 requests per day
- The system includes rate limiting to avoid exceeding these limits
Models Used:
-
Gemini Pro (
gemini-1.5-pro-latest):- Research query responses
- Text analysis and synthesis
- Citation generation
-
Gemini Vision (
gemini-1.5-flash):- PDF diagram analysis
- Image description generation
- Visual content understanding
Switching Models (in code):
Edit multi_modal_rag/orchestration/research_orchestrator.py:
# Current
self.llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro-latest")
# Alternative (faster, less capable)
self.llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash")ArXiv (via arxiv package):
- No API key required
- Rate limit: 1 request per 3 seconds (built-in)
- Free and unlimited
Semantic Scholar (via scholarly package):
- No API key required
- May require rate limiting for heavy use
- Free tier available
YouTube (via youtube-transcript-api):
- No API key required
- Uses public transcript API
- Subject to YouTube rate limits
PubMed Central:
- No API key required
- E-utilities API (free)
- Rate limit: 3 requests per second
The system uses Python's built-in logging with custom configuration.
Default: logs/research_system_YYYYMMDD_HHMMSS.log
Example: logs/research_system_20241002_143022.log
File Handler (detailed logging):
- Level:
DEBUG - Captures all events including debug information
- Format:
timestamp - module - level - file:line - function() - message
Console Handler (important messages):
- Level:
INFO - Shows only important events in terminal
- Format:
level - module - message
Defined in: multi_modal_rag/logging_config.py
File Handler Configuration:
file_handler = logging.FileHandler(log_file, mode='w', encoding='utf-8')
file_handler.setLevel(logging.DEBUG)
file_formatter = logging.Formatter(
'%(asctime)s - %(name)s - %(levelname)s - %(filename)s:%(lineno)d - %(funcName)s() - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)Console Handler Configuration:
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
console_formatter = logging.Formatter(
'%(levelname)s - %(name)s - %(message)s'
)Change log level (more or less verbose):
Edit multi_modal_rag/logging_config.py:
# More verbose (show everything)
logger.setLevel(logging.DEBUG)
console_handler.setLevel(logging.DEBUG)
# Less verbose (errors only)
logger.setLevel(logging.ERROR)
console_handler.setLevel(logging.ERROR)Change log file location:
# In setup_logging() function
log_dir = "custom_logs" # Change from "logs"Disable console output:
# Comment out console handler
# logger.addHandler(console_handler)Log rotation (for long-running applications):
from logging.handlers import RotatingFileHandler
file_handler = RotatingFileHandler(
log_file,
maxBytes=10*1024*1024, # 10MB
backupCount=5
)Automatic: Log files accumulate in logs/ directory
Manual cleanup:
# Delete logs older than 7 days
find logs/ -name "*.log" -mtime +7 -delete
# Keep only last 10 log files
ls -t logs/*.log | tail -n +11 | xargs rmArchive logs:
# Compress old logs
tar -czf logs_archive_$(date +%Y%m%d).tar.gz logs/*.logDefined in: main.py
Default settings:
app.launch(
server_name="0.0.0.0", # Listen on all interfaces
server_port=7860, # Default Gradio port
share=True # Create public URL
)Custom port:
app.launch(
server_name="0.0.0.0",
server_port=8080, # Custom port
share=True
)Disable public sharing:
app.launch(
server_name="127.0.0.1", # Localhost only
server_port=7860,
share=False # No public URL
)Authentication (add password protection):
app.launch(
server_name="0.0.0.0",
server_port=7860,
share=True,
auth=("username", "password") # Basic auth
)Default directories (created automatically):
data/
├── papers/ # Downloaded PDFs
├── videos/ # Video metadata and transcripts
├── podcasts/ # Podcast episode data
└── processed/ # Processed content ready for indexing
Change data location (edit collector classes):
# In data_collectors/paper_collector.py
self.download_dir = "custom_papers_location"Default model: all-MiniLM-L6-v2
- Dimension: 384
- Speed: Fast
- Quality: Good for most use cases
Change model (in opensearch_manager.py):
# Higher quality, slower
self.embedding_model = SentenceTransformer('all-mpnet-base-v2') # 768 dim
# Faster, lower quality
self.embedding_model = SentenceTransformer('all-MiniLM-L12-v2') # 384 dimImportant: If you change the embedding model, you must:
- Update the dimension in index mapping
- Delete and recreate the index
- Re-index all documents
Defined in: multi_modal_rag/indexing/opensearch_manager.py
Search query structure:
query = {
"query": {
"bool": {
"should": [
# Keyword search (BM25)
{
"multi_match": {
"query": query_text,
"fields": ["title^2", "abstract", "content", "transcript"],
"type": "best_fields"
}
},
# Vector similarity search
{
"knn": {
"embedding": {
"vector": query_embedding,
"k": 10
}
}
}
]
}
}
}Adjust field weights:
"fields": ["title^3", "abstract^2", "content", "transcript"]
# title is 3x more important
# abstract is 2x more importantChange k-NN neighbors:
"k": 20 # Consider top 20 nearest neighbors instead of 10Memory settings (in research_orchestrator.py):
# Current: Last 10 messages
self.memory = ConversationBufferWindowMemory(k=10)
# More context (uses more tokens)
self.memory = ConversationBufferWindowMemory(k=20)
# Full conversation history
self.memory = ConversationBufferMemory()Prompt template customization:
# Edit in research_orchestrator.py
RESEARCH_TEMPLATE = """
Custom instructions here...
Context: {context}
Question: {question}
"""Defined in: multi_modal_rag/data_processors/pdf_processor.py
Diagram extraction settings:
- Currently uses Gemini Vision for analysis
- Can be configured to extract specific image types
- Adjustable image resolution and quality
Text extraction:
- Uses PyPDF2 and PyMuPDF
- Fallback mechanisms for different PDF formats
OpenSearch JVM heap:
# Start with more memory
docker run -d \
-e "OPENSEARCH_JAVA_OPTS=-Xms2g -Xmx2g" \
opensearchproject/opensearch:latestPython process:
- Process documents in batches
- Clear memory between large operations
- Use generators for large datasets
Bulk indexing (already implemented):
# Batch multiple documents
helpers.bulk(client, actions, chunk_size=100)Optimize for speed:
# Reduce replicas during bulk indexing
index_settings = {
'number_of_shards': 2,
'number_of_replicas': 0, # Restore to 1 after indexing
'refresh_interval': '30s' # Default is 1s
}Limit result size:
# In hybrid_search()
results = self.client.search(
index=index_name,
body=query,
size=5 # Return top 5 instead of 10
)Cache embeddings:
- Embeddings are generated once during indexing
- Reused for all searches (already optimized)
OpenSearch connection pooling (already enabled):
http_compress=True # Compress HTTP trafficTimeout tuning:
timeout=30 # Increase for slow networksBest practices:
- Never commit
.envto version control (already in.gitignore) - Use separate API keys for development/production
- Rotate keys periodically
- Limit API key permissions when possible
Environment-specific keys:
# Development
GEMINI_API_KEY=dev_key_here
# Production
GEMINI_API_KEY=prod_key_hereProduction recommendations:
-
Enable SSL certificate verification:
verify_certs=True, ssl_assert_hostname=True
-
Use strong passwords:
OPENSEARCH_INITIAL_ADMIN_PASSWORD=VeryStr0ng!P@ssw0rd#2024 -
Network security:
- Don't expose OpenSearch port (9200) publicly
- Use firewall rules
- Run on private network
-
Authentication:
- Change default admin credentials
- Create role-based access
- Use separate users for read/write operations
For production:
app.launch(
server_name="127.0.0.1", # Localhost only
share=False, # No public URL
auth=("admin", "secure_password"), # Require authentication
ssl_keyfile="path/to/key.pem",
ssl_certfile="path/to/cert.pem"
)Behind reverse proxy (recommended):
- Use nginx or Apache as reverse proxy
- Handle SSL/TLS termination at proxy
- Add rate limiting
- Implement authentication at proxy level
Sensitive data handling:
- Review PDFs before indexing (no proprietary content)
- Clear conversation history regularly
- Don't index personal or confidential information
- Consider data retention policies
Local-first approach:
- All data stored locally by default
- No data sent to third parties except API calls
- OpenSearch runs on your machine
- Control your own data
- Copy
.env.exampleto.env - Add Gemini API key
- Configure OpenSearch host/port
- Start OpenSearch container
- Verify connection in Settings tab
- Use strong OpenSearch password
- Enable SSL certificate verification
- Set up Gradio authentication
- Configure firewall rules
- Set up log rotation
- Configure backups for OpenSearch data
- Use environment-specific API keys
- Implement rate limiting
- Set up monitoring and alerts
- Adjust OpenSearch JVM heap size
- Configure index refresh interval
- Tune k-NN parameters
- Optimize field weights for your use case
- Set appropriate timeout values
- Enable HTTP compression
- Regular log cleanup
- Periodic API key rotation
- Index optimization
- Data backup schedule
- Update dependencies regularly
Solution:
- Restart the application (Ctrl+C and run
python main.pyagain) - Verify
.envis in project root directory - Check for syntax errors (no quotes, no spaces around
=) - Ensure variable names are spelled correctly
Solution:
- Check
verify_certs=Falsein opensearch_manager.py - Verify OpenSearch is running:
docker ps - Test connection:
curl -k https://localhost:9200 - Check firewall settings
Solution:
- Verify
logs/directory exists and is writable - Check disk space
- Review file permissions
- Check logging configuration in
logging_config.py
Solution:
- Increase Docker memory limit
- Process fewer documents at once
- Reduce OpenSearch JVM heap size if needed
- Close other applications
After configuring your system:
- Test Configuration: Run through Quick Start Guide to verify everything works
- Explore Features: Learn about the system in Technology Stack
- Optimize: Use this guide to tune performance for your use case
- Monitor: Check logs regularly for issues
Your Multi-Modal Academic Research System is now fully configured and ready for production use!