Automatic Metadata-based Retrieval Augmented Generation framework for optimized document retrieval using hybrid search (semantic + metadata filtering).
- LLM-Powered Metadata Extraction: Automatically extracts structured metadata from documents
- Hybrid Search: Combines semantic similarity search with metadata filtering
- Qdrant Vector Database: Efficient vector storage and retrieval
- Configurable Pipeline: Easy configuration through environment variables
- Install dependencies:
pip install -r requirements.txt- Set up environment variables using
.envfile:
# Copy the example file
cp env.example .env
# Edit .env with your actual credentialsYour .env file should contain:
OPENAI_API_KEY=sk-your-openai-api-key
QDRANT_URL=https://your-cluster.qdrant.io
QDRANT_API_KEY=your-qdrant-api-key
# Optional
QDRANT_COLLECTION=AutoMetaRAG # default: AutoMetaRAG
DATA_DIR=./data # default: ./data- Configure metadata settings in
config.ini:
[Metadata]
probable_questions = "Your example questions"
document_info = "Description of your dataset"Process documents, extract metadata, and create vector database:
python -m AutoMetaRAG --mode ingestOptional arguments:
python -m AutoMetaRAG --mode ingest --config custom.ini --data-dir ./my_docsAvailable CLI Arguments:
| Argument | Mode | Type | Default | Description |
|---|---|---|---|---|
--mode |
Both | Required | - | ingest or query |
--config |
Both | Optional | config.ini |
Path to config file |
--data-dir |
ingest | Optional | ./data |
Data directory path |
--query |
query | Optional | - | Query string (if not provided, enters interactive mode) |
--score-threshold |
query | Optional | 0.3 |
Minimum relevance score (0.0-1.0) |
--vector-name |
query | Optional | "" |
Vector name for multi-vector search |
Option A: Single query (direct)
python -m AutoMetaRAG --mode query --query "What is this paper about?"With custom score threshold:
python -m AutoMetaRAG --mode query --query "Your question" --score-threshold 0.5With multi-vector approach:
python -m AutoMetaRAG --mode query --query "Your question" --vector-name my_vectorOption B: Interactive session
python -m AutoMetaRAG --mode queryThen enter your questions interactively. Type 'exit' to quit.
Interactive with custom threshold:
python -m AutoMetaRAG --mode query --score-threshold 0.5AutoMetaRAG uses python-dotenv to load environment variables from a .env file.
Required variables:
OPENAI_API_KEY: Your OpenAI API keyQDRANT_URL: Qdrant server URL (e.g., https://your-cluster.qdrant.io)QDRANT_API_KEY: Qdrant API key
Optional variables:
QDRANT_COLLECTION: Collection name (default: "AutoMetaRAG")DATA_DIR: Directory containing documents (default: "./data")
Setup:
- Copy
env.exampleto.env - Fill in your actual credentials
- The
.envfile is automatically loaded when running AutoMetaRAG
The config.ini file contains metadata configuration for schema generation:
[Metadata]
probable_questions = "Question 1?", "Question 2?", "Question 3?"
document_info = "Description of your dataset".
├── AutoMetaRAG/ # Main package directory
│ ├── __init__.py
│ ├── __main__.py # CLI entry point
│ ├── pipeline.py # Main pipeline class
│ ├── config.py # Configuration management
│ ├── metadata.py # Metadata generation/extraction
│ ├── document.py # Document processing
│ ├── indexer.py # Qdrant indexing
│ ├── query.py # Query engine
│ └── utils.py # Utility functions
├── examples/ # Example scripts
│ ├── test1.py # Query example
│ └── test2.py # Ingestion example
├── requirements.txt # Python dependencies
├── env.example # Environment variables template
├── README.md # This file
└── USAGE.md # Detailed usage guide
- Environment Setup: Loads API keys and configuration from
.envfile using python-dotenv - Metadata Schema Generation: LLM analyzes your data and suggests metadata schemas
- Document Processing: Documents are loaded and metadata is extracted using LLM
- Vector Database Creation: Documents are embedded and stored in Qdrant with metadata
- Query Processing: User queries are analyzed to extract metadata filters
- Hybrid Search: Uses Qdrant's
query_pointsmethod with:- Semantic similarity search
- Metadata filtering
- Score threshold filtering (minimum relevance score: 0.3)
- Multi-vector support (optional)
- Response Generation: LLM generates answers from retrieved context
from AutoMetaRAG import AutoMetaRAGPipeline
# Initialize pipeline
pipeline = AutoMetaRAGPipeline('config.ini')
# Run ingestion
pipeline.run_indexing_pipeline()
# Query
answer = pipeline.query("What did the paper discuss about transformers?")
print(answer)