A high-precision search tool for biomedical literature
Solves the specificity-recall trade-off problem (e.g., distinguishing between APOE3 and APOE4) using metadata-based re-ranking and a hybrid search architecture
Demo • Quick start • Architecture • Results

Query → Retrieved Papers → Entity-Aware Highlighting
- API Keys: Access to OpenRouter (Llama-3/Grok/Qwen and etc. models) is required
- Google AI: Be careful with Gemma — sometimes it doesn't work via OpenRouter API key from Russia, but you can always switch to another model in the
.envfile
cp .env.example .envAdd your keys to .env:
OPENROUTER_API_KEY="sk-or-v1-..."
RAG_MODEL_NAME="arcee-ai/trinity-mini:free"I use uv for deterministic environment assembly
Option A: via uv (Recommended)
git clone https://github.com/effes3/alzheimerRAG.git
cd alzheimerRAG
uv sync
cp .env.example .env # After copying it change OpenRouter API key to yours
sh scripts/download_db.sh # Automatic loading of vector database from Hugging Face (64MB)
uv run streamlit run src/app.pyOption B: via `pip` (Slow)
git clone https://github.com/effes3/alzheimerRAG.git
cd alzheimerRAG
python -m venv .venv
source .venv/bin/activate # Windows: .\.venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env # After copying it change OpenRouter API key to yours
sh scripts/download_db.sh # Automatic loading of vector database from Hugging Face (64MB)
streamlit run src/app.pyShow project tree
alzheimerRAG/
├── data/ # PDFs, extracted texts, and entity annotations
├── results_docling/ # Reports and evaluation results via Docling
├── results_pdfplumber/ # Reports and evaluation results via pdfplumber
├── scripts/ # Data processing and EDA scripts
├── src/ # Main application and retriever logic
├── README.md
├── requirements.txt
└── pyproject.tomlgraph TD
A[User Query] --> B{Pipeline}
subgraph Server with T4
K[Raw PDFs] --> L[Docling: Layout Analysis]
L --> M[Structured Markdown]
end
B --> C[Vector Search: PubMedBERT]
B --> D[Lexical Search: BM25]
A --> E[HyDE: Hypothetical Abstract]
E --> C
subgraph Entity Intelligence
A --> F[NER: Entity Extraction]
F --> G[Metadata Filter & Entity Boost]
end
C --> H[Hybrid Fusion]
D --> H
G --> H
H --> I[Context Window]
I --> J[Generator: Gemma-3-27b-it]
J --> N[Precise Biological Answer]
Hybrid search combines Semantic Density and Lexical Exactness, enhanced by the Metadata Injection mechanism
The formula for the final document score ($Score(d)$):
| Component | Implementation | Why is it needed? |
|---|---|---|
| S_vec 🟢 | NeuML/pubmedbert-base-embeddings |
Captures conceptual similarity ("meaning") |
| S_bm25 🔵 | BM25Okapi | Ensures exact term matching |
| Boost ⚡ | Metadata Injection | Multiplies the score if entities from the query are present in the metadata |
The evaluation dataset was synthetically generated using a "LLM-as-a-Researcher" pipeline (via NotebookLM). The goal was to create complex, multi-document queries that require cross-referencing information
You can find this dataset in code from evaluate.py
Infrastructure: Judge: gpt-4o-mini | Generator: gemma-3-27b-it | HyDE and NER from Query: google/gemma-3n-e4b-it:free
Table 1. Results of RAG on DB created by cleaning texts from PDFs via LLM (the HS parameter
alphais 0.7)
| Architecture | Faithfulness | Relevancy | Precision | Recall | Latency (s) |
|---|---|---|---|---|---|
| Vector + Entity Boost 🏆 | 0.816 | 0.410 | 0.706 | 0.579 | 9.80 |
| Hybrid + Entity Boost | 0.800 | 0.367 | 0.706 | 0.579 | 11.25 |
| Hybrid + HyDE + Entity Boost | 0.777 | 0.310 | 0.596 | 0.421 | 14.79 |
Key insights:
- Simplicity wins: The Vector + Entity Boost configuration showed the best faithfulness with minimal latency
- HyDE is noisy: Using hypothetical embeddings worsened metrics due to hallucinations in the biomedical context
- Recall Ceiling: The identical recall ceiling indicates a bottleneck in the ingestion stage rather than retrieval
But after implementing Docling to process PDFs, the system achieved a massive leap in Faithfulness and Recall with same Infrastructure
Table 2. Results of RAG on DB created by .md files via Docling (the HS parameter
alphais 0.7)
| Architecture | Faithfulness | Relevancy | Recall | Latency (s) |
|---|---|---|---|---|
| Vector + Entity Boost 🏆 | 0.935 | 0.709 | 0.789 | 8.50 |
| Hybrid + Entity Boost | 0.938 | 0.682 | 0.778 | 14.94 |
| Hybrid + HyDE + Entity Boost 🏆 | 0.956 | 0.614 | 0.895 | 19.33 |
Key insights:
- HyDE for Discovery: Using HyDE (Hypothetical Document Embeddings) increased Recall to ~90%, making it the best mode for identifying hidden drug targets
- Docling Effect: Faithfulness scores above 0.93 indicate that the LLM has almost stopped hallucinating and saying 'Sorry, I can't answer this question', as it now receives perfectly structured table data
- The Specificity Win: Metadata-based Entity Boosting ensures that APOE4 related queries prioritize chunks explicitly tagged with that isoform
To reproduce the exact same results, or slightly different results (because of OpenRouter) from Table 2, run
evaluate.pyand adjust your code settings in lines 48-56
- 🧪 Domain-Specific Re-ranking: Integrating
cross-encoderstrained on PubMed to further refine the Top-K - 🌐 GraphRAG: Transitioning to a knowledge-graph-based retrieval to map complex gene-protein-disease pathways
- 🛠️ Automated NER Pipeline: Full integration of the Entity Extraction step into the ingestion workflow via MARCUS or specialized LLMs
Click to expand
- Pilot Dataset Scale: The current evaluation was performed on a high-quality pilot network of 24 papers. While the architecture is scalable, behavior may shift when moving to million-scale document collections (requiring HNSW or DiskANN indexing)
- Distributed Ingestion Workflow: To maintain high performance without local GPU costs, document parsing (via Docling) is currently performed in a Google Colab environment. A unified, server-side ingestion API is planned for the next release
I am a chemist by education @ HSE who switched to ML engineering. My goal is to create tools that automate science
- Olympic background: Multiple winner of chemistry competitions, 5 sessions at the Sirius Educational Centre
- ML Experience: Graduate of T-Bank's ML programme (top 20 out of 600+ participants). The only chemist among developers from BigTech
- Domain Expertise: I understand the difference between protein isoforms not only in terms of text, but also in terms of biological function