AlzheimerRAG 🧬 Precision Biomedical Retrieval

A high-precision search tool for biomedical literature
Solves the specificity-recall trade-off problem (e.g., distinguishing between APOE3 and APOE4) using metadata-based re-ranking and a hybrid search architecture

Demo • Quick start • Architecture • Results

🎬 Demo: Quick Look

Query → Retrieved Papers → Entity-Aware Highlighting

🚀 Quick Start

1. Prerequisites

API Keys: Access to OpenRouter (Llama-3/Grok/Qwen and etc. models) is required
Google AI: Be careful with Gemma — sometimes it doesn't work via OpenRouter API key from Russia, but you can always switch to another model in the .env file

2. Configuration

cp .env.example .env

Add your keys to .env:

OPENROUTER_API_KEY="sk-or-v1-..."
RAG_MODEL_NAME="arcee-ai/trinity-mini:free"

3. Installation & Execution

I use uv for deterministic environment assembly

Option A: via uv (Recommended)

git clone https://github.com/effes3/alzheimerRAG.git
cd alzheimerRAG
uv sync
cp .env.example .env # After copying it change OpenRouter API key to yours
sh scripts/download_db.sh # Automatic loading of vector database from Hugging Face (64MB)
uv run streamlit run src/app.py

Option B: via `pip` (Slow)

git clone https://github.com/effes3/alzheimerRAG.git
cd alzheimerRAG
python -m venv .venv
source .venv/bin/activate  # Windows: .\.venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env # After copying it change OpenRouter API key to yours
sh scripts/download_db.sh # Automatic loading of vector database from Hugging Face (64MB)
streamlit run src/app.py

📂 Project Structure

Show project tree

alzheimerRAG/
├── data/                # PDFs, extracted texts, and entity annotations
├── results_docling/     # Reports and evaluation results via Docling
├── results_pdfplumber/  # Reports and evaluation results via pdfplumber
├── scripts/             # Data processing and EDA scripts
├── src/                 # Main application and retriever logic        
├── README.md
├── requirements.txt
└── pyproject.toml

🧬 Retrieval Architecture

graph TD
    A[User Query] --> B{Pipeline}
    
    subgraph Server with T4
    K[Raw PDFs] --> L[Docling: Layout Analysis]
    L --> M[Structured Markdown]
    end

    B --> C[Vector Search: PubMedBERT]
    B --> D[Lexical Search: BM25]
    A --> E[HyDE: Hypothetical Abstract]
    
    E --> C
    
    subgraph Entity Intelligence
    A --> F[NER: Entity Extraction]
    F --> G[Metadata Filter & Entity Boost]
    end
    
    C --> H[Hybrid Fusion]
    D --> H
    G --> H
    
    H --> I[Context Window]
    I --> J[Generator: Gemma-3-27b-it]
    J --> N[Precise Biological Answer]

Hybrid search combines Semantic Density and Lexical Exactness, enhanced by the Metadata Injection mechanism

Scoring Logic

The formula for the final document score ($Score(d)$):

$$ Score(d) = \underbrace{\left[ \alpha \left(1 - \frac{S_{\text{vec}}}{Max_{\text{vec}}}\right) + (1-\alpha) \left(\frac{S_{\text{bm25}}}{Max_{\text{bm25}}}\right) \right]}_{\text{Hybrid Base Score}} \times \underbrace{\text{Boost}(Entities)}_{\text{Metadata Multiplier}} $$

Component	Implementation	Why is it needed?
S_vec 🟢	`NeuML/pubmedbert-base-embeddings`	Captures conceptual similarity ("meaning")
S_bm25 🔵	BM25Okapi	Ensures exact term matching
Boost ⚡	Metadata Injection	Multiplies the score if entities from the query are present in the metadata

📊 Performance Evaluation (Results@10)

The evaluation dataset was synthetically generated using a "LLM-as-a-Researcher" pipeline (via NotebookLM). The goal was to create complex, multi-document queries that require cross-referencing information

You can find this dataset in code from evaluate.py

Infrastructure: Judge: gpt-4o-mini | Generator: gemma-3-27b-it | HyDE and NER from Query: google/gemma-3n-e4b-it:free

Table 1. Results of RAG on DB created by cleaning texts from PDFs via LLM (the HS parameter alpha is 0.7)

Architecture	Faithfulness	Relevancy	Precision	Recall	Latency (s)
Vector + Entity Boost 🏆	0.816	0.410	0.706	0.579	9.80
Hybrid + Entity Boost	0.800	0.367	0.706	0.579	11.25
Hybrid + HyDE + Entity Boost	0.777	0.310	0.596	0.421	14.79

Key insights:

Simplicity wins: The Vector + Entity Boost configuration showed the best faithfulness with minimal latency
HyDE is noisy: Using hypothetical embeddings worsened metrics due to hallucinations in the biomedical context
Recall Ceiling: The identical recall ceiling indicates a bottleneck in the ingestion stage rather than retrieval

But after implementing Docling to process PDFs, the system achieved a massive leap in Faithfulness and Recall with same Infrastructure

Table 2. Results of RAG on DB created by .md files via Docling (the HS parameter alpha is 0.7)

Architecture	Faithfulness	Relevancy	Recall	Latency (s)
Vector + Entity Boost 🏆	0.935	0.709	0.789	8.50
Hybrid + Entity Boost	0.938	0.682	0.778	14.94
Hybrid + HyDE + Entity Boost 🏆	0.956	0.614	0.895	19.33

Key insights:

HyDE for Discovery: Using HyDE (Hypothetical Document Embeddings) increased Recall to ~90%, making it the best mode for identifying hidden drug targets
Docling Effect: Faithfulness scores above 0.93 indicate that the LLM has almost stopped hallucinating and saying 'Sorry, I can't answer this question', as it now receives perfectly structured table data
The Specificity Win: Metadata-based Entity Boosting ensures that APOE4 related queries prioritize chunks explicitly tagged with that isoform

To reproduce the exact same results, or slightly different results (because of OpenRouter) from Table 2, run evaluate.py and adjust your code settings in lines 48-56

🔮 Future Roadmap

🧪 Domain-Specific Re-ranking: Integrating cross-encoders trained on PubMed to further refine the Top-K
🌐 GraphRAG: Transitioning to a knowledge-graph-based retrieval to map complex gene-protein-disease pathways
🛠️ Automated NER Pipeline: Full integration of the Entity Extraction step into the ingestion workflow via MARCUS or specialized LLMs

⚠️ Limitations

Click to expand

Pilot Dataset Scale: The current evaluation was performed on a high-quality pilot network of 24 papers. While the architecture is scalable, behavior may shift when moving to million-scale document collections (requiring HNSW or DiskANN indexing)
Distributed Ingestion Workflow: To maintain high performance without local GPU costs, document parsing (via Docling) is currently performed in a Google Colab environment. A unified, server-side ingestion API is planned for the next release

👨‍🔬 About the Author

I am a chemist by education @ HSE who switched to ML engineering. My goal is to create tools that automate science

Olympic background: Multiple winner of chemistry competitions, 5 sessions at the Sirius Educational Centre
ML Experience: Graduate of T-Bank's ML programme (top 20 out of 600+ participants). The only chemist among developers from BigTech
Domain Expertise: I understand the difference between protein isoforms not only in terms of text, but also in terms of biological function

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AlzheimerRAG 🧬 Precision Biomedical Retrieval

🎬 Demo: Quick Look

🚀 Quick Start

1. Prerequisites

2. Configuration

3. Installation & Execution

📂 Project Structure

🧬 Retrieval Architecture

Scoring Logic

📊 Performance Evaluation (Results@10)

🔮 Future Roadmap

⚠️ Limitations

👨‍🔬 About the Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
data		data
results_docling		results_docling
results_pdfplumber		results_pdfplumber
scripts		scripts
src		src
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
demo_of_usage.gif		demo_of_usage.gif
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AlzheimerRAG 🧬 Precision Biomedical Retrieval

🎬 Demo: Quick Look

🚀 Quick Start

1. Prerequisites

2. Configuration

3. Installation & Execution

📂 Project Structure

🧬 Retrieval Architecture

Scoring Logic

📊 Performance Evaluation (Results@10)

🔮 Future Roadmap

⚠️ Limitations

👨‍🔬 About the Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages