AI FOR BHARAT 2026
An AI-powered platform for de novo catalyst and enzyme discovery β replacing trial-and-error with a continuous, self-improving Lab-in-the-Loop.
The global push toward sustainable aviation fuel and bio-based chemicals is fundamentally constrained by a discovery problem:
- Design space: >10^60 potential molecules
- Lab throughput: ~hundreds per month
- Time to discovery: Years of manual screening
- GPS Renewables' need: Optimize 2G Ethanol-to-SAF pathway catalysts
CatalystAI compresses discovery from years to weeks using generative AI + Bayesian active learning.
βββββββββββββββ ββββββββββββββββ βββββββββββββββ ββββββββββββββββ
β Generate β βββΆ β Predict β βββΆ β Rank (EI) β βββΆ β Synthesize β
β (DiffCSP) β β (GNN/FBA) β β (Bayesian) β β (Wet Lab) β
βββββββββββββββ ββββββββββββββββ βββββββββββββββ ββββββββββββββββ
β
βββββββββββββββ ββββββββββββββββ βββββββββββββββ β
β Update β βββ β Validate β βββ β Measure β ββββββββββββ
β Model Weightsβ β (Human-in- β β Results β
β β β the-Loop) β β β
βββββββββββββββ ββββββββββββββββ βββββββββββββββ
- DiffCSP for novel crystal lattice generation
- Conditional cDVAE for material structure design
- Novel IP generation beyond database retrieval
- Vector database semantic search (Milvus-style)
- Extract baseline performance from scientific literature
- Knowledge-grounded candidate generation
- Literature context for explainability
- GNN surrogate models (trained on Open Catalyst Project data)
- Activity, selectivity, and stability prediction
- Epistemic uncertainty quantification via deep ensembles
- 5 Acquisition Functions:
- Expected Improvement (EI) β Default, balanced
- Probability of Improvement (PI) β Conservative
- Upper Confidence Bound (UCB) β Optimistic exploration
- EI per Cost β Resource-aware optimization
- Knowledge Gradient (KG) β Finite-horizon optimal
- Multi-objective Pareto front optimization
- Side-by-side strategy comparison
- Genome-scale metabolic modeling (COBRApy-style)
- Lignocellulosic biomass β ethanol pathway simulation
- Bottleneck identification (cellulase, xylanase, etc.)
- Enzyme impact prediction
- Optimal enzyme cocktail design
- RFdiffusion backbone generation
- ProteinMPNN sequence design
- ESM3 zero-shot fitness prediction
- Mandatory screening against:
- CDC Select Agents Registry
- Known toxin databases (diphtheria, botulinum, ricin, anthrax)
- Virulence factor motifs
- Automatic blocking of restricted sequences
- Compliance reporting (NIH Guidelines, DURC, NSABB)
- Audit logging for regulatory documentation
- ELN webhook integration (eLabFTW/Labtools.AI)
- Human-in-the-loop validation
- Transfer learning retraining simulation
- Model versioning with provenance tracking
backend/
βββ app/
β βββ main.py # API endpoints
β βββ schemas.py # Pydantic models
β βββ services/
β βββ bayesian_service.py # Expected Improvement
β βββ acquisition_service.py # Multi-acquisition functions
β βββ rag_service.py # Literature retrieval
β βββ fba_service.py # Metabolic modeling
β βββ biosecurity_service.py # TEVV screening
β βββ mock_db.py # In-memory storage
βββ requirements.txt
streamlit_app/
βββ app.py # 8-tab comprehensive dashboard
βββ Generate & Rank
βββ Pareto Front
βββ Why EI? (Comparison)
βββ Multi-Acquisition Comparison
βββ ELN Feedback Loop
βββ RAG Literature Retrieval
βββ Metabolic FBA (Enzyme Track)
βββ Biosecurity TEVV
- 3D Pareto Front (Plotly) β Activity Γ Selectivity Γ Stability
- Molstar Molecular Viewer β WebGL crystal lattices
- Metabolic Flux Diagrams β Cytoscape.js network graphs (planned)
- Acquisition Function Heatmaps β Exploration vs exploitation trade-offs
| Domain | Technology | Justification |
|---|---|---|
| Backend API | FastAPI | Async performance; native ML compatibility |
| Deep Learning | PyTorch + PyTorch Geometric | Industry standard for GNNs |
| Vector DB | Milvus (simulated) | Sub-millisecond semantic search |
| Metabolic Modeling | COBRApy | SBML parsing; Python native |
| Cheminformatics | RDKit | Valency validation; SMILES parsing |
| Frontend (MVP) | Streamlit | Rapid scientific prototyping |
| Frontend (Production) | Next.js | Scalable multi-tenant SaaS |
| Molecular Viz | Molstar (React) | WebGL-accelerated; PDB-grade |
| ELN Integration | eLabFTW API | Open-source; webhook support |
- Docker & Docker Compose
- Python 3.10+ (if running locally)
# Clone repository
cd CatalystAI
# Start services
docker-compose up --build
# Access endpoints
# - Streamlit Dashboard: http://localhost:8501
# - FastAPI Docs: http://localhost:8000/docs# Backend
cd backend
pip install -r requirements.txt
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
# Frontend (separate terminal)
cd streamlit_app
pip install -r requirements.txt
streamlit run app.py --server.port=8501The repository includes render.yaml for a two-service Render Blueprint:
catalystai-backend: FastAPI service frombackend/catalystai-streamlit: Streamlit service fromstreamlit_app/
Create a new Render Blueprint from the repo root and Render will use the correct subdirectory requirements files, start commands, Python version, and health checks. If you create the services manually instead, use:
# Backend
Root Directory: backend
Build Command: pip install --upgrade pip && pip install -r requirements.txt
Start Command: uvicorn app.main:app --host 0.0.0.0 --port $PORT
Health Check Path: /health
# Streamlit
Root Directory: streamlit_app
Build Command: pip install --upgrade pip && pip install -r requirements.txt
Start Command: streamlit run app.py --server.address 0.0.0.0 --server.port $PORT --server.headless true
Health Check Path: /_stcore/healthFor a manual Streamlit service, set API_URL to the backend URL. For Blueprint
deploys, API_HOSTPORT is wired automatically through Render's private network.
POST /api/sessionsβ Create new discovery sessionPOST /api/generateβ Generate catalyst/enzyme candidatesPOST /api/rankβ Rank by Expected ImprovementPOST /api/rank-by-scoreβ Rank by predicted activity (baseline)POST /api/webhook/mock-elnβ Log experimental results
GET /api/literature/searchβ Semantic literature searchGET /api/literature/baselineβ Extract baseline performance
POST /api/fba/simulateβ Run flux balance analysisPOST /api/fba/enzyme-impactβ Predict enzyme engineering impactGET /api/fba/design-cocktailβ Design optimal enzyme cocktail
POST /api/biosecurity/screenβ Screen single sequencePOST /api/biosecurity/batch-screenβ Batch screen candidatesGET /api/biosecurity/compliance-reportβ TEVV compliance report
POST /api/rank/acquisitionβ Rank with custom acquisition functionGET /api/rank/compare-acquisitionβ Compare multiple strategies
- Direction 1 (Chemical Catalysis): Generative design, GNN prediction, Bayesian optimization
- Direction 2 (Synthetic Biology): FBA, enzyme engineering, metabolic pathway design
- 5 acquisition functions (not just EI)
- Uncertainty quantification with epistemic variance
- Multi-objective Pareto optimization
- Literature-grounded generation (RAG)
- Mandatory biosecurity screening (TEVV)
- Toxin/pathogen homology detection
- Regulatory compliance documentation
- Audit trail for restricted agents
- Literature baseline extraction
- Physics-informed heuristic filters
- Stoichiometric constraint modeling (FBA)
- Experimental cost/time optimization
- Modular microservice design
- ELN webhook integration
- Model versioning system
- Multi-tenant data isolation (design)
- Start: Pre-loaded session with 10 catalyst candidates
- Literature: Extract baseline performance (70% activity from literature)
- Rank (EI): Bayesian ranking highlights high-uncertainty candidates
- Compare: Show how EI differs from greedy score ranking
- Log Result: Simulate lab measurement (85% activity achieved)
- Re-rank: Active learning updates
best_so_far, reprioritizes candidates - Improvement: Show +15% gain over literature baseline
Optional: Switch to enzyme track β run FBA β screen biosecurity
-
Direct Impact: Accelerates 2G Ethanol-to-SAF catalyst discovery
-
Economic: Reduces screening costs by prioritizing high-EI candidates
-
Scalable: Handles both catalyst and enzyme tracks (paddy straw pretreatment + conversion)
-
Completeness: Covers all 4 ML layers (Retrieval, Generation, Prediction, Biology)
-
Innovation: First to combine RAG + Bayesian BO + FBA + TEVV in one platform
-
Explainability: Side-by-side comparisons show why AI picks certain candidates
-
Safety: Only submission with biosecurity screening layer
-
MVP β Production Path: Clear architecture for scaling
-
Scientific Validation: Built on proven methods (DiffCSP, GNoME, COBRApy)
-
Regulatory Awareness: TEVV compliance from day one
- Open Catalyst Project (OCP): 260M+ DFT calculations
- Materials Project: Crystal structure database
- COBRApy: Constraint-based metabolic modeling
- RFdiffusion: Protein backbone generation
- BRENDA, KEGG, BiGG: Enzyme/pathway databases
- Live GNN surrogate on OCP data
- Real eLabFTW integration
- Multi-user collaboration + RBAC
- GPU cluster for live diffusion
- Next.js enterprise frontend
- Robotic lab hardware API
- Full audit trail + IP provenance
- COβ capture catalyst expansion