Leanda.io is an extensible open science data repository that enables researchers to consume, process, visualize, and analyze diverse scientific data types, formats, and volumes. Unlike traditional file stores or narrow-purpose databases, it features a modular microservices architecture designed for seamless extension with new domain-specific services.
Leanda.io addresses key deficiencies in existing open science tools by providing:
- Real-time automated + manual data curation with AI-powered metadata extraction
- Ontology-based property assignment and complex semantic searches
- On-the-fly data mining, text extraction, and format conversion during deposition
- Granular security model supporting private, shared, and public data
- Rapid ML training dataset composition from integrated sources and processed data
- Embedded ML framework for research and drug discovery pipelines
Leanda.io handles a wide range of scientific formats with automatic import/export conversions:
- Generic images (PNG, GIF, TIFF, BMP)
- Documents (PDF, MS Office, OpenOffice)
- Tabular data (CSV, TSV, Excel)
- Chemical structures (SDF, MOL, SMILES, CDX)
- Chemical reactions (RXN)
- Crystallographic data (CIF)
- Spectra (JDX)
- Microscopy imaging files
- Machine learning models & weights
Infrastructure is designed, documentation is comprehensive, and contracts are defined, but NO code is implemented or runnable yet. The ~92% completion refers to design/planning work, not implementation.
See Project Summary Report for detailed progress.
Journey: A single narrative from Week 1 through Week 5—discovery, reality check, front-end progress, tool limits, and the DynamoDB/S3 decision—is in the Journey Master Summary. It summarizes wins, lessons, and links to all journey docs and images.
leanda-ng/
├── LICENSE
├── infrastructure/ # CDK config (cdk.json, package.json, tsconfig.json), iam/ (IAM policy JSONs)
├── docs/ # architecture, adr, deployment, security, monitoring, testing, finops, agents, journey, phases, frontend, infrastructure
├── shared/ # contracts/ (AsyncAPI events, blob-storage-api), specs/ (OpenAPI core-api, models, events, implementation, tests)
├── docker/ # docker-compose.yml, Grafana/Prometheus config, test runners (not runnable without services)
└── scripts/ # agents/ (QA and automation scripts)
leanda-ng/
├── services/ # Java/Quarkus microservices (core-api, parsers, blob-storage, etc.)
├── frontend/ # Angular 21 app
├── ml-services/ # Python/FastAPI ML pipelines
└── tests/ # Integration and E2E
When implementation begins, you will need:
- Java 21 LTS (for backend services)
- Python 3.12+ (for ML services)
- Node.js 20+ (for frontend and CDK)
- Docker & Docker Compose (for local development)
- AWS CLI v2 (for infrastructure deployment)
- ✅ Infrastructure design complete (AWS CDK stacks designed)
- ✅ Documentation complete (architecture, ADRs, security, deployment guides)
- ✅ API contracts defined (OpenAPI/AsyncAPI specifications)
- ⏳ Service implementation (not started)
- ⏳ Frontend implementation (not started)
- ⏳ ML services implementation (not started)
See the Development Journey for progress updates.
| Layer | Technology | Description |
|---|---|---|
| Frontend | Angular 21 | Zoneless architecture, Signal Forms |
| Backend | Java 21, Quarkus 3.17+ | Cloud-native microservices |
| ML Services | Python 3.12+, FastAPI | ML pipelines and inference |
| Database | MongoDB 7.0 | DocumentDB compatible |
| Cache | Redis 7.2 | Session and data caching |
| Messaging | Redpanda | Kafka-compatible streaming |
| Search | OpenSearch 2.11 | Full-text and vector search |
| Storage | MinIO | S3-compatible object storage |
| Infrastructure | AWS CDK | Infrastructure as Code |
| Monitoring | Prometheus, Grafana | Metrics and dashboards |
- Architecture Overview - Project structure and design
- Modernization Plan - Migration strategy
- Engineering Strategy - Lakehouse approach
- Journey Master Summary - Week 1–5 narrative, wins, lessons, and links to all journey docs
- Development Journey - Progress logs
- Phase Documentation - Migration phase details
- Infrastructure (CDK) - Stacks and deployment
More: Cloud architecture · ADRs · Deployment · Security · Monitoring · FinOps · Testing · Agents / coordination · Journey
Note: These services are planned but not yet implemented. See API contracts in shared/contracts/ and shared/specs/ for specifications.
| Service | Port | Description | Status |
|---|---|---|---|
| core-api | 8080 | User management, events, WebSocket | ⏳ Planned |
| blob-storage | 8084 | File storage and retrieval | ⏳ Planned |
| chemical-parser | 8083 | Parse SDF, MOL files | ⏳ Planned |
| chemical-properties | 8086 | Calculate molecular properties | ⏳ Planned |
| reaction-parser | 8087 | Parse RXN files | ⏳ Planned |
| crystal-parser | 8089 | Parse CIF files | ⏳ Planned |
| spectra-parser | 8090 | Parse JDX files | ⏳ Planned |
| imaging | 8091 | Image processing | ⏳ Planned |
| office-processor | 8088 | Office document conversion | ⏳ Planned |
| metadata-processing | 8098 | Metadata extraction | ⏳ Planned |
| indexing | 8099 | OpenSearch indexing | ⏳ Planned |
- ✅ Infrastructure Design: AWS CDK stacks designed (9 stacks)
- ✅ Documentation: Comprehensive architecture, ADRs, security, deployment guides
- ✅ API Contracts: OpenAPI/AsyncAPI specifications defined
- ✅ Architecture Decisions: 12 ADRs documented
- ✅ Planning: Multi-agent coordination framework designed
- ⏳ Phase 1: Core services implementation (not started)
- ⏳ Phase 2: Domain parsers implementation (not started)
- ⏳ Phase 3: ML services implementation (not started)
- ⏳ Phase 4: Frontend implementation (not started)
- ⏳ Phase 5: Infrastructure deployment and testing (not started)
Note: ~92% refers to design/planning work, not implementation. See Project Summary Report for details.
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please see:
- Architecture Documentation - Project structure
- Development Guidelines - Coding standards
Leanda.io was originally developed by the ArqiSoft team and before that by Science Data Software team. This modernization effort aims to revitalize the platform for the open science community using modern AWS-native technologies and best practices.
Built with care for the open science community