Releases: SamoraHunter/pat2vec
v0.3.1
Full Changelog: v0.3.0...v0.3.1
Release v0.3.0: Elasticsearch Testing & Data Safety
🚀 What's New in v0.3.0
This release focuses on industrializing the testing pipeline and enhancing data safety when interacting with Elasticsearch.
✨ Highlights
🔍 Integrated Elasticsearch Testing
Developers can now validate their clinical pipelines against a real, temporary Elasticsearch instance inside Docker. This replaces static mocks with actual search behavior.
🧪 Automated Synthetic Data Seeding
Includes new utilities to generate and seed realistic patient timelines into test clusters, complete with automated schema management via elastic_schemas.json.
🛡️ Data Ingestion Safety
Introduced strict guardrails that prevent accidental write operations to production Elasticsearch clusters during testing or development runs.
🤖 CI/CD Enhancements
Full support for local GitHub Action runners (via act), making it easier to debug complex notebook-based tests locally before pushing.
Full Changelog: v0.2.0...v0.3.0
v0.2.0
Release v0.2.0
Database Backend Implementation
This release introduces a robust database backend using SQLAlchemy, which replaces the legacy file-based system as the default storage mechanism.
New Features
- Database Support: Added support for SQLite (default) and PostgreSQL.
- Defaults to a local
{project_name}.dbSQLite database if no connection string is provided. - Supports in-memory SQLite for testing.
- Defaults to a local
- Schema Management: The pipeline now handles automatic table creation and schema updates for:
- Raw Data:
raw_datatables (e.g.,raw_data.raw_bloods). - Annotations: MedCAT annotations tables.
- Features: Feature vectors with JSON serialization for sparse/high-dimensional data.
- Raw Data:
- Migration Utility: Added
pat2vec/util/migrate_to_db.pyto migrate existing file-based projects to the new database structure.
Configuration Changes
- Added
storage_backendoption toconfig_class(values:'database','file'). - Added
db_connection_stringoption toconfig_class.
Technical Improvements
- Centralized Data Retrieval: Implemented
get_df_from_dband updatedretrieve_patient_datato abstract data access. - Performance: Implemented batch insertion and automatic index creation on primary keys (e.g.,
client_idcode, timestamps) to improve query performance.