A VAST DataEngine serverless function that embeds enriched variants and stores them in VastDB for semantic search.
- Receives enriched variant records from
genomics-vcf-parservia function chaining - Generates 2048-dimensional vector embeddings for each variant description using NVIDIA NIM
- Skips re-embedding for variants that already have cached vectors (from the memoization layer)
- Bulk-inserts variant records with embeddings into the VastDB
variantstable (batches of 50) - Updates the sample record: status →
completed, setsvcf_pathandvariant_count
Configure in deployments/dataengine-genomics-pipeline/genomics-ingest.yaml:
| Key | Description |
|---|---|
embeddinghost / embeddingport / embeddinghttpscheme |
Self-hosted NIM embedding endpoint |
embeddingmodel |
Embedding model (default: nvidia/llama-nemotron-embed-1b-v2) |
embeddingdimensions |
Vector dimensions — must match the model output (default: 2048) |
use_api_catalog |
true = NVIDIA API Catalog, false = self-hosted NIM |
nvidia_api_key |
Required when use_api_catalog: true |
- Trigger: Receives events from
genomics-vcf-parservia function chaining (Kafka topicgenomics) - Input: Enriched variant list with sample and patient IDs
- Output: Variant rows with
vectorscolumn written to VastDB; sample markedcompleted
- Runtime: VAST DataEngine serverless runtime
- Image:
<your-registry>/genomic-engine-variant-processor:<tag> - Build:
vastde functions build(Cloud Native Buildpacks — no Dockerfile) - Dependencies:
vastdb,openai(NIM-compatible client), Python 3.11