Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion docs/assay-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,11 @@ analyses:
- "g1-site-1.yaml"
- "g1-site-2.yaml"
- "g2-site.yaml"
assets:
- id: "variants"
path: "variants.tsv"
- id: "findings"
path: "findings.tsv"
emits:
- key: "apol1_status"
label: "APOL1 status"
Expand Down Expand Up @@ -83,11 +88,27 @@ Rules:
- `kind` is currently `bioscript`
- `path` points to a BioScript-compatible Python file
- `output_format` is optional and defaults to `tsv`; use `json` or `jsonl` when the script writes structured JSON output
- `derived_from` lists the variant YAML files used by the interpretation
- `derived_from` lists the variant YAML or variant catalogue files used by the interpretation
- `assets` is optional and lists local files the analysis script can read through `bioscript.context["assets"]`; `asset_paths` is also injected during migration
- `emits` is optional but recommended so report generators know which output columns to display and how to label them
- `logic` is optional; use `logic.description` and `logic.source.url` to document where the script's derivation rules came from
- Analysis rows may emit `notes` or `report_notes` as a reporting convention. HTML reports render those notes below the analysis table and omit them from the table columns; this avoids a manifest-level template language while still letting the script build human-readable text from computed values.

Analysis scripts receive these injected variables:

- `input_file`: virtual input genotype/VCF/CRAM path, normally `/input/genotypes`
- `output_file`: virtual output path the script must write, normally under `/output`
- `participant_id`: current participant/sample identifier
- `observations_file`: virtual TSV path containing the variant observations already gathered from assay/panel members for this participant
- `asset_paths`: dict from each `assets[].id` to its virtual path

The preferred API is `bioscript.context`, which includes `participant_id`,
`input_files`, `pipeline_files`, `assets`, `observations_file`, and
`output_file`. Scripts should use `bioscript.read_tsv(path)` for TSV assets and
observation files.

This supports large catalogues where Rust/BioScript first gathers all variant observations, then Python joins those observations to attached TSV assets such as `findings.tsv`, `conditions.tsv`, or `rules.tsv` and emits derived classification rows.

## Findings

Use `findings` for evidence that binds either to a variant observation or an emitted analysis value. Keep the executable logic in `analyses`; keep PGx evidence and reporting semantics in YAML.
Expand Down
154 changes: 154 additions & 0 deletions docs/variant-catalogue-schema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Variant Catalogue Schema

For the assay/runtime design, including direct catalogue members and Python
analysis assets, see [variant-catalogue.md](variant-catalogue.md).

This schema describes a compact, auditable source catalogue for projects with
many variants. It is intended for curation and packaging. Current BioScript
runtime lookup still consumes normal `bioscript:variant:1.0` variant manifests;
a packaging step can expand this catalogue into per-variant YAML files.

Use `bioscript:variant-catalogue:1.0` when a project has more than about 10
variants, or when repeated provenance and per-file boilerplate make review
harder than a tabular source.

## Shape

```yaml
schema: "bioscript:variant-catalogue:1.0"
version: "1.0"
name: "thalassemia-variants"

variants:
source: "variants.tsv"
format: "tsv"
key: "variant_id"
columns:
variant_id:
role: "id"
required: true
rsid:
role: "identifier.rsid"
alts:
role: "alleles.alts"
list_separator: "|"
grch38_pos:
role: "coordinates.grch38.pos"
type: "integer"

findings:
source: "findings.tsv"
format: "tsv"
key: "variant_id"
columns:
variant_id:
role: "variant.id"
required: true
finding_id:
role: "finding.id"
alt:
role: "finding.alt"
notes:
role: "finding.notes"

provenance:
sources:
- id: "ithagenes"
kind: "database"
label: "IthaGenes"
url: "https://www.ithanet.eu/db/ithagenes?action=list"
- id: "dbsnp"
kind: "database"
label: "dbSNP"
url_template: "https://www.ncbi.nlm.nih.gov/snp/{rsid}"
```

## Required Fields

- `schema`: must be `bioscript:variant-catalogue:1.0`
- `version`: must be `1.0`
- `name`
- `variants.source`

`variants.format` and `findings.format`, when present, must be `tsv`.
`variants.columns` and `findings.columns` are recommended for real catalogues so
tools can validate and interpret TSV columns without relying on hardcoded column
names.

## Recommended Files

Keep files separate during curation:

- `variants.yaml`: catalogue manifest and shared provenance
- `variants.tsv`: biological identity and matching fields
- `findings.tsv`: one interpretation row per finding

Generate per-variant `bioscript:variant:1.0` YAML as a packaging/build artifact.

## `variants.tsv`

Recommended columns:

```text
variant_id name gene rsid aliases kind ref alts observed_alts grch37_chrom grch37_pos grch37_start grch37_end grch38_chrom grch38_pos grch38_start grch38_end
```

Use `|` inside list cells such as `aliases`, `alts`, and `observed_alts`.
Use `pos` columns for SNVs and `start`/`end` columns for spans.

Declare the semantic mapping in `variants.columns`. Recommended roles include:

```text
id
name
gene
identifier.rsid
identifier.aliases
alleles.kind
alleles.ref
alleles.alts
alleles.observed_alts
coordinates.grch37.chrom
coordinates.grch37.pos
coordinates.grch37.start
coordinates.grch37.end
coordinates.grch38.chrom
coordinates.grch38.pos
coordinates.grch38.start
coordinates.grch38.end
```

## `findings.tsv`

Recommended columns:

```text
variant_id finding_id schema alt label summary notes
```

Additional source-specific columns such as `itha_id`, `functionality`,
`phenotype`, `transcript_hgvs`, `genomic_hgvs`, and `ncbi_spdi` are allowed as
curation data. The packaging tool can combine those atoms into standard variant
finding entries.

Each `findings.tsv.variant_id` should match a row in `variants.tsv`. Each
allele-specific `alt` should be present in the corresponding variant row's
`alts`, unless `alt` is `*`.

Declare the semantic mapping in `findings.columns`. Recommended roles include:

```text
variant.id
finding.id
finding.schema
finding.alt
finding.label
finding.summary
finding.notes
source.itha_id
source.functionality
source.phenotype
source.transcript_hgvs
source.genomic_hgvs
source.ncbi_spdi
```
26 changes: 26 additions & 0 deletions docs/variant-catalogue-schema.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
schema: "bioscript:variant-catalogue:1.0"
version: "1.0"
name: "example-variant-catalogue"
label: "Example Variant Catalogue"
summary: "Compact source catalogue for projects with many variants."

variants:
source: "variants.tsv"
format: "tsv"
key: "variant_id"

findings:
source: "findings.tsv"
format: "tsv"
key: "variant_id"

provenance:
sources:
- id: "dbsnp"
kind: "database"
label: "dbSNP"
url_template: "https://www.ncbi.nlm.nih.gov/snp/{rsid}"
- id: "ncbi_variation"
kind: "database"
label: "NCBI Variation Services"
url_template: "https://api.ncbi.nlm.nih.gov/variation/v0/hgvs/{genomic_hgvs}/contextuals"
Loading
Loading