Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
f499b53
KL/AT spec: update definitions to latest in biolink-model
colleenXu May 14, 2026
22da0d4
KL/AT spec: minor grammar, typo fixes
colleenXu May 14, 2026
f6433b6
KL/AT spec: put example in json code block
colleenXu May 14, 2026
9480daa
old qualifier spec: update, change wording of rules
colleenXu May 14, 2026
37b7e44
old qualifier spec: put examples in yaml code blocks, spacing
colleenXu May 14, 2026
131df4b
old query spec: adjust asyncquery requirements
colleenXu May 14, 2026
f243e6c
old query spec: remove null mentions
colleenXu May 14, 2026
c9a532f
old query spec: fix typos, minor wording/format changes
colleenXu May 14, 2026
d694f76
old query spec: adjust QNode.categories requirements
colleenXu May 14, 2026
d80a0dd
old query spec: update predicate examples, fix typos, minor wording/f…
colleenXu May 14, 2026
b4bf412
old sources spec: update title, fix typos, minor wording/format changes
colleenXu May 14, 2026
83195fa
pathfinder spec: fix a few typos
colleenXu May 14, 2026
52ad10c
old publications spec: fix typos, update info, minor wording/format c…
colleenXu May 14, 2026
831e9fc
Update docs/reference.md [no ci]
colleenXu May 14, 2026
778ef7c
Fix formatting of manual validation agent type
edeutsch May 14, 2026
bdd940c
Fix formatting of manual_validation_of_automated_agent entry
edeutsch May 14, 2026
70bd3c5
old sources spec: add italics back
colleenXu May 14, 2026
f4eaee2
old sources spec: take out extra spaces
colleenXu May 14, 2026
cd56abd
Merge branch '2.0' into 2.0-specification-update
colleenXu May 14, 2026
7f65bba
Merge branch '2.0' into 2.0-specification-update
colleenXu May 14, 2026
edc3746
Update docs/reference.md [no ci]
colleenXu May 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,56 +10,59 @@

These properties are complemented by a larger model that is being developed to support a more detailed representation of evidence and provenance metadata, and will be documented elsewhere.

The scope of this initial specification covers only **Agent Type** and **Knowledge Level** metadata - which describes the type of knowledge expressed in an edge based on type of agent that originally generated a statement of knowledge encoded in an Edge, and the level of knowledge expressed in an Edge. It is complemented by a [Supplemental Guidance document](https://docs.google.com/document/d/140dtM5CjWM97JiBRdAmDT-9IKqHoOj-xbE_5TWkdYqg/edit) that provides detailed examples implementation support.
The scope of this initial specification covers only **Agent Type** and **Knowledge Level** metadata - which describes the type of knowledge expressed in an edge based on the type of agent that originally generated a statement of knowledge encoded in an Edge, and the level of knowledge expressed in an Edge. It is complemented by a [Supplemental Guidance document](https://docs.google.com/document/d/140dtM5CjWM97JiBRdAmDT-9IKqHoOj-xbE_5TWkdYqg/edit) that provides detailed examples and implementation support.

## Biolink properties and enumerations
Biolink defines hte following properties and enumerations for classifying agent type and knowledge level, which are used to annotate individual edges in knowledge graphs.
Biolink defines the following properties and enumerations for classifying agent type and knowledge level, which are used to annotate individual edges in knowledge graphs.

### Agent Type
**Biolink Edge Property**:
- `agent_type`: describes the high-level category of agent that originally generated a statement of knowledge or other type of information. Permissible values are provided by the **biolink:AgentTypeEnum** enumeration, as defined below:
- `agent_type`: Describes the high-level category of agent who originally generated a statement of knowledge or other type of information. Permissible values are provided by the **biolink:AgentTypeEnum** enumeration, as defined below:

**Permissible Values**:

- `manual_agent`: a human agent is responsible for generating the knowledge expressed in the Edge. The human may utilize computationally generated information as evidence for the resulting knowledge, but the human is the one who ultimately generates this knowledge.
- `automated_agent`: an automated agent, typically a software program or tool, is responsible for generating the knowledge expressed in the Edge. Human contribution to the knowledge creation process ends with the definition and coding of algorithms or analysis pipelines that get executed by the automated agent.
- `data_analysis_pipeline`: an automated agent that executes an analysis workflow over data and reports results in an Edge. These typically report statistical associations/correlations between variables in the input data.
- `computational_model`: an automated agent that generates knowledge (typically predictions) based on rules/logic explicitly encoded in an algorithm (e.g. heuristic models, supervised classifiers), or learned from patterns observed in data (e.g. ML models, unsupervised classifiers).
- `text_mining_agent`: an automated agent that uses Natural Language Processing to recognize concepts and/or relationships in text, and generates Edges relating these concepts with formally encoded semantics.
- `image_processing_agent`: an automated agent that processes images to generate Edges reporting knowledge derived from the image and/or expressed in text the image depicts (e.g. via OCR).
- ` manual_validation_of_ automated_agent`: a human agent reviews and validates/approves the veracity of knowledge that is initially generated by an automated agent.
- `not_provided`: the agent type is not provided, typically because it cannot be determined from available information if the agent that generated the knowledge is manual or automated.
- `manual_agent`: A human agent who is responsible for generating a statement of knowledge. The human may utilize computationally generated information as evidence for the resulting knowledge, but the human is the one who ultimately interprets/reasons with this evidence to produce a statement of knowledge.
- `automated_agent`: An automated agent, typically a software program or tool, that is responsible for generating a statement of knowledge. Human contribution to the knowledge creation process ends with the definition and coding of algorithms or analysis pipelines that get executed by the automated agent.
- `data_analysis_pipeline`: An automated agent that executes an analysis workflow over data and reports the direct results of the analysis. These typically report statistical associations/correlations between variables in the input dataset, and do not interpret/infer broader conclusions from associations the analysis reveals in the data.
- `computational_model`: An automated agent that generates knowledge statements (typically predictions) based on rules/logic explicitly encoded in an algorithm (e.g. heuristic models, supervised classifiers), or learned from patterns observed in data (e.g. ML models, unsupervised classifiers).
- `text_mining_agent`: An automated agent that uses Natural Language Processing to recognize concepts and/or relationships in text, and report them using formally encoded semantics (e.g. as an edge in a knowledge graph).
- `image_processing_agent`: An automated agent that processes images to generate textual statements of knowledge derived from the image and/or expressed in text the image depicts (e.g. via OCR).
- `manual_validation_of_automated_agent`: A human agent reviews and validates/approves the veracity of knowledge that is initially generated by an automated agent.
- `not_provided`: The agent type is not provided, typically because it cannot be determined from available information if the agent that generated the knowledge is manual or automated.

Note that this property indicates the type of agent who *produced a final statement of knowledge*, which is often different from the agent or agents who produced *information used as evidence* to support the generation of this knowledge. For example, if a human curator concludes that a particular gene variant causes a medical condition - based on their interpretation of information produced by computational modeling tools, automated statistical analyses, and robotic laboratory assay systems - the agent type for this statement is "manual_agent" despite all of the evidence being created by automated agents. But if any of these systems is programmed to generate knowledge statements directly and without human assistance, the statement would be attributed to an "automated_agent".
Note that this property indicates the type of agent who *produced a final statement of knowledge*, which is often different from the agent or agents who produced *information used as evidence* to support generation of this knowledge. For example, if a human curator concludes that a particular gene variant causes a medical condition - based on their interpretation of information produced by computational modeling tools, automated data analysis pipelines, and robotic laboratory assay systems - the agent_type for this statement is "manual_agent" - despite all of the evidence being created by automated agents. But if any of these systems is programmed to generate knowledge statements directly and without human assistance, the statement would be attributed to an "automated_agent".

### Knowledge Level
**Biolink Edge Property**:
- `knowledge_level`: describes the level of knowledge expressed in a statement, based on the reasoning or analysis methods used to generate the statement, or the scope or specificity of what the statement expresses to be true. Permissible values are defined in the **biolink:KnowledgeLevelEnum** enumeration:
- `knowledge_level`: Describes the level of knowledge expressed in a statement, based on the reasoning or analysis methods used to generate the statement, or the scope or specificity of what the statement expresses to be true. Permissible values are defined in the **biolink:KnowledgeLevelEnum** enumeration:

**Permissible Values**:
- `knowledge_assertion`: a statement of purported fact that is put forth by an agent as true, based on assessment of direct evidence. Assertions generally have a high confidence of being true based on the strength of evidence supporting them.
- `logical_entailment`: a statement reporting a conclusion that follows logically from premises, which are typically well-established facts or knowledge assertions. (e.g. fingernail part of finger, finger part of hand → fingernail part of hand)). Logical entailments are based on dedictive inference, and generally have a high degree of confidence when based on sound premises and inference logic.
- `prediction`: a statement of a possible fact based on more probabilistic (non-deductive) forms of reasoning over indirect forms of evidence, that lead to more speculative conclusions. Predictions often have a lower degree of confidence based on the indirect nature of their evidence and reasoning supporting them.
- `statistical_association`: a statement that reports concepts representing variables in a dataset to be statistically associated in the context of a particular cohort or dataset (e.g. “Metformin Treatment (variable 1) is correlated with Diabetes Diagnosis (variable 2) in EHR dataset X”). These associations are inherently true in that they simple report the results of some statistical analysis, but do not interpret these data to draw broader conclusions about general types in the domain of discourse.
- `observation`: a statement reporting (and possibly quantifying) a phenomenon that was observed to occur - absent any analysis or interpretation that generates a statistical association or supports a broader conclusion or inference. Observation statements are also inherently true in that they simple report what an agent observed - without any interpretation or inference.
- `not_provided`: the knowledge level/type fora statement is not provided, typically because it cannot be determined from available information.
- `knowledge_assertion`: A statement of purported fact that is put forth by an agent as true, based on assessment of direct evidence. Assertions are likely but not definitively true.
- `logical_entailment`: A statement reporting a conclusion that follows logically from premises representing established facts or knowledge assertions (e.g. fingernail part of finger, finger part of hand --> fingernail part of hand).
- `prediction`: A statement of a possible fact based on probabilistic forms of reasoning over more indirect forms of evidence, that lead to more speculative conclusions.
- `statistical_association`: A statement that reports concepts representing variables in a dataset to be statistically associated with each other in a particular cohort (e.g. 'Metformin Treatment (variable 1) is correlated with Diabetes Diagnosis (variable 2) in EHR dataset X').
- `text_co_occurrence`: A statement reporting that mentions of two concepts in some corpus of text (e.g. the biomedical literature) occur together at a statistically significant frequency - suggesting that a real-world biological or clinical relationship may exist between the concepts.
- `observation`: A statement reporting (and possibly quantifying) a phenomenon that was observed to occur - absent any analysis or interpretation that generates a statistical association or supports a broader conclusion or inference.
- `not_provided`: The knowledge level is not provided, typically because it cannot be determined from available information.

NOTE that the notion of a 'level' of knowledge can in one sense relate to the strength of a statement - i.e. how confident we are that it says something true about our domain of discourse. Here, we can generally consider Knowledge Assertions to be stronger than Entailments to be stronger than Predictions. But in another sense, 'level' of knowledge can refer to the scope or specificity of what a statement expresses - on a spectrum from context-specific results of a data analysis, to generalized assertions of knowledge or fact. Here, Statistical Associations and Observations represent more foundational statements that are only slightly removed from the data on which they are based (the former reporting the direct results of an analysis in terms of correlations between variables in the data, and the latter describing phenomena that were observed/reported to have occurred).
NOTE that the notion of a 'level' of knowledge can in one sense relate to the strength of a statement - i.e. how confident we are that it says something true about our domain of discourse. Here, we can generally consider Assertions to be stronger than Entailments to be stronger than Predictions. But in another sense, 'level' of knowledge can refer to the scope or specificity of what a statement expresses - on a spectrum from context-specific results of a data analysis, to generalized assertions of knowledge or fact. Here, Statistical Associations and Observations represent more foundational statements that are only slightly removed from the data on which they are based (the former reporting the direct results of an analysis in terms of correlations between variables in the data, and the latter describing phenomena that were observed/reported to have occurred).

## Implementation Guidance

1. Knowledge Providers MUST report one and only one `agent type` on each Edge they return in a TRAPI message, using an `agent_type` property. The value of this property MUST come from the [biolink:AgentTypeEnum](https://biolink.github.io/biolink-model/AgentTypeEnum/).

2. Knowledge Providers MUST report one and only one `knowledge level` for each Edge returned in a TRAPI message, using a `knowledge_level` property. The value MUST come from the [biolink:KnowledgeLevelEnum](https://biolink.github.io/biolink-model/KnowledgeLevelEnum/).

```json
"edge_id": {
"subject": "PUBCHEM.COMPOUND:6623",
"predicate": "biolink:affects",
"object": "HGNC:3467",
"sources": { . . . }
"sources": { . . . },
"knowledge_level": "knowledge_assertion",
"agent_type": "manual_agent"
}
```

3. The main challenge in applying this standard concerns selecting appropriate agent type and knowledge level terms for a given Edge. To assist KPs in this task, a [Supplemental Guidance document](https://docs.google.com/document/d/140dtM5CjWM97JiBRdAmDT-9IKqHoOj-xbE_5TWkdYqg/edit) provides additional implementation support beyond the base specification above. This includes clarification of key distinctions, tips for proper term selection, and a corpus of examples illustrating how agent type and knowledge level terms are applied to the diverse kinds of Edges provided in Translator knowledge graphs.

Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ Paths are represented within AuxiliaryGraphs. Each Path references edges from th
AuxiliaryGraph's `edges` field. These Edges are unordered; their sequence does not convey Path structure. Instead, the
Path must be reconstructed from the graph itself.

Paths are expected to be linearthere should be no way to skip or bypass any node in the Path. However, parallel edges
Paths are expected to be linearthere should be no way to skip or bypass any node in the Path. However, parallel edges
between two nodes are allowed.

Using the Knowledge Graph above, we can construct the AuxiliaryGraphs shown in the example below:
Expand Down Expand Up @@ -185,7 +185,7 @@ Path `a2` represents a direct lookup edge between Crohn’s and Parkinson’s. I
input nodes, it should be included as a valid path.

The Paths shown correspond to the unconstrained version of the initial query. Applying `intermediate_categories`
constraints would yield a slightly different set of paths. As show below in the constrained version of the
constraints would yield a slightly different set of paths. As shown below in the constrained version of the
AuxiliaryGraphs, `a2` is removed because it does not contain any `Gene` nodes. Therefore, it is not a valid Path
for the constrained version of the query.

Expand All @@ -212,7 +212,7 @@ for the constrained version of the query.
## Results

Each individual Result is structured similarly, with NodeBindings and Analyses. Both input
nodes must be pinned, and therefor no unpinned nodes will exist in the QueryGraph. This means there is only one result
nodes must be pinned, and therefore no unpinned nodes will exist in the QueryGraph. This means there is only one result
for each query, contained within the Results field of the Message. This Result can have many Analyses, each one
corresponding to a different Path. This follows the same Result-merging rules used in other query types, where results
that contain the same nodes but different analyses are combined into a single result, with their analyses concatenated.
Expand Down
Loading