Add manifests and refactor DJORNL parser #19

ialarmedalien · 2020-08-20T22:34:44Z

use manifest file to specify files to be included in release
update DJORNL parser to apply manifest file
add tests for manifest file validation
small refactor of parser to apply the same QC to every file
small formatting updates to DJORNL source files
add VERSION (fixes #5) and CHANGELOG files

This is the remainder of the changes in kbaseattic/relation_engine_spec#144 from the old spec repo.

I updated the README.md docs to reflect this change.
This is a breaking API change and I have incremented the API version.

ialarmedalien · 2020-08-20T22:39:47Z

importers/djornl/parser.py

+            delimiter = ','
+        return csv.reader(fd, delimiter=delimiter)
+
+    def parser_gen(self, file):


exciting generator function!

ialarmedalien · 2020-08-20T22:41:37Z

importers/djornl/parser.py

+        def _key(row):
+            return '__'.join([
+                row['node1'],
+                row['node2'],
+                edge_type(row),
+                row['edge'],
+            ])


unique key required by the djornl dataset because the default _key is derived from _to and _from, which is not unique enough

ialarmedalien · 2020-08-20T22:42:55Z

importers/djornl/parser.py

+            'go_description': lambda row: row['go_descr'],
+            'mapman_description': lambda row: row['mapman_descr'],
+            'pheno_description': lambda row: row['pheno_descrip1'],
+            'pheno_pto_name': lambda row: row['pheno_descrip2'],
+            'pheno_pto_description': lambda row: row['pheno_descrip3'],


I'm going to ask the Jacobson group to rename these fields so that this remapping is not necessary.

ialarmedalien · 2020-08-20T22:44:51Z

importers/djornl/parser.py

+                cluster_id = cluster_label + ':' + cols[0].replace('Cluster', '')
+                node_keys = [n.strip() for n in cols[1].split(',')]


asked the DJORNL people to change the format of the cluster files so they're valid TSV/CSV.

lgtm-com · 2020-08-20T22:45:01Z

This pull request introduces 1 alert when merging 380854a into 7e9165b - view on LGTM.com

new alerts:

1 for Duplicate key in dict literal

ialarmedalien · 2020-08-20T22:46:14Z

importers/test/test_djornl_parser.py

+        with self.assertRaisesRegex(RuntimeError, err_str):
+            self.init_parser_with_path(RES_ROOT_DATA_PATH)


all the validation now happens before file parsing starts, so the individual tests in the previous version can be replaced by this single test. Woohoo!

spec/collections/djornl/djornl_node.yaml

CHANGELOG.md

ialarmedalien · 2020-08-20T22:47:57Z

spec/test/djornl/results.json

-      {"_key": "AT1G01100", "node_type": "gene", "transcript": "AT1G01100.4", "gene_symbol": "", "gene_full_name": "", "gene_model_type": "protein_coding", "tair_computational_desc": "60S acidic ribosomal protein family;(source:Araport11)", "tair_curator_summary": "", "tair_short_desc": "60S acidic ribosomal protein family", "go_descr": "structural constituent of ribosome, ribonucleoprotein complex binding, protein kinase activator activity", "go_terms": ["GO:0003735", "GO:0043021", "GO:0030295"], "mapman_bin": "17.1.2.1.46", "mapman_name": ".Protein biosynthesis.ribosome biogenesis.large ribosomal subunit (LSU).LSU proteome.component RPP1", "mapman_desc": "component RPP1 of LSU proteome component (original description: pep chromosome:TAIR10:1:50090:51187:-1 gene:AT1G01100 transcript:AT1G01100.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:RPP1A description:60S acidic ribosomal protein P1-1 [Source:UniProtKB/Swiss-Prot;Acc:Q8LCW9])", "pheno_aragwas_id": "", "pheno_desc1": "", "pheno_desc2": "", "pheno_desc3": "", "pheno_ref": "", "user_notes": ""},
-      {"_key": "Na23", "node_type": "pheno", "transcript": "", "gene_symbol": "", "gene_full_name": "", "gene_model_type": "", "tair_computational_desc": "", "tair_curator_summary": "", "tair_short_desc": "", "go_descr": "", "go_terms": [], "mapman_bin": "", "mapman_name": "", "mapman_desc": "", "pheno_aragwas_id": "10.21958/phenotype:5", "pheno_desc1": "Sodium concentrations in leaves, grown in soil. Elemental analysis was performed with an ICP-MS (PerkinElmer). Sample normalized to calculated weights as described in Baxter et al., 2008", "pheno_desc2": "sodium concentration", "pheno_desc3": "The total sodium ion concentration measured in a given volume of a plant or a plant part or plant extract. [GR:pj]", "pheno_ref": "Atwell et. al, Nature 2010", "user_notes": ""},
-      {"_key": "SDV", "node_type": "pheno", "transcript": "", "gene_symbol": "", "gene_full_name": "", "gene_model_type": "", "tair_computational_desc": "", "tair_curator_summary": "", "tair_short_desc": "", "go_descr": "", "go_terms": [], "mapman_bin": "", "mapman_name": "", "mapman_desc": "", "pheno_aragwas_id": "10.21958/phenotype:104", "pheno_desc1": "Number of days following stratification to opening of first flower. The experiment was stopped at 200 d, and accessions that had not flowered at that point were assigned a value of 200", "pheno_desc2": "days to flowering trait", "pheno_desc3": "A flowering time trait (TO:0002616)which is the number of days required for an individual flower (PO:0009046), a whole plant (PO:0000003) or a plant population to reach flowering stage (PO:0007616) from a predetermined time point (e.g. the date of seed sowing, seedling transplant, or seedling emergence). [GR:pj, TO:cooperl]", "pheno_ref": "Atwell et. al, Nature 2010", "user_notes": ""}
+      {"_key": "As2", "node_type": "pheno", "transcript": "", "gene_symbol": "", "gene_full_name": "", "gene_model_type": "", "tair_computational_description": "", "tair_curator_summary": "", "tair_short_description": "", "go_description": "", "go_terms": [], "mapman_bin": "", "mapman_name": "", "mapman_description": "", "pheno_aragwas_id": "10.21958/phenotype:103", "pheno_description": "", "pheno_pto_name": "bacterial disease resistance", "pheno_pto_description": "The resistance exhibited by a plant or a group of plants (population) in response to the disease caused by a bacterial pathogen infection as compared to the susceptible and/or the reference plants of the same species. [GR:pj]", "pheno_ref": "Atwell et. al, Nature 2010", "user_notes": ""},


field renaming (desc ==> description)

importers/djornl/parser.py

jayrbolton · 2020-08-21T22:38:54Z

importers/test/test_djornl_parser.py

            parser._configure()
            return parser

+    def test_load_no_manifest(self):


I would argue that failure tests and edge cases aren't necessary when the users are just ourselves running this via CLI. I mean it's good we have it and you already did it, but I would consider the ROI pretty low

I would always test failures and error cases, because

you should know how the software acts when things go wrong; it is difficult and annoying to define how software should work if the environment or input isn't as expected, but that's frequently where bugs crop up

it doesn't take long for the nuances of some piece of code to disappear from working memory, and it's much easier to write tests now whilst it's uppermost in my mind than have to come back in several months' time (or for someone else new to come in) and get familiar with the code

I'm planning to use this parser to do automated testing on the exascale_data repo to QA data releases

ideally the whole data upload would be automated at some point, so a new release in the exascale_data repo triggers an Arango data load. A generic dataset upload/update script should be fairly easy to put together (if there isn't one already) so there's less reliance on humans having to trigger these processes manually.

👍 I do admire the work ethic and agree it would be especially useful if we automate this importer. I don't think we will have this high of test standards for the other importers :p

jayrbolton · 2020-08-26T18:15:29Z

Assuming you will come back to this PR at some point @ialarmedalien ?

- update DJORNL parser to apply manifest file - add tests for manifest file validation - small refactor of parser to apply the same QC to every file - small formatting updates to DJORNL source files

ialarmedalien · 2020-08-27T19:38:50Z

importers/djornl/parser.py

-        edge_remap = {
-            'AraGWAS-Phenotype_Associations': 'pheno_assn',
-            'AraNetv2-CX_pairwise-gene-coexpression': 'gene_coexpr',
-            'AraNetv2-DC_domain-co-occurrence': 'domain_co_occur',
-            'AraNetv2-HT_high-throughput-ppi': 'ppi_hithru',
-            'AraNetv2-LC_lit-curated-ppi': 'ppi_liter',
-        }


no longer bothering with this remapping - I'll just remap the names in the UI

ialarmedalien · 2020-08-27T19:41:53Z

importers/test/test_djornl_parser.py

+        edge_err_msg = "\n".join([
+            r"edges.tsv line 3: 'Same-Old-Stuff' is not valid under any of the given schemas",
+            r"edges.tsv line 7: '2.' does not match .*?",
+            r"edges.tsv line 8: 'raNetv2-DC_' is not valid under any of the given schemas",
+            r"edges.tsv line 10: 'score!' does not match .*?"
+        ])


All the errors in the input files get collected, so if the dataset is full of crap, you don't have to run the parser over and over and over again to find all the problems.

Refactor spec files to use a definitions file Add tests for duplicated data

Add a couple more parser tests

jayrbolton · 2020-08-28T20:20:21Z

importers/djornl/parser.py

+
+        if not hasattr(self, '_dataset_schema_dir'):
+            dir_path = os.path.dirname(os.path.realpath(__file__))
+            self._dataset_schema_dir = os.path.join(dir_path, '../', '../', 'spec', 'datasets', 'djornl')


You don't have to change anything, but it may be easier to just use the current working directory and assume/assert that it is always the root of the project

That would break the tests (and other use cases). This ensures that you'll get the correct dir regardless of where you're calling from.

You have to cd within the project to run certain tests?

ialarmedalien requested a review from jayrbolton as a code owner August 20, 2020 22:34

ialarmedalien commented Aug 20, 2020

View reviewed changes

spec/collections/djornl/djornl_node.yaml Show resolved Hide resolved

jayrbolton reviewed Aug 20, 2020

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

ialarmedalien commented Aug 20, 2020

View reviewed changes