-
Notifications
You must be signed in to change notification settings - Fork 11
Add manifests and refactor DJORNL parser #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| delimiter = ',' | ||
| return csv.reader(fd, delimiter=delimiter) | ||
|
|
||
| def parser_gen(self, file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exciting generator function!
importers/djornl/parser.py
Outdated
| def _key(row): | ||
| return '__'.join([ | ||
| row['node1'], | ||
| row['node2'], | ||
| edge_type(row), | ||
| row['edge'], | ||
| ]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unique key required by the djornl dataset because the default _key is derived from _to and _from, which is not unique enough
importers/djornl/parser.py
Outdated
| 'go_description': lambda row: row['go_descr'], | ||
| 'mapman_description': lambda row: row['mapman_descr'], | ||
| 'pheno_description': lambda row: row['pheno_descrip1'], | ||
| 'pheno_pto_name': lambda row: row['pheno_descrip2'], | ||
| 'pheno_pto_description': lambda row: row['pheno_descrip3'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to ask the Jacobson group to rename these fields so that this remapping is not necessary.
importers/djornl/parser.py
Outdated
| cluster_id = cluster_label + ':' + cols[0].replace('Cluster', '') | ||
| node_keys = [n.strip() for n in cols[1].split(',')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asked the DJORNL people to change the format of the cluster files so they're valid TSV/CSV.
|
This pull request introduces 1 alert when merging 380854a into 7e9165b - view on LGTM.com new alerts:
|
| with self.assertRaisesRegex(RuntimeError, err_str): | ||
| self.init_parser_with_path(RES_ROOT_DATA_PATH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the validation now happens before file parsing starts, so the individual tests in the previous version can be replaced by this single test. Woohoo!
| {"_key": "AT1G01100", "node_type": "gene", "transcript": "AT1G01100.4", "gene_symbol": "", "gene_full_name": "", "gene_model_type": "protein_coding", "tair_computational_desc": "60S acidic ribosomal protein family;(source:Araport11)", "tair_curator_summary": "", "tair_short_desc": "60S acidic ribosomal protein family", "go_descr": "structural constituent of ribosome, ribonucleoprotein complex binding, protein kinase activator activity", "go_terms": ["GO:0003735", "GO:0043021", "GO:0030295"], "mapman_bin": "17.1.2.1.46", "mapman_name": ".Protein biosynthesis.ribosome biogenesis.large ribosomal subunit (LSU).LSU proteome.component RPP1", "mapman_desc": "component RPP1 of LSU proteome component (original description: pep chromosome:TAIR10:1:50090:51187:-1 gene:AT1G01100 transcript:AT1G01100.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:RPP1A description:60S acidic ribosomal protein P1-1 [Source:UniProtKB/Swiss-Prot;Acc:Q8LCW9])", "pheno_aragwas_id": "", "pheno_desc1": "", "pheno_desc2": "", "pheno_desc3": "", "pheno_ref": "", "user_notes": ""}, | ||
| {"_key": "Na23", "node_type": "pheno", "transcript": "", "gene_symbol": "", "gene_full_name": "", "gene_model_type": "", "tair_computational_desc": "", "tair_curator_summary": "", "tair_short_desc": "", "go_descr": "", "go_terms": [], "mapman_bin": "", "mapman_name": "", "mapman_desc": "", "pheno_aragwas_id": "10.21958/phenotype:5", "pheno_desc1": "Sodium concentrations in leaves, grown in soil. Elemental analysis was performed with an ICP-MS (PerkinElmer). Sample normalized to calculated weights as described in Baxter et al., 2008", "pheno_desc2": "sodium concentration", "pheno_desc3": "The total sodium ion concentration measured in a given volume of a plant or a plant part or plant extract. [GR:pj]", "pheno_ref": "Atwell et. al, Nature 2010", "user_notes": ""}, | ||
| {"_key": "SDV", "node_type": "pheno", "transcript": "", "gene_symbol": "", "gene_full_name": "", "gene_model_type": "", "tair_computational_desc": "", "tair_curator_summary": "", "tair_short_desc": "", "go_descr": "", "go_terms": [], "mapman_bin": "", "mapman_name": "", "mapman_desc": "", "pheno_aragwas_id": "10.21958/phenotype:104", "pheno_desc1": "Number of days following stratification to opening of first flower. The experiment was stopped at 200 d, and accessions that had not flowered at that point were assigned a value of 200", "pheno_desc2": "days to flowering trait", "pheno_desc3": "A flowering time trait (TO:0002616)which is the number of days required for an individual flower (PO:0009046), a whole plant (PO:0000003) or a plant population to reach flowering stage (PO:0007616) from a predetermined time point (e.g. the date of seed sowing, seedling transplant, or seedling emergence). [GR:pj, TO:cooperl]", "pheno_ref": "Atwell et. al, Nature 2010", "user_notes": ""} | ||
| {"_key": "As2", "node_type": "pheno", "transcript": "", "gene_symbol": "", "gene_full_name": "", "gene_model_type": "", "tair_computational_description": "", "tair_curator_summary": "", "tair_short_description": "", "go_description": "", "go_terms": [], "mapman_bin": "", "mapman_name": "", "mapman_description": "", "pheno_aragwas_id": "10.21958/phenotype:103", "pheno_description": "", "pheno_pto_name": "bacterial disease resistance", "pheno_pto_description": "The resistance exhibited by a plant or a group of plants (population) in response to the disease caused by a bacterial pathogen infection as compared to the susceptible and/or the reference plants of the same species. [GR:pj]", "pheno_ref": "Atwell et. al, Nature 2010", "user_notes": ""}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
field renaming (desc ==> description)
| parser._configure() | ||
| return parser | ||
|
|
||
| def test_load_no_manifest(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that failure tests and edge cases aren't necessary when the users are just ourselves running this via CLI. I mean it's good we have it and you already did it, but I would consider the ROI pretty low
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would always test failures and error cases, because
- you should know how the software acts when things go wrong; it is difficult and annoying to define how software should work if the environment or input isn't as expected, but that's frequently where bugs crop up
- it doesn't take long for the nuances of some piece of code to disappear from working memory, and it's much easier to write tests now whilst it's uppermost in my mind than have to come back in several months' time (or for someone else new to come in) and get familiar with the code
- I'm planning to use this parser to do automated testing on the
exascale_datarepo to QA data releases - ideally the whole data upload would be automated at some point, so a new release in the
exascale_datarepo triggers an Arango data load. A generic dataset upload/update script should be fairly easy to put together (if there isn't one already) so there's less reliance on humans having to trigger these processes manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I do admire the work ethic and agree it would be especially useful if we automate this importer. I don't think we will have this high of test standards for the other importers :p
|
Assuming you will come back to this PR at some point @ialarmedalien ? |
bc81b02 to
9d50c6c
Compare
- update DJORNL parser to apply manifest file - add tests for manifest file validation - small refactor of parser to apply the same QC to every file - small formatting updates to DJORNL source files
9d50c6c to
2606afd
Compare
| edge_remap = { | ||
| 'AraGWAS-Phenotype_Associations': 'pheno_assn', | ||
| 'AraNetv2-CX_pairwise-gene-coexpression': 'gene_coexpr', | ||
| 'AraNetv2-DC_domain-co-occurrence': 'domain_co_occur', | ||
| 'AraNetv2-HT_high-throughput-ppi': 'ppi_hithru', | ||
| 'AraNetv2-LC_lit-curated-ppi': 'ppi_liter', | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer bothering with this remapping - I'll just remap the names in the UI
| edge_err_msg = "\n".join([ | ||
| r"edges.tsv line 3: 'Same-Old-Stuff' is not valid under any of the given schemas", | ||
| r"edges.tsv line 7: '2.' does not match .*?", | ||
| r"edges.tsv line 8: 'raNetv2-DC_' is not valid under any of the given schemas", | ||
| r"edges.tsv line 10: 'score!' does not match .*?" | ||
| ]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the errors in the input files get collected, so if the dataset is full of crap, you don't have to run the parser over and over and over again to find all the problems.
Refactor spec files to use a definitions file Add tests for duplicated data
2606afd to
541dd4e
Compare
Add a couple more parser tests
|
|
||
| if not hasattr(self, '_dataset_schema_dir'): | ||
| dir_path = os.path.dirname(os.path.realpath(__file__)) | ||
| self._dataset_schema_dir = os.path.join(dir_path, '../', '../', 'spec', 'datasets', 'djornl') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to change anything, but it may be easier to just use the current working directory and assume/assert that it is always the root of the project
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would break the tests (and other use cases). This ensures that you'll get the correct dir regardless of where you're calling from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to cd within the project to run certain tests?
This is the remainder of the changes in kbaseattic/relation_engine_spec#144 from the old spec repo.