Skip to content

Conversation

@ialarmedalien
Copy link
Collaborator

@ialarmedalien ialarmedalien commented Aug 20, 2020

  • use manifest file to specify files to be included in release
  • update DJORNL parser to apply manifest file
  • add tests for manifest file validation
  • small refactor of parser to apply the same QC to every file
  • small formatting updates to DJORNL source files
  • add VERSION (fixes #5) and CHANGELOG files

This is the remainder of the changes in kbaseattic/relation_engine_spec#144 from the old spec repo.

  • I updated the README.md docs to reflect this change.
  • This is a breaking API change and I have incremented the API version.

delimiter = ','
return csv.reader(fd, delimiter=delimiter)

def parser_gen(self, file):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exciting generator function!

Comment on lines 165 to 232
def _key(row):
return '__'.join([
row['node1'],
row['node2'],
edge_type(row),
row['edge'],
])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique key required by the djornl dataset because the default _key is derived from _to and _from, which is not unique enough

Comment on lines 247 to 251
'go_description': lambda row: row['go_descr'],
'mapman_description': lambda row: row['mapman_descr'],
'pheno_description': lambda row: row['pheno_descrip1'],
'pheno_pto_name': lambda row: row['pheno_descrip2'],
'pheno_pto_description': lambda row: row['pheno_descrip3'],
Copy link
Collaborator Author

@ialarmedalien ialarmedalien Aug 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to ask the Jacobson group to rename these fields so that this remapping is not necessary.

Comment on lines 305 to 306
cluster_id = cluster_label + ':' + cols[0].replace('Cluster', '')
node_keys = [n.strip() for n in cols[1].split(',')]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asked the DJORNL people to change the format of the cluster files so they're valid TSV/CSV.

@lgtm-com
Copy link

lgtm-com bot commented Aug 20, 2020

This pull request introduces 1 alert when merging 380854a into 7e9165b - view on LGTM.com

new alerts:

  • 1 for Duplicate key in dict literal

Comment on lines +77 to +93
with self.assertRaisesRegex(RuntimeError, err_str):
self.init_parser_with_path(RES_ROOT_DATA_PATH)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the validation now happens before file parsing starts, so the individual tests in the previous version can be replaced by this single test. Woohoo!

{"_key": "AT1G01100", "node_type": "gene", "transcript": "AT1G01100.4", "gene_symbol": "", "gene_full_name": "", "gene_model_type": "protein_coding", "tair_computational_desc": "60S acidic ribosomal protein family;(source:Araport11)", "tair_curator_summary": "", "tair_short_desc": "60S acidic ribosomal protein family", "go_descr": "structural constituent of ribosome, ribonucleoprotein complex binding, protein kinase activator activity", "go_terms": ["GO:0003735", "GO:0043021", "GO:0030295"], "mapman_bin": "17.1.2.1.46", "mapman_name": ".Protein biosynthesis.ribosome biogenesis.large ribosomal subunit (LSU).LSU proteome.component RPP1", "mapman_desc": "component RPP1 of LSU proteome component (original description: pep chromosome:TAIR10:1:50090:51187:-1 gene:AT1G01100 transcript:AT1G01100.4 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:RPP1A description:60S acidic ribosomal protein P1-1 [Source:UniProtKB/Swiss-Prot;Acc:Q8LCW9])", "pheno_aragwas_id": "", "pheno_desc1": "", "pheno_desc2": "", "pheno_desc3": "", "pheno_ref": "", "user_notes": ""},
{"_key": "Na23", "node_type": "pheno", "transcript": "", "gene_symbol": "", "gene_full_name": "", "gene_model_type": "", "tair_computational_desc": "", "tair_curator_summary": "", "tair_short_desc": "", "go_descr": "", "go_terms": [], "mapman_bin": "", "mapman_name": "", "mapman_desc": "", "pheno_aragwas_id": "10.21958/phenotype:5", "pheno_desc1": "Sodium concentrations in leaves, grown in soil. Elemental analysis was performed with an ICP-MS (PerkinElmer). Sample normalized to calculated weights as described in Baxter et al., 2008", "pheno_desc2": "sodium concentration", "pheno_desc3": "The total sodium ion concentration measured in a given volume of a plant or a plant part or plant extract. [GR:pj]", "pheno_ref": "Atwell et. al, Nature 2010", "user_notes": ""},
{"_key": "SDV", "node_type": "pheno", "transcript": "", "gene_symbol": "", "gene_full_name": "", "gene_model_type": "", "tair_computational_desc": "", "tair_curator_summary": "", "tair_short_desc": "", "go_descr": "", "go_terms": [], "mapman_bin": "", "mapman_name": "", "mapman_desc": "", "pheno_aragwas_id": "10.21958/phenotype:104", "pheno_desc1": "Number of days following stratification to opening of first flower. The experiment was stopped at 200 d, and accessions that had not flowered at that point were assigned a value of 200", "pheno_desc2": "days to flowering trait", "pheno_desc3": "A flowering time trait (TO:0002616)which is the number of days required for an individual flower (PO:0009046), a whole plant (PO:0000003) or a plant population to reach flowering stage (PO:0007616) from a predetermined time point (e.g. the date of seed sowing, seedling transplant, or seedling emergence). [GR:pj, TO:cooperl]", "pheno_ref": "Atwell et. al, Nature 2010", "user_notes": ""}
{"_key": "As2", "node_type": "pheno", "transcript": "", "gene_symbol": "", "gene_full_name": "", "gene_model_type": "", "tair_computational_description": "", "tair_curator_summary": "", "tair_short_description": "", "go_description": "", "go_terms": [], "mapman_bin": "", "mapman_name": "", "mapman_description": "", "pheno_aragwas_id": "10.21958/phenotype:103", "pheno_description": "", "pheno_pto_name": "bacterial disease resistance", "pheno_pto_description": "The resistance exhibited by a plant or a group of plants (population) in response to the disease caused by a bacterial pathogen infection as compared to the susceptible and/or the reference plants of the same species. [GR:pj]", "pheno_ref": "Atwell et. al, Nature 2010", "user_notes": ""},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field renaming (desc ==> description)

parser._configure()
return parser

def test_load_no_manifest(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that failure tests and edge cases aren't necessary when the users are just ourselves running this via CLI. I mean it's good we have it and you already did it, but I would consider the ROI pretty low

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would always test failures and error cases, because

  1. you should know how the software acts when things go wrong; it is difficult and annoying to define how software should work if the environment or input isn't as expected, but that's frequently where bugs crop up
  2. it doesn't take long for the nuances of some piece of code to disappear from working memory, and it's much easier to write tests now whilst it's uppermost in my mind than have to come back in several months' time (or for someone else new to come in) and get familiar with the code
  3. I'm planning to use this parser to do automated testing on the exascale_data repo to QA data releases
  4. ideally the whole data upload would be automated at some point, so a new release in the exascale_data repo triggers an Arango data load. A generic dataset upload/update script should be fairly easy to put together (if there isn't one already) so there's less reliance on humans having to trigger these processes manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I do admire the work ethic and agree it would be especially useful if we automate this importer. I don't think we will have this high of test standards for the other importers :p

@jayrbolton
Copy link
Contributor

Assuming you will come back to this PR at some point @ialarmedalien ?

- update DJORNL parser to apply manifest file
- add tests for manifest file validation
- small refactor of parser to apply the same QC to every file
- small formatting updates to DJORNL source files
Comment on lines -69 to -75
edge_remap = {
'AraGWAS-Phenotype_Associations': 'pheno_assn',
'AraNetv2-CX_pairwise-gene-coexpression': 'gene_coexpr',
'AraNetv2-DC_domain-co-occurrence': 'domain_co_occur',
'AraNetv2-HT_high-throughput-ppi': 'ppi_hithru',
'AraNetv2-LC_lit-curated-ppi': 'ppi_liter',
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer bothering with this remapping - I'll just remap the names in the UI

Comment on lines +103 to +108
edge_err_msg = "\n".join([
r"edges.tsv line 3: 'Same-Old-Stuff' is not valid under any of the given schemas",
r"edges.tsv line 7: '2.' does not match .*?",
r"edges.tsv line 8: 'raNetv2-DC_' is not valid under any of the given schemas",
r"edges.tsv line 10: 'score!' does not match .*?"
])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the errors in the input files get collected, so if the dataset is full of crap, you don't have to run the parser over and over and over again to find all the problems.

Refactor spec files to use a definitions file
Add tests for duplicated data

if not hasattr(self, '_dataset_schema_dir'):
dir_path = os.path.dirname(os.path.realpath(__file__))
self._dataset_schema_dir = os.path.join(dir_path, '../', '../', 'spec', 'datasets', 'djornl')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to change anything, but it may be easier to just use the current working directory and assume/assert that it is always the root of the project

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would break the tests (and other use cases). This ensures that you'll get the correct dir regardless of where you're calling from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to cd within the project to run certain tests?

@jayrbolton jayrbolton merged commit 2019ef3 into develop Aug 28, 2020
@jayrbolton jayrbolton deleted the parser_updates branch August 28, 2020 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docker build version

3 participants