Roundtrip differences

I ran a roundtrip, from the files [here](https://github.com/NuriaQueralt/ngly1-graph/tree/master/neo4j-graphs/ngly1-v3.1/import/ngly1), and then dumped them out (to nodes_out.csv and edges_out.csv).

**Comparing nodes file**
- There are columns "synonyms:IGNORE" and "name" in the input file. I merged "name" with the synonyms, so I cannot separate it back out, so the name column is always blank in the output file, and the synonyms column may contain what used to be in "name".
- Missing value are blank in the output and "NA" in the input. Is this important?

Looking only at the IDs
$ cut -f1 -d, nodes_out.csv | sort > nodes_out_id.csv
$ cut -f1 -d, ngly1_concepts.csv | sort > ngly1_concepts_sort.csv
$ diff nodes_out_id.csv ngly1_concepts_sort.csv
Result: everything is there except for the 4 items with huge IDs (https://github.com/SuLab/Krusty/issues/2)

**Comparing edges file**
- I ignored the column "reference_date" in the input file.
- Nuria's file has some edges where the prop is "None". Ignore those

$ cut -f1-3 -d, edges_out.csv | sort > edges_out_id.csv
$ cut -f1-3 -d, ngly1_statements.csv | grep -v ",None," | sort > ngly1_statements_id.csv
$ wc -l edges_out_id.csv ngly1_statements_id.csv
  786913 edges_out_id.csv
  791161 ngly1_statements_id.csv
We're missing 4248 lines... 

Which subj IDs am I missing?
$ diff -U0 =(cut -f1 -d, edges_out_id.csv) =(cut -f1 -d, ngly1_statements_id.csv) |  grep -E "^\+" | uniq -c
```
      1 +FlyBase:FBgn0000180
      1 +HGNC:17646
      1 +HGNC:633
   2827 +HGNC:6914
   1402 +HGNC:8031
      2 +MGI:102709
      1 +MGI:103201
      1 +RGD:2141
      3 +RGD:2280
      1 +SGD:S000000763
      1 +UniProt:O94778
      1 +UniProt:P29972
      1 +UniProt:P30301
      1 +UniProt:P41181
      1 +UniProt:P55064
      1 +UniProt:P55087
      1 +UniProt:Q13520
      1 +UniProt:Q9UKM7
```
Missing  2827 from HGNC:6914 and 1402  from HGNC:8031, which we [know](https://github.com/SuLab/Krusty/issues/1).
What are the 19 others?

$ diff -U0 =(grep -v HGNC:6914 edges_out_id.csv | grep -v HGNC:8031 | cut -f2 -d,) =(grep -v HGNC:6914 ngly1_statements_id.csv | grep -v HGNC:8031 | cut -f2 -d,) |  grep -E "^\+" | uniq -c

      1 +RO:0002200
      2 +RO:0002331
      7 +colocalizes_with
      1 +contributes_to
      8 +rdf:type

The rdf:type issue: https://github.com/SuLab/Krusty/issues/5
I know about colocalizes_with and contributes_to (https://github.com/NuriaQueralt/ngly1-graph/issues/3)

For the other two, these look like weird edge cases. For example
```
FlyBase:FBgn0000180,RO:0002200,FBcv:0000435,NA,NA,NA,has phenotype,NA,http://purl.obolibrary.org/obo/RO_0002200
FlyBase:FBgn0000180,RO:0002200,FBcv:0000435,https://www.ncbi.nlm.nih.gov/pubmed/15534205,This edge comes from the Monarch Knowledge Graph 2018.,NA,has phenotype,NA,http://purl.obolibrary.org/obo/RO_0002200
```
There are two lines for the same edge in the input file. One has no ref, one does. So in wikidata, they become one. One output, we end up with one line instead of two. This isn't an issue as we aren't actually missing anything.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Roundtrip differences #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Roundtrip differences #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions