Conversation
openfe_benchmarks/data/data_generation/generate_industry_ref_data.py
Outdated
Show resolved
Hide resolved
openfe_benchmarks/data/data_generation/generate_industry_ref_data.py
Outdated
Show resolved
Hide resolved
| if system_group not in available_groups: | ||
| raise ValueError(f"System group {system_group} not found. Available groups: {available_groups}") | ||
| # check the system name | ||
| group_data = ref_dg_data[ref_dg_data["system group"] == system_group].reset_index(drop=True) |
There was a problem hiding this comment.
Would we also need to do the name conversions here, similar to the network script?
There was a problem hiding this comment.
Do you mean the name of the output file?
There was a problem hiding this comment.
Ah yes good catch I think we would need to do the reverse to make them match. Or would it make sense to go the other way and clean the ligand names to remove the spaces?
There was a problem hiding this comment.
Do we still rely on the names nowadays? My understanding is that this is mostly fixed in openfe now. If so, it might be better to just have the original names from the original publications (with spaces if necessary)
There was a problem hiding this comment.
Yes thats how we are matching the reference data to the edges, we could store other identifiers like smiles and inchikey to help mitigate the effects of naming issues though.
jaclark5
left a comment
There was a problem hiding this comment.
LGTM but I'll leave it for @hannahbaumann to approve. The only concern I have is the the names experimental_binding_data.json is really general if there happens to be additional data to add in the future. We could handle that just like the network files where the stem of the filename is the key to a dictionary. These files could be more specific if they were named something like experimental_binding_data_schrodinger.json.
After this is merges I can draft a PR with an attribute BenchmarkData.experimental_binding_data.
The data factory is now merged so if you want to add a tag to the yaml "exp_bfe" or something, the CI will pick it up if you put the conditions for the tags in the header of test_benchmark_index.py with ("exp_bfe", ["experimental_binding_data_*.json"]) or whatever glob should be associated with the tag.
I'm not sure I understand - why would there be experimental data specificly from Schrodinger? All that Schrodinger did was gather the experimental data from published sources. |
| if system_group not in available_groups: | ||
| raise ValueError(f"System group {system_group} not found. Available groups: {available_groups}") | ||
| # check the system name | ||
| group_data = ref_dg_data[ref_dg_data["system group"] == system_group].reset_index(drop=True) |
There was a problem hiding this comment.
Do we still rely on the names nowadays? My understanding is that this is mostly fixed in openfe now. If so, it might be better to just have the original names from the original publications (with spaces if necessary)
| "pint_unit_registry": "openff_units" | ||
| }, | ||
| "uncertainty": { | ||
| "magnitude": 0.0, |
There was a problem hiding this comment.
Are we happy having to assert of near zero floats to be "no experimental uncertainty" or would it be better to use -1 like what David Hahn did in the PLB?
Or maybe even better, don't have the dictionary entry if there's not uncertainty?
There was a problem hiding this comment.
Good point lets just remove the entry to avoid any confusion in future if we have no data for it.
@IAlibay Yes that my quick naive suggestion to make a point that something more specific would allows additional data to follow a similar naming scheme. However @jthorton are the dois or information on where this data came from recorded anywhere? If you don't want to reference the compiled dataset it seems to me that you need to include the direct ref for the experimental data. |
|
Adding some kind of metadata to the json files with provenance would be good - ideally we should point to e.g. the original paper and the Ross et al one. |
…ata.py Co-authored-by: Hannah Baumann <43765638+hannahbaumann@users.noreply.github.com>
…ata.py Co-authored-by: Hannah Baumann <43765638+hannahbaumann@users.noreply.github.com>
|
If we are happy with this format, I'll look into adding the DOI metadata for at least the Ross et al paper. |
|
@jthorton format looks great to me! |
IAlibay
left a comment
There was a problem hiding this comment.
From a quick look, this seems good to me - will defer to the others for final approval though.
Adds a JSON to each folder with the experimental binding free energy data tagged with openff units, this can be loaded with gufe.
I plan on adding some helper functions in another PR to load the data into a DF with units for the analysis scripts.