We already have some tests for data preprocessing. However, those are more integration tests that capture the behaviour of the tool as a whole than unit tests for specific functions.
In order to efficiently test the different preprocessing functionalities, we need to add some smaller-scale unit tests. Those should not include real data, but sample input values that can be generated from scratch.
Here are the classes / functions that should be covered (from the implementation in the protein_prediction branch
reader.py:
- DataReader:
to_data()
- ChemDataReader:
_read_data()
- DeepChemDataReader:
_read_data()
- SelfiesReader:
_read_data()
- ProteinDataReader:
_read_data()
collate.py:
- DefaultCollator:
__call__()
- RaggedCollator:
__call__(), process_label_rows()
datasets/base.py
- XYBaseDataModule:
_filter_labels()
- DynamicDataset:
get_test_split(), get_train_val_splits_given_test()
datasets/chebi.py
- _ChEBIDataExtractor:
_extract_class_hierarchy(), _graph_to_raw_dataset(), _load_dict(), _setup_pruned_test_set()
- ChEBIOverX:
select_classes()
- ChEBIOverXPartial:
extract_class_hierarchy()
term_callback()
datasets/go_uniprot.py:
- _GOUniprotDataExtractor:
_extract_class_hierarchy(), term_callback(), _graph_to_raw_dataset(), _get_swiss_to_go_mapping(), _load_dict()
- _GoUniProtOverX:
select_classes()
datasets/tox21.py:
Tox21MolNet: setup_processed(), _load_data_from_file()
- Tox21Challenge:
setup_processed(), _load_data_from_file(), _load_dict()
For some functions, it is necessary to read from / write to files. Instead of real files, I would suggest to use mock objects (see e.g. this comment)
We already have some tests for data preprocessing. However, those are more integration tests that capture the behaviour of the tool as a whole than unit tests for specific functions.
In order to efficiently test the different preprocessing functionalities, we need to add some smaller-scale unit tests. Those should not include real data, but sample input values that can be generated from scratch.
Here are the classes / functions that should be covered (from the implementation in the
protein_predictionbranchreader.py:to_data()_read_data()_read_data()_read_data()_read_data()collate.py:__call__()__call__(),process_label_rows()datasets/base.py_filter_labels()get_test_split(),get_train_val_splits_given_test()datasets/chebi.py_extract_class_hierarchy(),_graph_to_raw_dataset(),_load_dict(),_setup_pruned_test_set()select_classes()extract_class_hierarchy()term_callback()datasets/go_uniprot.py:_extract_class_hierarchy(),term_callback(),_graph_to_raw_dataset(),_get_swiss_to_go_mapping(),_load_dict()select_classes()datasets/tox21.py:Tox21MolNet:setup_processed(),_load_data_from_file()setup_processed(),_load_data_from_file(),_load_dict()For some functions, it is necessary to read from / write to files. Instead of real files, I would suggest to use mock objects (see e.g. this comment)