This stage of the project focuses on generating non-redundant and unbiased datasets that can be used for training and benchmarking machine learning models. The process involves clustering, representative selection, TSV filtering, and final dataset splitting.
Positive and negative datasets were clustered independently using MMseqs2 in order to reduce redundancy and ensure that no highly similar proteins appear in both training and benchmarking sets.
Main outputs:
pos_cluster-results_cluster.tsv/neg_cluster-results_cluster.tsv
Tables mapping each input sequence to its cluster representative.pos-cluster-results_rep_seq.fasta/neg-cluster-results_rep_seq.fasta
FASTA files containing only the representative sequences (one per cluster).
Representative identifiers were retrieved directly from the FASTA headers of the MMseqs2 output.
Generated files:
rep_positive.ids→ list of representative IDs for the positive datasetrep_negative.ids→ list of representative IDs for the negative dataset
The script get_tsv.py takes the representative ID lists (.ids) and the cluster mapping tables (*_cluster.tsv) as input. It produces reduced TSV files that contain only the representative entries.
Outputs:
pos_cluster_results.tsv→ positive dataset, reduced to representativesneg_cluster_results.tsv→ negative dataset, reduced to representatives
These files serve as the non-redundant reference datasets for downstream analysis.
The script get_sets.py takes the representative ID lists (.ids) as input and generates:
-
Training set (80%)
Used for model training and hyperparameter tuning.
Within this set, each sequence is also assigned to one of five cross-validation folds, preserving the positive/negative ratio. -
Benchmarking set (20%)
Held out and never used during model training, providing an unbiased evaluation of generalization performance.
Outputs in Cross_Validation/:
pos_train.tsvandneg_train.tsv→ training data for positive and negative classes, including fold assignmentspos_bench.tsvandneg_bench.tsv→ benchmarking data for positive and negative classes
All data preparation steps (clustering, representative selection, TSV filtering, and train/benchmark splitting)
are automated in the script clustering.sh:
bash clustering.shMMseqs2 (Many-against-Many sequence searching) was used in this project to:
- Cluster sequences based on ≥30% identity and ≥40% alignment coverage
- Select a single representative per cluster
- Prevent redundancy and overlap between subsets
These thresholds are specifically chosen to minimize the risk of data leakage, which occurs if similar sequences are present in both training and benchmarking datasets.
Official repository: soedinglab/MMseqs2
After clustering and representative selection:
| Dataset | Total sequences | Representative sequences | File containing representatives |
|---|---|---|---|
| Positive | 2,949 | 1093 | POS CLUSTER TSV |
| Negative | 20,615 | 8934 | NEG CLUSTER TSV |
After splitting with get_sets.py:
| Subset | Positive | Negative | Files generated |
|---|---|---|---|
| Training (80%) | 874 | 7147 | POS TRAIN TSV, NEG TRAIN TSV |
| Benchmarking (20%) | 219 | 1787 | POS BENCH TSV, NEG BENCH TSV |
Each sequence in the training set is also annotated with a cross-validation fold index (1–5).
Next step → The non-redundant training and benchmarking sets generated in this stage
are used as input for the Data_Analysis module, where exploratory data analysis (EDA) is performed to assess dataset quality, visualize protein length distributions, and identify conserved sequence motifs before proceeding to feature extraction and modeling.