FUSE-PhyloTree enables to identify sequence conservation modules and associate them with functional annotations through phylogenomic analysis.
Introducing an original approach to characterizing functional motifs. This methodology encompasses:
- Detection of conserved sequence modules (using Partial Local Multiple Alignment)
- Phylogenetic inference of species/genes/modules/functions evolutionary histories
- Identification of co-appearances of modules and functions
The process considers protein sequences and their associated functional annotations as input. It then returns the presence of the conserved sequence modules and of the functional annotations in each ancestral gene along the gene phylogeny.
Detailed Pipeline
Detailed flowchart of the pipeline:
---
config:
look: classic
theme: default
themeVariables:
fontSize: 50px
---
flowchart TB
U@{ shape: manual-input, label: "User Input"} --> A
U --> B
U --> C
A@{ shape: doc, label: "sequences.fasta" } --> m1@{ shape: rectangle, label: "modules_segm.py: segmentation_and_module_phylo()"}
m1 --> m2@{ shape: docs, label: "module_segm_dir_seq/" }
m2 --> m3@{ shape: rectangle, label: "tools.py: segmentation()"}
m5@{ shape: flag, label: "paloma-D" } o--o|call| m3
m3 --> m4@{ shape: doc, label: "seq_filename.dot" }
m4 --> m6@{ shape: rectangle, label: "tools.py: modules_fasta()"}
m6 --> m7@{ shape: docs, label: "module_seq/*.fasta" }
m7 --> m8@{ shape: rectangle, label: "tools.py: all_phylo()"}
m9@{ shape: flag, label: "PhyML" } o--o|call| m8
m8 --> m12@{ shape: docs, label: "module_seq/*.tree" }
m12 --> m10@{ shape: rectangle, label: "modules_segm.py: correct_modules_tree()"}
m11@{ shape: flag, label: "TreeFix" } o--o|call| m10
m10 --> m13@{ shape: docs, label: "module_seq/*.treefix.tree" }
m10 --> m14@{ shape: doc, label: "modules_path_modules.txt" }
m13 o--o|describe paths of| m14
B@{ shape: doc, label: "sequences.tree" } --> g1{gene tree as input ?}
g1 -->|yes| g2@{ shape: doc, label: "rooted binary gene tree .tree" }
g1 -->|no| g3@{ shape: rectangle, label: "gene_phylo.py: whole_phylo()"}
A -.-> g3
g4@{ shape: flag, label: "Muscle" } o--o|call| g3
g5@{ shape: flag, label: "TrimAl" } o--o|call| g3
g6@{ shape: flag, label: "PhyML" } o--o|call| g3
g7@{ shape: flag, label: "TreeFix" } o--o|call| g3
g8@{ shape: flag, label: "PhyML (branch len)" } o--o|call| g3
g3 --> g2
A --> s1@{ shape: rectangle, label: "species_phylo.py: taxid_from_fasta()"}
s2@{ shape: cyl, label: "NCBI Taxonomy" } o--o|use| s1
s1 --> s3@{ shape: doc, label: "species.tree" }
A --> d1@{ shape: rectangle, label: "tools.py: known_domains()"}
d1 --> d2@{ shape: doc, label: "domains.csv" }
g2 --> r1@{ shape: rectangle, label: "tools.py: seadog_md()"}
m14 --> r1
s3 --> r1
r2@{ shape: flag, label: "Seadog-MD" } o--o|call| r1
r1 --> r3@{ shape: doc, label: "seadogMD.output" }
r3 --> r4@{ shape: rectangle, label: "integrates_3phylo.py: write_sp_gene_event()"}
r4 --> r5@{ shape: doc, label: "seadogMD_sp_gene_event.csv" }
r4 --> r6@{ shape: doc, label: "seadogMD_gene.tree" }
C@{ shape: doc, label: "functional annotations.csv" } --> a1@{ shape: rectangle, label: "ances_scenario.py: acs_inference()"}
r5 --> a1
r6 --> a1
a1 --> a2@{ shape: doc, label: "pastml_seadogMD.csv" }
a2 --> a3@{ shape: rectangle, label: "tools.py: pastml()"}
a4@{ shape: flag, label: "PastML" } o--o|call| a3
a3 --> a5@{ shape: doc, label: "pastml_seadogMD_combined_ancestral_states.tab" }
a3 --> a6@{ shape: docs, label: "acs_dir_seadogMD/" }
r3 --> i1@{ shape: rectangle, label: "integrate_3phylo.py: __main__()"}
g2 --> i1
a5 --> i1
d2 --> i1
i1 --> i2@{ shape: doc, label: "0_gene.tree" }
i1 --> i3@{ shape: doc, label: "1_modules_and_functions_evolution.csv" }
i1 --> i4@{ shape: doc, label: "2_module_descriptions.csv" }
i1 --> i5@{ shape: docs, label: "3_visuReconc/" }
i1 --> i6@{ shape: docs, label: "working_dir/" }
linkStyle 0 stroke:#670067, stroke-width:25px
linkStyle 1 stroke:#670067, stroke-width:25px
linkStyle 2 stroke:#670067, stroke-width:25px
linkStyle 3 stroke:#EE6677, stroke-width:20px
linkStyle 4 stroke:#EE6677, stroke-width:20px
linkStyle 5 stroke:#EE6677, stroke-width:20px
linkStyle 6 stroke:#EE6677, stroke-width:20px
linkStyle 7 stroke:#EE6677, stroke-width:20px
linkStyle 8 stroke:#EE6677, stroke-width:20px
linkStyle 9 stroke:#EE6677, stroke-width:20px
linkStyle 10 stroke:#EE6677, stroke-width:20px
linkStyle 11 stroke:#EE6677, stroke-width:20px
linkStyle 12 stroke:#EE6677, stroke-width:20px
linkStyle 13 stroke:#EE6677, stroke-width:20px
linkStyle 14 stroke:#EE6677, stroke-width:20px
linkStyle 15 stroke:#EE6677, stroke-width:20px
linkStyle 16 stroke:#EE6677, stroke-width:20px
linkStyle 17 stroke:#EE6677, stroke-width:20px
linkStyle 18 stroke:#BBBBBB, stroke-width:20px
linkStyle 19 stroke:#BBBBBB, stroke-width:20px
linkStyle 20 stroke:#BBBBBB, stroke-width:20px
linkStyle 21 stroke:#BBBBBB, stroke-width:20px
linkStyle 22 stroke:#BBBBBB, stroke-width:20px
linkStyle 23 stroke:#BBBBBB, stroke-width:20px
linkStyle 24 stroke:#BBBBBB, stroke-width:20px
linkStyle 25 stroke:#BBBBBB, stroke-width:20px
linkStyle 26 stroke:#BBBBBB, stroke-width:20px
linkStyle 27 stroke:#BBBBBB, stroke-width:20px
linkStyle 28 stroke:#66CCEE, stroke-width:20px
linkStyle 29 stroke:#66CCEE, stroke-width:20px
linkStyle 30 stroke:#66CCEE, stroke-width:20px
linkStyle 31 stroke:brown, stroke-width:20px
linkStyle 32 stroke:brown, stroke-width:20px
linkStyle 33 stroke:gold, stroke-width:20px
linkStyle 34 stroke:gold, stroke-width:20px
linkStyle 35 stroke:gold, stroke-width:20px
linkStyle 36 stroke:gold, stroke-width:20px
linkStyle 37 stroke:gold, stroke-width:20px
linkStyle 38 stroke:#4477AA, stroke-width:20px
linkStyle 39 stroke:#4477AA, stroke-width:20px
linkStyle 40 stroke:#4477AA, stroke-width:20px
linkStyle 41 stroke:#4477AA, stroke-width:20px
linkStyle 42 stroke:#4477AA, stroke-width:20px
linkStyle 43 stroke:#4477AA, stroke-width:20px
linkStyle 44 stroke:#4477AA, stroke-width:20px
linkStyle 45 stroke:#4477AA, stroke-width:20px
linkStyle 46 stroke:#4477AA, stroke-width:20px
linkStyle 47 stroke:#4477AA, stroke-width:20px
linkStyle 48 stroke:#4477AA, stroke-width:20px
linkStyle 49 stroke:gold, stroke-width:20px
linkStyle 50 stroke:#BBBBBB, stroke-width:20px
linkStyle 51 stroke:#4477AA, stroke-width:20px
linkStyle 52 stroke:#228833, stroke-width:20px
linkStyle 53 stroke:#228833, stroke-width:20px
linkStyle 54 stroke:#228833, stroke-width:20px
linkStyle 55 stroke:#228833, stroke-width:20px
linkStyle 56 stroke:#228833, stroke-width:20px
linkStyle 57 stroke:#228833, stroke-width:20px
classDef tools fill:#E0B0FF,stroke:#000,stroke-width:4px;
class m5,m9,m11,g4,g5,g6,g7,g8,s2,r2,a4,l4 tools
classDef files fill:#fff,stroke:#000,stroke-width:2px;
class m2,m4,m7,m12,m13,m14,g2,s3,r3,r5,r6,a2,a5,a6,d2,l2 files
classDef input stroke:#670067,stroke-width:15px;
class U,A,B,C,i2,i3,i4,i5,i6,l1 input
classDef scripts fill:#9BD184,stroke:#000,stroke-width:2px;
class m1,m3,m6,m8,m10,g3,s1,d1,r1,r4,a1,a3,i1,l3 scripts
Legend of the detailed pipeline
---
config:
look: classic
theme: default
themeVariables:
fontSize: 50px
---
flowchart LR
l1@{ shape: doc, label: "input/output file" } --- l2@{ shape: doc, label: "intermediate file" }
l2 --- l3@{ shape: rectangle, label: "python3_script.py: function()"}
l3 --- l4@{ shape: flag, label: "external software" }
classDef legend1 fill:#E0B0FF,stroke:#000,stroke-width:4px
class l4 legend1
classDef legend2 fill:#fff,stroke:#000,stroke-width:2px;
class l2 legend2
classDef legend3 stroke:#670067,stroke-width:15px;
class l1 legend3
classDef legend4 fill:#9BD184,stroke:#000,stroke-width:2px;
class l3 legend4
❓ Differences from PhyloCharMod: FUSE-PhyloTree is an actively maintained fork of PhyloCharMod, originally developed for studying the ADAMTS-TSL protein family.
Application Note on Bioinformatics
Olivier Dennler, Elisa Chenel, François Coste, Samuel Blanquart, Catherine Belleannée, Nathalie Théret, FUSE-PhyloTree: Linking functions and sequence conservation modules of a protein family through phylogenomic analysis, Bioinformatics, 2025;, btaf479, https://doi.org/10.1093/bioinformatics/btaf479
For a more detailed explanation of the methodology, refer to the following article [Dennler et al. 2023], or the following PhD thesis (only available in French).
To run the full pipeline, simply use:
python3 fuse-phylotree.py <sequences.fasta> <annotations.csv> <gene_tree.tree>
To install and run this command, you need a proper environment. We strongly recommend using the available Docker image (recommended for personal computers) or Singularity image (recommended for high-performance computing clusters).
-
<sequences.fasta>:Each sequence header must satisfies the format:
>SeqID_taxid, whereSeqIDrepresents the unique sequence identifier andtaxidis the NCBI species taxid (e.g.,>NP031426.2_10090)>NP00001.1_9913 MKAAPTKARREGALVV >XP00002.1_9606 MKVAPTTKAKRGVLVA >NP00003.1_7955 MTKARRMTGGIW >NP00004.3_9606 MTKARRMTGGLLAVRefer to this file for an example
Please refrain from using special characters in the header (e.g.
, |,()`":;). Use only_as a separator.These files can be generated using orthogroups and GFF files, all of which are included in the Docker and Singularity images for nine species (Homo sapiens, Mus musculus, Bos taurus, Gallus gallus, Xenopus tropicalis, Danio rerio, Ciona intestinalis, Drosophila melanogaster, Caenorhabditis elegans). You only need to provide a file with a list of RefSeq of interest. For detailed instructions, please refer to To build a sequence dataset based on orthogroups.
-
<annotations.csv>:This file contains the functional annotations associated with the different sequences (e.g., PPI)
Each line should be formatted as:
SeqID,Annotation_1|Annotation_2. Here,SeqIDis the unique sequence identifier. It is separated from the list of annotations by a comma (,), and individual annotations are separated by pipe symbols (|) (e.g.,NP_620594.1,P00451_F8|P04275_VWF). Annotations must have more than one character in length.NP_00001.1,Interact_with_A|Interact_with_B XP_00002.1,Interact_with_A|Interact_with_B NP_00003.1,Interact_with_A NP_00004.3,Interact_with_A
Refer to this file for an example
-
<gene_tree.tree>:This file contains a binary rooted gene tree of the sequences in newick format.
((NP_00001.1:0.16,XP_00002.1:0.23):0.87,(NP_00003.1:0.35,NP_00004.3:0.42):0.47);Refer to this file for an example
💡 For other option
Pre-computed phylogenetic trees or/and paloma module decompositions can be use, as long as they respect the required header format.
usage: fuse-phylotree.py [-h] [--output_directory OUTPUT_DIRECTORY] [--iter ITER] [--ml_thres ML_THRES] [--pres_fthr PRES_FTHR] [--gain_fthr GAIN_FTHR] [--lost_fthr LOST_FTHR]
[--species_tree SPECIES_TREE] [--infer_gene_tree] [--plma_file PLMA_FILE] [--user_pastml_csv USER_PASTML_CSV] [--reconc_domains] [--paloma_args PALOMA_ARGS]
[--phyml_args PHYML_ARGS] [--treefix_args TREEFIX_ARGS] [--raxml_args RAXML_ARGS] [--seadog_args SEADOG_ARGS] [--pastml_args PASTML_ARGS] [--itol] [--itol_api ITOL_API]
[--itol_project_name ITOL_PROJECT_NAME]
multi_fasta_file [leaf_functions_csv] [gene_tree]
positional arguments:
multi_fasta_file Multi fasta file, with specific formated header >RefSeq_taxid (ex : >XP_012810820.2_8364)
leaf_functions_csv csv file containing for each of our sequence, the list of his functions (ex : XP_012810820.2, P59509 | P999999)
gene_tree Gene tree to use as a support for the pastML and DGS reconciliation inference (WARNING, must correspond to the sequences in the multi fasta file !)
optional arguments:
-h, --help show this help message and exit
--output_directory OUTPUT_DIRECTORY
output directory name
--iter ITER Number of times the whole module evolution inference will be performed, ie: module tree inference; their corrections; DGS reconciliation (default: 10)
--ml_thres ML_THRES Module length thresold: minimum module length, shorter modules will be systematically filtered out (default: 5)
--pres_fthr PRES_FTHR
Sets the presence frequency threshold for modules. A module is considered present at a given gene only if its presence frequency is strictly greater than (>) the specified
threshold. Default: 0.0.
--gain_fthr GAIN_FTHR
Sets the gained frequency threshold for modules. A module is considered gained at a given gene only if its gain frequency is strictly greater than (>) the specified threshold.
Default: 0.0.
--lost_fthr LOST_FTHR
Sets the lost frequency threshold for modules. A module is considered lost at a given gene only if its lost frequency is strictly greater than (>) the specified threshold.
Default: 0.0.
--species_tree SPECIES_TREE
Species tree to use as a support for the reconciliations (WARNING, must correspond to the taxid use in the other files !)
--infer_gene_tree Infer gene tree to use as a support for the pastML and DGS reconciliation inference (WARNING, user should check it and reroot it - we advise to only use it if you know what you
are doing !)
--plma_file PLMA_FILE
Paloma-D output file (.agraph format, .dot, or .oplma format)
--user_pastml_csv USER_PASTML_CSV
PastML full input file, corresponding full custom states to use for the different sequence id (.csv format); eg, header: 'id,P59509,P999999', data: 'XP_012810820.2,1,0' or
'NP_001278744.1,0,,' ; unknown states (empty) will be inferred based on known states; sequence id will be converted to fit the reconcilied gene tree ids;
--reconc_domains Do a DGS reconciliation with known modules (pfam / prosite) ; not tested
--paloma_args PALOMA_ARGS
Custom arguments to pass to paloma-D (e.g, --paloma_args "--thr 5 --min-size 5")
--phyml_args PHYML_ARGS
Custom arguments to pass to PhyML for module trees inference (e.g, --phyml_args "--model JTT")
--treefix_args TREEFIX_ARGS
Custom arguments to pass to TreeFix for gene-modules (e.g, --treefix_args "--niter 100 -D 1 -L 1" - corresponds to options inside -E from treefix or to --niter)
--raxml_args RAXML_ARGS
Custom arguments to pass to RaxML (for TreeFix) for gene-modules (e.g, --raxml_args "-m PROTGAMMAJTT" - corresponds at options inside -e from treefix)
--seadog_args SEADOG_ARGS
Custom arguments to pass to SEADOG-MD (e.g, --seadog_args "--DD 5 --DL 1 --DTA 20 --GD 2 --GL 1")
--pastml_args PASTML_ARGS
Custom arguments to pass to PastML (e.g, --pastml_args "--prediction_method ACCTRAN -m JTT")
--itol Upload directly on my itol account (need --itol_api and --itol_project_name)
--itol_api ITOL_API User iTOL api key for batch upload
--itol_project_name ITOL_PROJECT_NAME
iTOL project name where to upload
⚠️ Infering the gene tree using the pipeline option--infer_gene_treeWhen running the analysis without a gene tree as input (with--infer_gene_tree), a default rooted tree will be generated. However, for optimal results, it is strongly recommended to infer a properly rooted gene tree prior to analysis and use it as the input for the gene_tree option. The gene phylogenetic tree serves as a critical template for the entire analysis, thus it is essential that a reliable and accurately rooted gene tree is prepared and utilized.
The main workflow output is the list of modules and functions that are present, gained or lost at the different ancestral genes. This output is presented as a table in the file 1_modules_and_functions_evolution.csv (Example here). It is strongly advised to also look at the final gene tree (with internal node names) 0_gene.tree to visualise the annotated gene nodes. The file 0_gene.tree corresponds to the input phylogenetic tree and assigns an ancestor name to each internal node. The description of all the modules is available in 2_module_descriptions.csv and it enables to get module segments (sequences and positions) based on module names. For an interactive visualisation of these data, various iTOL files are generated in 3_visuReconc/ and compressed in 3_visuReconc.zip for batch upload on iTOL.
-
0_gene.tree:This file contains the binary rooted gene tree of the sequences in newick format with internal node names.
-
1_modules_and_functions_evolution.csv:This file contains the list of modules and functions that are present, gained or lost at the different ancestral and actual genes—i.e., at all nodes in the gene tree—and records the frequency with which each presence, gain, or loss is observed across iterations.
gene modules_present function_present modules_gained function_gained modules_lost function_lost XP0000041_9606_1 B25:1.0 | B30:1.0 ACAN | MFAP2 – MFAP2 – – XP0000042_10090_2 B25:1.0 | B30:1.0 | B33:1.0 ACAN | VCAN B33:1.0 VCAN – – G10_1_2 B25:1.0 | B30:1.0 ACAN B30:0.9 ACAN B11:0.4 – G11_10_3 B25:1.0 | B11:0.6 – – – – – -
2_module_descriptions.csv:This file contains all module descriptions: all their segments, proteins, and positions in the protein sequences.
module protein start end segment B723 NP001004019.1 888 902 LICARGYHASDDGTK B723 XP018668509.1 1252 1266 LTCTAGYEISDDGES B723 NP034310.2 341 355 PNCGRGYHLNEEGTR B723 NP032018.2 878 892 LVCGRGYHANEEGSE B419 NP114141.2 2521 2527 QEDEAHH B419 NP001179466.3 2521 2527 KEDDTHH B419 XP006529820.1 2544 2550 QEDEAHH -
3_visuReconc/:This directory contains various iTOL files for the interactive visualisation.
-
X_module_function_assoc_stats.csv:This file contains different co-presence statistics for every module-function pairs. Each row lists a module–function pair, flags whether they co-emerged, and reports Jaccard overlap, Fisher exact p-value, and odds ratio comparing genes that contain the module with genes annotated for the function.
module function co_emergence co-presence_jaccard_index co-presence_FET_p_value co-presence_FET-odds_ratio B734 MFAP3 False 0.11764705882352941 0.18209949232049583 3.4482758620689653 B734 COL8A1 True 0.07692307692307693 0.3016405747582532 3.888888888888889 B734 FBXW5 True 0.07692307692307693 0.3016405747582532 3.888888888888889 -
Working directory -
working_dir_*:The intermediates outputs obtained at each of the pipeline steps are provided in a directory called
working_dir_*(see this directory for an example)
Detailed Output & Working Directory
working_directory
├── 0_gene.tree -----> file contains the binary rooted gene tree of the sequences in newick format with internal node names
├── 1_modules_and_functions_evolution.csv -----> file contains the list of modules/functions present/gained/lost at the different ancestral and actual genes, i.e., for all nodes in the gene tree
├── 2_module_descriptions.csv -----> file contains all module descriptions: all their segments, proteins, and positions in the protein sequences
├── 3_visuReconc -----> All itol visualisation files
│ ├── geneReconc_seadogMD_*.tree -----> The "final" gene tree after treefix correction / internal node labelling by seadog mDGS reconciliation / branch length computing by PhyML
│ ├── itolAnnotPresence_seadogMD_*.txt -----> Annotation presences in leaves as heatmap
│ ├── itolBarModulesNb_seadogMD_*.txt -----> Module number in leaves as barplot
│ ├── itolDomains_seadogMD_*.txt -----> Domain decompositions as domain mosaics
│ ├── itolGOt_seadogMD_*.txt -----> Annotation presence as symbols
│ ├── itolModPresence_seadogMD_*.txt -----> Module presences in leaves as heatmap
│ ├── itolModTransfer_seadogMD_*.txt -----> Module transfers as arrows
│ ├── itol_modules_PieGainsLost_seadogMD_*.txt -----> Number of modules gained / lost at each gene nodes as a pie charts
│ ├── itolModulesPresent_G*_*_*_only_mod.txt -----> For each gene node * : all module segments in actual gene present at this * (ancestral) gene as domain mosaics (brown squares)
│ ├── itolModules_seadogMD_*.txt -----> Module decompositions as domain mosaics
│ ├── itolModulesThatChanged_G*_*_*_only_mod.txt -----> For each gene node * : all module segments in actual gene gained at this * (ancestral) gene as domain mosaics (green squares)
│ ├── itolPopup_seadogMD_*.txt -----> Gene nodes popup with module and annotation decriptions
│ ├── itol_ppi_PieGainsLost_seadogMD_*.txt -----> Number of modules gained / lost at each gene nodes as a pie charts
│ └── itolSpGeneEvents_seadogMD_*.txt -----> Gene - Species reconcilation events as symbols
├── 3_visuReconc.zip -----> All itol visualisation files compressed for itol batch upload
└── working_dir_*
├── gene.fasta -----> The input fasta file
├── *_species.tree -----> Species tree extracted from ncbi taxonomy
├── leaf_Manual.csv -----> The input with gene functional annotation file
├── gene.tree -----> The input gene tree file (if given as input)
│
├── iter_module_output -----> All module presence/gain/lost observed frequencies across iterations
│ ├── gene_module_presence_freq.csv -----> All module-gene presence frequencies observed
│ ├── gene_module_gain_freq.csv -----> All module-gene gain event frequencies observed
│ └── gene_module_lost_freq.csv -----> All module-gene lost event frequencies observed
│
├── acs_dir_seadogMD_*_gene -----> All ancestral scenario reconstruction files from pastML
│ ├── leaf_Manual.csv -----> The input with gene functional annotation file
│ ├── pastml_seadogMD_*_gene_leaf_Manual_combined_ancestral_states.tab -----> pastML output regrouping all annotation ancestral states
│ ├── pastml_seadogMD_*_gene_leaf_Manual.csv -----> The input annotation formated as a {0,1,?} matrix for pastML use
│ ├── seadogMD_*_gene_pastml -----> pastML outputs
│ │ ├── marginal_probabilities.character_*.model_*.tab
│ │ ├── params.character_*.method_MPPA.model_*.tab
│ ├── seadogMD_*_gene.tree -----> The "final" gene tree after treefix correction / internal node labelling by seadog mDGS reconciliation / branch length computing by PhyML
│ └── seadogMD_*_sp_gene_event.csv -----> Gene nodes event from Species - Gene reconciliation (e.g., Gene duplication, Speciation)
│
├── gene_phylo_dir_* -----> All tree construction files
│ ├── gene.fasta -----> The input fasta file
│ ├── gene.tree -----> The "final" gene tree after treefix correction / internal node labelling by seadog mDGS reconciliation / branch length computing by PhyML
│ ├── *_species.tree -----> Species tree extracted from ncbi taxonomy
│ ├── muscle_*.fasta -----> Alignment file from muscle
│ ├── phyml_trimal_muscle_*.tree -----> Tree generated by PhyML after alignment with MUSCLE and filtration with trimAl
│ ├── phyml_trimal_muscle_*_treefix_dir -----> Directory containing TreeFix tree correction files
│ │ ├── phyml_trimal_muscle_*.fasta -----> Sequence alignment used for tree correction
│ │ ├── phyml_trimal_muscle_*.tree -----> Initial tree generated by PhyML before correction
│ │ └── phyml_trimal_muscle_*.treefix.tree -----> Tree corrected after treatment with TreeFix
│ ├── phyml_trimal_muscle_*_treefix_dir.smap -----> Node correspondence file between initial tree and corrected tree
│ ├── phyml_trimal_muscle_*_treefix_dir_treeToFixPath.txt -----> Path of the file containing the tree to be corrected
│ ├── trimal_muscle_*.fasta -----> Sequence alignment after filtration with trimAl
│ ├── trimal_muscle_*.phylip -----> Phylip file containing filtered alignment
│ ├── trimal_muscle_*.phylip_phyml_stats.txt -----> Phylogenetic model statistics used by PhyML
│ └── trimal_muscle_*.phylip_phyml_tree.txt -----> Phylogenetic tree generated by PhyML from Phylip alignment
│
├── gene_tree_*
│ └── gene.tree -----> The "final" gene tree after treefix correction / internal node labelling by seadog mDGS reconciliation / branch length computing by PhyML
│
├── modulesChange_seadogMD_*.csv -----> Lists of modules gained / lost at each gene nodes
├── modulesCompo_seadogMD_*.csv -----> Lists of modules presents at each gene nodes
├── modules_paths_modules_*_t*m*M*.txt -----> Define list of module trees to use as seadog input
├── modules_segm_dir_* -----> All modules decomposition files: paloma plma -> for all modules: module fasta -> PhyML tree -> treefix corrected tree
│ ├── *.dnd
│ ├── *.fasta
│ ├── *_t*m*M*.dot -----> The plma file (from input or computed by paloma)
│ ├── *_t*m*M*.oplma
│ ├── *_t*m*M*.plma
│ ├── *_t*m*M*_q*.agraph -----> The plma file (from input or computed by paloma)
│ ├── *_t*M*.afc
│ ├── *.weights
│ └── modules_*_t*m*M*
│ ├── B*.fasta -----> A module fasta file
│ ├── B*.phylip -----> A module in phylip format
│ ├── B*.phylip_phyml_stats.txt -----> PhyML output
│ ├── B*.phylip_phyml_tree.txt -----> PhyML output
│ ├── B*.tree -----> PhyML output
│ ├── B*_treefix_dir -----> treefix working directory
│ │ ├── B*.fasta -----> Module fasta file
│ │ ├── B*.tree -----> Module PhyML tree
│ │ └── B*.treefix.tree -----> Module treefix corrected tree - "final"
│ ├── B*_treefix_dir.smap -----> Gene - Module nodes mapping
│ └── B*_treefix_dir_treeToFixPath.txt -----> Define treefix input
│
├── domains_*.csv -----> Known domains/motifs from scans
├── complete_functionChange_moduleChange_seadogMD_*.csv -----> Table of all annotations and modules gained / lost at each gene nodes (actual and ancestral)
├── functionChange_moduleChange_expand_seadogMD_*.csv -----> Table of modules gained / lost at each gene nodes where there is annotation changes (actual and ancestral) with all module segment details in leaf
├── functionChange_moduleChange_seadogMD_*.csv -----> Table of modules gained / lost at each gene nodes where there is annotation changes (actual and ancestral)
├── seadogMD_*_gene.tree -----> Gene tree from seadog mDGS, with internal gene labelled
├── seadogMD_*.output -----> Seadog mDGS output file
├── seadogMD_*_sp_gene_event.csv -----> Gene nodes event from Species - Gene reconciliation (e.g., Gene duplication, Speciation)
└── specieGeneEvent_seadogMD_*.csv -----> Gene nodes event from Species - Gene reconciliation (e.g., Gene duplication, Speciation)
For ease of use, considering the various software and dependencies required, we strongly recommend using our Docker image. After installing Docker, you can pull our Docker image using the following command:
docker pull ghcr.io/ocmalde/fuse-phylotree:v1.1.0
First, you need to create a container, linked to your workspace, and connect to it interactively:
docker run -it --entrypoint /bin/bash -v "$(pwd):/workspace" -w /workspace ghcr.io/ocmalde/fuse-phylotree:v1.1.0
Then you can simply execute the whole pipeline using:
python3 /fuse-phylotree/fuse-phylotree.py <sequences.fasta> <annotations.csv> <gene_tree.tree>
To quit it, simply type exit
Example with test files provided in the Docker image (you need to connect to the container first):
cd test_dir/ && python3 ../fuse-phylotree/fuse-phylotree.py --infer_gene_tree INFER_GENE_TREE 712buddy37seq.fasta leaf_Manual_712.csv
If you already made a container, you can obtain its <CONTAINER ID> using docker ps -a.
Using the <CONTAINER ID>, you can connect to the existing container with:
docker start <CONTAINER ID> && docker attach <CONTAINER ID>
To use move file from/to the container, use:
docker cp <CONTAINER ID>:/path/in/container/ /path/in/local
On most High Performance Computing clusters, you can directly pull our Singularity image using the following command:
singularity pull fuse-phylotree.sif https://github.com/OcMalde/fuse-phylotree/releases/download/V1.1.0/fuse-phylotree.sif
singularity exec fuse-phylotree.sif python3 /fuse-phylotree/fuse-phylotree.py <sequences.fasta> <annotations.csv> <gene_tree.tree>
Using simple lists of RefSeq IDs proteins and a FUSE-PhyloTree Docker or Singularity image, you can easily generate a formatted FASTA file for the pipeline. This file will contain the longest protein sequence for each homolog (both orthologs and paralogs) in your input family. Simply run one the following script:
Usage: make_orthogroup_fasta.sh <id_seq_file> <name> <docker_image|singularity_image_path>
📑
<id_seq_file>short example:
NP_001987.2
NP_006476.2
NP_006477.2
NP_006478.2
...Refer to this file for the fibulin file example.
Specific modules of the pipeline can be executed independently (see their --help for all usage details).
For example, to compute a phylogenetic tree using Muscle/Trimal/PhyML/Treefix:
python3 /fuse-phylotree/gene_phylo.py <fasta_file> <species_tree>
To only compute the final integration module on computed files:
python3 /fuse-phylotree/integrate_3phylo.py <seadogMD.output> <gene_tree.tree> --pastml_tab <pastml_seadogMD_combined_ancestral_states.tab> --domains_csv <domains.csv>
Example: Running only the final integration step on the fibulin working directory
Move in the working directory (available here):
cd dir_fibuline_phylocharmod
and execute the integration module with:
python3 /fuse-phylotree/integrate_3phylo.py seadogMD_fibulin59.output gene_tree_fibulin59/fibulin59.tree --pastml_tab acs_dir_seadogMD_fibulin59_gene/pastml_seadogMD_fibulin59_gene_leaf_Manual_fibulin59_combined_ancestral_states.tab --domains_csv domains_fibulin59.csv
All the different included softwares are usable using the Docker or Singularity image. For example, paloma-D can be used with:
docker start <CONTAINER ID> && docker exec <CONTAINER ID> /bin/bash -c ". ~/.bashrc && paloma-D --help && exit"
Or simply: paloma-D --help if connected to a container
singularity exec fuse-phylotree.sif paloma-D --help
All these programs are mandatory to run FUSE-PhyloTree, and must be in ;
/usr/local/bin/
If not, their path must be specified in the config file ;
fuse-phylotree/config.txt
Muscle (v3.8.31), one of the best-performing multiple alignment programs, Conda package
PhyML (v3.3.20190909), maximum likelihood phylogenetic inference for the gene and the module trees, Conda package
TreeFix (v1.1.10), Statistically Informed Gene Tree Error Correction Using Species Trees, Conda package
trimAL (v1.4.rev15), Alignment filtering, Conda package
SEADOG-MD, for DGS-reconciliation.
PastML (v1.9.41), for ancestral characters inference, Pip package
paloma-D (v0.1), for sequence segmentation.
