A Snakemake pipeline for quantifying transposable element (TE) presence across one or more genome assemblies. Given a TE consensus FASTA and a set of genome FASTAs, the pipeline uses BLAT to find all full and partial TE copies, summarizes them per-species and across species, and produces UCSC Genome Browser–ready files for visualization.
The pipeline:
- BLATs each TE consensus sequence against each genome assembly
- Reconstructs full TE copies from BLAT hits
- Computes per-TE statistics: number of genomic regions, total nucleotides covered, full vs. partial copy counts
- Aligns found copies back to the consensus (via minimap2) and converts to indexed BAM
- Converts the TE consensus to a 2bit file and generates UCSC hub files (assembly hub, track hub, annotation hub)
- Combines per-species summaries into a single cross-species TSV
| Input | Description |
|---|---|
| Genome FASTA folder | Directory containing one .fa file per genome assembly |
| TE consensus FASTA | FASTA of TE consensus sequences (e.g. from Dfam or RepBase/DRoSophila REPbase) |
| BigBed annotation file | .bb file annotating internal regions, LTRs, and protein-coding domains relative to the TE consensus |
| Config file | summarize_TE_config.yaml — paths, species list, and email |
output/
├── combined_summary.tsv # cross-species summary
├── species_summaries/
│ ├── <species>_blat_hits.pslx # raw BLAT output
│ ├── <species>_blat_regions_full.fa # FASTA of full TE copies found
│ ├── <species>_blat_regions_full_and_partial.fa
│ └── <species>_summary_file.tsv # per-species TE counts
└── ucsc_browser_files/
├── <TE_consensus_name>.2bit # TE consensus as 2bit
├── assembly_hub.txt # UCSC assembly hub
├── annotation_hub.txt # UCSC annotation hub
├── groups.txt
└── <species>/
├── <species>_full_sorted.bam[.bai]
├── <species>_full_and_partial_sorted.bam[.bai]
├── <species>_full_track_hub.txt
└── <species>_full_and_partial_track_hub.txt
Install with conda/mamba:
conda env create -f environment.yaml
conda activate <env_name>Key tools: snakemake, blat, minimap2, samtools, faToTwoBit
Edit summarize_TE_config.yaml:
genome_fasta_folder: /path/to/genome/fastas # one .fa per species
output_folder: output
TE_fasta: /path/to/TE_consensus.fa
big_bed_file: /path/to/annotations.bb
email: your@email.com
all_species:
- species_name_1 # must match the .fa filename without extension
- species_name_2conda activate <env_name>
snakemake -s summarize_TE.smk --use-conda --cores 4- Copy the
ucsc_browser_files/folder to your public web server - Go to the UCSC Genome Browser → My Data → Connected Hubs
- Add URLs pointing to:
assembly_hub.txt— loads the TE consensus as a custom assembly<species>_full_track_hub.txt— track of full TE copies aligned to the consensus<species>_full_and_partial_track_hub.txt— track of full + partial copiesannotation_hub.txt— annotates internal regions, LTRs, and protein domains
The included annotation files (annotation_files/) have coordinates relative to the TE consensus sequence, not the host genome. This allows the annotation track to align correctly when browsing TE copies in the UCSC custom assembly view.