summarize_TE

A Snakemake pipeline for quantifying transposable element (TE) presence across one or more genome assemblies. Given a TE consensus FASTA and a set of genome FASTAs, the pipeline uses BLAT to find all full and partial TE copies, summarizes them per-species and across species, and produces UCSC Genome Browser–ready files for visualization.

Overview

The pipeline:

BLATs each TE consensus sequence against each genome assembly
Reconstructs full TE copies from BLAT hits
Computes per-TE statistics: number of genomic regions, total nucleotides covered, full vs. partial copy counts
Aligns found copies back to the consensus (via minimap2) and converts to indexed BAM
Converts the TE consensus to a 2bit file and generates UCSC hub files (assembly hub, track hub, annotation hub)
Combines per-species summaries into a single cross-species TSV

Inputs

Input	Description
Genome FASTA folder	Directory containing one `.fa` file per genome assembly
TE consensus FASTA	FASTA of TE consensus sequences (e.g. from Dfam or RepBase/DRoSophila REPbase)
BigBed annotation file	`.bb` file annotating internal regions, LTRs, and protein-coding domains relative to the TE consensus
Config file	`summarize_TE_config.yaml` — paths, species list, and email

Outputs

output/
├── combined_summary.tsv                        # cross-species summary
├── species_summaries/
│   ├── <species>_blat_hits.pslx                # raw BLAT output
│   ├── <species>_blat_regions_full.fa          # FASTA of full TE copies found
│   ├── <species>_blat_regions_full_and_partial.fa
│   └── <species>_summary_file.tsv              # per-species TE counts
└── ucsc_browser_files/
    ├── <TE_consensus_name>.2bit                 # TE consensus as 2bit
    ├── assembly_hub.txt                         # UCSC assembly hub
    ├── annotation_hub.txt                       # UCSC annotation hub
    ├── groups.txt
    └── <species>/
        ├── <species>_full_sorted.bam[.bai]
        ├── <species>_full_and_partial_sorted.bam[.bai]
        ├── <species>_full_track_hub.txt
        └── <species>_full_and_partial_track_hub.txt

Dependencies

Install with conda/mamba:

conda env create -f environment.yaml
conda activate <env_name>

Key tools: snakemake, blat, minimap2, samtools, faToTwoBit

Configuration

Edit summarize_TE_config.yaml:

genome_fasta_folder: /path/to/genome/fastas   # one .fa per species
output_folder: output
TE_fasta: /path/to/TE_consensus.fa
big_bed_file: /path/to/annotations.bb
email: your@email.com

all_species:
  - species_name_1     # must match the .fa filename without extension
  - species_name_2

Running the Pipeline

conda activate <env_name>
snakemake -s summarize_TE.smk --use-conda --cores 4

Uploading to the UCSC Genome Browser

Copy the ucsc_browser_files/ folder to your public web server
Go to the UCSC Genome Browser → My Data → Connected Hubs
Add URLs pointing to:
- assembly_hub.txt — loads the TE consensus as a custom assembly
- <species>_full_track_hub.txt — track of full TE copies aligned to the consensus
- <species>_full_and_partial_track_hub.txt — track of full + partial copies
- annotation_hub.txt — annotates internal regions, LTRs, and protein domains

Note on the BigBed Annotation File

The included annotation files (annotation_files/) have coordinates relative to the TE consensus sequence, not the host genome. This allows the annotation track to align correctly when browsing TE copies in the UCSC custom assembly view.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
annotation_files		annotation_files
scripts		scripts
.gitignore		.gitignore
README.bk.md		README.bk.md
README.md		README.md
environment.yaml		environment.yaml
summarize_TE.smk		summarize_TE.smk
summarize_TE_config.yaml		summarize_TE_config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

summarize_TE

Overview

Inputs

Outputs

Dependencies

Configuration

Running the Pipeline

Uploading to the UCSC Genome Browser

Note on the BigBed Annotation File

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

summarize_TE

Overview

Inputs

Outputs

Dependencies

Configuration

Running the Pipeline

Uploading to the UCSC Genome Browser

Note on the BigBed Annotation File

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages