Skip to content

simkin-bioinformatics/summarize_TE

Repository files navigation

summarize_TE

A Snakemake pipeline for quantifying transposable element (TE) presence across one or more genome assemblies. Given a TE consensus FASTA and a set of genome FASTAs, the pipeline uses BLAT to find all full and partial TE copies, summarizes them per-species and across species, and produces UCSC Genome Browser–ready files for visualization.

Overview

The pipeline:

  1. BLATs each TE consensus sequence against each genome assembly
  2. Reconstructs full TE copies from BLAT hits
  3. Computes per-TE statistics: number of genomic regions, total nucleotides covered, full vs. partial copy counts
  4. Aligns found copies back to the consensus (via minimap2) and converts to indexed BAM
  5. Converts the TE consensus to a 2bit file and generates UCSC hub files (assembly hub, track hub, annotation hub)
  6. Combines per-species summaries into a single cross-species TSV

Inputs

Input Description
Genome FASTA folder Directory containing one .fa file per genome assembly
TE consensus FASTA FASTA of TE consensus sequences (e.g. from Dfam or RepBase/DRoSophila REPbase)
BigBed annotation file .bb file annotating internal regions, LTRs, and protein-coding domains relative to the TE consensus
Config file summarize_TE_config.yaml — paths, species list, and email

Outputs

output/
├── combined_summary.tsv                        # cross-species summary
├── species_summaries/
│   ├── <species>_blat_hits.pslx                # raw BLAT output
│   ├── <species>_blat_regions_full.fa          # FASTA of full TE copies found
│   ├── <species>_blat_regions_full_and_partial.fa
│   └── <species>_summary_file.tsv              # per-species TE counts
└── ucsc_browser_files/
    ├── <TE_consensus_name>.2bit                 # TE consensus as 2bit
    ├── assembly_hub.txt                         # UCSC assembly hub
    ├── annotation_hub.txt                       # UCSC annotation hub
    ├── groups.txt
    └── <species>/
        ├── <species>_full_sorted.bam[.bai]
        ├── <species>_full_and_partial_sorted.bam[.bai]
        ├── <species>_full_track_hub.txt
        └── <species>_full_and_partial_track_hub.txt

Dependencies

Install with conda/mamba:

conda env create -f environment.yaml
conda activate <env_name>

Key tools: snakemake, blat, minimap2, samtools, faToTwoBit

Configuration

Edit summarize_TE_config.yaml:

genome_fasta_folder: /path/to/genome/fastas   # one .fa per species
output_folder: output
TE_fasta: /path/to/TE_consensus.fa
big_bed_file: /path/to/annotations.bb
email: your@email.com

all_species:
  - species_name_1     # must match the .fa filename without extension
  - species_name_2

Running the Pipeline

conda activate <env_name>
snakemake -s summarize_TE.smk --use-conda --cores 4

Uploading to the UCSC Genome Browser

  1. Copy the ucsc_browser_files/ folder to your public web server
  2. Go to the UCSC Genome Browser → My DataConnected Hubs
  3. Add URLs pointing to:
    • assembly_hub.txt — loads the TE consensus as a custom assembly
    • <species>_full_track_hub.txt — track of full TE copies aligned to the consensus
    • <species>_full_and_partial_track_hub.txt — track of full + partial copies
    • annotation_hub.txt — annotates internal regions, LTRs, and protein domains

Note on the BigBed Annotation File

The included annotation files (annotation_files/) have coordinates relative to the TE consensus sequence, not the host genome. This allows the annotation track to align correctly when browsing TE copies in the UCSC custom assembly view.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages