assembly_qc

assembly_qc is a Snakemake workflow for assembly quality control. The workflow starts from haplotype-resolved assembly FASTA files, performs contamination/adaptor/rDNA cleaning, then generates assembly statistics, reference-alignment summaries, QV/completeness metrics, BUSCO/compleasm summaries, and human-specific visualization outputs.

Main outputs

For each sample and available haplotype, the workflow can generate:

cleaned FASTA files after short-contig filtering, FCS-GX/FCS-adaptor screening, optional mitochondrial extraction, rDNA splitting, and final FASTA renaming
scaffold/contig assembly statistics, including N50-style summaries and telomere annotations
assembly-to-reference PAF, SAF, chromosome coverage summaries, and chain files
Merqury QV and completeness outputs from Illumina reads
optional trio Merqury hap-mer plots when parental IDs are provided
compleasm/BUSCO gene completeness summaries
contig length plots, ideograms, ploidy plots, and assembly-eval config files
human-only ModDotPlot summaries for acrocentric regions
optional SMaHT DSA metadata tables when SMAHT_METADATA: True

Repository layout

.
├── Snakefile
├── config/
│   ├── config_asm_qc.yaml
│   └── manifest_asm_qc.tab
├── rules/
│   ├── merqury.smk
│   ├── saffire.smk
│   ├── compleasm.smk
│   ├── fasta_stats.smk
│   ├── moddotplot.smk
│   └── plots.smk
├── scripts/
├── runcluster
└── runlocal

Configuration

The default config is:

MANIFEST: config/manifest_asm_qc.tab

REF:
  CHM13:
    PATH: /net/eichler/vol28/eee_shared/assemblies/CHM13/T2T/v2.0/T2T-CHM13v2.fasta
    CHROMS: /net/eichler/vol28/eee_shared/assemblies/CHM13/T2T/v2.0/genome.txt
    CYTO: /net/eichler/vol28/eee_shared/assemblies/CHM13/T2T/v2.0/anno/cyto.bed
  GRCh38:
    PATH: /net/eichler/vol28/eee_shared/assemblies/hg38/no_alt/hg38.no_alt.fa
    CHROMS: /net/eichler/vol28/eee_shared/assemblies/hg38/no_alt/genome.txt
    CYTO: /net/eichler/vol28/eee_shared/assemblies/hg38/no_alt/anno/cyto.bed

DB_PATH: /net/eichler/vol28/eee_shared/buscodb/
LINEAGE: primates
MODE: busco

TAXID: 9606
INCLUDE_MITO: True
HPRC_NAMING: True

Key config fields

Field	Description
`MANIFEST`	Tab-delimited manifest describing assemblies, read FOFNs, and optional trio information.
`REF`	Reference genomes to use for minimap2/Saffire outputs, ideograms, ploidy plots, and chain files. Each reference requires `PATH` and `CHROMS`; `CYTO` is optional.
`DB_PATH`	BUSCO/compleasm database path.
`LINEAGE`	compleasm lineage, such as `primates`.
`MODE`	compleasm mode, such as `busco`.
`TAXID`	Expected organism taxid for FCS-GX screening. Human-specific downstream outputs are enabled for `9606`.
`INCLUDE_MITO`	Whether mitochondrial FASTA extraction should be included when supported by the workflow settings.
`HPRC_NAMING`	Whether final FASTA names should follow HPRC-style naming logic.

Manifest

The manifest is tab-delimited. Current expected columns are:

SAMPLE  NAME  H1  H2  UNASSIGNED  ILLUMINA  FOFN  TRIO  MO_ID  FA_ID

Example:

SAMPLE_verkko  SAMPLE  test_data/asm/verkko.hap1.fasta  test_data/asm/verkko.hap2.fasta  test_data/asm/verkko.unassigned.fasta  SAMPLE  test_data/fofn/reads.fofn  NO  NA  NA

Manifest columns

Column	Description
`SAMPLE`	Workflow sample ID. This is used in `results/{SAMPLE}`.
`NAME`	Display or output naming value used by the workflow.
`H1`	Haplotype 1 assembly FASTA. Required for a sample to be included in the main workflow.
`H2`	Haplotype 2 assembly FASTA. Optional.
`UNASSIGNED`	Unassigned assembly FASTA. Optional.
`ILLUMINA`	Illumina read sample ID used to locate/read the FOFN for Merqury. Multiple assemblies can share the same Illumina ID.
`FOFN`	File of file names for Illumina FASTQ inputs used by Merqury/meryl. One FASTQ path per line.
`TRIO`	`YES` or `NO`; enables trio Merqury outputs when parental IDs are available.
`MO_ID`	Maternal Illumina sample ID for trio Merqury. Use `NA` if not available.
`FA_ID`	Paternal Illumina sample ID for trio Merqury. Use `NA` if not available.

Running the workflow

Run a dry-run first:

./runcluster 30 -np

Run on the cluster:

./runcluster 30

Run locally:

./runlocal 30

runcluster and runlocal both resolve the repository path automatically, load miniconda/4.12.0, create log/, and pass the remaining arguments to Snakemake. runcluster submits jobs through DRMAA/SGE with mfree, h_rt, threads, and -V -cwd -j y settings. Both wrappers use Singularity/Apptainer and bind /net.

To use a different config file:

./runcluster 30 --configfile path/to/config.yaml

Partial workflow targets

You can run selected output groups instead of the full workflow:

./runcluster 30 get_cleaned_fasta_only
./runcluster 30 get_stats_only
./runcluster 30 get_saf_only
./runcluster 30 get_qv_only
./runcluster 30 get_busco_only
./runcluster 30 get_plots_only
./runcluster 30 get_moddotplots_only

Target	Outputs
`get_cleaned_fasta_only`	Final cleaned FASTA files under `results/{sample}/contamination_screening/outputs/final_fasta/`.
`get_stats_only`	Per-haplotype FASTA statistics under `results/{sample}/stats/outputs/summary/`.
`get_saf_only`	Saffire SAF files, chromosome coverage summaries, trimmed PAFs, and human chain files.
`get_qv_only`	Merqury QV and completeness outputs. Trio plots are included when `TRIO=YES`.
`get_busco_only`	compleasm/BUSCO summary TSV.
`get_plots_only`	Contig length plots, ideograms, ploidy plots, and assembly-eval config files.
`get_moddotplots_only`	Human acrocentric ModDotPlot summary and contig statistics.

Workflow overview

The full all target creates:

results/{sample}/complete_flag/all_done

for every sample with a valid H1 assembly in the manifest.

Internally, the workflow includes these rule modules:

fcs_gx.smk
Links/indexes raw FASTA input, removes very short contigs, runs FCS-GX/adaptor screening, detects mitochondrial and rDNA sequence, splits rDNA, and writes final cleaned FASTA outputs.
merqury.smk
Builds meryl databases from Illumina FASTQ FOFNs, computes Merqury QV/completeness, and optionally generates trio hap-mer outputs.
saffire.smk
Prepares reference FASTA resources, aligns cleaned assemblies to each configured reference with minimap2, trims/filters PAFs, generates SAF/chromosome coverage summaries, and creates chain files for human assemblies.
compleasm.smk
Runs compleasm and summarizes BUSCO-style completeness metrics.
fasta_stats.smk
Splits scaffolds into contigs, calculates telomere annotations, computes per-haplotype and full-genome assembly statistics, and summarizes stats tables.
moddotplot.smk
For human assemblies, identifies acrocentric contigs from CHM13 alignments and generates ModDotPlot-related summaries/plots.
plots.smk
Generates contig length plots, ideogram plots, ploidy plots, and assembly-eval configuration files.

Important output locations

results/{sample}/
├── complete_flag/
│   └── all_done
├── contamination_screening/
│   └── outputs/
│       ├── final_fasta/{sample}_{hap}.fasta
│       ├── contig_fasta/{sample}_{hap}.fasta
│       ├── mito_fasta/{hap}-mt.fasta
│       └── rdna_fasta/{hap}-rdna.fasta
├── stats/
│   └── outputs/
│       └── summary/{hap}.summary.stats
├── saffire/
│   └── outputs/
│       ├── trimmed_pafs/{ref}/{hap}.minimap2.trimmed.paf
│       ├── safs/{ref}/{hap}.minimap2.saf
│       └── chrom_cov/{ref}/{hap}.minimap2.chrom_cov.tsv
├── chain_files/
│   └── outputs/
│       ├── {sample}_{hap}_To_{ref}.chain
│       └── {sample}_{hap}_To_{ref}.invert.chain
├── merqury/
│   └── outputs/
│       ├── {sample}.qv
│       └── {sample}_{hap}.qv
├── compleasm/
│   └── outputs/
│       └── summary/{sample}.summary.tsv
├── plots/
│   └── outputs/
│       ├── contig_length/{hap}.scaffold.scatter_logged.png
│       ├── ideo/{ref}/pdf/{sample}.minimap2.ideoplot.pdf
│       ├── ideo/{ref}/pdf/{sample}.minimap2.ideoplot_wide.pdf
│       └── ploidy/{ref}/pdf/{sample}.minimap2.ploidy.pdf
└── moddotplot/
    └── outputs/
        ├── summary/{hap}.generated_acros.tsv
        └── contig_stats/{hap}.CHM13_lifted_contigs.tsv

Shared resources are written under:

resources/
├── reference/{ref}/
│   ├── genome.fa
│   ├── genome.fa.fai
│   └── genome_index.txt
├── meryl/{ILLUMINA}/
└── acro_target_beds/

Notes

Human-specific outputs, including ModDotPlot and chain files, depend on TAXID: 9606.
H1 is required for a sample to be selected by the default workflow.
H2 and UNASSIGNED are optional; the workflow runs only for haplotypes present in the manifest.
FOFN is used for Merqury/meryl input and should list Illumina FASTQ files one per line.
The workflow uses Singularity/Apptainer containers through Snakemake’s --use-singularity option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

assembly_qc

Main outputs

Repository layout

Configuration

Key config fields

Manifest

Manifest columns

Running the workflow

Partial workflow targets

Workflow overview

Important output locations

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
config		config
rules		rules
scripts		scripts
test_data		test_data
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
runcluster		runcluster
runlocal		runlocal

Folders and files

Latest commit

History

Repository files navigation

assembly_qc

Main outputs

Repository layout

Configuration

Key config fields

Manifest

Manifest columns

Running the workflow

Partial workflow targets

Workflow overview

Important output locations

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages