- Genome Downsampler
This section details the available CLI options for configuring the genome-downsampler application. Each option allows you to customize the behavior and output of the downsampling process.
-
INPUT_FILEPATH: Input file path- Path to the input .bam file containing DNA sequences. This argument is required.
-
MAX_COVERAGE: Maximum coverage- Maximum coverage per reference genome's base pair index. This argument is required.
-
-h,--help: Help- Print help message and exit.
-
-o,--outputTEXT: Output file path- Path to the output .bam file. Default is "output.bam" in the input file's directory.
-
-a,--algorithmTEXT: Algorithm- Algorithm to use for downsampling. Options are:
quasi-mcp-cpu,quasi-mcp-cuda,mcp-cpu,qmcp-cpu, . Default isquasi-mcp-cpu.
- Algorithm to use for downsampling. Options are:
-
-b,--bedTEXT:FILE: BED file with amplicon bounds- Path to .bed file specifying amplicon bounds for filtering or prioritization based on the selected algorithm.
-
-t,--tsvTEXT:FILE: TSV file with primer pairings- Path to .tsv file describing pairs of primers from the .bed file to create amplicons.
-
-p,--preprocessing-outTEXT: Output file for preprocessed reads- Path to .bam file for storing reads filtered out during preprocessing. Useful for debugging purposes.
-
-l,--min-lengthUINT: Minimum sequence length- Minimum length of DNA sequences to retain. Sequences shorter than this length will be filtered out. Default is 90.
-
-q,--min-mapqUINT: Minimum MAPQ value- Minimum Mapping Quality (MAPQ) value of sequences to retain. Sequences with MAPQ lower than this value will be filtered out. Default is 30.
-
-@,--threadsUINT: Number of threads- Number of threads for htslib read/write operations. Default is 2.
-
-v,--verbose: Verbose mode- Enable additional logging for detailed execution information.
- Basic usage:
genome-downsampler /data/input.bam 100- Advanced usage with optional arguments:
genome-downsampler /data/input.bam 100 -a quasi-mcp-cuda -v -l 100 -q 50 -p /data/filtered_out_prep.bam -o /data/output.bam -b /data/primers.bed -t /data/pairs.tsv- Verbose mode with preprocessing output:
genome-downsampler /data/input.bam 100 -v -p /data/filtered_out_prep.bam -o /data/output.bam- Using amplicon filtering:
genome-downsampler /data/input.bam 100 -v -o /data/output.bam -b /data/primers.bed -t /data/pairs.tsv
To run the app in docker, clone the repository and navigate to it:
git clone https://github.com/migoox/genome-downsampler
cd genome-downsamplerBuild the Docker image using:
docker build -t genome-downsampler .Now you can run the container with --help argument:
docker run -it genome-downsampler --help To provide the data and get the output, run the app with a mounted data volume. Assuming you want to work with a file sample.bam located in /home/user/data, and you'd like the output to appear in the same folder:
docker run -it -v /home/user/data:/data genome-downsampler /data/sample.bam 100 -o /data/output.bamThis software only supports GNU/Linux systems, if you are a Windows user, we recommend using WSL. In order to run (or compile), the HTSlib and OR-Tools are required to be installed on the your machine.
Install the following common dependencies by running:
Debian/Ubuntu/Linux Mint:
sudo apt install autoconf automake make gcc zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-devFedora/Red Hat:
sudo dnf install autoconf automake make gcc zlib-devel bzip2-devel xz-devel libcurl-devel openssl-develOpenSUSE:
sudo zypper install autoconf automake make gcc zlib-devel libbz2-devel xz-devel libcurl-devel libopenssl-develArch Linux:
sudo pacman -S autoconf automake make gcc zlib bzip2 xz curl opensslYou can install HTSlib nad OR-Tools using an installation script scripts/install_libs.sh if
your package manager does not provide those dependencies and you want to avoid doing it manually.
Assuming your current working directory is the repo directory and you want to install the libraries in
/usr/local/:
./scripts/install_libs.sh installIf you want to install ortools only and change the prefix directory to say /opt use
./scripts/install_libs.sh install --prefix /opt --subset ortoolsIf you prefer to install HTSlib only, you can use the following script. For the full installation guide, visit here:
wget https://github.com/samtools/htslib/releases/download/1.20/htslib-1.20.tar.bz2
tar -xf htslib-1.20.tar.bz2
cd htslib-1.20
make
sudo make installAlternatively, you can install samtools via package manager, since HTSlib is part of the samtools project.
Debian/Ubuntu/Linux Mint:
sudo apt install samtoolsFedora/Red Hat:
sudo dnf install samtoolsOpenSUSE:
sudo zypper install samtoolsArch Linux:
sudo pacman -S samtoolsMany package managers does not provide OR-Tools library. To install it on your machine, download the appropriate binaries from here and extract the files.
Now, supposing that ORTOOLS_DIR_NAME represents the path to the extracted directory, use the following commands:
sudo cp -r ${ORTOOLS_DIR_NAME}/bin/* /usr/local/bin/
sudo cp -r ${ORTOOLS_DIR_NAME}/lib/* /usr/local/lib/
sudo cp -r ${ORTOOLS_DIR_NAME}/include/* /usr/local/include/
sudo cp -r ${ORTOOLS_DIR_NAME}/share/* /usr/local/share/By default, CUDA algorithms are not included in the build. To enable
them, set the WITH_CUDA flag when configuring with CMake. To build with CUDA support, you must first install the CUDA library (see the installation guide and download CUDA from here). Note that you need a CUDA capable GPU to run the CUDA algorithms.
- Install the dependencies.
- Clone the repository.
- Navigate to the repository directory and run
cmake --preset gcc-x64-release && cmake --build --preset gcc-x64-releasefor default build.cmake --preset gcc-x64-release WITH_CUDA=ON && cmake --build --preset gcc-x64-releasefor build with CUDA algorithms.
- The binary file location:
<repository-dir>/build/release/src/genome-downsampler.
Available cmake flags:
WITH_CUDA: builds the program with CUDA algorithms. When this option is disabled, the CUDA library is no longer requiredWITH_TESTS: builds the program with additionaltestsubcommand for testing correctness of the algorithms.