Skip to content

marti-dotcom/intersection-and-union

Repository files navigation

Intersection & Union of BED4 Intervals

A clean, dependency-free Python CLI for computing the union or intersection of two BED4 files. Designed for clarity, explicit edge-case handling, and easy reuse in bioinformatics pipelines.

  • Input: two whitespace-separated BED4 files with columns: chrom start end name
  • Output: a BED4 file you specify
  • Operations: union (merge by feature name within a chromosome), isec (pairwise interval overlap per chromosome)
  • Python: 3.8+

Please note: BED is typically 0-based, half-open; chromosome labels must match exactly (e.g., chr1 vs 1 are different).


Motivation

Goal: Read two BED4 files and, based on user choice, compute either the union or the intersection.

Intersection (isec)
Report all non-empty overlaps between each interval in file 1 and each interval in file 2 on the same chromosome.

  • Overlap rule (half-open intervals): [a,b) and [c,d) overlap iff a < d and c < b.
  • Example: [30,50) & [50,70)no overlap
  • Example: [30,52) & [50,70)overlap [50,52)
  • The feature name in the output is taken from file 1.

Union (union)
Output all features that occur in at least one file.

  • If a feature name occurs in only one file → include as-is.
  • If a feature name occurs in both files on different chromosomesexclude both.
  • If a feature name occurs in both files on the same chromosome → output a single interval using the smallest covering span of both:
    • [30,40) + [70,90)[30,90)
    • [30,50) + [40,45)[30,50)

CLI contract (argparse)

  • operation (union or isec)
  • input1 path
  • input2 path
  • output path

The order of output lines is not important for the script.


Quick Start

# Union (merge by feature name if on the same chromosome)
python mycode.py union main.bed.txt unionsecondfile.bed.txt union_results.bed.txt
cat union_results.bed.txt

# Intersection (interval overlap by chromosome; name is taken from file1)
python mycode.py isec  main.bed.txt intersectionsecondfile.bed.txt isec_results.bed.txt
cat isec_results.bed.txt

# CLI help
python mycode.py -h

Features included

Robust parsing

  • Skips blank lines and # comments
  • Accepts whitespace-separated columns (tabs or spaces)
  • Auto-swaps inverted intervals (start > end)
  • Defaults name to . if the 4th column is missing

Union

  • Groups intervals by name across both files
  • Merges only when intervals with the same name are on the same chromosome
  • Flags & drops “same name but different chromosome”

Intersection

  • Checks pairwise overlaps by chromosome (no requirement to match names)
  • Output name is inherited from the file1 interval

Clear output & errors

  • Summary stats printed to stdout
  • Warnings/parse errors printed to stderr (with line numbers)

Usage

python mycode.py {union|isec} <input1.bed> <input2.bed> <output.bed>

-union: merge intervals by feature name if on the same chromosome. start = min(start_a, start_b); end = max(end_a, end_b) Pairs with identical names on different chromosomes are excluded.

-isec: report overlapping regions by chromosome for all pairs between file1 and file2. Overlap calculation:

overlap_start = max(start_a, start_b)
overlap_end   = min(end_a, end_b)
emit if overlap_start < overlap_end

Output name is taken from file1’s interval.


How it works read_bed_file(path)

Validates file existence, parses into (chrom, start, end, name), skips malformed rows and non-integers, prints warnings with line numbers.

find_unions(intervals1, intervals2)

Concatenates both lists; groups by name. Ensures all intervals for a given name share the same chromosome; otherwise marks the group as invalid. Emits one merged interval per valid name.


Examples (files included)

python mycode.py union main.bed.txt unionsecondfile.bed.txt union_results.bed.txt python mycode.py isec main.bed.txt intersectionsecondfile.bed.txt isec_results.bed.txt

The repository includes the following example inputs/outputs at the repo root:

  • main.bed.txt

  • unionsecondfile.bed.txt

  • intersectionsecondfile.bed.txt

  • union_results.bed.txt (generated by the command below)

  • isec_results.bed.txt (generated by the command below)

Run the following please:

Union example

cat union_results.bed.txt

Intersection example

cat isec_results.bed.txt

Project Structure

.
├── mycode.py
├── main.bed.txt
├── unionsecondfile.bed.txt
├── intersectionsecondfile.bed.txt
├── union_results.bed.txt         # generated
└── isec_results.bed.txt          # generated

Appendix: BED4 Files

A BED4 file has four columns (whitespace separated):

  1. chrom – chromosome/contig label (e.g., chr1)

  2. start – 0-based start (inclusive)

  3. end – 0-based end (exclusive)

  4. name – feature label (optional; defaults to . in this tool)

Lines beginning with # are treated as comments and ignored.


Contact

Developed by Martina Debnath | MSc Genetics and Multiomics in Medicine | UCL

Thank you for using my intersection-and-union CLI <3

Feel free to reach out for collaboration.

GitHub: https://github.com/marti-dotcom

Email: martinadebnath@gmail.com

About

BED4 union/intersection CLI in Python for genomic intervals, developed by Martina Debnath

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages