Skip to content

radiant-systems-lab/CDMT

Repository files navigation

CDMT — Content-Defined Merkle Tree

CDMT is an implementation of a Content-Defined Merkle Tree (CDMT): a hash tree whose internal node grouping is determined by the hash values of its children rather than by a fixed branching factor. This makes it robust against chunk-shift — the problem where a single byte insertion or deletion causes an entire standard Merkle tree to restructure because its fixed-width chunking falls on different boundaries.

The primary use case is efficient similarity comparison of Docker image layers across versions.


Background

Concept Problem it solves
Content-Defined Chunking (CDC) Fixed-size chunking is not robust to byte insertions/deletions — one change shifts every subsequent chunk boundary. CDC cuts at content-defined boundaries (e.g., a hash window ending in N zero bits), so nearby chunks stay aligned after small edits.
Merkle Tree Allows O(log N) comparison of two large datasets over a low-bandwidth link: if two internal nodes have the same hash, their entire subtrees are identical.
Chunk-shift in standard Merkle trees When CDC is combined with a fixed-branching-factor Merkle tree, a byte shift can cause children to be grouped into different internal nodes, breaking the tree's deduplication.
CDMT (this repo) Makes the grouping threshold content-defined too: a new internal node is created whenever the rolling hash of the current window ends in K zero bits. The tree structure is now shift-resistant at both the chunk and the internal-node level.

Repository Layout

CDMT/
├── cdmt/                       # Core CDMT implementation (C11)
│   ├── main.c/h                # Entry point: iterates a directory, builds CDMT per file
│   ├── pos.c/h                 # POS (partial-order-based similarity) tree construction
│   ├── parse.c/h               # Input file parsing into key-value chunk pairs
│   ├── node.c/h                # Tree node type and operations
│   ├── hashmap.c/h             # Hash map for node deduplication across versions
│   ├── queue.c/h               # Queue utility
│   └── Makefile
├── spew/                       # Tool to pack a directory tree into a single binary blob
│   ├── src/main.c              # Recursively writes directory contents + metadata
│   └── Makefile
├── tools/                      # Python helper scripts
│   ├── fetch.py                # Copy and tar Docker layer diff directories
│   ├── tar.py                  # Tar utilities
│   ├── untar.py                # Untar utilities
│   ├── mapping.py              # Layer-to-image mapping helpers
│   └── CONSTANTS.py (?)
├── order/                      # Layer ordering files for each Docker image
│   └── <image>.txt             # Ordered list of .tar layer filenames
├── experiment2/                # Experiment 2: compression ratio (gzip vs. CDMT)
│   ├── experiment2.py
│   ├── graph2.py / graph2_2.py
├── experiment3/                # Experiment 3: CDMT vs. standard Merkle tree accuracy
│   ├── cdmtvsmerkle/           # C implementation comparing both trees side by side
│   ├── experiment3.py
│   ├── graph3.py
│   └── setup3.sh
├── experiment4/                # Experiment 4: node comparison count and dedup ratio
│   ├── experiment4.py
│   └── graph4.py
├── experiment5/                # Experiment 5: CDMT build time vs. hash-only approaches
│   ├── experiment5_cdmt.py
│   ├── experiment5_hash.py
│   ├── experiment5_size.py
│   └── graph5.py
├── experiment7/                # Experiment 7
│   └── experiment7.py
├── results/                    # Pre-computed experiment results
│   └── experiment5/
├── data.txt                    # List of Docker images used in experiments
└── README.md

Building

CDMT

cd cdmt
make          # optimized build → ./main
make debug    # debug build     → ./debug

Dependencies: gcc, OpenSSL (libssl-dev, libcrypto), librt

sudo apt-get install gcc libssl-dev

CDMT vs. Merkle comparison tool (experiment 3)

cd experiment3/cdmtvsmerkle
make
# Output: ./main

spew (directory packer)

cd spew
make
# Output: ./main

Usage

Running CDMT on a directory

./cdmt/main <directory>

CDMT iterates over every file in <directory>, builds the content-defined Merkle tree for each file, and deduplicates nodes across files using a shared hash map. Outputs:

  • times.txt — per-file timing and node statistics: path:leaves:total_nodes:node_difference:nanoseconds,...
  • posmeta/ — serialized tree node metadata written by POS_write

Input file format

Each input file contains CDC chunks encoded as key-value pairs (one per line). The parse.c module reads these into key_value structs that become the leaves of the CDMT.


Experiments

All experiments use Docker image layer tarballs as benchmark data. The order/<image>.txt files specify the sequence of layer versions for each image (e.g., golang, nginx, postgres, ...).

Experiment 2 — Compression ratio

Measures gzip compression ratio across Docker layer directories.

cd experiment2
python3 experiment2.py
python3 graph2.py      # generate figures

Experiment 3 — CDMT vs. standard Merkle tree

Compares the number of nodes flagged as changed by CDMT vs. a standard Merkle tree when one layer version transitions to the next. CDMT should flag fewer false-positive changes due to shift-robustness.

# Run setup first (downloads/extracts image layers):
cd experiment3
bash setup3.sh <image>

# Run comparison:
./cdmtvsmerkle/main <directory>

# Or run the full benchmark:
python3 experiment3.py
python3 graph3.py

Experiment 4 — Node comparison count and deduplication ratio

Measures how many tree nodes must be compared (transferred) when syncing one layer version to the next, and what fraction of chunks can be deduplicated.

cd experiment4
python3 experiment4.py
python3 graph4.py

Experiment 5 — Build time

Measures end-to-end CDMT construction time vs. a hash-only baseline.

cd experiment5
python3 experiment5_cdmt.py    # CDMT timing
python3 experiment5_hash.py    # hash-only baseline
python3 experiment5_size.py    # size metrics
python3 graph5.py              # generate figures

Experiment 7

cd experiment7
python3 experiment7.py

Docker Images Used

Benchmarks cover 15 Docker Hub images, each tested across multiple version tags:

deepmind, django, golang, httpd, mysql, nginx, node, postgres, pytorch, python, r-base, rails, redis, tensorflow, tomcat

The order/<image>.txt files list the .tar layer filenames in version order.


Known Limitations

  • The experiment scripts use hardcoded paths (/home/yutan/cdmt/...) — update these before running on a new machine
  • Experiments require Docker layer tarballs to be fetched and extracted in advance (see tools/fetch.py)
  • Multi-threaded comparison is not yet implemented

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors