CDMT is an implementation of a Content-Defined Merkle Tree (CDMT): a hash tree whose internal node grouping is determined by the hash values of its children rather than by a fixed branching factor. This makes it robust against chunk-shift — the problem where a single byte insertion or deletion causes an entire standard Merkle tree to restructure because its fixed-width chunking falls on different boundaries.
The primary use case is efficient similarity comparison of Docker image layers across versions.
| Concept | Problem it solves |
|---|---|
| Content-Defined Chunking (CDC) | Fixed-size chunking is not robust to byte insertions/deletions — one change shifts every subsequent chunk boundary. CDC cuts at content-defined boundaries (e.g., a hash window ending in N zero bits), so nearby chunks stay aligned after small edits. |
| Merkle Tree | Allows O(log N) comparison of two large datasets over a low-bandwidth link: if two internal nodes have the same hash, their entire subtrees are identical. |
| Chunk-shift in standard Merkle trees | When CDC is combined with a fixed-branching-factor Merkle tree, a byte shift can cause children to be grouped into different internal nodes, breaking the tree's deduplication. |
| CDMT (this repo) | Makes the grouping threshold content-defined too: a new internal node is created whenever the rolling hash of the current window ends in K zero bits. The tree structure is now shift-resistant at both the chunk and the internal-node level. |
CDMT/
├── cdmt/ # Core CDMT implementation (C11)
│ ├── main.c/h # Entry point: iterates a directory, builds CDMT per file
│ ├── pos.c/h # POS (partial-order-based similarity) tree construction
│ ├── parse.c/h # Input file parsing into key-value chunk pairs
│ ├── node.c/h # Tree node type and operations
│ ├── hashmap.c/h # Hash map for node deduplication across versions
│ ├── queue.c/h # Queue utility
│ └── Makefile
├── spew/ # Tool to pack a directory tree into a single binary blob
│ ├── src/main.c # Recursively writes directory contents + metadata
│ └── Makefile
├── tools/ # Python helper scripts
│ ├── fetch.py # Copy and tar Docker layer diff directories
│ ├── tar.py # Tar utilities
│ ├── untar.py # Untar utilities
│ ├── mapping.py # Layer-to-image mapping helpers
│ └── CONSTANTS.py (?)
├── order/ # Layer ordering files for each Docker image
│ └── <image>.txt # Ordered list of .tar layer filenames
├── experiment2/ # Experiment 2: compression ratio (gzip vs. CDMT)
│ ├── experiment2.py
│ ├── graph2.py / graph2_2.py
├── experiment3/ # Experiment 3: CDMT vs. standard Merkle tree accuracy
│ ├── cdmtvsmerkle/ # C implementation comparing both trees side by side
│ ├── experiment3.py
│ ├── graph3.py
│ └── setup3.sh
├── experiment4/ # Experiment 4: node comparison count and dedup ratio
│ ├── experiment4.py
│ └── graph4.py
├── experiment5/ # Experiment 5: CDMT build time vs. hash-only approaches
│ ├── experiment5_cdmt.py
│ ├── experiment5_hash.py
│ ├── experiment5_size.py
│ └── graph5.py
├── experiment7/ # Experiment 7
│ └── experiment7.py
├── results/ # Pre-computed experiment results
│ └── experiment5/
├── data.txt # List of Docker images used in experiments
└── README.md
cd cdmt
make # optimized build → ./main
make debug # debug build → ./debugDependencies: gcc, OpenSSL (libssl-dev, libcrypto), librt
sudo apt-get install gcc libssl-devcd experiment3/cdmtvsmerkle
make
# Output: ./maincd spew
make
# Output: ./main./cdmt/main <directory>CDMT iterates over every file in <directory>, builds the content-defined Merkle tree for each file, and deduplicates nodes across files using a shared hash map. Outputs:
times.txt— per-file timing and node statistics:path:leaves:total_nodes:node_difference:nanoseconds,...posmeta/— serialized tree node metadata written byPOS_write
Each input file contains CDC chunks encoded as key-value pairs (one per line). The parse.c module reads these into key_value structs that become the leaves of the CDMT.
All experiments use Docker image layer tarballs as benchmark data. The order/<image>.txt files specify the sequence of layer versions for each image (e.g., golang, nginx, postgres, ...).
Measures gzip compression ratio across Docker layer directories.
cd experiment2
python3 experiment2.py
python3 graph2.py # generate figuresCompares the number of nodes flagged as changed by CDMT vs. a standard Merkle tree when one layer version transitions to the next. CDMT should flag fewer false-positive changes due to shift-robustness.
# Run setup first (downloads/extracts image layers):
cd experiment3
bash setup3.sh <image>
# Run comparison:
./cdmtvsmerkle/main <directory>
# Or run the full benchmark:
python3 experiment3.py
python3 graph3.pyMeasures how many tree nodes must be compared (transferred) when syncing one layer version to the next, and what fraction of chunks can be deduplicated.
cd experiment4
python3 experiment4.py
python3 graph4.pyMeasures end-to-end CDMT construction time vs. a hash-only baseline.
cd experiment5
python3 experiment5_cdmt.py # CDMT timing
python3 experiment5_hash.py # hash-only baseline
python3 experiment5_size.py # size metrics
python3 graph5.py # generate figurescd experiment7
python3 experiment7.pyBenchmarks cover 15 Docker Hub images, each tested across multiple version tags:
deepmind, django, golang, httpd, mysql, nginx, node, postgres, pytorch, python, r-base, rails, redis, tensorflow, tomcat
The order/<image>.txt files list the .tar layer filenames in version order.
- The experiment scripts use hardcoded paths (
/home/yutan/cdmt/...) — update these before running on a new machine - Experiments require Docker layer tarballs to be fetched and extracted in advance (see
tools/fetch.py) - Multi-threaded comparison is not yet implemented