docs: add pack file format design and internals documentation, refs #8572 by mr-raj12 · Pull Request #9669 · borgbackup/borg

mr-raj12 · 2026-05-27T07:59:35Z

pack format design and documentation

Adds docs/internals/packs.rst covering the pack file format design: binary layout, pack ID, Phase 1 (N=1), write order/crash safety, index namespace, recovery path, and repository version requirements.
Includes three diagrams (ObjHeader structure, binary layout, write order).

refs #8572.

…orgbackup#8572

ThomasWaldmann

Here is some feedback. If there are questions or other ideas, we can either discuss here or on IRC.

ThomasWaldmann · 2026-05-27T08:32:03Z

This is quite close to what we have currently, so it might be good for a little-change first step.

But I have thought about these changes:

blob_len is somehow redundant, can be computed from meta and data len

if there is corruption in the length values, all entries afterwards are undiscoverable, because we can't determine their offsets. i thought about repeating the "BORGPACK" magic (and version) before EACH of the blob entries. then we could scan for that in case length is corrupted (and also we could easily notice THAT length is corrupted). this would make all pack entries look the same and remove the special case for the magic header.

ThomasWaldmann · 2026-05-27T08:34:51Z

we can check full pack integrity server-side by sha256 now (that was not the case yet), what would xxh64(meta) and xxh64(data) give us additionally? if borg check runs on the client and has detected that a pack is damaged (by server-side sha256), it could read the pack from the client. the aead decryption would detect any problem (and just fail) in the meta and data blobs (each of these are encrypted/authenticated separately).

the magic and version could just be part of this header structure

ThomasWaldmann · 2026-05-27T08:38:06Z

As I mentioned, there is no real "manifest" (meaning a signed list of archives) in borg2 now. Each single archive is a trust root now. The manifest is rather an artifact from the past which I would like to get rid of.

So, "write archive" + "write archive pointer" instead of "write manifest"?

ThomasWaldmann · 2026-05-27T08:39:48Z

+Borg currently stores each repository object (chunk) as a separate object in the
+borgstore.  For large repositories this means millions of individual objects, each
+requiring its own I/O round trip to read or write.  On high-latency backends (SFTP,
+cloud object storage) this overhead dominates backup and restore times.


As the docs describe the current state, we will have to remove this later.

ThomasWaldmann · 2026-05-27T08:44:44Z

+RepoObj wire format and are left as-is.  They remain useful for the keyless recovery
+scan (see :ref:`pack-recovery`) where AEAD decryption is not available.


In the encryption mode docs, we recommend that everybody is at least using an "authenticated" mode (and not "none"). So we could use the crypto authentication usually.

Those people who insist on using the least safe mode would still have the full pack sha256, but would lose a full pack if something goes wrong.

ThomasWaldmann · 2026-05-27T09:00:35Z

+Index Namespace
+---------------
+
+Borg does not embed a table of contents inside each pack file.  Chunk-to-location


Don't document what doesn't exist nor is planned.

ThomasWaldmann · 2026-05-27T09:01:56Z

+mappings are stored as a separate set of encrypted piece files under the ``index/``
+namespace.
+
+Each piece file covers the packs written in one backup session.  Its name is the


maybe "partial index file" is a better term?

ThomasWaldmann · 2026-05-27T09:07:39Z

+``borg compact`` consolidates all existing piece files into a single replacement file
+that covers only live chunks, writes it to ``index/``, and removes the files it
+supersedes.  This keeps the namespace small and open-time merge cost bounded.


In the end, we rather want medium-sized partial indexes and avoid creation of all-in-one indexes.

If we create big single index files, we will cause a lot of traffic due to cache invalidation on other clients.

ThomasWaldmann · 2026-05-27T09:09:00Z

+it by forward-scanning all pack files in ``packs/``.
+
+The 4-byte ``blob_len`` prefix before each blob makes the scan self-contained: no
+prior knowledge of blob sizes or count is required.  The algorithm for one pack file::


Maybe don't put detailled algorithms into the docs, they will change in real code and get outdated anytime soon.

ThomasWaldmann · 2026-05-27T09:11:58Z

+Reconstructing ``chunk_id`` values requires the repository key because the chunk ID
+is a keyed MAC of the plaintext data (``id_hash(plaintext_data)``).  Without the key,


I guess we have the chunk_id inside that "meta" part, so we do not have to read all data just to recomputed that when rebuilding the index.

docs: add pack file format design and internals documentation, refs b…

5d6cafe

…orgbackup#8572

ThomasWaldmann requested changes May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: add pack file format design and internals documentation, refs #8572#9669

docs: add pack file format design and internals documentation, refs #8572#9669
mr-raj12 wants to merge 1 commit into
borgbackup:masterfrom
mr-raj12:docs/internals-pack-files

mr-raj12 commented May 27, 2026

Uh oh!

ThomasWaldmann left a comment

Uh oh!

ThomasWaldmann May 27, 2026 •

edited

Loading

Uh oh!

ThomasWaldmann May 27, 2026

Uh oh!

ThomasWaldmann May 27, 2026

Uh oh!

ThomasWaldmann May 27, 2026

Uh oh!

ThomasWaldmann May 27, 2026

Uh oh!

ThomasWaldmann May 27, 2026

Uh oh!

ThomasWaldmann May 27, 2026

Uh oh!

ThomasWaldmann May 27, 2026

Uh oh!

ThomasWaldmann May 27, 2026

Uh oh!

ThomasWaldmann May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		RepoObj wire format and are left as-is. They remain useful for the keyless recovery
		scan (see :ref:`pack-recovery`) where AEAD decryption is not available.

		Reconstructing ``chunk_id`` values requires the repository key because the chunk ID
		is a keyed MAC of the plaintext data (``id_hash(plaintext_data)``). Without the key,

Uh oh!

Conversation

mr-raj12 commented May 27, 2026

pack format design and documentation

Uh oh!

ThomasWaldmann left a comment

Choose a reason for hiding this comment

Uh oh!

ThomasWaldmann May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ThomasWaldmann May 27, 2026 •

edited

Loading