Skip to content

docs: add pack file format design and internals documentation, refs #8572#9669

Draft
mr-raj12 wants to merge 1 commit into
borgbackup:masterfrom
mr-raj12:docs/internals-pack-files
Draft

docs: add pack file format design and internals documentation, refs #8572#9669
mr-raj12 wants to merge 1 commit into
borgbackup:masterfrom
mr-raj12:docs/internals-pack-files

Conversation

@mr-raj12
Copy link
Copy Markdown
Contributor

pack format design and documentation

Adds docs/internals/packs.rst covering the pack file format design: binary layout, pack ID, Phase 1 (N=1), write order/crash safety, index namespace, recovery path, and repository version requirements.
Includes three diagrams (ObjHeader structure, binary layout, write order).

refs #8572.

Copy link
Copy Markdown
Member

@ThomasWaldmann ThomasWaldmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is some feedback. If there are questions or other ideas, we can either discuss here or on IRC.

Copy link
Copy Markdown
Member

@ThomasWaldmann ThomasWaldmann May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite close to what we have currently, so it might be good for a little-change first step.

But I have thought about these changes:

  • blob_len is somehow redundant, can be computed from meta and data len
  • if there is corruption in the length values, all entries afterwards are undiscoverable, because we can't determine their offsets. i thought about repeating the "BORGPACK" magic (and version) before EACH of the blob entries. then we could scan for that in case length is corrupted (and also we could easily notice THAT length is corrupted). this would make all pack entries look the same and remove the special case for the magic header.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • we can check full pack integrity server-side by sha256 now (that was not the case yet), what would xxh64(meta) and xxh64(data) give us additionally? if borg check runs on the client and has detected that a pack is damaged (by server-side sha256), it could read the pack from the client. the aead decryption would detect any problem (and just fail) in the meta and data blobs (each of these are encrypted/authenticated separately).
  • the magic and version could just be part of this header structure

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned, there is no real "manifest" (meaning a signed list of archives) in borg2 now. Each single archive is a trust root now. The manifest is rather an artifact from the past which I would like to get rid of.

So, "write archive" + "write archive pointer" instead of "write manifest"?

Comment thread docs/internals/packs.rst
Comment on lines +9 to +12
Borg currently stores each repository object (chunk) as a separate object in the
borgstore. For large repositories this means millions of individual objects, each
requiring its own I/O round trip to read or write. On high-latency backends (SFTP,
cloud object storage) this overhead dominates backup and restore times.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the docs describe the current state, we will have to remove this later.

Comment thread docs/internals/packs.rst
Comment on lines +74 to +75
RepoObj wire format and are left as-is. They remain useful for the keyless recovery
scan (see :ref:`pack-recovery`) where AEAD decryption is not available.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the encryption mode docs, we recommend that everybody is at least using an "authenticated" mode (and not "none"). So we could use the crypto authentication usually.

Those people who insist on using the least safe mode would still have the full pack sha256, but would lose a full pack if something goes wrong.

Comment thread docs/internals/packs.rst
Index Namespace
---------------

Borg does not embed a table of contents inside each pack file. Chunk-to-location
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't document what doesn't exist nor is planned.

Comment thread docs/internals/packs.rst
mappings are stored as a separate set of encrypted piece files under the ``index/``
namespace.

Each piece file covers the packs written in one backup session. Its name is the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "partial index file" is a better term?

Comment thread docs/internals/packs.rst
Comment on lines +234 to +236
``borg compact`` consolidates all existing piece files into a single replacement file
that covers only live chunks, writes it to ``index/``, and removes the files it
supersedes. This keeps the namespace small and open-time merge cost bounded.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, we rather want medium-sized partial indexes and avoid creation of all-in-one indexes.

If we create big single index files, we will cause a lot of traffic due to cache invalidation on other clients.

Comment thread docs/internals/packs.rst
it by forward-scanning all pack files in ``packs/``.

The 4-byte ``blob_len`` prefix before each blob makes the scan self-contained: no
prior knowledge of blob sizes or count is required. The algorithm for one pack file::
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe don't put detailled algorithms into the docs, they will change in real code and get outdated anytime soon.

Comment thread docs/internals/packs.rst
Comment on lines +283 to +284
Reconstructing ``chunk_id`` values requires the repository key because the chunk ID
is a keyed MAC of the plaintext data (``id_hash(plaintext_data)``). Without the key,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we have the chunk_id inside that "meta" part, so we do not have to read all data just to recomputed that when rebuilding the index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants