Add binary output format RFC by sims1253 · Pull Request #58 · stan-dev/design-docs

sims1253 · 2026-03-09T23:05:13Z

No description provided.

jgabry · 2026-03-09T23:55:01Z

A rendered version is at https://github.com/sims1253/stan-design-docs/blob/master/designs/0035-stanbin-binary-output.md

WardBrian

Thanks so much @sims1253!

A bunch of comments to start. Note that if I could wave a wand and get exactly what you wrote, I'd take it! But I do thing there are some minor improvements possible

WardBrian · 2026-03-09T23:20:02Z

designs/0035-stanbin-binary-output.md

+| 12 | 4 | uint32 | Flags (`0` in v1; reserved for extensions) |
+| 16 | 8 | uint64 | Number of rows (draws) |
+| 24 | 8 | uint64 | Number of columns (parameters) |
+| 32 | 4 | uint32 | Data section offset in bytes (`64 + names_size` in v1) |


We will want to avoid unaligned reads in the data section, so I think it will be important to at least 8-byte align the start of the data section. So this will be something like 64 + ((names_size + 7) / 8) * 8

WardBrian · 2026-03-10T00:16:27Z

designs/0035-stanbin-binary-output.md

+
+2. Should stanbin become the default output format in a future CmdStan release, or remain opt-in indefinitely?
+
+3. Is the trailing metadata section (raw CSV comment text) the right long-term metadata representation, or should v1 adopt a structured format (e.g., key-value pairs) from the start?


My thought is either start as structured key-value metadata, or cut it entirely, but I'm mostly ambivalent (cmdstanpy can just stick to reading this information from the json version cmdstan can provide)

imo this is also a good opportunity to have the metadata go to a separate file completly. So we would write a little json file with all the metadata.

Curious if there is some kind of consensus here. I thought having it as one file was desired. 2 files would actually be easier to implement :D My first draft of this actually used a separate file for the metadata.

We already have an argument called save_cmdstan_config which creates a json file equivalent of the opening comments. The adaptation metadata is also saved elsewhere, so it’s really only timing that is still available only as a comment, and we’d also like to change that (stan-dev/stan#3340)

WardBrian · 2026-03-10T13:59:09Z

designs/0035-stanbin-binary-output.md

+  void operator()(const std::vector<double>& state) override {
+    write_row(state);
+    ++num_rows_;
+    update_rows_in_header(num_rows_);


You say earlier:

A reader that wants to attempt partial recovery from an incomplete file can compute the number of usable rows from the file size

but here you show the header being modified each write. I think the text earlier is much better: the rows in the header should just be set to the intended number of rows, to avoid needing to constantly seek back and forth.

If we really want to avoid a number of rows that is not 'true', it should be moved to finalization instead and left as 0 for the intermediary steps.

WardBrian · 2026-03-10T14:00:58Z

designs/0035-stanbin-binary-output.md

+# Drawbacks
+[drawbacks]: #drawbacks
+
+1. Not human-readable: Unlike CSV, stanbin files cannot simply be inspected with a text editor. Users must use provided reader functions.


[note] a hex editor would be usable in a pinch

WardBrian · 2026-03-10T14:01:54Z

designs/0035-stanbin-binary-output.md

+
+1. Not human-readable: Unlike CSV, stanbin files cannot simply be inspected with a text editor. Users must use provided reader functions.
+
+2. Tool ecosystem: Existing tools (stansummary, arviz, etc.) expect CSV format. These would need updates to support stanbin.


Note: I think it will also be important to provide a new cmdstan/bin/stanbin2csv that converts these files into more-or-less the normal stan csvs

WardBrian · 2026-03-10T14:03:17Z

designs/0035-stanbin-binary-output.md

+
+## To resolve before merging this RFC
+
+1. Should the diagnostic file also support stanbin output in v1, or should it remain CSV/JSON only? The current proposal excludes it, but reviewers may want a unified binary path.


Inasmuch as the API proposed here matches the writer interface, I think we should be able to get this 'for free'/with very minimal additional effort

WardBrian · 2026-03-10T14:04:07Z

designs/0035-stanbin-binary-output.md

+
+1. Should the diagnostic file also support stanbin output in v1, or should it remain CSV/JSON only? The current proposal excludes it, but reviewers may want a unified binary path.
+
+2. Should stanbin become the default output format in a future CmdStan release, or remain opt-in indefinitely?


I think that whatever the default is in cmdstanr/cmdstanpy will end up being the most used thing, regardless

WardBrian · 2026-03-10T14:06:03Z

designs/0035-stanbin-binary-output.md

+
+2. Finalization edge cases: The writer rewrites the header on successful close. The behavior when sampling is interrupted (e.g., SIGINT during warmup) should be tested to confirm that readers can detect and recover from incomplete files via `metadata_offset == 0`.
+
+3. Integration scope: Which sampling algorithms are tested and supported at initial release.


I would say it should be all or nothing, we already have too many edge cases in the command line parser up in cmdstan to have more partial coverage like this. Plus, since it follows the writer interface, it should be fine either way

WardBrian · 2026-03-10T14:06:40Z

designs/0035-stanbin-binary-output.md

+    update_rows_in_header(num_rows_);
+  }
+
+  void finalize() {


Is there something preventing this from being a destructor?

From what I understand destructors should not throw errors and given that I/O can fail the finalize makes handling that easier. The real implementation could to some kind of best effort cleanup in a destructor but I think that might be beyond the scope of the spec?

sims1253 · 2026-03-10T16:12:32Z

Just a quick question re. the process: There are a few things that I would just count as oversights in the proposal that I would simply fix/adapt with a new commit. Is there anything to consider there or do I just push a new commit?

WardBrian · 2026-03-10T16:13:42Z

Yep, until the PR is merged it is open for modification via normal commits during the discussions

SteveBronder · 2026-03-11T17:02:17Z

(one minor note after addressing @WardBrian 's comments): For markdown it's nice to start sentence on a newline so that reviewers can comment on individual sentences easier. Markdown only puts in a true \n newline into the rendered document if there is a full space between lines.

i.e. this will be rendered on the same line

The quick brown fox.
Jumps over the lazy dog.

but this will be on rendered with a newline

The quick brown fox.

Jumps over the lazy dog.

You should be able to just do a regex find and replace to replace . with . \n

SteveBronder

I like it! imo the only questions from me, besides below, are things we will resolve during the PR

SteveBronder · 2026-03-11T17:21:14Z

designs/0035-stanbin-binary-output.md

+| Offset | Size | Type | Description |
+|--------|------|------|-------------|
+| 0 | 8 | char[8] | Magic: `"STANBIN\0"` |
+| 8 | 4 | uint32 | Version (`1`) |
+| 12 | 4 | uint32 | Flags (`0` in v1; reserved for extensions) |
+| 16 | 8 | uint64 | Number of rows (draws) |
+| 24 | 8 | uint64 | Number of columns (parameters) |
+| 32 | 4 | uint32 | Data section offset in bytes (`64 + names_size` in v1) |
+| 36 | 4 | uint32 | Names section size in bytes |
+| 40 | 4 | uint32 | Layout parameter (`0` = row-major in v1; non-zero values reserved for extensions such as chunking) |
+| 44 | 8 | uint64 | Metadata section offset (`0` if file not yet finalized) |
+| 52 | 8 | uint64 | Metadata section size in bytes |
+| 60 | 4 | reserved | Reserved for future use |


A few minor things here.

Do we expect the version to need a uint32? We could probably just do uint8 here since idt we will exceed 255 versions

If we are treating flags like a bitset then I think we can remove Layout and specify is as a flag bit. I think making the flag uint64 would also be nice since that gives us 64 options to choose from in the future

I think the metadata section size should just be a uint32. If the size is in bytes idt we will ever go over 4GB of metadata

Suggested change

| Offset | Size | Type | Description |

|--------|------|------|-------------|

| 0 | 8 | char[8] | Magic: `"STANBIN\0"` |

| 8 | 4 | uint32 | Version (`1`) |

| 12 | 4 | uint32 | Flags (`0` in v1; reserved for extensions) |

| 16 | 8 | uint64 | Number of rows (draws) |

| 24 | 8 | uint64 | Number of columns (parameters) |

| 32 | 4 | uint32 | Data section offset in bytes (`64 + names_size` in v1) |

| 36 | 4 | uint32 | Names section size in bytes |

| 40 | 4 | uint32 | Layout parameter (`0` = row-major in v1; non-zero values reserved for extensions such as chunking) |

| 44 | 8 | uint64 | Metadata section offset (`0` if file not yet finalized) |

| 52 | 8 | uint64 | Metadata section size in bytes |

| 60 | 4 | reserved | Reserved for future use |

| Offset (bytes) | Size (bytes) | Type | Description |

|--------|------|------|-------------|

| 0 | 8 | char[8] | Magic: `"STANBIN\0"` |

| 8 | 8 | uint64 | Flags (see flags below) |

| 16 | 8 | uint64 | Number of rows (draws) |

| 24 | 8 | uint64 | Number of columns (parameters) |

| 32 | 4 | uint32 | Data section offset in bytes (`64 + names_size` in v1) |

| 36 | 4 | uint32 | Names section size in bytes |

| 40 | 8 | uint64 | Metadata section offset (`0` if file not yet finalized) |

| 48 | 4 | uint32 | Metadata section size in bytes |

| 52 | 1 | uint8 | Version (`1`) |

| 53 | 11 | reserved | Reserved for future use |

Flags for each byte

`STAN_CHUNKING_FORMAT`: Specifies data form is in Stan's chunking format

...

I incorporated the metadata-size change to uint32. I left the current version/flags/layout split in place for now because it still reads a bit more directly in the RFC, but I’m happy to revisit that if there is a stronger preference to collapse layout into flags.

SteveBronder · 2026-03-11T17:35:46Z

designs/0035-stanbin-binary-output.md

+    write_row(state);
+    ++num_rows_;
+    update_rows_in_header(num_rows_);


I would add a return flag from write_row so we only update the number of rows / the header if we successfully wrote

Suggested change

write_row(state);

++num_rows_;

update_rows_in_header(num_rows_);

const bool write_success = write_row(state);

if (write_success) {

++num_rows_;

update_rows_in_header(num_rows_);

}

SteveBronder · 2026-03-11T17:36:56Z

designs/0035-stanbin-binary-output.md

+    write_metadata();
+    rewrite_header();


Same as above I'd have each operation return a flag so you know if writing was successful or not. You could also have an enum that you return for different error codes.

SteveBronder · 2026-03-11T17:38:58Z

designs/0035-stanbin-binary-output.md

+  stream_.write(reinterpret_cast<const char*>(state.data()),
+                state.size() * sizeof(double));


Suggested change

stream_.write(reinterpret_cast<const char*>(state.data()),

state.size() * sizeof(double));

stream_.write(reinterpret_cast<const std::byte*>(state.data()),

state.size() * sizeof(double));

(generally I just like this and think it is more standard now)

SteveBronder · 2026-03-11T17:41:43Z

designs/0035-stanbin-binary-output.md

+
+2. Should stanbin become the default output format in a future CmdStan release, or remain opt-in indefinitely?
+
+3. Is the trailing metadata section (raw CSV comment text) the right long-term metadata representation, or should v1 adopt a structured format (e.g., key-value pairs) from the start?


imo this is also a good opportunity to have the metadata go to a separate file completly. So we would write a little json file with all the metadata.

ahartikainen · 2026-03-11T21:50:44Z

Hi, I wanted to add a comment on the external metadata issue. I like the idea of having 1 file, but I think external metadata would be much more flexible way of handling it.

We could copy idea from zarr world, where they put multiple files in uncompressed zip. I think you can stream data to the zip file and append it too.

In the reading side, accessing the metadata would be easy.

And this would still keep the file count as 1.

sims1253 · 2026-03-11T22:18:02Z

I pushed a revision addressing the straightforward spec fixes from the review. I left a few things out for now as I wasn't sure which way would be the best choice. Happy to adjust things further.
Re. the example code feedback, I meant the code mainly as illustrations. I fixed some obvious oversights but I think it might make more sense to keep implementation details light here?

How do you prefer handling of resolved comments? Should I resolve them if I think I addressed them or should it be up to the authors?

Add binary output format RFC

50193c8

WardBrian reviewed Mar 10, 2026

View reviewed changes

SteveBronder reviewed Mar 11, 2026

View reviewed changes

Incorporate PR feedback

2078035


		2. Should stanbin become the default output format in a future CmdStan release, or remain opt-in indefinitely?

		3. Is the trailing metadata section (raw CSV comment text) the right long-term metadata representation, or should v1 adopt a structured format (e.g., key-value pairs) from the start?


		1. Not human-readable: Unlike CSV, stanbin files cannot simply be inspected with a text editor. Users must use provided reader functions.

		2. Tool ecosystem: Existing tools (stansummary, arviz, etc.) expect CSV format. These would need updates to support stanbin.


		## To resolve before merging this RFC

		1. Should the diagnostic file also support stanbin output in v1, or should it remain CSV/JSON only? The current proposal excludes it, but reviewers may want a unified binary path.


		2. Finalization edge cases: The writer rewrites the header on successful close. The behavior when sampling is interrupted (e.g., SIGINT during warmup) should be tested to confirm that readers can detect and recover from incomplete files via `metadata_offset == 0`.

		3. Integration scope: Which sampling algorithms are tested and supported at initial release.

		stream_.write(reinterpret_cast<const char*>(state.data()),
		state.size() * sizeof(double));

Conversation

sims1253 commented Mar 9, 2026

Uh oh!

jgabry commented Mar 9, 2026

Uh oh!

WardBrian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sims1253 commented Mar 10, 2026

Uh oh!

WardBrian commented Mar 10, 2026

Uh oh!

SteveBronder commented Mar 11, 2026

Uh oh!

SteveBronder left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahartikainen commented Mar 11, 2026

Uh oh!

sims1253 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SteveBronder left a comment •

edited

Loading