Skip to content

Sort datasets by shot/time#520

Open
gtrevisan wants to merge 5 commits intodevfrom
glt/xr-sortby
Open

Sort datasets by shot/time#520
gtrevisan wants to merge 5 commits intodevfrom
glt/xr-sortby

Conversation

@gtrevisan
Copy link
Member

@gtrevisan gtrevisan commented Feb 12, 2026

rationale

although for most purposes this should not be an issue, it's better to sort idx-indexed datasets by shot and time.

if the datasets were indexed by physics-based coordinates, eg:

# watch out! possible huge memory footprint
ds.set_index(idx=["shot", "time"]).unstack("idx")

then downstream comparisons would be handled by xarray itself, eg:

a.equals(b)
a == b
xr.testing.assert_equal(a, b)
# etc

but with idx-indexed datasets this fails and requires sorting -- after all, idx is left un-coordinated because it's unphysical.

a local full-scale run on the C-MOD database shows that the new sorting requirement is hardly taxing on the workflow:

[ INFO  ] Completed workflow: retrieved 10,429/10,435 shots (99.94%) in 9m 45s (0.056 s/shot)
[ DEBUG ] Concatenating 10,429 shots...
[ DEBUG ] Sorted 940,543 rows in 0.179s.
[ INFO  ] Concatenated 10,429 shots in 22.627s.

as this drastically simplifies downstream comparison of idx-indexed datasets, I'd switch on by default.
reviewers, please take some time to understand and try this out.

test

simple test to execute two iterations with multi-processed shuffled shots.

for ITER in 1 2
do
   echo {1150805012..1150805021} \
   | xargs -n 1 \
   | shuf \
   | tee "input-$ITER.txt" \
   | xargs uv run disruption-py -p 10 \
   | grep output.nc
done

result on dev:

garbled!

Coordinates:
    shot                  (idx) int64 7kB 1150805018 1150805018 ... 1150805017

vs

Coordinates:
    shot                  (idx) int64 7kB 1150805021 1150805021 ... 1150805020

comparison:

import xarray as xr
a, b = (xr.open_dataset(p) for p in ["one/output.nc", "another/output.nc"])
print(a.equals(b))
False

result on this branch:

same!

Coordinates:
    shot                  (idx) int64 7kB 1150805012 1150805012 ... 1150805021

comparison:

True

thoughts?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds deterministic sorting of concatenated datasets by shot and time to ensure reproducible outputs when processing shots in parallel. The changes address a subtle but important issue where multi-processed workflows with shuffled shot order would produce different dataset ordering, making dataset comparisons fail even when the data content was identical.

Changes:

  • Added sorting by shot and time after concatenating datasets in DatasetOutputSetting.concat()
  • Enhanced logging to show concatenation timing and sorting performance
  • Updated debug logging format to use thousand separators for shot counts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@yumouwei yumouwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never thought about this, but yes I'd prefer to have shot & time sorted.

Copy link
Contributor

@ZanderKeith ZanderKeith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having this option and the implementation looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants