Skip to content

feat: add incremental checkpointing for nat eval (#1631)#1652

Open
Akshat8510 wants to merge 4 commits intoNVIDIA:developfrom
Akshat8510:feature/incremental-eval
Open

feat: add incremental checkpointing for nat eval (#1631)#1652
Akshat8510 wants to merge 4 commits intoNVIDIA:developfrom
Akshat8510:feature/incremental-eval

Conversation

@Akshat8510
Copy link
Contributor

@Akshat8510 Akshat8510 commented Feb 25, 2026

Description

Addresses #1631. Currently, nat eval only writes the final workflow_output.json after the entire dataset has finished processing. If the evaluation hangs or is interrupted, all progress is lost.

This PR introduces incremental checkpointing by appending results to a workflow_output.jsonl file in real-time as each dataset item completes. This ensures that partial results are always recoverable from disk.

Changes:

  • Updated run_workflow_local and run_workflow_remote in evaluate.py to accept dataset_handler.
  • Implemented real-time serialization of EvalInputItem to workflow_output.jsonl.
  • Used f.flush() to ensure immediate disk persistence, preventing data loss during process interruptions.
  • Maintained backward compatibility: the final aggregated workflow_output.json is still generated at the end of successful runs.

Closes #1631

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Evaluation outputs are now written incrementally per item during both local and remote runs, enabling real-time progress persistence.
    • Per-item checkpoint writes are performed off the main execution thread to avoid blocking.
  • Reliability

    • Checkpointing is guarded by configuration and includes error handling; write failures log warnings and do not interrupt main execution.

Signed-off-by: Akshat Kumar <akshat230405@gmail.com>
Signed-off-by: Akshat Kumar <akshat230405@gmail.com>
@Akshat8510 Akshat8510 requested a review from a team as a code owner February 25, 2026 14:53
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link

coderabbitai bot commented Feb 25, 2026

Walkthrough

Adds per-item incremental checkpointing to evaluation: new helper to write JSONL checkpoint entries; run_workflow_local and run_workflow_remote now accept a dataset_handler; both local and remote flows write per-item checkpoint records asynchronously after logging predictions and usage stats, with exception handling and warnings.

Changes

Cohort / File(s) Summary
Incremental checkpointing & API updates
packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py
Added _write_checkpoint_item(checkpoint_file, item_dict) to append one JSONL record. Updated run_workflow_local(self, session_manager: SessionManager, dataset_handler: DatasetHandler) and run_workflow_remote(self, dataset_handler: DatasetHandler) signatures. Integrated per-item checkpoint writes in both local and remote flows (use dataset_handler.publish_eval_input and asyncio.to_thread), guarded by config.write_output and wrapped in exception handling with warnings. Updated internal call sites to pass dataset_handler.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Runner as Workflow Runner
participant Session as SessionManager
participant Dataset as DatasetHandler
participant Disk as Filesystem
participant Log as Logger

Runner->>Dataset: request next item / publish_eval_input
Dataset-->>Runner: item_dict
Runner->>Session: (local) obtain session / run eval
Runner->>Runner: evaluate item
Runner->>Log: log_prediction(item)
Runner->>Log: log_usage_stats(item)
Runner->>Dataset: publish_eval_input(item) -> item_dict
Runner->>Disk: asyncio.to_thread(_write_checkpoint_item, workflow_output.jsonl, item_dict)
Disk-->>Runner: write success / exception
Runner->>Log: warn on write failure (if any)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change - adding incremental checkpointing for nat eval. It follows imperative mood convention and is concise at 56 characters.
Linked Issues check ✅ Passed The changes implement incremental checkpointing by adding dataset_handler parameters, creating per-item checkpointing logic with _write_checkpoint_item helper, and writing to workflow_output.jsonl, directly addressing issue #1631's requirement to persist per-item outputs incrementally.
Out of Scope Changes check ✅ Passed All changes are scoped to incremental checkpointing: new _write_checkpoint_item helper, updated run_workflow_local/remote signatures, per-item writing logic with exception handling, and call site updates - no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py`:
- Around line 323-337: The remote runner currently waits for asyncio.gather in
EvaluationRemoteWorkflowHandler.run_workflow_remote and so never writes per-item
checkpoints; change the implementation to persist each item as it completes
(e.g., use asyncio.as_completed or attach per-task callbacks) so you call
dataset_handler.publish_eval_input and append the resulting item_dict to the
checkpoint_file ("workflow_output.jsonl") immediately when each task finishes;
ensure you reuse the existing logic that builds temp_input from
eval_input.eval_input_items and the step_filter from
eval_config.general.output.workflow_output_step_filter, open checkpoint_file
with append and flush on each write to provide incremental persistence and
recovery.
- Line 175: The public async methods run_workflow_local and run_workflow_remote
currently lack explicit return type hints; update their function signatures to
include the Python 3.11+ return annotation "-> None" (e.g., change "async def
run_workflow_local(self, session_manager: SessionManager, dataset_handler:
DatasetHandler):" to "async def run_workflow_local(self, session_manager:
SessionManager, dataset_handler: DatasetHandler) -> None:") and do the same for
run_workflow_remote so both public APIs have explicit return type hints; ensure
any related type stubs or usages remain compatible.
- Around line 276-297: The synchronous checkpoint write block (the code that
builds EvalInput, calls dataset_handler.publish_eval_input, json.loads and opens
checkpoint_file for append) must be moved off the event loop: create a
synchronous helper method named _append_checkpoint_line(self, checkpoint_file,
line) that opens the file, writes the line and flushes, then invoke that helper
via asyncio.to_thread(...) from the async context (e.g., inside run_one where
self.config.write_output is checked) and wrap the entire checkpoint sequence
(building temp_input with EvalInput, calling publish_eval_input, extracting
item_dict, and the to_thread call) in a try/except so any serialization or I/O
error is caught and logged but does not propagate to fail the task; keep
existing symbols dataset_handler.publish_eval_input, EvalInput, checkpoint_file,
and step_filter to locate the code to change.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c2d9703 and b5a8270.

📒 Files selected for processing (1)
  • packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py

Signed-off-by: Akshat Kumar <akshat230405@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py (2)

292-293: Remove redundant local import of EvalInput.

EvalInput is already imported at module level (line 42). The local re-import here and at line 336 (inside a loop) is unnecessary clutter.

Proposed fix
-                                from nat.data_models.evaluator import EvalInput
                                 temp_input = EvalInput(eval_input_items=[item])

And similarly at line 336:

-                    from nat.data_models.evaluator import EvalInput
                     temp_input = EvalInput(eval_input_items=[item])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py` around
lines 292 - 293, The local re-imports of EvalInput inside evaluate.py are
redundant; remove the inner "from nat.data_models.evaluator import EvalInput"
occurrences (the ones adjacent to creating temp_input and the one at the loop
around line 336) and just use the module-level EvalInput already imported at
top, leaving the temp_input = EvalInput(eval_input_items=[item]) constructions
intact; this eliminates duplicate imports and cleans up the function(s) that
construct temp_input.

282-301: Extract duplicated checkpoint logic into a shared async helper.

The checkpoint block here (lines 282–301) and in run_workflow_remote (lines 327–343) share nearly identical logic: compute step_filter, wrap item in EvalInput, serialize via publish_eval_input, parse JSON, and write via _write_checkpoint_item. Consider extracting an async method like _checkpoint_item(self, item, dataset_handler, checkpoint_file, step_filter) to DRY this up.

Additionally, output_dir.mkdir(parents=True, exist_ok=True) is called on every concurrent item in the local path (line 286). Move this to a one-time setup before the asyncio.gather to avoid redundant syscalls under concurrency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py` around
lines 282 - 301, Duplicate incremental checkpointing logic should be extracted
into a shared async helper and the output directory creation moved out of
per-item concurrent execution. Implement an async method (e.g.,
_checkpoint_item(self, item, dataset_handler, checkpoint_file, step_filter))
that encapsulates computing step_filter (use
self.eval_config.general.output.workflow_output_step_filter), wrapping the item
in nat.data_models.evaluator.EvalInput, calling
dataset_handler.publish_eval_input, parsing the JSON and invoking
self._write_checkpoint_item via asyncio.to_thread; replace the duplicated blocks
in the current evaluate logic and in run_workflow_remote to call this helper.
Also perform output_dir.mkdir(parents=True, exist_ok=True) once before launching
asyncio.gather (not per item) so the filesystem creation isn't repeated
concurrently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py`:
- Around line 322-343: The current remote run writes checkpoints only after
handler.run_workflow_remote(self.eval_input) completes; change
run_workflow_remote in evaluate.py to stream per-item checkpoint writes during
the remote execution by coordinating with EvaluationRemoteWorkflowHandler so
each finished item triggers publishing and writing immediately (use
asyncio.as_completed or per-task callbacks inside/around
handler.run_workflow_remote rather than waiting for the full gather); for each
completed task call dataset_handler.publish_eval_input(...) and await
asyncio.to_thread(self._write_checkpoint_item, checkpoint_file, item_dict) to
ensure incremental JSONL writes; update any associated
handler.run_workflow_remote signatures or add a streaming iterator/callback hook
so you can process individual results as they finish without blocking on the
entire gather.
- Around line 299-300: Replace the two exception logging calls that currently
use logger.warning(...) inside the incremental checkpoint write handlers with
logger.exception(...), so the full traceback is recorded; specifically update
the except blocks that catch Exception as e around the incremental checkpoint
write for item (the handler using "Failed to write incremental checkpoint for
item %s: %s", item.id, e) and the other similar except block further down
(around the code referenced at the second exception handler) to call
logger.exception with the same message format and arguments.
- Around line 149-155: The helper method _write_checkpoint_item was inserted
such that the subsequent block that computes llm_latencies and assigns to
self.usage_stats.usage_stats_items[item.id] is accidentally nested inside it;
move the entire _write_checkpoint_item method so it sits at class level (either
before or after _compute_usage_stats) and then dedent the lines that compute
llm_latencies and set self.usage_stats.usage_stats_items[item.id] (the block
that references steps and item.id) so that they are part of _compute_usage_stats
again, restoring proper scope for variables like steps and item.id.

---

Nitpick comments:
In `@packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py`:
- Around line 292-293: The local re-imports of EvalInput inside evaluate.py are
redundant; remove the inner "from nat.data_models.evaluator import EvalInput"
occurrences (the ones adjacent to creating temp_input and the one at the loop
around line 336) and just use the module-level EvalInput already imported at
top, leaving the temp_input = EvalInput(eval_input_items=[item]) constructions
intact; this eliminates duplicate imports and cleans up the function(s) that
construct temp_input.
- Around line 282-301: Duplicate incremental checkpointing logic should be
extracted into a shared async helper and the output directory creation moved out
of per-item concurrent execution. Implement an async method (e.g.,
_checkpoint_item(self, item, dataset_handler, checkpoint_file, step_filter))
that encapsulates computing step_filter (use
self.eval_config.general.output.workflow_output_step_filter), wrapping the item
in nat.data_models.evaluator.EvalInput, calling
dataset_handler.publish_eval_input, parsing the JSON and invoking
self._write_checkpoint_item via asyncio.to_thread; replace the duplicated blocks
in the current evaluate logic and in run_workflow_remote to call this helper.
Also perform output_dir.mkdir(parents=True, exist_ok=True) once before launching
asyncio.gather (not per item) so the filesystem creation isn't repeated
concurrently.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b5a8270 and bb2de02.

📒 Files selected for processing (1)
  • packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py

Signed-off-by: Akshat Kumar <akshat230405@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py (1)

322-343: ⚠️ Potential issue | 🟠 Major

Remote checkpointing is still post-run, not incremental during execution.

At Line 325, the method waits for handler.run_workflow_remote(...) to fully finish before any checkpoint write at Line 335+, so a hang/interruption during remote execution still yields no partial JSONL progress. Also, the outer try (Lines 329-343) means one item serialization/write failure aborts checkpointing for all remaining items in that loop.

Suggested direction
 async def run_workflow_remote(self, dataset_handler: DatasetHandler) -> None:
     from nat.plugins.eval.runtime.remote_workflow import EvaluationRemoteWorkflowHandler
     handler = EvaluationRemoteWorkflowHandler(self.config, self.eval_config.general.max_concurrency)
-    await handler.run_workflow_remote(self.eval_input)
-
-    if self.config.write_output:
-        try:
-            output_dir = self.eval_config.general.output_dir
-            output_dir.mkdir(parents=True, exist_ok=True)
-            checkpoint_file = output_dir / "workflow_output.jsonl"
-            step_filter = self.eval_config.general.output.workflow_output_step_filter if self.eval_config.general.output else None
-
-            for item in self.eval_input.eval_input_items:
-                from nat.data_models.evaluator import EvalInput
-                temp_input = EvalInput(eval_input_items=[item])
-                item_dict = json.loads(dataset_handler.publish_eval_input(temp_input, step_filter))[0]
-                await asyncio.to_thread(self._write_checkpoint_item, checkpoint_file, item_dict)
-        except Exception:
-            logger.exception("Failed to write remote checkpoint items")
+    if not self.config.write_output:
+        await handler.run_workflow_remote(self.eval_input)
+        return
+
+    output_dir = self.eval_config.general.output_dir
+    output_dir.mkdir(parents=True, exist_ok=True)
+    checkpoint_file = output_dir / "workflow_output.jsonl"
+    step_filter = self.eval_config.general.output.workflow_output_step_filter if self.eval_config.general.output else None
+
+    async def on_item_complete(item: EvalInputItem) -> None:
+        try:
+            temp_input = EvalInput(eval_input_items=[item])
+            item_dict = json.loads(dataset_handler.publish_eval_input(temp_input, step_filter))[0]
+            await asyncio.to_thread(self._write_checkpoint_item, checkpoint_file, item_dict)
+        except Exception:
+            logger.exception("Failed to write remote checkpoint for item %s", item.id)
+
+    await handler.run_workflow_remote(self.eval_input, on_item_complete=on_item_complete)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py` around
lines 322 - 343, The checkpointing currently runs only after
EvaluationRemoteWorkflowHandler.run_workflow_remote completes, so move
checkpoint writes to run concurrently and make them per-item-fault tolerant:
start handler.run_workflow_remote as an asyncio.create_task (or modify
EvaluationRemoteWorkflowHandler to provide an async iterator/callback for
per-item progress), and in parallel iterate self.eval_input.eval_input_items to
serialize and write each item to checkpoint via await
asyncio.to_thread(self._write_checkpoint_item, checkpoint_file, item_dict) as
they become available; ensure the per-item loop around
dataset_handler.publish_eval_input and _write_checkpoint_item catches and logs
exceptions per item (do not let one failure abort the rest) and keep use of
eval_config.general.output.workflow_output_step_filter when calling
publish_eval_input.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py`:
- Around line 322-343: The checkpointing currently runs only after
EvaluationRemoteWorkflowHandler.run_workflow_remote completes, so move
checkpoint writes to run concurrently and make them per-item-fault tolerant:
start handler.run_workflow_remote as an asyncio.create_task (or modify
EvaluationRemoteWorkflowHandler to provide an async iterator/callback for
per-item progress), and in parallel iterate self.eval_input.eval_input_items to
serialize and write each item to checkpoint via await
asyncio.to_thread(self._write_checkpoint_item, checkpoint_file, item_dict) as
they become available; ensure the per-item loop around
dataset_handler.publish_eval_input and _write_checkpoint_item catches and logs
exceptions per item (do not let one failure abort the rest) and keep use of
eval_config.general.output.workflow_output_step_filter when calling
publish_eval_input.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb2de02 and 693a770.

📒 Files selected for processing (1)
  • packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py

@Akshat8510
Copy link
Contributor Author

Hi @AnuradhaKaruppiah, I have addressed all the feedback provided by the CodeRabbit reviewer.

@Akshat8510
Copy link
Contributor Author

Akshat8510 commented Feb 25, 2026

The code has been verified via py_compile. This PR is now ready for your review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Checkpointing workflow.json incrementally for nat eval

1 participant