feat: add stream_validate() hook to Requirement (#900) by planetf1 · Pull Request #925 · generative-computing/mellea

planetf1 · 2026-04-24T12:11:13Z

Requirement PR

Description

Link to Issue: Closes feat(core): add stream_validate() to Requirement #900

Adds async stream_validate(chunk, *, backend, ctx) -> PartialValidationResult to the base Requirement class as a per-chunk streaming validation hook. The default implementation returns PartialValidationResult("unknown"); subclasses override to inspect the current chunk (the semantic delta from the chunking strategy — one sentence, word, or paragraph per call) and signal "pass" or "fail" early. Stateful implementations maintain their own running state across calls (e.g. self._count += chunk.count("•")); the orchestrator (in the stacked PR #942) clones the requirement before each attempt so state does not bleed across retries.

Part of streaming epic #891. Builds on #924 (merged — PartialValidationResult) and #923 (merged — ChunkingStrategy).

Design decisions

Per the agreed Phase 1 spec:

Method name: stream_validate (not avalidate) — different operation, different semantics
Return type: PartialValidationResult (tri-state "pass" / "fail" / "unknown") — from feat(core): add PartialValidationResult with tri-state semantics #898 / feat(core): add PartialValidationResult with tri-state semantics #924
Chunk semantics: chunk is a single complete chunk from the chunking strategy (the delta since the previous call), not the full accumulated output. This matches the feat: streaming validation — per-chunk requirement checking with early exit #891 epic's "one chunk at a time" contract and the feat(core): add stream_validate() to Requirement #900 spec.
backend and ctx are keyword-only: prevents positional confusion and keeps future parameter additions non-breaking
"pass" is informational: does not short-circuit the final validate() call in Phase 1
No reset() method: state isolation is handled by orchestrator cloning (copy(req)) before each attempt

No LLM-as-a-Judge logic here — this is a pure hook for custom validation overrides.

Review feedback addressed

@jakelorocco flagged on first pass that an earlier iteration of this branch passed the accumulated output to stream_validate rather than a single chunk, which deviated from the spec. Commit 82bdd3a5 restores the spec-compliant behaviour:

requirement.py docstring rewritten to describe chunk as the delta from the chunking strategy, with notes on ctx being incomplete during streaming and the MOT single-consumer constraint.
test_stateful_subclass_accumulates_state rewritten: BulletCounter now accumulates on self._count across delta calls, dropping the earlier self._seen_len workaround that only made sense under accumulated semantics.

The corresponding orchestrator change (pass single chunk, not accumulated) is in the stacked PR #942.

Implementation Checklist

Base Class

Extends appropriate base class:
- Requirement - standard requirement

Validation Logic

Hook method on base Requirement; no default validation_fn added (stream_validate is an override point, not a validation implementation)
Returns PartialValidationResult with tri-state success and optional reason / score

Integration

No change to mellea/stdlib/requirements/__init__.py needed — this PR modifies the base class, not a new requirement

Testing

Tests added to test/core/test_stream_validate.py:
- test_default_returns_unknown — base class always returns "unknown"
- test_default_returns_partial_validation_result_instance — correct return type
- test_stream_validate_is_coroutine — method is async
- test_subclass_can_return_pass / test_subclass_can_return_fail — subclass overrides work
- test_does_not_mutate_requirement / test_stream_validate_idempotent — base class is stateless
- test_stateful_subclass_accumulates_state — delta-semantics bullet counter correctly accumulates on self
- test_stateful_subclass_clone_isolation — copy() produces independent clones for orchestrator use
New code has 100% coverage
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

AI coding assistants used

…eam_validate - Remove "In Phase 1" temporal qualifier from docstring — reworded to timeless statement about orchestrator responsibility - Add type annotations (str, Backend, Context) to test subclass overrides - Add idempotency test: multiple calls on the same Requirement instance leave state unchanged Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

…#900) Add an async `stream_validate(chunk, backend, ctx)` method to the base `Requirement` class. The default implementation returns `PartialValidationResult("unknown")`; subclasses override to inspect the accumulated chunk and return `"pass"` or `"fail"` early. Per the Phase 1 design: `"pass"` is informational and does not short-circuit the final `validate()` call. The method must not mutate `self` — state isolation is the orchestrator's responsibility. Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

…eam_validate - Remove "In Phase 1" temporal qualifier from docstring — reworded to timeless statement about orchestrator responsibility - Add type annotations (str, Backend, Context) to test subclass overrides - Add idempotency test: multiple calls on the same Requirement instance leave state unchanged Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

Prevents positional confusion and makes future parameter additions to the signature non-breaking for existing subclass overrides. Assisted-by: Claude Code

The docstring incorrectly stated that implementations must not mutate self. Issue generative-computing#900 spec explicitly allows stateful accumulation and requires the shallow-copy caveat to be documented. Fix the docstring to match the spec. Add two tests required by the issue acceptance criteria: - test_stateful_subclass_accumulates_state: verifies a subclass can accumulate state (bullet counter) across stream_validate calls - test_stateful_subclass_clone_isolation: verifies copy() gives an independent clone, confirming the orchestrator clone pattern Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

The previous implementation overwrote _bullet_count from the full accumulated chunk on each call — equivalent to a pure function with no real dependency on prior state. Use _seen_len to extract only the new portion of each accumulated chunk, accumulating the count additively. This genuinely requires prior-call state to know where to slice, making the test name "accumulates_state" accurate. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

In multi-line calls, # type: ignore only suppresses errors on its own line. The backend=None argument was uncovered; add the ignore there too. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

Use the public API for imports: Backend and Context both appear in mellea.core.__all__, so import from mellea.core rather than the internal submodules. Rewrite test_stateful_subclass_clone_isolation to simulate the correct orchestrator pattern: the original requirement is never called directly; each attempt clones from the fresh original, giving _calls == 0 at the start of every attempt. The previous test cloned mid-stream, which tested shallow-copy isolation but demonstrated the wrong usage pattern. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

…eam_validate - Remove "In Phase 1" temporal qualifier from docstring — reworded to timeless statement about orchestrator responsibility - Add type annotations (str, Backend, Context) to test subclass overrides - Add idempotency test: multiple calls on the same Requirement instance leave state unchanged Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

planetf1 · 2026-04-28T08:24:55Z

@jakelorocco how does this look for you - I have another stacked PR behind this one too. (I'll just go one level deep)

jakelorocco

@planetf1, I might've missed this in the original proposal, but passing the accumulated chunks to the requirements is different from what I thought the proposal was. Could you please elaborate more on the design choice?

jakelorocco · 2026-04-28T12:57:23Z

+        isolation.
+
+        Args:
+            chunk: The accumulated model output so far (not just the latest token).


This worries me. @nrfulton, your initial proposal was to have the requirement only see new chunks. This would show all accumulated chunks so far.

This forces all streaming requirements to be stateful; all requirements must now keep track of what chunks they have processed. The alternative would be to only provide new chunks to requirements; then streaming validation would be stateless except when needed. Requirements can choose whether they need to store and process multiple chunks, or just check each chunk independently.

I guess the checking each chunk independently is unlikely to be helpful, so I can see why accumulating and forcing requirements to track their own progress doesn't actually add much complexity.

If so, I think we should actually pre-define functions to help with this (either as a new class of requirements or functions that implementors can draw on).

Also, if we are passing the accumulated chunk through, I almost think we should just pass in some point-in-time copy of the model output thunk. Ie one that doesn't get streamed the new chunks but has all the data fields from the point-in-time it was copied at.

I would suggest reverting to the simplicity of the original proposal. The change was made incorrectly when reviewing the initial PR. The initial approach is simple - and requirements can maintain state if the need to work on accumulated chunks. If we need to support accumulated content generally we can consider it in a later phase.

Which works best will vary by use case. Per-chunk like works better for

checkinf for forbidden words/phrases

ensuring a paragraph or sentence is coherent

structural checks - code fencing

format validation

especially when the MoT manages the semantic chunking (later)

There will be cases where accumulation is better -- these are probably more complex checks - does a story line flow, are we taking the response in an unexpected direction, do we have a complete enough response

Importantly if we stick to per-chunk we could still implement this second approach - albeit not as cleanly.

So in summary - I'll revert to original -- but if you now think that's wrong we can adjust?

Restores the chunk-at-a-time semantics set out in the generative-computing#891 epic and generative-computing#900 spec: stream_validate is called once per complete chunk produced by the chunking strategy, and receives that single chunk. Requirements that need history accumulate it on self. Commit 315a98c inadvertently flipped this: the BulletCounter test was rewritten to recover deltas from accumulated text via self._seen_len, and the docstring was updated to match ("The accumulated model output so far"). Neither change reflected a design decision — it was drift during a test fix, and buries a confusing workaround in what should be a straightforward stateful override. Changes: - requirement.py: rewrite chunk Args description to name the chunking-strategy-produced delta, clarify that ctx does not contain the generated output during streaming, and note the MOT single-consumer constraint - test_stream_validate.py: rewrite BulletCounter to accumulate its own running count (no self._seen_len); calls pass delta chunks ("\n- one\n- two") rather than re-sending accumulated text The corresponding orchestrator fix in stream_with_chunking() -- pass the chunk, iterate per chunk -- is in the stacked Wave 3 branch. Assisted-by: Claude Code

jakelorocco

lgtm; I think the single chunk approach is good. If we want to revert back to the accumulation, I don't think we are stuck with this approach yet.

jakelorocco · 2026-04-28T18:13:40Z

+        Implementations must not call ``mot.astream()`` or otherwise read the
+        underlying stream; the orchestrator is the single consumer of the MOT
+        stream (see ``ModelOutputThunk.astream``). Requirements that need access
+        to the text seen so far should accumulate it themselves from the
+        ``chunk`` values they receive.
+
+        Args:
+            chunk: A single complete semantic chunk produced by the chunking
+                strategy (e.g. one sentence for ``SentenceChunker``). This is
+                the delta since the previous ``stream_validate`` call for this
+                attempt, not the accumulated output. Requirements that need
+                earlier context should retain it on ``self`` across calls.
+            backend: The inference backend, available for backend-assisted checks.
+            ctx: The current generation context. During streaming the MOT is
+                not yet computed, so ``ctx`` does not contain the generated
+                output; use ``chunk`` (and any state accumulated on ``self``)
+                instead.


Clarification, the generation context has the uncomputed mot at this point, right? That's the reason for the warning?

Yes - they need to rely on the chunk & anything they've accumulated themselves. mot_is_computed stays false until streaming ends so we can't really say for sure what's in it. Tried to ensure we capture the behaviour in the docstrings

I think any changes to this initial approach -- and making the mot more responsible -- is a later phase.

Thanks for approval!

I have a stacked PR which I'll get out asap too (may not be until tomorrow)

github-actions Bot added the enhancement New feature or request label Apr 24, 2026

planetf1 force-pushed the feat/900-stream-validate branch from 96f1919 to 0c030fb Compare April 24, 2026 12:18

planetf1 force-pushed the feat/900-stream-validate branch from d922a2c to 58128a7 Compare April 24, 2026 14:16

planetf1 added 3 commits April 27, 2026 14:12

fix(core): make stream_validate backend/ctx keyword-only

358e4d1

Prevents positional confusion and makes future parameter additions to the signature non-breaking for existing subclass overrides. Assisted-by: Claude Code

planetf1 force-pushed the feat/900-stream-validate branch from 58128a7 to 358e4d1 Compare April 27, 2026 13:12

planetf1 marked this pull request as ready for review April 27, 2026 13:19

planetf1 requested review from a team, jakelorocco and nrfulton as code owners April 27, 2026 13:19

planetf1 added 3 commits April 27, 2026 15:02

test(core): fix missing type: ignore on backend=None in multi-line call

2f3c423

In multi-line calls, # type: ignore only suppresses errors on its own line. The backend=None argument was uncovered; add the ignore there too. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

planetf1 mentioned this pull request Apr 27, 2026

feat(stdlib): add stream_with_chunking() with per-chunk validation (#901) #942

Open

10 tasks

jakelorocco reviewed Apr 28, 2026

View reviewed changes

jakelorocco approved these changes Apr 28, 2026

View reviewed changes

planetf1 added this pull request to the merge queue Apr 28, 2026

Merged via the queue into generative-computing:main with commit 7912a1d Apr 28, 2026
10 of 11 checks passed

planetf1 deleted the feat/900-stream-validate branch April 28, 2026 19:19

planetf1 mentioned this pull request Apr 29, 2026

feat: streaming validation — per-chunk requirement checking with early exit #891

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add stream_validate() hook to Requirement (#900)#925

feat: add stream_validate() hook to Requirement (#900)#925
planetf1 merged 8 commits intogenerative-computing:mainfrom
planetf1:feat/900-stream-validate

planetf1 commented Apr 24, 2026 •

edited

Loading

Uh oh!

planetf1 commented Apr 28, 2026

Uh oh!

jakelorocco left a comment

Uh oh!

jakelorocco Apr 28, 2026

Uh oh!

planetf1 Apr 28, 2026

Uh oh!

jakelorocco left a comment

Uh oh!

jakelorocco Apr 28, 2026

Uh oh!

planetf1 Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

planetf1 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirement PR

Description

Design decisions

Review feedback addressed

Implementation Checklist

Base Class

Validation Logic

Integration

Testing

Attribution

Uh oh!

planetf1 commented Apr 28, 2026

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

jakelorocco Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

planetf1 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

Uh oh!

jakelorocco Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

planetf1 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

planetf1 commented Apr 24, 2026 •

edited

Loading