feat: add stream_validate() hook to Requirement (#900)#925
feat: add stream_validate() hook to Requirement (#900)#925planetf1 merged 8 commits intogenerative-computing:mainfrom
Conversation
96f1919 to
0c030fb
Compare
…eam_validate - Remove "In Phase 1" temporal qualifier from docstring — reworded to timeless statement about orchestrator responsibility - Add type annotations (str, Backend, Context) to test subclass overrides - Add idempotency test: multiple calls on the same Requirement instance leave state unchanged Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…eam_validate - Remove "In Phase 1" temporal qualifier from docstring — reworded to timeless statement about orchestrator responsibility - Add type annotations (str, Backend, Context) to test subclass overrides - Add idempotency test: multiple calls on the same Requirement instance leave state unchanged Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
d922a2c to
58128a7
Compare
…#900) Add an async `stream_validate(chunk, backend, ctx)` method to the base `Requirement` class. The default implementation returns `PartialValidationResult("unknown")`; subclasses override to inspect the accumulated chunk and return `"pass"` or `"fail"` early. Per the Phase 1 design: `"pass"` is informational and does not short-circuit the final `validate()` call. The method must not mutate `self` — state isolation is the orchestrator's responsibility. Signed-off-by: Nigel Jones <jonesn@uk.ibm.com> Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…eam_validate - Remove "In Phase 1" temporal qualifier from docstring — reworded to timeless statement about orchestrator responsibility - Add type annotations (str, Backend, Context) to test subclass overrides - Add idempotency test: multiple calls on the same Requirement instance leave state unchanged Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Prevents positional confusion and makes future parameter additions to the signature non-breaking for existing subclass overrides. Assisted-by: Claude Code
58128a7 to
358e4d1
Compare
The docstring incorrectly stated that implementations must not mutate self. Issue generative-computing#900 spec explicitly allows stateful accumulation and requires the shallow-copy caveat to be documented. Fix the docstring to match the spec. Add two tests required by the issue acceptance criteria: - test_stateful_subclass_accumulates_state: verifies a subclass can accumulate state (bullet counter) across stream_validate calls - test_stateful_subclass_clone_isolation: verifies copy() gives an independent clone, confirming the orchestrator clone pattern Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
The previous implementation overwrote _bullet_count from the full accumulated chunk on each call — equivalent to a pure function with no real dependency on prior state. Use _seen_len to extract only the new portion of each accumulated chunk, accumulating the count additively. This genuinely requires prior-call state to know where to slice, making the test name "accumulates_state" accurate. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
In multi-line calls, # type: ignore only suppresses errors on its own line. The backend=None argument was uncovered; add the ignore there too. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Use the public API for imports: Backend and Context both appear in mellea.core.__all__, so import from mellea.core rather than the internal submodules. Rewrite test_stateful_subclass_clone_isolation to simulate the correct orchestrator pattern: the original requirement is never called directly; each attempt clones from the fresh original, giving _calls == 0 at the start of every attempt. The previous test cloned mid-stream, which tested shallow-copy isolation but demonstrated the wrong usage pattern. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…eam_validate - Remove "In Phase 1" temporal qualifier from docstring — reworded to timeless statement about orchestrator responsibility - Add type annotations (str, Backend, Context) to test subclass overrides - Add idempotency test: multiple calls on the same Requirement instance leave state unchanged Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
|
@jakelorocco how does this look for you - I have another stacked PR behind this one too. (I'll just go one level deep) |
jakelorocco
left a comment
There was a problem hiding this comment.
@planetf1, I might've missed this in the original proposal, but passing the accumulated chunks to the requirements is different from what I thought the proposal was. Could you please elaborate more on the design choice?
| isolation. | ||
|
|
||
| Args: | ||
| chunk: The accumulated model output so far (not just the latest token). |
There was a problem hiding this comment.
This worries me. @nrfulton, your initial proposal was to have the requirement only see new chunks. This would show all accumulated chunks so far.
This forces all streaming requirements to be stateful; all requirements must now keep track of what chunks they have processed. The alternative would be to only provide new chunks to requirements; then streaming validation would be stateless except when needed. Requirements can choose whether they need to store and process multiple chunks, or just check each chunk independently.
I guess the checking each chunk independently is unlikely to be helpful, so I can see why accumulating and forcing requirements to track their own progress doesn't actually add much complexity.
If so, I think we should actually pre-define functions to help with this (either as a new class of requirements or functions that implementors can draw on).
Also, if we are passing the accumulated chunk through, I almost think we should just pass in some point-in-time copy of the model output thunk. Ie one that doesn't get streamed the new chunks but has all the data fields from the point-in-time it was copied at.
There was a problem hiding this comment.
I would suggest reverting to the simplicity of the original proposal. The change was made incorrectly when reviewing the initial PR. The initial approach is simple - and requirements can maintain state if the need to work on accumulated chunks. If we need to support accumulated content generally we can consider it in a later phase.
Which works best will vary by use case. Per-chunk like works better for
- checkinf for forbidden words/phrases
- ensuring a paragraph or sentence is coherent
- structural checks - code fencing
- format validation
especially when the MoT manages the semantic chunking (later)
There will be cases where accumulation is better -- these are probably more complex checks - does a story line flow, are we taking the response in an unexpected direction, do we have a complete enough response
Importantly if we stick to per-chunk we could still implement this second approach - albeit not as cleanly.
So in summary - I'll revert to original -- but if you now think that's wrong we can adjust?
Restores the chunk-at-a-time semantics set out in the generative-computing#891 epic and generative-computing#900 spec: stream_validate is called once per complete chunk produced by the chunking strategy, and receives that single chunk. Requirements that need history accumulate it on self. Commit 315a98c inadvertently flipped this: the BulletCounter test was rewritten to recover deltas from accumulated text via self._seen_len, and the docstring was updated to match ("The accumulated model output so far"). Neither change reflected a design decision — it was drift during a test fix, and buries a confusing workaround in what should be a straightforward stateful override. Changes: - requirement.py: rewrite chunk Args description to name the chunking-strategy-produced delta, clarify that ctx does not contain the generated output during streaming, and note the MOT single-consumer constraint - test_stream_validate.py: rewrite BulletCounter to accumulate its own running count (no self._seen_len); calls pass delta chunks ("\n- one\n- two") rather than re-sending accumulated text The corresponding orchestrator fix in stream_with_chunking() -- pass the chunk, iterate per chunk -- is in the stacked Wave 3 branch. Assisted-by: Claude Code
jakelorocco
left a comment
There was a problem hiding this comment.
lgtm; I think the single chunk approach is good. If we want to revert back to the accumulation, I don't think we are stuck with this approach yet.
| Implementations must not call ``mot.astream()`` or otherwise read the | ||
| underlying stream; the orchestrator is the single consumer of the MOT | ||
| stream (see ``ModelOutputThunk.astream``). Requirements that need access | ||
| to the text seen so far should accumulate it themselves from the | ||
| ``chunk`` values they receive. | ||
|
|
||
| Args: | ||
| chunk: A single complete semantic chunk produced by the chunking | ||
| strategy (e.g. one sentence for ``SentenceChunker``). This is | ||
| the delta since the previous ``stream_validate`` call for this | ||
| attempt, not the accumulated output. Requirements that need | ||
| earlier context should retain it on ``self`` across calls. | ||
| backend: The inference backend, available for backend-assisted checks. | ||
| ctx: The current generation context. During streaming the MOT is | ||
| not yet computed, so ``ctx`` does not contain the generated | ||
| output; use ``chunk`` (and any state accumulated on ``self``) | ||
| instead. |
There was a problem hiding this comment.
Clarification, the generation context has the uncomputed mot at this point, right? That's the reason for the warning?
There was a problem hiding this comment.
Yes - they need to rely on the chunk & anything they've accumulated themselves. mot_is_computed stays false until streaming ends so we can't really say for sure what's in it. Tried to ensure we capture the behaviour in the docstrings
I think any changes to this initial approach -- and making the mot more responsible -- is a later phase.
Thanks for approval!
I have a stacked PR which I'll get out asap too (may not be until tomorrow)
7912a1d
Requirement PR
Description
Adds
async stream_validate(chunk, *, backend, ctx) -> PartialValidationResultto the baseRequirementclass as a per-chunk streaming validation hook. The default implementation returnsPartialValidationResult("unknown"); subclasses override to inspect the current chunk (the semantic delta from the chunking strategy — one sentence, word, or paragraph per call) and signal"pass"or"fail"early. Stateful implementations maintain their own running state across calls (e.g.self._count += chunk.count("•")); the orchestrator (in the stacked PR #942) clones the requirement before each attempt so state does not bleed across retries.Part of streaming epic #891. Builds on #924 (merged —
PartialValidationResult) and #923 (merged —ChunkingStrategy).Design decisions
Per the agreed Phase 1 spec:
stream_validate(notavalidate) — different operation, different semanticsPartialValidationResult(tri-state"pass"/"fail"/"unknown") — from feat(core): add PartialValidationResult with tri-state semantics #898 / feat(core): add PartialValidationResult with tri-state semantics #924chunkis a single complete chunk from the chunking strategy (the delta since the previous call), not the full accumulated output. This matches the feat: streaming validation — per-chunk requirement checking with early exit #891 epic's "one chunk at a time" contract and the feat(core): add stream_validate() to Requirement #900 spec.backendandctxare keyword-only: prevents positional confusion and keeps future parameter additions non-breaking"pass"is informational: does not short-circuit the finalvalidate()call in Phase 1reset()method: state isolation is handled by orchestrator cloning (copy(req)) before each attemptNo LLM-as-a-Judge logic here — this is a pure hook for custom validation overrides.
Review feedback addressed
@jakelorocco flagged on first pass that an earlier iteration of this branch passed the accumulated output to
stream_validaterather than a single chunk, which deviated from the spec. Commit82bdd3a5restores the spec-compliant behaviour:requirement.pydocstring rewritten to describechunkas the delta from the chunking strategy, with notes onctxbeing incomplete during streaming and the MOT single-consumer constraint.test_stateful_subclass_accumulates_staterewritten:BulletCounternow accumulates onself._countacross delta calls, dropping the earlierself._seen_lenworkaround that only made sense under accumulated semantics.The corresponding orchestrator change (pass single chunk, not accumulated) is in the stacked PR #942.
Implementation Checklist
Base Class
Requirement- standard requirementValidation Logic
Requirement; no defaultvalidation_fnadded (stream_validate is an override point, not a validation implementation)PartialValidationResultwith tri-statesuccessand optionalreason/scoreIntegration
mellea/stdlib/requirements/__init__.pyneeded — this PR modifies the base class, not a new requirementTesting
test/core/test_stream_validate.py:test_default_returns_unknown— base class always returns"unknown"test_default_returns_partial_validation_result_instance— correct return typetest_stream_validate_is_coroutine— method is asynctest_subclass_can_return_pass/test_subclass_can_return_fail— subclass overrides worktest_does_not_mutate_requirement/test_stream_validate_idempotent— base class is statelesstest_stateful_subclass_accumulates_state— delta-semantics bullet counter correctly accumulates onselftest_stateful_subclass_clone_isolation—copy()produces independent clones for orchestrator useAttribution