Skip to content

feat: add stream_validate() hook to Requirement (#900)#925

Merged
planetf1 merged 8 commits intogenerative-computing:mainfrom
planetf1:feat/900-stream-validate
Apr 28, 2026
Merged

feat: add stream_validate() hook to Requirement (#900)#925
planetf1 merged 8 commits intogenerative-computing:mainfrom
planetf1:feat/900-stream-validate

Conversation

@planetf1
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 commented Apr 24, 2026

Requirement PR

Description

Adds async stream_validate(chunk, *, backend, ctx) -> PartialValidationResult to the base Requirement class as a per-chunk streaming validation hook. The default implementation returns PartialValidationResult("unknown"); subclasses override to inspect the current chunk (the semantic delta from the chunking strategy — one sentence, word, or paragraph per call) and signal "pass" or "fail" early. Stateful implementations maintain their own running state across calls (e.g. self._count += chunk.count("•")); the orchestrator (in the stacked PR #942) clones the requirement before each attempt so state does not bleed across retries.

Part of streaming epic #891. Builds on #924 (merged — PartialValidationResult) and #923 (merged — ChunkingStrategy).

Design decisions

Per the agreed Phase 1 spec:

No LLM-as-a-Judge logic here — this is a pure hook for custom validation overrides.

Review feedback addressed

@jakelorocco flagged on first pass that an earlier iteration of this branch passed the accumulated output to stream_validate rather than a single chunk, which deviated from the spec. Commit 82bdd3a5 restores the spec-compliant behaviour:

  • requirement.py docstring rewritten to describe chunk as the delta from the chunking strategy, with notes on ctx being incomplete during streaming and the MOT single-consumer constraint.
  • test_stateful_subclass_accumulates_state rewritten: BulletCounter now accumulates on self._count across delta calls, dropping the earlier self._seen_len workaround that only made sense under accumulated semantics.

The corresponding orchestrator change (pass single chunk, not accumulated) is in the stacked PR #942.

Implementation Checklist

Base Class

  • Extends appropriate base class:
    • Requirement - standard requirement

Validation Logic

  • Hook method on base Requirement; no default validation_fn added (stream_validate is an override point, not a validation implementation)
  • Returns PartialValidationResult with tri-state success and optional reason / score

Integration

  • No change to mellea/stdlib/requirements/__init__.py needed — this PR modifies the base class, not a new requirement

Testing

  • Tests added to test/core/test_stream_validate.py:
    • test_default_returns_unknown — base class always returns "unknown"
    • test_default_returns_partial_validation_result_instance — correct return type
    • test_stream_validate_is_coroutine — method is async
    • test_subclass_can_return_pass / test_subclass_can_return_fail — subclass overrides work
    • test_does_not_mutate_requirement / test_stream_validate_idempotent — base class is stateless
    • test_stateful_subclass_accumulates_state — delta-semantics bullet counter correctly accumulates on self
    • test_stateful_subclass_clone_isolationcopy() produces independent clones for orchestrator use
  • New code has 100% coverage
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

  • AI coding assistants used

@github-actions github-actions Bot added the enhancement New feature or request label Apr 24, 2026
@planetf1 planetf1 force-pushed the feat/900-stream-validate branch from 96f1919 to 0c030fb Compare April 24, 2026 12:18
planetf1 added a commit to planetf1/mellea that referenced this pull request Apr 24, 2026
…eam_validate

- Remove "In Phase 1" temporal qualifier from docstring — reworded to
  timeless statement about orchestrator responsibility
- Add type annotations (str, Backend, Context) to test subclass overrides
- Add idempotency test: multiple calls on the same Requirement instance
  leave state unchanged

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
planetf1 added a commit to planetf1/mellea that referenced this pull request Apr 24, 2026
…eam_validate

- Remove "In Phase 1" temporal qualifier from docstring — reworded to
  timeless statement about orchestrator responsibility
- Add type annotations (str, Backend, Context) to test subclass overrides
- Add idempotency test: multiple calls on the same Requirement instance
  leave state unchanged

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 force-pushed the feat/900-stream-validate branch from d922a2c to 58128a7 Compare April 24, 2026 14:16
…#900)

Add an async `stream_validate(chunk, backend, ctx)` method to the base
`Requirement` class. The default implementation returns
`PartialValidationResult("unknown")`; subclasses override to inspect the
accumulated chunk and return `"pass"` or `"fail"` early.

Per the Phase 1 design: `"pass"` is informational and does not
short-circuit the final `validate()` call. The method must not mutate
`self` — state isolation is the orchestrator's responsibility.

Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…eam_validate

- Remove "In Phase 1" temporal qualifier from docstring — reworded to
  timeless statement about orchestrator responsibility
- Add type annotations (str, Backend, Context) to test subclass overrides
- Add idempotency test: multiple calls on the same Requirement instance
  leave state unchanged

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Prevents positional confusion and makes future parameter additions
to the signature non-breaking for existing subclass overrides.

Assisted-by: Claude Code
@planetf1 planetf1 force-pushed the feat/900-stream-validate branch from 58128a7 to 358e4d1 Compare April 27, 2026 13:12
The docstring incorrectly stated that implementations must not mutate
self. Issue generative-computing#900 spec explicitly allows stateful accumulation and
requires the shallow-copy caveat to be documented. Fix the docstring
to match the spec.

Add two tests required by the issue acceptance criteria:
- test_stateful_subclass_accumulates_state: verifies a subclass can
  accumulate state (bullet counter) across stream_validate calls
- test_stateful_subclass_clone_isolation: verifies copy() gives an
  independent clone, confirming the orchestrator clone pattern

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 marked this pull request as ready for review April 27, 2026 13:19
@planetf1 planetf1 requested review from a team, jakelorocco and nrfulton as code owners April 27, 2026 13:19
The previous implementation overwrote _bullet_count from the full
accumulated chunk on each call — equivalent to a pure function with
no real dependency on prior state.

Use _seen_len to extract only the new portion of each accumulated
chunk, accumulating the count additively. This genuinely requires
prior-call state to know where to slice, making the test name
"accumulates_state" accurate.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
In multi-line calls, # type: ignore only suppresses errors on its own
line. The backend=None argument was uncovered; add the ignore there too.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Use the public API for imports: Backend and Context both appear in
mellea.core.__all__, so import from mellea.core rather than the
internal submodules.

Rewrite test_stateful_subclass_clone_isolation to simulate the correct
orchestrator pattern: the original requirement is never called directly;
each attempt clones from the fresh original, giving _calls == 0 at the
start of every attempt. The previous test cloned mid-stream, which
tested shallow-copy isolation but demonstrated the wrong usage pattern.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
planetf1 added a commit to planetf1/mellea that referenced this pull request Apr 28, 2026
…eam_validate

- Remove "In Phase 1" temporal qualifier from docstring — reworded to
  timeless statement about orchestrator responsibility
- Add type annotations (str, Backend, Context) to test subclass overrides
- Add idempotency test: multiple calls on the same Requirement instance
  leave state unchanged

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1
Copy link
Copy Markdown
Contributor Author

@jakelorocco how does this look for you - I have another stacked PR behind this one too. (I'll just go one level deep)

Copy link
Copy Markdown
Contributor

@jakelorocco jakelorocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@planetf1, I might've missed this in the original proposal, but passing the accumulated chunks to the requirements is different from what I thought the proposal was. Could you please elaborate more on the design choice?

Comment thread mellea/core/requirement.py Outdated
isolation.

Args:
chunk: The accumulated model output so far (not just the latest token).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me. @nrfulton, your initial proposal was to have the requirement only see new chunks. This would show all accumulated chunks so far.

This forces all streaming requirements to be stateful; all requirements must now keep track of what chunks they have processed. The alternative would be to only provide new chunks to requirements; then streaming validation would be stateless except when needed. Requirements can choose whether they need to store and process multiple chunks, or just check each chunk independently.

I guess the checking each chunk independently is unlikely to be helpful, so I can see why accumulating and forcing requirements to track their own progress doesn't actually add much complexity.

If so, I think we should actually pre-define functions to help with this (either as a new class of requirements or functions that implementors can draw on).

Also, if we are passing the accumulated chunk through, I almost think we should just pass in some point-in-time copy of the model output thunk. Ie one that doesn't get streamed the new chunks but has all the data fields from the point-in-time it was copied at.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest reverting to the simplicity of the original proposal. The change was made incorrectly when reviewing the initial PR. The initial approach is simple - and requirements can maintain state if the need to work on accumulated chunks. If we need to support accumulated content generally we can consider it in a later phase.

Which works best will vary by use case. Per-chunk like works better for

  • checkinf for forbidden words/phrases
  • ensuring a paragraph or sentence is coherent
  • structural checks - code fencing
  • format validation

especially when the MoT manages the semantic chunking (later)

There will be cases where accumulation is better -- these are probably more complex checks - does a story line flow, are we taking the response in an unexpected direction, do we have a complete enough response

Importantly if we stick to per-chunk we could still implement this second approach - albeit not as cleanly.

So in summary - I'll revert to original -- but if you now think that's wrong we can adjust?

Restores the chunk-at-a-time semantics set out in the generative-computing#891 epic and
generative-computing#900 spec: stream_validate is called once per complete chunk produced
by the chunking strategy, and receives that single chunk. Requirements
that need history accumulate it on self.

Commit 315a98c inadvertently flipped this: the BulletCounter test was
rewritten to recover deltas from accumulated text via self._seen_len,
and the docstring was updated to match ("The accumulated model output
so far"). Neither change reflected a design decision — it was drift
during a test fix, and buries a confusing workaround in what should be
a straightforward stateful override.

Changes:
- requirement.py: rewrite chunk Args description to name the
  chunking-strategy-produced delta, clarify that ctx does not contain
  the generated output during streaming, and note the MOT
  single-consumer constraint
- test_stream_validate.py: rewrite BulletCounter to accumulate its own
  running count (no self._seen_len); calls pass delta chunks
  ("\n- one\n- two") rather than re-sending accumulated text

The corresponding orchestrator fix in stream_with_chunking() -- pass
the chunk, iterate per chunk -- is in the stacked Wave 3 branch.

Assisted-by: Claude Code
Copy link
Copy Markdown
Contributor

@jakelorocco jakelorocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; I think the single chunk approach is good. If we want to revert back to the accumulation, I don't think we are stuck with this approach yet.

Comment on lines +305 to +321
Implementations must not call ``mot.astream()`` or otherwise read the
underlying stream; the orchestrator is the single consumer of the MOT
stream (see ``ModelOutputThunk.astream``). Requirements that need access
to the text seen so far should accumulate it themselves from the
``chunk`` values they receive.

Args:
chunk: A single complete semantic chunk produced by the chunking
strategy (e.g. one sentence for ``SentenceChunker``). This is
the delta since the previous ``stream_validate`` call for this
attempt, not the accumulated output. Requirements that need
earlier context should retain it on ``self`` across calls.
backend: The inference backend, available for backend-assisted checks.
ctx: The current generation context. During streaming the MOT is
not yet computed, so ``ctx`` does not contain the generated
output; use ``chunk`` (and any state accumulated on ``self``)
instead.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification, the generation context has the uncomputed mot at this point, right? That's the reason for the warning?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - they need to rely on the chunk & anything they've accumulated themselves. mot_is_computed stays false until streaming ends so we can't really say for sure what's in it. Tried to ensure we capture the behaviour in the docstrings

I think any changes to this initial approach -- and making the mot more responsible -- is a later phase.

Thanks for approval!

I have a stacked PR which I'll get out asap too (may not be until tomorrow)

@planetf1 planetf1 added this pull request to the merge queue Apr 28, 2026
Merged via the queue into generative-computing:main with commit 7912a1d Apr 28, 2026
10 of 11 checks passed
@planetf1 planetf1 deleted the feat/900-stream-validate branch April 28, 2026 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(core): add stream_validate() to Requirement

2 participants