feat(eval): add ClassifierEvaluator (pure-metadata aggregator)#1674
Open
ajay-kesavan wants to merge 3 commits into
Open
feat(eval): add ClassifierEvaluator (pure-metadata aggregator)#1674ajay-kesavan wants to merge 3 commits into
ajay-kesavan wants to merge 3 commits into
Conversation
Adds a new evaluator type whose role is to carry a `classes` list and a `source_evaluator` name to downstream consumers. It does not compute classification metrics per datapoint — that work moves to the Studio Web C# backend, which reads each datapoint's agent output and the source evaluator's expected label after the per-datapoint loop finishes, scans the output for each configured class, and builds the confusion matrix. The per-datapoint evaluate() returns score=0.0 with a ClassifierJustification(classes, source_evaluator) details payload. This payload survives the existing CLI -> backend wire path via StudioWebProgressReporter._serialize_justification (json.dumps of the model_dump), arriving in the backend as a JSON string inside CodedEvaluatorScore.Justification where the C# layer can read it. Replaces the design in earlier draft PRs #1669 and #5307: the SDK no longer owns the dataset-level computation. The pure-config approach is ~50 LOC instead of ~1500 LOC of dataset-evaluator framework + worker workflow + factory + child workflow plumbing. Files: src/uipath/eval/evaluators/classifier_evaluator.py new (~90 LOC) src/uipath/eval/evaluators/__init__.py re-export + EVALUATORS list src/uipath/eval/evaluators/evaluator.py discriminator + Union entry src/uipath/eval/models/models.py EvaluatorType.CLASSIFIER tests/evaluators/test_classifier_evaluator.py 9 unit tests, all passing Verified: pytest tests/evaluators tests/cli/eval --no-cov -> 824 passed ruff check / ruff format / mypy -> clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A minimal 3-class intent classification agent (book / cancel / reschedule)
that exercises the new ClassifierEvaluator end-to-end via `uipath eval`.
Mirrors the wire shape Studio Web will see once the C# backend and frontend
PRs land, so SDK changes can be validated standalone before the full stack
is brought up.
Layout:
main.py — keyword classifier returning {"intent": "..."}
evaluations/
eval-sets/main.json
evaluators/
intent_match.json per-datapoint ExactMatch on .intent
intent_classifier.json new uipath-classifier with classes + sourceEvaluator
README.md — Path A (SDK CLI) + Path B (Studio Web) instructions
Each datapoint has `evaluationCriterias.intent_classifier: {}` (the runtime
skips evaluators that aren't keyed there). 6/9 datapoints are correctly
classified by design; the resulting (expected, actual) pairs flow through
the existing CLI -> backend wire path inside the classifier's justification
payload as classes/source_evaluator metadata.
Verified live:
- ExactMatch averages to 0.7 (6/9 correct).
- ClassifierEvaluator emits {"expected":"","actual":"","classes":[...],
"source_evaluator":"intent_match"} per datapoint.
- Plugging the (expected, actual) pairs from the resulting output into the
same confusion-matrix math the C# helper implements yields macro F1 of
0.667 on this fixture — the number Studio Web's Aggregations panel
would render once the backend pipeline is live.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pydantic's generic resolution leaves T = typing.Any when a TypeVar is parameterized with its own bound (BaseEvaluationCriteria here), so BaseEvaluator[BaseEvaluationCriteria, ...] tripped the runtime's "X must be a subclass of BaseEvaluationCriteria" guard at load time: Failed to create evaluator from file 'evaluations/evaluators/classifier-*.json': typing.Any must be a subclass of BaseEvaluationCriteria. Introduce an empty ClassifierEvaluationCriteria(BaseEvaluationCriteria) subclass and parameterize Config + Evaluator with it. Mirrors how every other built-in evaluator (ExactMatch via OutputEvaluationCriteria, etc.) provides a concrete criteria type even when no per-datapoint fields are needed.
This was referenced May 22, 2026
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Adds a new evaluator type whose role is to carry a
classeslist and asource_evaluatorname to downstream consumers (the C# Studio Web backend). It does not compute classification metrics per datapoint — that work moves out of the SDK and into the C# layer, which scans each datapoint's agent output for the configured class strings and builds the confusion matrix after the per-datapoint loop finishes.Replaces the earlier draft architecture in #1669 / #5307 (Python dataset evaluator framework + Temporal worker workflow). The pure-metadata approach is ~50 LOC instead of ~1500 LOC and ships through the existing CLI → backend wire path with zero new endpoints.
How it works
Files
New
eval/evaluators/classifier_evaluator.py—ClassifierEvaluator+ClassifierEvaluatorConfig+ClassifierJustificationtests/evaluators/test_classifier_evaluator.py— 9 unit testsModified
eval/models/models.py—EvaluatorType.CLASSIFIER = "uipath-classifier"eval/evaluators/evaluator.py— discriminator +CodedEvaluatorunion entryeval/evaluators/__init__.py— re-export +EVALUATORSlist entryTotal: 5 files, +297 / -0.
Test plan
pytest tests/evaluators/test_classifier_evaluator.py— 9 tests passingpytest tests/evaluators tests/cli/eval— 824 passing (815 existing + 9 new), zero regressionsruff check/ruff format/mypy— clean on all changed filesEvaluatorFactory.create_evaluator({"version":"1.0","evaluatorTypeId":"uipath-classifier", ...})builds it correctlyevaluate()returnsscore=0.0+ClassifierJustification(classes=..., source_evaluator=...)with the wire JSON shape that the C# layer expectsuipath evalagainst a real eval set with a classifier — pending companion Agents PR landingDisposition
This branch supersedes the SDK changes in #1669 (Python dataset evaluator framework). I'll close #1669 once this lands.
🤖 Generated with Claude Code