Skip to content

feat(eval): add ClassifierEvaluator (pure-metadata aggregator)#1674

Open
ajay-kesavan wants to merge 3 commits into
mainfrom
feat/eval-classifier-evaluator
Open

feat(eval): add ClassifierEvaluator (pure-metadata aggregator)#1674
ajay-kesavan wants to merge 3 commits into
mainfrom
feat/eval-classifier-evaluator

Conversation

@ajay-kesavan
Copy link
Copy Markdown

Summary

Adds a new evaluator type whose role is to carry a classes list and a source_evaluator name to downstream consumers (the C# Studio Web backend). It does not compute classification metrics per datapoint — that work moves out of the SDK and into the C# layer, which scans each datapoint's agent output for the configured class strings and builds the confusion matrix after the per-datapoint loop finishes.

Replaces the earlier draft architecture in #1669 / #5307 (Python dataset evaluator framework + Temporal worker workflow). The pure-metadata approach is ~50 LOC instead of ~1500 LOC and ships through the existing CLI → backend wire path with zero new endpoints.

How it works

Eval set:
  evaluatorRefs:
    - intent_match            ExactMatch (existing) — produces expected/actual per datapoint
    - intent_classifier       NEW ClassifierEvaluator — carries classes list

CLI / SDK runtime (this PR):
  For each datapoint:
    ExactMatch.evaluate(...)        → result with BaseEvaluatorJustification(expected, actual)
    ClassifierEvaluator.evaluate()  → result with ClassifierJustification(classes, source_evaluator)
                                     (no per-datapoint computation; the result is metadata)
  CLI POSTs both results to C# via the existing per-evaluator-run update path.

C# (in companion Agents PR):
  After per-datapoint loop, the C# detects the classifier evaluator by inspecting
  Justification payloads, reads (output, expected_class) per datapoint, builds the
  confusion matrix + per-class TP/TN/FP/FN, persists into EvaluatorScores envelope.

Files

New

  • eval/evaluators/classifier_evaluator.pyClassifierEvaluator + ClassifierEvaluatorConfig + ClassifierJustification
  • tests/evaluators/test_classifier_evaluator.py — 9 unit tests

Modified

  • eval/models/models.pyEvaluatorType.CLASSIFIER = "uipath-classifier"
  • eval/evaluators/evaluator.py — discriminator + CodedEvaluator union entry
  • eval/evaluators/__init__.py — re-export + EVALUATORS list entry

Total: 5 files, +297 / -0.

Test plan

  • pytest tests/evaluators/test_classifier_evaluator.py — 9 tests passing
  • pytest tests/evaluators tests/cli/eval — 824 passing (815 existing + 9 new), zero regressions
  • ruff check / ruff format / mypy — clean on all changed files
  • Factory smoke: EvaluatorFactory.create_evaluator({"version":"1.0","evaluatorTypeId":"uipath-classifier", ...}) builds it correctly
  • Per-datapoint smoke: evaluate() returns score=0.0 + ClassifierJustification(classes=..., source_evaluator=...) with the wire JSON shape that the C# layer expects
  • End-to-end via uipath eval against a real eval set with a classifier — pending companion Agents PR landing

Disposition

This branch supersedes the SDK changes in #1669 (Python dataset evaluator framework). I'll close #1669 once this lands.

🤖 Generated with Claude Code

Adds a new evaluator type whose role is to carry a `classes` list and a
`source_evaluator` name to downstream consumers. It does not compute
classification metrics per datapoint — that work moves to the Studio Web
C# backend, which reads each datapoint's agent output and the source
evaluator's expected label after the per-datapoint loop finishes, scans
the output for each configured class, and builds the confusion matrix.

The per-datapoint evaluate() returns score=0.0 with a
ClassifierJustification(classes, source_evaluator) details payload. This
payload survives the existing CLI -> backend wire path via
StudioWebProgressReporter._serialize_justification (json.dumps of the
model_dump), arriving in the backend as a JSON string inside
CodedEvaluatorScore.Justification where the C# layer can read it.

Replaces the design in earlier draft PRs #1669 and #5307: the SDK no
longer owns the dataset-level computation. The pure-config approach is
~50 LOC instead of ~1500 LOC of dataset-evaluator framework + worker
workflow + factory + child workflow plumbing.

Files:
  src/uipath/eval/evaluators/classifier_evaluator.py  new (~90 LOC)
  src/uipath/eval/evaluators/__init__.py              re-export + EVALUATORS list
  src/uipath/eval/evaluators/evaluator.py             discriminator + Union entry
  src/uipath/eval/models/models.py                    EvaluatorType.CLASSIFIER
  tests/evaluators/test_classifier_evaluator.py       9 unit tests, all passing

Verified:
  pytest tests/evaluators tests/cli/eval --no-cov  -> 824 passed
  ruff check / ruff format / mypy                  -> clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels May 21, 2026
A minimal 3-class intent classification agent (book / cancel / reschedule)
that exercises the new ClassifierEvaluator end-to-end via `uipath eval`.
Mirrors the wire shape Studio Web will see once the C# backend and frontend
PRs land, so SDK changes can be validated standalone before the full stack
is brought up.

Layout:
  main.py             — keyword classifier returning {"intent": "..."}
  evaluations/
    eval-sets/main.json
    evaluators/
      intent_match.json       per-datapoint ExactMatch on .intent
      intent_classifier.json  new uipath-classifier with classes + sourceEvaluator
  README.md           — Path A (SDK CLI) + Path B (Studio Web) instructions

Each datapoint has `evaluationCriterias.intent_classifier: {}` (the runtime
skips evaluators that aren't keyed there). 6/9 datapoints are correctly
classified by design; the resulting (expected, actual) pairs flow through
the existing CLI -> backend wire path inside the classifier's justification
payload as classes/source_evaluator metadata.

Verified live:
  - ExactMatch averages to 0.7 (6/9 correct).
  - ClassifierEvaluator emits {"expected":"","actual":"","classes":[...],
    "source_evaluator":"intent_match"} per datapoint.
  - Plugging the (expected, actual) pairs from the resulting output into the
    same confusion-matrix math the C# helper implements yields macro F1 of
    0.667 on this fixture — the number Studio Web's Aggregations panel
    would render once the backend pipeline is live.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ajay-kesavan ajay-kesavan marked this pull request as ready for review May 21, 2026 17:34
Pydantic's generic resolution leaves T = typing.Any when a TypeVar is
parameterized with its own bound (BaseEvaluationCriteria here), so
BaseEvaluator[BaseEvaluationCriteria, ...] tripped the runtime's
"X must be a subclass of BaseEvaluationCriteria" guard at load time:

  Failed to create evaluator from file 'evaluations/evaluators/classifier-*.json':
  typing.Any must be a subclass of BaseEvaluationCriteria.

Introduce an empty ClassifierEvaluationCriteria(BaseEvaluationCriteria)
subclass and parameterize Config + Evaluator with it. Mirrors how every
other built-in evaluator (ExactMatch via OutputEvaluationCriteria, etc.)
provides a concrete criteria type even when no per-datapoint fields are
needed.
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:uipath-integrations test:uipath-langchain Triggers tests in the uipath-langchain-python repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant