feat(eval): classification evaluator schemas + sample projects + e2e tests by ajay-kesavan · Pull Request #1663 · UiPath/uipath-python

ajay-kesavan · 2026-05-20T00:48:06Z

Summary

Completes the classification evaluator feature shipped in #1397 by adding the three pieces that PR didn't carry:

Generated type schemas — BinaryClassificationEvaluator.json and MulticlassClassificationEvaluator.json under packages/uipath/src/uipath/eval/evaluators_types/, produced by python -m uipath.eval.evaluators_types.generate_types. These are the machine-readable schemas external tooling (Flow UI evaluator picker, uip maestro flow eval) uses to know each evaluator's config / criteria / justification shape.
Sample projects under packages/uipath/samples/:
- binary_classification_agent/ — rule-based spam/ham classifier wired to the binary classification evaluator with metric_type=precision. Eval set is designed so 4/5 datapoints pass but precision is 2/3 because of one deliberate false positive — demonstrates the dataset-level metric diverging from a simple per-row pass rate.
- multiclass_classification_simple/ — rule-based 3-class router (payments / support / spam) wired to the multiclass classification evaluator with averaging=macro. Eval set forces a misroute that hurts both payments precision and support recall, giving macro F1 = (0.8 + 0.8 + 1.0) / 3.
End-to-end test at packages/uipath/tests/cli/eval/test_classification_samples_e2e.py — loads each sample's eval set, wires its main.py into a stand-in runtime, calls evaluate(), and asserts both the per-row scores and the aggregated metric produced by reduce_scores. Locks in the dataset-level math.

Why split this PR

PR #1397 added the Python implementation and registered the new evaluator type IDs (uipath-binary-classification, uipath-multiclass-classification) in the coded-evaluator discriminator, but didn't regenerate the JSON type files or add a runnable example. Without these the evaluators are merged-in-name-only.

Test plan

pytest tests/cli/eval/test_classification_samples_e2e.py — both samples pass
ruff check tests/cli/eval/test_classification_samples_e2e.py — clean
ruff format --check — clean
cat packages/uipath/src/uipath/eval/evaluators_types/BinaryClassificationEvaluator.json exposes positive_class, metric_type, f_value in evaluatorConfigSchema.properties
cat packages/uipath/src/uipath/eval/evaluators_types/MulticlassClassificationEvaluator.json exposes classes, averaging, metric_type, f_value
CI passes

Related PRs

chore(eval): resync evaluator type schemas with Python source #1664 — companion PR that refreshes the 11 unrelated stale schemas in the same directory (split out for review hygiene; no functional overlap with this PR).
UiPath/cli#2128 — TypeScript-side flow-tool registry entries that wire these evaluators into the Flow UI evaluator picker.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

🤖 Generated with Claude Code

Generates BinaryClassificationEvaluator.json and MulticlassClassificationEvaluator.json from the new evaluators added in #1397 so external tooling (Flow UI evaluator picker, `uip maestro flow eval`) can read the config / criteria / justification schemas. Files produced by `python -m uipath.eval.evaluators_types.generate_types`, restricted to the two new evaluator types. A companion PR refreshes the other 11 stale schemas in evaluators_types/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tors Adds two sample projects under packages/uipath/samples/ that double as end-to-end test fixtures for the binary and multiclass classification evaluators added in #1397: - binary_classification_agent — rule-based spam/ham classifier wired up to the binary classification evaluator with metric_type=precision. Eval set is designed so 4/5 datapoints pass but precision is 2/3 because of one deliberate false positive. - multiclass_classification_simple — rule-based 3-class router (payments / support / spam) wired up to the multiclass classification evaluator with macro-averaged F1. Eval set forces a misroute that hurts both payments precision and support recall, giving macro F1 = 26/30. Adds tests/cli/eval/test_classification_samples_e2e.py which loads each sample's eval-sets/default.json, wires its main.py into a stand-in runtime, calls evaluate(), and asserts both the per-row scores and the aggregated metric produced by reduce_scores. Locks in the dataset-level math, not just per-row correct/incorrect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-05-20T01:30:48Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

ajay-kesavan · 2026-05-22T03:38:18Z

Superseded by #1674 (ClassifierEvaluator). The schema/sample work here was replaced by the simpler single-evaluator approach.

ajay-kesavan force-pushed the feat/classification-evaluator-types branch from 6931598 to 6b11767 Compare May 20, 2026 00:54

ajay-kesavan mentioned this pull request May 20, 2026

chore(eval): resync evaluator type schemas with Python source #1664

Draft

3 tasks

ajay-kesavan changed the title ~~chore(eval): regenerate evaluator type schemas with classification evaluators~~ feat(eval): add evaluator type schemas for classification evaluators May 20, 2026

ajay-kesavan changed the title ~~feat(eval): add evaluator type schemas for classification evaluators~~ feat(eval): classification evaluator schemas + sample projects + e2e tests May 20, 2026

ajay-kesavan closed this May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): classification evaluator schemas + sample projects + e2e tests#1663

feat(eval): classification evaluator schemas + sample projects + e2e tests#1663
ajay-kesavan wants to merge 2 commits into
mainfrom
feat/classification-evaluator-types

ajay-kesavan commented May 20, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 20, 2026

Uh oh!

ajay-kesavan commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajay-kesavan commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why split this PR

Test plan

Related PRs

Uh oh!

sonarqubecloud Bot commented May 20, 2026

Quality Gate passed

Uh oh!

ajay-kesavan commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ajay-kesavan commented May 20, 2026 •

edited

Loading