feat(eval): classification evaluator schemas + sample projects + e2e tests#1663
Closed
ajay-kesavan wants to merge 2 commits into
Closed
feat(eval): classification evaluator schemas + sample projects + e2e tests#1663ajay-kesavan wants to merge 2 commits into
ajay-kesavan wants to merge 2 commits into
Conversation
Generates BinaryClassificationEvaluator.json and MulticlassClassificationEvaluator.json from the new evaluators added in #1397 so external tooling (Flow UI evaluator picker, `uip maestro flow eval`) can read the config / criteria / justification schemas. Files produced by `python -m uipath.eval.evaluators_types.generate_types`, restricted to the two new evaluator types. A companion PR refreshes the other 11 stale schemas in evaluators_types/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6931598 to
6b11767
Compare
3 tasks
…tors Adds two sample projects under packages/uipath/samples/ that double as end-to-end test fixtures for the binary and multiclass classification evaluators added in #1397: - binary_classification_agent — rule-based spam/ham classifier wired up to the binary classification evaluator with metric_type=precision. Eval set is designed so 4/5 datapoints pass but precision is 2/3 because of one deliberate false positive. - multiclass_classification_simple — rule-based 3-class router (payments / support / spam) wired up to the multiclass classification evaluator with macro-averaged F1. Eval set forces a misroute that hurts both payments precision and support recall, giving macro F1 = 26/30. Adds tests/cli/eval/test_classification_samples_e2e.py which loads each sample's eval-sets/default.json, wires its main.py into a stand-in runtime, calls evaluate(), and asserts both the per-row scores and the aggregated metric produced by reduce_scores. Locks in the dataset-level math, not just per-row correct/incorrect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Author
|
Superseded by #1674 (ClassifierEvaluator). The schema/sample work here was replaced by the simpler single-evaluator approach. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Completes the classification evaluator feature shipped in #1397 by adding the three pieces that PR didn't carry:
Generated type schemas —
BinaryClassificationEvaluator.jsonandMulticlassClassificationEvaluator.jsonunderpackages/uipath/src/uipath/eval/evaluators_types/, produced bypython -m uipath.eval.evaluators_types.generate_types. These are the machine-readable schemas external tooling (Flow UI evaluator picker,uip maestro flow eval) uses to know each evaluator's config / criteria / justification shape.Sample projects under
packages/uipath/samples/:binary_classification_agent/— rule-based spam/ham classifier wired to the binary classification evaluator withmetric_type=precision. Eval set is designed so 4/5 datapoints pass but precision is 2/3 because of one deliberate false positive — demonstrates the dataset-level metric diverging from a simple per-row pass rate.multiclass_classification_simple/— rule-based 3-class router (payments / support / spam) wired to the multiclass classification evaluator withaveraging=macro. Eval set forces a misroute that hurts both payments precision and support recall, giving macro F1 = (0.8 + 0.8 + 1.0) / 3.End-to-end test at
packages/uipath/tests/cli/eval/test_classification_samples_e2e.py— loads each sample's eval set, wires itsmain.pyinto a stand-in runtime, callsevaluate(), and asserts both the per-row scores and the aggregated metric produced byreduce_scores. Locks in the dataset-level math.Why split this PR
PR #1397 added the Python implementation and registered the new evaluator type IDs (
uipath-binary-classification,uipath-multiclass-classification) in the coded-evaluator discriminator, but didn't regenerate the JSON type files or add a runnable example. Without these the evaluators are merged-in-name-only.Test plan
pytest tests/cli/eval/test_classification_samples_e2e.py— both samples passruff check tests/cli/eval/test_classification_samples_e2e.py— cleanruff format --check— cleancat packages/uipath/src/uipath/eval/evaluators_types/BinaryClassificationEvaluator.jsonexposespositive_class,metric_type,f_valueinevaluatorConfigSchema.propertiescat packages/uipath/src/uipath/eval/evaluators_types/MulticlassClassificationEvaluator.jsonexposesclasses,averaging,metric_type,f_valueRelated PRs
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com
🤖 Generated with Claude Code