-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Intermittent Connection Error with Evaluation for Custom Evaluators #46627
Copy link
Copy link
Open
Labels
AI ProjectsEvaluationIssues related to the client library for Azure AI EvaluationIssues related to the client library for Azure AI EvaluationService AttentionWorkflow: This issue is responsible by Azure service team.Workflow: This issue is responsible by Azure service team.bugThis issue requires a change to an existing behavior in the product in order to be resolved.This issue requires a change to an existing behavior in the product in order to be resolved.customer-reportedIssues that are reported by GitHub users external to the Azure organization.Issues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK teamWorkflow: This issue needs attention from Azure service team or SDK team
Metadata
Metadata
Assignees
Labels
AI ProjectsEvaluationIssues related to the client library for Azure AI EvaluationIssues related to the client library for Azure AI EvaluationService AttentionWorkflow: This issue is responsible by Azure service team.Workflow: This issue is responsible by Azure service team.bugThis issue requires a change to an existing behavior in the product in order to be resolved.This issue requires a change to an existing behavior in the product in order to be resolved.customer-reportedIssues that are reported by GitHub users external to the Azure organization.Issues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK teamWorkflow: This issue needs attention from Azure service team or SDK team
Describe the bug
When running an evaluation via the Azure AI Foundry Evals API with custom (prompty-based) evaluators, individual evaluator executions intermittently fail with FAILED_EXECUTION / "Connection error.". The overall run status still returns completed, but affected items have score: null. The same evaluator succeeds in other runs without any change to the configuration.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Custom evaluators should complete successfully or retry on transient connection failures.
Screenshots
Run 1 where all succeeded -- all inputs very similar
result 1:
"results": [ { "type": "azure_ai_evaluator", "name": "ifd_handoff_completeness", "metric": "ifd_handoff_completeness", "score": 5.0, "label": "pass", "threshold": 3, "passed": true } ],result2:
"results": [ { "type": "azure_ai_evaluator", "name": "ifd_handoff_completeness", "metric": "ifd_handoff_completeness", "score": 5.0, "label": "pass", "threshold": 3, "passed": true } ],Run 2 only 1 of 2 succeeded, the other with error
results 1:
"results": [ { "type": "azure_ai_evaluator", "name": "ifd_handoff_completeness", "metric": "ifd_handoff_completeness", "score": 4.0, "label": "pass", "reason": "The tool call includes case CAS-4941DA05 with mismatch details (Lumity vs Life), user ID, and description/context. However, missing critical items like order number, customer contact details, and explicit urgency prevent a fully handoff-ready payload.", "threshold": 3.0, "passed": true } ],results 2 - CONNECTION ERRORS
"results": [ { "type": "azure_ai_evaluator", "name": "ifd_handoff_accuracy", "metric": "ifd_handoff_accuracy", "score": null, "label": null, "reason": null, "threshold": null, "passed": false }, { "type": "azure_ai_evaluator", "name": "ifd_handoff_completeness", "metric": "ifd_handoff_completeness", "score": null, "label": null, "reason": null, "threshold": null, "passed": false }, { "type": "azure_ai_evaluator", "name": "ifd_handoff_accuracy", "metric": "result", "score": null, "label": null, "reason": null, "threshold": null, "passed": null, "sample": { "error": { "code": "FAILED_EXECUTION", "message": "Connection error." } } }, { "type": "azure_ai_evaluator", "name": "ifd_handoff_completeness", "metric": "result", "score": null, "label": null, "reason": null, "threshold": null, "passed": null, "sample": { "error": { "code": "FAILED_EXECUTION", "message": "Connection error." } } } ],Additional context
Intermittent — identical runs succeed without any changes
Only custom (prompty-based) evaluators are affected; built-in evaluators (e.g. builtin.tool_selection) in the same run are not affected
When a custom evaluator errors, it appears twice in the output item results: once with passed: false, label: "NaN", score: null, and once as a separate "metric": "result" entry with the FAILED_EXECUTION error