Skip to content

Intermittent Connection Error with Evaluation for Custom Evaluators #46627

@hante-sonova

Description

@hante-sonova
  • Package Name: azure-ai-projects
  • Package Version: 2.1.0
  • Operating System: MacOs
  • Python Version: Python 3.12.13

Describe the bug
When running an evaluation via the Azure AI Foundry Evals API with custom (prompty-based) evaluators, individual evaluator executions intermittently fail with FAILED_EXECUTION / "Connection error.". The overall run status still returns completed, but affected items have score: null. The same evaluator succeeds in other runs without any change to the configuration.

To Reproduce
Steps to reproduce the behavior:

  1. Register a custom prompty-based evaluator via the Azure AI Projects SDK
  2. Create an evaluation with that custom evaluator as a testing criterion
  3. Create and run an evaluation run against a dataset with multiple items
  4. Observe output items — some evaluator results contain "code": "FAILED_EXECUTION", "message": "Connection error." with score: null

Expected behavior
Custom evaluators should complete successfully or retry on transient connection failures.

Screenshots
Run 1 where all succeeded -- all inputs very similar

Image

result 1:
"results": [ { "type": "azure_ai_evaluator", "name": "ifd_handoff_completeness", "metric": "ifd_handoff_completeness", "score": 5.0, "label": "pass", "threshold": 3, "passed": true } ],
result2:
"results": [ { "type": "azure_ai_evaluator", "name": "ifd_handoff_completeness", "metric": "ifd_handoff_completeness", "score": 5.0, "label": "pass", "threshold": 3, "passed": true } ],

Run 2 only 1 of 2 succeeded, the other with error

Image

results 1:
"results": [ { "type": "azure_ai_evaluator", "name": "ifd_handoff_completeness", "metric": "ifd_handoff_completeness", "score": 4.0, "label": "pass", "reason": "The tool call includes case CAS-4941DA05 with mismatch details (Lumity vs Life), user ID, and description/context. However, missing critical items like order number, customer contact details, and explicit urgency prevent a fully handoff-ready payload.", "threshold": 3.0, "passed": true } ],
results 2 - CONNECTION ERRORS
"results": [ { "type": "azure_ai_evaluator", "name": "ifd_handoff_accuracy", "metric": "ifd_handoff_accuracy", "score": null, "label": null, "reason": null, "threshold": null, "passed": false }, { "type": "azure_ai_evaluator", "name": "ifd_handoff_completeness", "metric": "ifd_handoff_completeness", "score": null, "label": null, "reason": null, "threshold": null, "passed": false }, { "type": "azure_ai_evaluator", "name": "ifd_handoff_accuracy", "metric": "result", "score": null, "label": null, "reason": null, "threshold": null, "passed": null, "sample": { "error": { "code": "FAILED_EXECUTION", "message": "Connection error." } } }, { "type": "azure_ai_evaluator", "name": "ifd_handoff_completeness", "metric": "result", "score": null, "label": null, "reason": null, "threshold": null, "passed": null, "sample": { "error": { "code": "FAILED_EXECUTION", "message": "Connection error." } } } ],

Additional context
Intermittent — identical runs succeed without any changes
Only custom (prompty-based) evaluators are affected; built-in evaluators (e.g. builtin.tool_selection) in the same run are not affected
When a custom evaluator errors, it appears twice in the output item results: once with passed: false, label: "NaN", score: null, and once as a separate "metric": "result" entry with the FAILED_EXECUTION error

Metadata

Metadata

Assignees

No one assigned

    Labels

    AI ProjectsEvaluationIssues related to the client library for Azure AI EvaluationService AttentionWorkflow: This issue is responsible by Azure service team.bugThis issue requires a change to an existing behavior in the product in order to be resolved.customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions