Intermittent Connection Error with Evaluation for Custom Evaluators

- **Package Name**:  azure-ai-projects
- **Package Version**: 2.1.0
- **Operating System**: MacOs
- **Python Version**: Python 3.12.13

**Describe the bug**
When running an evaluation via the Azure AI Foundry Evals API with custom (prompty-based) evaluators, individual evaluator executions intermittently fail with FAILED_EXECUTION / "Connection error.". The overall run status still returns completed, but affected items have score: null. The same evaluator succeeds in other runs without any change to the configuration.

**To Reproduce**
Steps to reproduce the behavior:
1. Register a custom prompty-based evaluator via the Azure AI Projects SDK
2. Create an evaluation with that custom evaluator as a testing criterion
3. Create and run an evaluation run against a dataset with multiple items
4. Observe output items — some evaluator results contain "code": "FAILED_EXECUTION", "message": "Connection error." with score: null

**Expected behavior**
Custom evaluators should complete successfully or retry on transient connection failures.

**Screenshots**
Run 1 where all succeeded -- all inputs very similar

<img width="2333" height="616" alt="Image" src="https://github.com/user-attachments/assets/6a29bd77-47b4-4e71-b232-d422eef694e4" />

result 1:
`
"results": [
    {
      "type": "azure_ai_evaluator",
      "name": "ifd_handoff_completeness",
      "metric": "ifd_handoff_completeness",
      "score": 5.0,
      "label": "pass",
      "threshold": 3,
      "passed": true
    }
  ],
`
result2:
`
  "results": [
    {
      "type": "azure_ai_evaluator",
      "name": "ifd_handoff_completeness",
      "metric": "ifd_handoff_completeness",
      "score": 5.0,
      "label": "pass",
      "threshold": 3,
      "passed": true
    }
  ],
`

Run 2 only 1 of 2 succeeded, the other with error

<img width="2328" height="598" alt="Image" src="https://github.com/user-attachments/assets/2b57174f-4430-496f-ad01-5ca5a8362c46" />

results 1:
`
"results": [
    {
      "type": "azure_ai_evaluator",
      "name": "ifd_handoff_completeness",
      "metric": "ifd_handoff_completeness",
      "score": 4.0,
      "label": "pass",
      "reason": "The tool call includes case CAS-4941DA05 with mismatch details (Lumity vs Life), user ID, and description/context. However, missing critical items like order number, customer contact details, and explicit urgency prevent a fully handoff-ready payload.",
      "threshold": 3.0,
      "passed": true
    }
  ],
`
**results 2 - CONNECTION ERRORS**
`
"results": [
    {
      "type": "azure_ai_evaluator",
      "name": "ifd_handoff_accuracy",
      "metric": "ifd_handoff_accuracy",
      "score": null,
      "label": null,
      "reason": null,
      "threshold": null,
      "passed": false
    },
    {
      "type": "azure_ai_evaluator",
      "name": "ifd_handoff_completeness",
      "metric": "ifd_handoff_completeness",
      "score": null,
      "label": null,
      "reason": null,
      "threshold": null,
      "passed": false
    },
    {
      "type": "azure_ai_evaluator",
      "name": "ifd_handoff_accuracy",
      "metric": "result",
      "score": null,
      "label": null,
      "reason": null,
      "threshold": null,
      "passed": null,
      "sample": {
        "error": {
          "code": "FAILED_EXECUTION",
          "message": "Connection error."
        }
      }
    },
    {
      "type": "azure_ai_evaluator",
      "name": "ifd_handoff_completeness",
      "metric": "result",
      "score": null,
      "label": null,
      "reason": null,
      "threshold": null,
      "passed": null,
      "sample": {
        "error": {
          "code": "FAILED_EXECUTION",
          "message": "Connection error."
        }
      }
    }
  ],
`

**Additional context**
Intermittent — identical runs succeed without any changes
Only custom (prompty-based) evaluators are affected; built-in evaluators (e.g. builtin.tool_selection) in the same run are not affected
When a custom evaluator errors, it appears twice in the output item results: once with passed: false, label: "NaN", score: null, and once as a separate ["metric": "result"](vscode-file://vscode-app/Users/honeylane.ante/Desktop/Visual%20Studio%20Code.app/Contents/Resources/app/out/vs/code/electron-browser/workbench/workbench.html) entry with the FAILED_EXECUTION error


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent Connection Error with Evaluation for Custom Evaluators #46627

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Intermittent Connection Error with Evaluation for Custom Evaluators #46627

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions