Skip to content

Add tag probabilities to AI Guard evaluation result#5580

Merged
y9v merged 1 commit intomasterfrom
ai-guard-tag-probabilities
Apr 13, 2026
Merged

Add tag probabilities to AI Guard evaluation result#5580
y9v merged 1 commit intomasterfrom
ai-guard-tag-probabilities

Conversation

@y9v
Copy link
Copy Markdown
Member

@y9v y9v commented Apr 10, 2026

What does this PR do?
This PR adds AIGuard::Evaluation::Result#tag_probabilities attribute, exposing the tag_probs field from the AI Guard API response.

Motivation:
We want to add evaluation result tag probabilities to our SDK, and attach tag probabilities to metastruct.

Change log entry
Yes. AI Guard evaluation results now include tag probabilities from the API response.

Additional Notes:
APPSEC-61897

How to test the change?
CI and manual testing

@y9v y9v self-assigned this Apr 10, 2026
@y9v y9v requested review from a team as code owners April 10, 2026 16:05
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 10, 2026

Typing analysis

Note: Ignored files are excluded from the next sections.

steep:ignore comments

This PR introduces 1 steep:ignore comment, and clears 1 steep:ignore comment.

steep:ignore comments (+1-1)Introduced:
lib/datadog/ai_guard/evaluation.rb:66
Cleared:
lib/datadog/ai_guard/evaluation.rb:65

Untyped methods

This PR introduces 1 partially typed method, and clears 1 partially typed method.

Partially typed methods (+1-1)Introduced:
sig/datadog/ai_guard/evaluation/result.rbs:29
└── def initialize: (::Hash[::String, untyped] raw_response_body) -> void
Cleared:
sig/datadog/ai_guard/evaluation/result.rbs:28
└── def initialize: (::Hash[::String, untyped] raw_response_body) -> void

If you believe a method or an attribute is rightfully untyped or partially typed, you can add # untyped:accept on the line before the definition to remove it from the stats.

@datadog-datadog-prod-us1
Copy link
Copy Markdown
Contributor

datadog-datadog-prod-us1 bot commented Apr 10, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 94.12%
Overall Coverage: 95.35% (+0.05%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 672a7c7 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@y9v y9v force-pushed the ai-guard-tag-probabilities branch from 0c20b88 to 672a7c7 Compare April 13, 2026 11:30
@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Apr 13, 2026

Benchmarks

Benchmark execution time: 2026-04-13 11:56:19

Comparing candidate commit 672a7c7 in PR branch ai-guard-tag-probabilities with baseline commit b9193de in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 45 metrics, 1 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

@y9v y9v merged commit 103d9c5 into master Apr 13, 2026
629 checks passed
@y9v y9v deleted the ai-guard-tag-probabilities branch April 13, 2026 12:26
@github-actions github-actions bot added this to the 2.31.0 milestone Apr 13, 2026
@y9v y9v mentioned this pull request Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants