Skip to content

MAF-19750: feat(deploy): provision Grafana alerts for Heimdall via sidecar#130

Open
seongsu-dev wants to merge 5 commits into
mainfrom
MAF-19750-mif-alert-provisioning
Open

MAF-19750: feat(deploy): provision Grafana alerts for Heimdall via sidecar#130
seongsu-dev wants to merge 5 commits into
mainfrom
MAF-19750-mif-alert-provisioning

Conversation

@seongsu-dev
Copy link
Copy Markdown
Contributor

Summary

  • Adds file-based Grafana Unified Alerting provisioning to the MIF chart so that installing MIF immediately wires up the Heimdall error-log alert pipeline.
  • Mirrors the existing templates/grafana/dashboard-configmap.yaml pattern: each file under moai-inference-framework/files/alerts/*.yaml becomes one ConfigMap labelled grafana_alert=1, picked up by the grafana-sc-alerts sidecar and mounted into /etc/grafana/provisioning/alerting/.
  • Externalises cluster-specific values via __GRAFANA_URL__ (Slack deep links) and __RECEIVER__ (contact point name) placeholders substituted with replace, so that Grafana's own Go template syntax inside alert rules (e.g. {{ printf "%.180s" .message }} inside LogQL) is preserved as raw text.

Background

This is the productisation of the PoC validated on the p-cluster mif release, where three ConfigMaps (rules / templates / policies) were applied directly to deliver an end-to-end pipeline: Heimdall error logs → Loki → Grafana Unified Alerting → Slack.

The PoC's environment-specific bits (namespace="seongsu", owner: seongsu, environment: p-cluster, hardcoded grafana.product.moreh.dev URLs) are removed here. The LogQL selector {app="heimdall-inference-scheduler", level="error"} is chart-name based, so a single alert rule covers every Heimdall release in the cluster without per-release duplication.

Files

  • templates/grafana/alert-configmap.yaml — new template, replicates the dashboard ConfigMap pattern.
  • files/alerts/heimdall-rules.yaml — error-log-burst alert rule.
  • files/alerts/heimdall-templates.yaml — Slack notification template.
  • files/alerts/heimdall-policies.yaml — routing policy (matches component=heimdall__RECEIVER__).
  • values.yaml — enables prometheus-stack.grafana.sidecar.alerts and adds the top-level alerts.heimdall.{enabled, grafanaURL, receiver} section.
  • deploy/helm/AGENTS.md — new "Alert Provisioning" section documenting the pattern, the out-of-scope contact point boundary, and the explicit prohibition on wrapping alert ConfigMap data with tpl.
  • deploy/helm/moai-inference-framework/README.md — regenerated by make helm-docs.

Out of scope (follow-up work)

  • Contact point provisioning. The Slack webhook URL is a secret and must be supplied through a separate Secret-backed Grafana provisioning file (or created via the UI). The chart only references the contact point name through alerts.heimdall.receiver.
  • moreh-iac integration. A later change to moreh-iac/SNUSHC/p-cluster/mif/mif.tf should bump the chart version and set alerts.heimdall.grafanaURL=https://grafana.product.moreh.dev/. Once that lands, the temporary ConfigMaps applied during the PoC can be removed.
  • Additional alert rules. Heimdall panic detection, gRPC 5xx burst, responses-store backend errors, etc. would land as additional files/alerts/heimdall-<category>.yaml files in follow-up PRs.

Test plan

  • helm lint deploy/helm/moai-inference-framework — passes.
  • helm template with --set alerts.heimdall.grafanaURL=https://grafana.example.com --set alerts.heimdall.receiver=heimdall-slack renders three ConfigMaps (*-alert-heimdall-{rules,templates,policies}) with the grafana_alert: "1" label.
  • kubectl apply --dry-run=client on the rendered alert ConfigMaps — created (dry run) for all three.
  • ConfigMap data parses as valid YAML and matches the Grafana provisioning schema (required apiVersion, groups/templates/policies; alert rule uid/title/condition/data/noDataState/execErrState; object_matchers 3-tuple form).
  • LogQL selector is chart-name based (app="heimdall-inference-scheduler") with no namespace pin, so all Heimdall releases are covered by a single rule.
  • __GRAFANA_URL__ and __RECEIVER__ placeholders are fully substituted in the rendered output, with no residue.
  • Grafana's own Go template syntax ({{ printf "%.180s" .message }}, {{ define "heimdall-slack.title" }}, {{ if .Annotations.exploreURL }}) is preserved as raw text in the rendered ConfigMap (Helm does not evaluate it).
  • alerts.heimdall.enabled=false and prometheus-stack.grafana.sidecar.alerts.enabled=false each gate the ConfigMaps off (0 rendered).
  • The rendered Grafana Deployment includes the grafana-sc-alerts sidecar container with LABEL=grafana_alert, FOLDER=/etc/grafana/provisioning/alerting, and the correct reload URL.
  • make helm-docs regenerates deploy/helm/moai-inference-framework/README.md to include the new alerts.heimdall.* and prometheus-stack.grafana.sidecar.alerts.enabled keys.
  • End-to-end validation in a dev cluster (PoC already proved the underlying pipeline; deferred unless a reviewer wants a fresh dry run).

🤖 Generated with Claude Code

…decar

Add file-based Grafana Unified Alerting provisioning to the MIF chart so that
installing MIF immediately wires up the Heimdall error-log alert pipeline that
was validated as a PoC in the p-cluster `mif` release.

The new `templates/grafana/alert-configmap.yaml` mirrors the existing
`dashboard-configmap.yaml`: it iterates over `files/alerts/*.yaml` and emits one
ConfigMap per file with the `grafana_alert` label, which the
`grafana-sc-alerts` sidecar picks up and mounts into
`/etc/grafana/provisioning/alerting/`. Cluster-specific values
(`__GRAFANA_URL__` for Slack deep links, `__RECEIVER__` for the contact point
name) are substituted from chart values via `replace` rather than `tpl`, so
that Grafana's own Go template syntax embedded in alert rules (e.g.
`{{ printf "%.180s" .message }}` inside LogQL) is preserved as raw text.

Contact points (Slack webhook URLs) are intentionally out of scope because the
webhook is a secret. Operators must create the contact point named by
`alerts.heimdall.receiver` separately via the Grafana UI or a Secret-backed
provisioning file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 14, 2026 07:41
@seongsu-dev seongsu-dev requested a review from a team as a code owner May 14, 2026 07:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Helm-managed file-based Grafana Unified Alerting provisioning to the MIF chart, mirroring the existing dashboard ConfigMap pattern so that Heimdall error-log alerts (LogQL → Loki → Slack) are installed automatically alongside the chart. Cluster-specific values are passed via __GRAFANA_URL__ / __RECEIVER__ placeholders that are replaced literally (rather than via tpl) so Grafana's own Go template syntax inside the alert YAML survives Helm rendering.

Changes:

  • New templates/grafana/alert-configmap.yaml that generates one ConfigMap per file under files/alerts/*.yaml, gated by prometheus-stack.grafana.sidecar.alerts.enabled and alerts.heimdall.enabled.
  • New alert content: heimdall-rules.yaml (error-log-burst rule), heimdall-templates.yaml (Slack message templates), and heimdall-policies.yaml (component=heimdall routing policy).
  • values.yaml, README.md, and AGENTS.md updated to expose the new alerts.heimdall.* keys, enable the alerts sidecar, and document the no-tpl constraint and out-of-scope contact-point boundary.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
deploy/helm/moai-inference-framework/templates/grafana/alert-configmap.yaml New template rendering one ConfigMap per alert file with placeholder substitution.
deploy/helm/moai-inference-framework/files/alerts/heimdall-rules.yaml Error-log-burst alert rule with Explore/rule deep links using __GRAFANA_URL__.
deploy/helm/moai-inference-framework/files/alerts/heimdall-templates.yaml Slack notification templates for Heimdall alerts.
deploy/helm/moai-inference-framework/files/alerts/heimdall-policies.yaml Routing policy mapping component=heimdall to __RECEIVER__.
deploy/helm/moai-inference-framework/values.yaml Enables prometheus-stack.grafana.sidecar.alerts and adds alerts.heimdall.{enabled,grafanaURL,receiver}.
deploy/helm/moai-inference-framework/README.md Regenerated docs for the new values keys.
deploy/helm/AGENTS.md Documents the alert-provisioning pattern and the tpl prohibition.

Comment thread deploy/helm/moai-inference-framework/templates/grafana/alert-configmap.yaml Outdated
…fanaURL

Prevents double-slash URLs in Slack notification links when operators configure
`alerts.heimdall.grafanaURL` with a trailing slash (e.g.
`https://grafana.example.com/`). Without trimming, the alert rule annotations
would render `https://grafana.example.com//explore?...` and
`https://grafana.example.com//alerting/...`, which most browsers tolerate but
reverse proxies and OAuth redirect path matchers may reject.

Apply `trimSuffix "/"` to the value before substituting `__GRAFANA_URL__`, so
both `https://grafana.example.com` and `https://grafana.example.com/` produce
the same single-slash result. Also document the trimming behavior in the
values.yaml comment and add a "Placeholder conventions" subsection to
deploy/helm/AGENTS.md so authors of future alert files use the same pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

seongsu-dev and others added 2 commits May 19, 2026 12:40
…nfigMap

Add chart-managed provisioning for the Heimdall Slack contact point as a
ConfigMap labelled `grafana_alert=1`, picked up by the existing
`grafana-sc-alerts` sidecar alongside the rules / templates / policies.
Removes the need for operators to create the contact point through the
Grafana UI (which requires `grafana.persistence.enabled=true` to survive
pod restarts) or via a separate Secret-backed provisioning file.

The webhook URL is sourced either from `alerts.heimdall.slack.existingSecret`
(resolved through Helm `lookup`, following the same convention as the
sibling MongoDB and Redis Sentinel charts) or from `alerts.heimdall.slack.webhookUrl`
when no external Secret is referenced. With neither set, the contact-points
ConfigMap is skipped and Slack delivery is silently off — alert rules
still fire but do not route anywhere.

Other adjustments:

- Default `alerts.heimdall.enabled` to false; the chart cannot deliver
  Slack messages without a webhook URL, so the operator must opt in
  explicitly after providing one.
- Hardcode the receiver name to `heimdall-slack` inside both the policy
  routing file and the contact-points ConfigMap, and drop the now-unused
  `alerts.heimdall.receiver` value and `__RECEIVER__` placeholder.
- Trim duplication in `deploy/helm/AGENTS.md` Alert Provisioning section
  and document the new ConfigMap-only layout plus the `helm template`
  limitation when using `existingSecret`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ontact point

Replace the implicit `webhook-url` data key with an explicit
`alerts.heimdall.slack.secretKey` value (default `SLACK_WEBHOOK_URL`), and
rename `slack.webhookUrl` to `slack.secretValue` so operators can see both
the key and the value contract in the same shape used elsewhere (e.g.
mongodb `existingSecret` pattern in the heimdall-inference-scheduler repo).

`existingSecret` retains precedence over `secretValue`: when set, the chart
reads the URL from `existingSecret.data[secretKey]` via Helm `lookup`. When
`existingSecret` is empty the chart embeds `secretValue` directly into the
contact-points ConfigMap. With neither producing a URL the ConfigMap is
skipped and Slack delivery is silently off — alert rules still fire.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 19, 2026 03:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

…e UID

Two issues surfaced while validating the alert pipeline end-to-end on a
local kind cluster against a real Slack workspace.

1. Operators who pass the Slack webhook URL via `helm install --set-file`
   pick up a trailing newline from the source file, which Grafana then
   rejects when parsing the contact-points provisioning YAML:

       invalid URL "https://hooks.slack.com/services/.../...
"

   Trim leading/trailing whitespace from the resolved URL (both the
   `secretValue` path and the `existingSecret` `lookup` path) so the
   chart is robust regardless of how the URL is supplied.

2. The Heimdall alert rule hardcodes `datasourceUid: loki`, but the
   `datasource-loki.yaml` ConfigMap left `uid` unset, so Grafana assigned
   a random UID instead and rule evaluation failed with:

       failed to build query 'A': data source not found

   Pin the Loki datasource UID to `loki` so the alert rule can resolve
   it reliably.

With both fixes, the kind cluster e2e flow succeeds: mock JSON error logs
→ Vector → Loki → LogQL rule evaluation → routing policy match
(`component=heimdall`) → Slack contact point → channel message delivered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants