MAF-19750: feat(deploy): provision Grafana alerts for Heimdall via sidecar#130
Open
seongsu-dev wants to merge 5 commits into
Open
MAF-19750: feat(deploy): provision Grafana alerts for Heimdall via sidecar#130seongsu-dev wants to merge 5 commits into
seongsu-dev wants to merge 5 commits into
Conversation
…decar
Add file-based Grafana Unified Alerting provisioning to the MIF chart so that
installing MIF immediately wires up the Heimdall error-log alert pipeline that
was validated as a PoC in the p-cluster `mif` release.
The new `templates/grafana/alert-configmap.yaml` mirrors the existing
`dashboard-configmap.yaml`: it iterates over `files/alerts/*.yaml` and emits one
ConfigMap per file with the `grafana_alert` label, which the
`grafana-sc-alerts` sidecar picks up and mounts into
`/etc/grafana/provisioning/alerting/`. Cluster-specific values
(`__GRAFANA_URL__` for Slack deep links, `__RECEIVER__` for the contact point
name) are substituted from chart values via `replace` rather than `tpl`, so
that Grafana's own Go template syntax embedded in alert rules (e.g.
`{{ printf "%.180s" .message }}` inside LogQL) is preserved as raw text.
Contact points (Slack webhook URLs) are intentionally out of scope because the
webhook is a secret. Operators must create the contact point named by
`alerts.heimdall.receiver` separately via the Grafana UI or a Secret-backed
provisioning file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds Helm-managed file-based Grafana Unified Alerting provisioning to the MIF chart, mirroring the existing dashboard ConfigMap pattern so that Heimdall error-log alerts (LogQL → Loki → Slack) are installed automatically alongside the chart. Cluster-specific values are passed via __GRAFANA_URL__ / __RECEIVER__ placeholders that are replaced literally (rather than via tpl) so Grafana's own Go template syntax inside the alert YAML survives Helm rendering.
Changes:
- New
templates/grafana/alert-configmap.yamlthat generates one ConfigMap per file underfiles/alerts/*.yaml, gated byprometheus-stack.grafana.sidecar.alerts.enabledandalerts.heimdall.enabled. - New alert content:
heimdall-rules.yaml(error-log-burst rule),heimdall-templates.yaml(Slack message templates), andheimdall-policies.yaml(component=heimdall routing policy). values.yaml,README.md, andAGENTS.mdupdated to expose the newalerts.heimdall.*keys, enable the alerts sidecar, and document the no-tplconstraint and out-of-scope contact-point boundary.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| deploy/helm/moai-inference-framework/templates/grafana/alert-configmap.yaml | New template rendering one ConfigMap per alert file with placeholder substitution. |
| deploy/helm/moai-inference-framework/files/alerts/heimdall-rules.yaml | Error-log-burst alert rule with Explore/rule deep links using __GRAFANA_URL__. |
| deploy/helm/moai-inference-framework/files/alerts/heimdall-templates.yaml | Slack notification templates for Heimdall alerts. |
| deploy/helm/moai-inference-framework/files/alerts/heimdall-policies.yaml | Routing policy mapping component=heimdall to __RECEIVER__. |
| deploy/helm/moai-inference-framework/values.yaml | Enables prometheus-stack.grafana.sidecar.alerts and adds alerts.heimdall.{enabled,grafanaURL,receiver}. |
| deploy/helm/moai-inference-framework/README.md | Regenerated docs for the new values keys. |
| deploy/helm/AGENTS.md | Documents the alert-provisioning pattern and the tpl prohibition. |
…fanaURL Prevents double-slash URLs in Slack notification links when operators configure `alerts.heimdall.grafanaURL` with a trailing slash (e.g. `https://grafana.example.com/`). Without trimming, the alert rule annotations would render `https://grafana.example.com//explore?...` and `https://grafana.example.com//alerting/...`, which most browsers tolerate but reverse proxies and OAuth redirect path matchers may reject. Apply `trimSuffix "/"` to the value before substituting `__GRAFANA_URL__`, so both `https://grafana.example.com` and `https://grafana.example.com/` produce the same single-slash result. Also document the trimming behavior in the values.yaml comment and add a "Placeholder conventions" subsection to deploy/helm/AGENTS.md so authors of future alert files use the same pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nfigMap Add chart-managed provisioning for the Heimdall Slack contact point as a ConfigMap labelled `grafana_alert=1`, picked up by the existing `grafana-sc-alerts` sidecar alongside the rules / templates / policies. Removes the need for operators to create the contact point through the Grafana UI (which requires `grafana.persistence.enabled=true` to survive pod restarts) or via a separate Secret-backed provisioning file. The webhook URL is sourced either from `alerts.heimdall.slack.existingSecret` (resolved through Helm `lookup`, following the same convention as the sibling MongoDB and Redis Sentinel charts) or from `alerts.heimdall.slack.webhookUrl` when no external Secret is referenced. With neither set, the contact-points ConfigMap is skipped and Slack delivery is silently off — alert rules still fire but do not route anywhere. Other adjustments: - Default `alerts.heimdall.enabled` to false; the chart cannot deliver Slack messages without a webhook URL, so the operator must opt in explicitly after providing one. - Hardcode the receiver name to `heimdall-slack` inside both the policy routing file and the contact-points ConfigMap, and drop the now-unused `alerts.heimdall.receiver` value and `__RECEIVER__` placeholder. - Trim duplication in `deploy/helm/AGENTS.md` Alert Provisioning section and document the new ConfigMap-only layout plus the `helm template` limitation when using `existingSecret`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ontact point Replace the implicit `webhook-url` data key with an explicit `alerts.heimdall.slack.secretKey` value (default `SLACK_WEBHOOK_URL`), and rename `slack.webhookUrl` to `slack.secretValue` so operators can see both the key and the value contract in the same shape used elsewhere (e.g. mongodb `existingSecret` pattern in the heimdall-inference-scheduler repo). `existingSecret` retains precedence over `secretValue`: when set, the chart reads the URL from `existingSecret.data[secretKey]` via Helm `lookup`. When `existingSecret` is empty the chart embeds `secretValue` directly into the contact-points ConfigMap. With neither producing a URL the ConfigMap is skipped and Slack delivery is silently off — alert rules still fire. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e UID
Two issues surfaced while validating the alert pipeline end-to-end on a
local kind cluster against a real Slack workspace.
1. Operators who pass the Slack webhook URL via `helm install --set-file`
pick up a trailing newline from the source file, which Grafana then
rejects when parsing the contact-points provisioning YAML:
invalid URL "https://hooks.slack.com/services/.../...
"
Trim leading/trailing whitespace from the resolved URL (both the
`secretValue` path and the `existingSecret` `lookup` path) so the
chart is robust regardless of how the URL is supplied.
2. The Heimdall alert rule hardcodes `datasourceUid: loki`, but the
`datasource-loki.yaml` ConfigMap left `uid` unset, so Grafana assigned
a random UID instead and rule evaluation failed with:
failed to build query 'A': data source not found
Pin the Loki datasource UID to `loki` so the alert rule can resolve
it reliably.
With both fixes, the kind cluster e2e flow succeeds: mock JSON error logs
→ Vector → Loki → LogQL rule evaluation → routing policy match
(`component=heimdall`) → Slack contact point → channel message delivered.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
templates/grafana/dashboard-configmap.yamlpattern: each file undermoai-inference-framework/files/alerts/*.yamlbecomes one ConfigMap labelledgrafana_alert=1, picked up by thegrafana-sc-alertssidecar and mounted into/etc/grafana/provisioning/alerting/.__GRAFANA_URL__(Slack deep links) and__RECEIVER__(contact point name) placeholders substituted withreplace, so that Grafana's own Go template syntax inside alert rules (e.g.{{ printf "%.180s" .message }}inside LogQL) is preserved as raw text.Background
This is the productisation of the PoC validated on the p-cluster
mifrelease, where three ConfigMaps (rules / templates / policies) were applied directly to deliver an end-to-end pipeline: Heimdall error logs → Loki → Grafana Unified Alerting → Slack.The PoC's environment-specific bits (
namespace="seongsu",owner: seongsu,environment: p-cluster, hardcodedgrafana.product.moreh.devURLs) are removed here. The LogQL selector{app="heimdall-inference-scheduler", level="error"}is chart-name based, so a single alert rule covers every Heimdall release in the cluster without per-release duplication.Files
templates/grafana/alert-configmap.yaml— new template, replicates the dashboard ConfigMap pattern.files/alerts/heimdall-rules.yaml— error-log-burst alert rule.files/alerts/heimdall-templates.yaml— Slack notification template.files/alerts/heimdall-policies.yaml— routing policy (matchescomponent=heimdall→__RECEIVER__).values.yaml— enablesprometheus-stack.grafana.sidecar.alertsand adds the top-levelalerts.heimdall.{enabled, grafanaURL, receiver}section.deploy/helm/AGENTS.md— new "Alert Provisioning" section documenting the pattern, the out-of-scope contact point boundary, and the explicit prohibition on wrapping alert ConfigMap data withtpl.deploy/helm/moai-inference-framework/README.md— regenerated bymake helm-docs.Out of scope (follow-up work)
alerts.heimdall.receiver.moreh-iac/SNUSHC/p-cluster/mif/mif.tfshould bump the chart version and setalerts.heimdall.grafanaURL=https://grafana.product.moreh.dev/. Once that lands, the temporary ConfigMaps applied during the PoC can be removed.files/alerts/heimdall-<category>.yamlfiles in follow-up PRs.Test plan
helm lint deploy/helm/moai-inference-framework— passes.helm templatewith--set alerts.heimdall.grafanaURL=https://grafana.example.com --set alerts.heimdall.receiver=heimdall-slackrenders three ConfigMaps (*-alert-heimdall-{rules,templates,policies}) with thegrafana_alert: "1"label.kubectl apply --dry-run=clienton the rendered alert ConfigMaps —created (dry run)for all three.dataparses as valid YAML and matches the Grafana provisioning schema (requiredapiVersion,groups/templates/policies; alert ruleuid/title/condition/data/noDataState/execErrState;object_matchers3-tuple form).app="heimdall-inference-scheduler") with no namespace pin, so all Heimdall releases are covered by a single rule.__GRAFANA_URL__and__RECEIVER__placeholders are fully substituted in the rendered output, with no residue.{{ printf "%.180s" .message }},{{ define "heimdall-slack.title" }},{{ if .Annotations.exploreURL }}) is preserved as raw text in the rendered ConfigMap (Helm does not evaluate it).alerts.heimdall.enabled=falseandprometheus-stack.grafana.sidecar.alerts.enabled=falseeach gate the ConfigMaps off (0 rendered).grafana-sc-alertssidecar container withLABEL=grafana_alert,FOLDER=/etc/grafana/provisioning/alerting, and the correct reload URL.make helm-docsregeneratesdeploy/helm/moai-inference-framework/README.mdto include the newalerts.heimdall.*andprometheus-stack.grafana.sidecar.alerts.enabledkeys.🤖 Generated with Claude Code