Skip to content

feat: add dynamicLabels to session pod config#113

Draft
ian-flores wants to merge 40 commits intomainfrom
dynamic-labels
Draft

feat: add dynamicLabels to session pod config#113
ian-flores wants to merge 40 commits intomainfrom
dynamic-labels

Conversation

@ian-flores
Copy link
Contributor

@ian-flores ian-flores commented Mar 4, 2026

Summary

  • Adds DynamicLabelRule struct and dynamicLabels field on PodConfig CRD, enabling per-session pod labels derived from runtime .Job data
  • Supports two modes: direct field mapping (labelKey) and regex pattern extraction (match + labelPrefix)
  • Adds Go template logic in job.tpl to render dynamic labels at session launch time

Closes PTDC-199

Context

A customer needs to track Workbench session costs per Entra ID group using OpenCost. Group names are extracted from session launch arguments and added as pod labels. The customer had a working Go template solution, but the operator reconciles the template ConfigMap, overwriting manual edits. This adds a first-class CRD field.

Usage example

sessionConfig:
  pod:
    dynamicLabels:
      - field: "args"
        match: "_entra_[^ ,]+"
        trimPrefix: "_entra_"
        labelPrefix: "posit.entra.group."
        labelValue: "true"

Produces: posit.entra.group.research_team: "true", posit.entra.group.data_science: "true"

Test plan

  • Unit tests for JSON serialization (direct mapping + pattern extraction)
  • just mgenerate succeeds (deepcopy + CRD manifests)
  • just test passes
  • Deploy adhoc image to test cluster, apply Workbench CR with dynamicLabels, launch session, verify pod labels

Add DynamicLabelRule struct and dynamicLabels field on PodConfig,
enabling per-session pod labels derived from runtime .Job data.
Supports direct field mapping (labelKey) and regex pattern extraction
(match + labelPrefix) for use cases like Entra group cost tracking.
@claude
Copy link

claude bot commented Mar 4, 2026

Claude finished @ian-flores's task —— View job


Review: feat: add dynamicLabels to session pod config

Overall the design is clean — a CRD field with two operating modes (direct mapping vs. regex extraction) that flows through job.tpl. The struct, deepcopy, and CRD manifests are consistent. A few items worth addressing:


1. Mutual exclusivity not enforced — api/product/session_config.go:64-85

labelKey and match/labelPrefix are documented as mutually exclusive, and labelPrefix is described as "required when match is set," but neither constraint is enforced via kubebuilder validation markers or programmatic checks. A user could set both labelKey and match — the template silently prefers labelKey. At minimum, document which takes precedence; ideally add validation.

2. Security: field as an arbitrary key into .Jobapi/templates/2.5.0/job.tpl:83

{{- if hasKey $.Job $rule.field }}

field is user-supplied CRD input used as a key into $.Job via index. If .Job contains internal/sensitive fields, anyone with CRD write access could surface that data as a pod label (visible to get pods RBAC). This is partially mitigated since CRD write is privileged, but consider validating field against an allowlist of known-safe fields (e.g., user, args, name, tags) or documenting the trust model.

3. join " " on non-array fields — api/templates/2.5.0/job.tpl:87

{{- $str := index $.Job $rule.field | join " " }}

join " " will fail at template execution time if the field value is a string rather than an array. If someone uses the match path on a scalar field like user, this causes a runtime template error. Consider guarding with a type check or documenting which fields are arrays vs. strings.

4. Regex / label key validation — api/templates/2.5.0/job.tpl:88-90

  • User-supplied regex is passed directly to regexFindAll. Go's RE2 (used by Sprig) is resistant to catastrophic backtracking, so ReDoS isn't a concern, but a malformed regex will cause a template execution error at session launch time — a bad feedback loop.
  • The resulting label key (prefix + cleaned match) isn't validated against the Kubernetes label format (63-char name limit, valid DNS subdomain prefix). Invalid labels would fail pod creation at runtime with a potentially confusing error. A length/format check in the template or CRD-level validation would improve the user experience.

5. Only added to template 2.5.0 — api/templates/2.5.0/job.tpl:81

Templates 2.3.1 and 2.4.0 don't include the dynamic labels block. If those versions are still in active use, users who configure dynamicLabels will get no labels and no error. Is this intentional? A note about minimum template version in docs or the CRD description would help.

6. Test coverage — api/product/session_config_test.go

The two new tests verify JSON serialization round-trips, which is good. But the real complexity lives in the Go template logic in job.tpl. Consider adding template rendering tests that:

  • Exercise direct mapping with a mock .Job containing a scalar field
  • Exercise pattern extraction with a mock .Job containing an array field
  • Test edge cases: no matches, trimPrefix longer than the match, empty field

This would catch issues like the join " " on non-array fields before they reach a real cluster.

7. No self-review comments

Per repo norms (review guidelines), PR authors are expected to add inline comments on areas of concern or key decisions. There are several template logic decisions here (e.g., lower | replace " " "_" | replace "-" "_" normalization) that would benefit from author context.


The CRD struct, deepcopy, and generated manifests all look correct and consistent. The template logic is the area that carries the most risk — the items above are mostly about improving guardrails and test coverage for runtime behavior.

Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of dynamicLabels feature — see detailed feedback below.

Changes:
- Add `ValidateDynamicLabelRules()` function that validates regex compilation (ReDoS prevention via Go's RE2 engine) and enforces mutual exclusivity between `labelKey` and `match`/`labelPrefix`
- Call validation from `GenerateSessionConfigTemplate()` to reject invalid rules before template rendering
- Fix template to handle scalar fields in the `match` branch using `kindIs "slice"` type check instead of unconditional `join`
- Add `trunc 63` to label key suffix in template to enforce Kubernetes label value length limits
- Remove lossy `replace "-" "_"` transformation — hyphens are valid in label keys
- Add kubebuilder `MaxLength`/`MinLength` validation markers to `DynamicLabelRule` fields
- Add validation unit tests covering mutual exclusivity, missing fields, invalid regex, and generation-time rejection
{{- $str := (kindIs "slice" $val) | ternary ($val | join " ") ($val | toString) }}
{{- $matches := regexFindAll $rule.match $str -1 }}
{{- range $match := $matches }}
{{ trimPrefix ($rule.trimPrefix | default "") $match | lower | replace " " "_" | trunc 63 | printf "%s%s" $rule.labelPrefix }}: {{ $rule.labelValue | default "true" | quote }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design choice: The lower | replace " " "_" | replace "-" "_" normalization is intentionally hardcoded rather than configurable. This covers the common case (Kubernetes label compliance) and avoids complexity. If customers need different sanitization, we can add a transform field later.

// to a label (using labelKey) or extracts multiple labels via regex (using match).
// +kubebuilder:object:generate=true
type DynamicLabelRule struct {
// Field is the name of a top-level .Job field to read (e.g., "user", "args").
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trust model: field accepts any top-level .Job key, which means CRD authors can surface any launcher job field as a pod label. This is acceptable because CRD write access is already a privileged operation (cluster admin or namespace admin). Documented this explicitly in the field comment.

LabelValue string `json:"labelValue,omitempty"`
}

// ValidateDynamicLabelRules validates a slice of DynamicLabelRule, checking for
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation strategy: Mutual exclusivity (labelKey vs match) and regex compilation are validated programmatically in ValidateDynamicLabelRules(), called at template generation time. Kubebuilder markers handle length constraints at admission. This catches errors before they reach the Go template engine at session launch.

{{- if $rule.labelKey }}
{{ $rule.labelKey }}: {{ $val | toString | quote }}
{{- else if $rule.match }}
{{- $str := (kindIs "slice" $val) | ternary ($val | join " ") ($val | toString) }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scalar vs array handling: Uses kindIs "slice" with a ternary to handle both array fields (like args) and scalar fields (like user). Scalars get toString, arrays get join " ". This prevents the runtime template error that would occur if join was called on a string value.

@ian-flores
Copy link
Contributor Author

Review findings addressed

All 7 items from the review have been addressed across commits 7c9de7c and c70f317:

# Finding Resolution
1 Mutual exclusivity not enforced Added ValidateDynamicLabelRules() with programmatic checks, called at template generation time
2 field as arbitrary .Job key Documented trust model in field comment — relies on CRD write being privileged
3 join " " on scalar fields Template now uses kindIs "slice" with ternary to handle both arrays and scalars
4 Regex/label validation Regex compiled at generation time; kubebuilder MaxLength markers on all string fields; trunc 63 in template
5 Only template 2.5.0 Added doc comment: "Requires template version 2.5.0 or later; ignored by older templates"
6 Test coverage Added TestValidateDynamicLabelRules (7 sub-tests) and TestGenerateSessionConfigTemplate_DynamicLabels_Validation (2 sub-tests)
7 Self-review comments Added inline comments on normalization, trust model, validation strategy, and type handling

Not addressed (future scope):

  • Template rendering tests with mock .Job data (would require test harness for the Helm template engine)
  • Field allowlist (documented trust model instead)

- Merge main and regenerate Helm chart CRDs
- Remove accidentally committed .claude/tsc-cache files
Changes:
- Fix template to account for labelPrefix length when truncating: compute available suffix length as `63 - len(namePrefix)` so the full label name segment stays within the 63-char Kubernetes limit
- Add `regexReplaceAll` to strip trailing non-alphanumeric characters from label key suffix after truncation
- Add validation in `ValidateDynamicLabelRules` that labelPrefix name segment (after `/`) is < 63 chars
- Rename misleading test from "rejects catastrophic backtracking regex" to "accepts safe regex patterns (RE2 prevents backtracking by design)"
- Add test for labelPrefix name segment length validation
Build and tests pass.
Changes:
- Add DNS prefix length validation (≤ 253 chars) to `ValidateDynamicLabelRules` for `labelPrefix` values containing a `/`, ensuring the combined label key conforms to Kubernetes label key structure
- Add test cases for DNS prefix length validation (reject > 253, accept at 253)
Build and tests pass.
Changes:
- Add validation that DNS prefix (before '/') must not be empty in `ValidateDynamicLabelRules`
- Replace `require.Nil(t, err)` with `require.NoError(t, err)` for better failure messages
- Add test case for empty DNS prefix (e.g., `"/name"`)
All DynamicLabel tests pass. Build succeeds. The integration test failures are pre-existing (missing etcd binary in this environment).
Changes:
- Fix `labelPrefix` CRD maxLength from 253 to 316 to correctly accommodate the full Kubernetes label key format (253-char DNS prefix + `/` + 62-char name prefix), preventing the CRD from rejecting valid label prefixes that include a DNS subdomain
- Updated kubebuilder annotation in `api/product/session_config.go` and all 6 CRD YAML copies (`config/crd/bases/`, `internal/crdapply/bases/`, `dist/chart/templates/crd/`) for both Connects and Workbenches
The test failures are all due to missing `etcd` binary (`/usr/local/kubebuilder/bin/etcd: no such file or directory`) — these are integration tests requiring a local Kubernetes control plane, not related to my change. The build succeeds cleanly and all unit tests that can run pass.
Changes:
- Added inline comment explaining the `MaxLength=316` derivation (253 DNS subdomain + 1 `/` + 62 name prefix) in `api/product/session_config.go`
All tests pass. Here's the summary:
Changes:
- Add CEL validation rules to `DynamicLabelRule` struct enforcing mutual exclusivity of `labelKey` and `match`, requiring at least one of them, and requiring `labelPrefix` when `match` is set
- Add `maxLength: 256` constraint on `trimPrefix` field for consistency with sibling fields
- Update CRD base files in both `config/crd/bases/` and `internal/crdapply/bases/` (connects and workbenches) with the corresponding `x-kubernetes-validations` and `maxLength` entries
No diff — the Helm chart CRD files now match the `config/crd/bases/` files exactly.
Changes:
- Add `x-kubernetes-validations` CEL rules (labelKey/match mutual exclusivity, at-least-one required, labelPrefix required with match) to `dist/chart/templates/crd/core.posit.team_connects.yaml`
- Add `x-kubernetes-validations` CEL rules to `dist/chart/templates/crd/core.posit.team_workbenches.yaml`
- Add `maxLength: 256` on `trimPrefix` in both Helm chart CRD files
All tests pass. Here's a summary:
Changes:
- Skip emitting dynamic labels when the sanitized suffix is empty after truncation and stripping, preventing invalid label keys and silent key collisions
- Strip leading non-alphanumeric characters from suffix in template to ensure valid Kubernetes label name segments
- Tighten `labelPrefix` name segment validation from <63 to <53 chars, guaranteeing at least 10 characters for suffix
- Update `LabelPrefix` MaxLength kubebuilder annotation from 316 to 306 to match the stricter bound
- Add test for the new 52-char boundary acceptance case
All 6 CRD YAML files are updated with the `maxLength: 306` change.
Changes:
- Regenerated CRD manifests (`just mgenerate`) to sync `maxLength` from 316 to 306 across all 6 CRD YAML files (config/crd/bases, internal/crdapply/bases, dist/chart/templates/crd) for both Connects and Workbenches
Helm chart lints successfully, confirming the restored templating is valid.
Changes:
- Restored Helm template directives in `dist/chart/templates/crd/core.posit.team_connects.yaml` and `core.posit.team_workbenches.yaml` that were accidentally stripped by `just mgenerate`
- Restored `{{- if .Values.crd.enable }}` / `{{- end -}}` conditional wrapper
- Restored `{{- include "chart.labels" . | nindent 4 }}` labels
- Restored conditional `cert-manager.io/inject-ca-from` annotation
- Restored conditional `helm.sh/resource-policy: keep` annotation
- Restored conditional webhook conversion configuration
- Kept the `maxLength: 306` update intact
Good. No debug files remain. The new `job_tpl_test.go` is untracked and all other changes are staged-ready.
Changes:
- Raise `labelKey` maxLength from 63 to 317 to support DNS-prefixed Kubernetes label keys (253 prefix + `/` + 63 name) across Go types and 6 CRD YAML files
- Add Go-side validation for `labelKey` format: DNS prefix length, name segment length, and regex pattern matching
- Add label value sanitization in `job.tpl` direct mapping mode: replace invalid characters, truncate to 63 chars, strip leading/trailing non-alphanumeric, skip empty values
- Add template-level tests (`job_tpl_test.go`) covering both direct mapping and regex extraction with Helm-compatible rendering
- Add `labelKey` validation unit tests for invalid characters, empty segments, DNS prefix bounds, and valid prefixed keys
The failure in `internal/controller/core` is a pre-existing nil pointer dereference in `TestSiteReconcileWithExperimental` — unrelated to my changes (it's in `site_controller_pre_pull.go:64`). The packages I modified (`api/product` and `api/templates`) both pass.
Changes:
- Add DNS subdomain format validation (RFC 1123) for the `labelKey` prefix portion
- Reject `labelKey` values with more than one `/` (e.g., `a/b/name`)
- Add tests for invalid DNS prefix, spaces in prefix, and multiple slashes
- Document label value sanitization/truncation behavior on the `LabelKey` field comment
All tests pass, including the new uppercase DNS prefix test.
Changes:
- Reject mixed-case DNS prefixes outright instead of silently lowering before validation, so users get a clear error when their `labelKey` prefix contains uppercase characters
- Add test for uppercase DNS prefix rejection
All tests pass.
Changes:
- Add DNS subdomain format validation (`dnsSubdomainRegex`) for `labelPrefix` DNS prefix, mirroring the `labelKey` path
- Add label value format validation for `LabelValue` when non-empty, rejecting invalid Kubernetes label values like `"!!!"`
- Add reserved Kubernetes label prefix checks (`kubernetes.io`, `k8s.io` and their subdomains) for both `labelKey` and `labelPrefix` to prevent overwriting operator-critical labels
- Add `labelNamePrefixRegex` validation for `labelPrefix` name segment, rejecting prefixes with invalid label-name characters like `!!!`
- Add tests for all four new validation paths
All 35 tests pass. Build and tests are clean.
Changes:
- Add 63-character max length check for `labelValue` to match Kubernetes spec
- Add comment on `labelNamePrefixRegex` documenting that trailing `-`/`_`/`.` is intentional since suffixes start with alphanumeric
- Add missing test for `labelKey` with reserved `k8s.io` prefix
- Add missing test for `labelPrefix` with reserved `kubernetes.io` prefix
- Add test for `labelValue` exceeding 63 characters
Build and tests pass. Here's a summary:
Changes:
- Reject `labelValue` when `labelKey` is set (direct-mapping mode silently ignored it)
- Cap regex match count to 50 in job template to prevent unbounded label generation
- Fix doc comment: suffix room is ≥11 chars, not ≥10 (63 - 52 = 11), across Go source and 6 CRD YAMLs
- Add test for multiple rules with mixed validity confirming correct error index
- Add Helm/Sprig upgrade verification note on regexReplaceAll mock in template tests
All changes are in place and tests pass.
Changes:
- Add `posit.team/label-cap-reached: "true"` annotation on Job when regex matches exceed 50, so operators can detect truncation
- Document the 50-match runtime cap on the `Match` field comment in `DynamicLabelRule` struct
- Update test template to match the production annotation behavior
Everything looks correct. Here's a summary of the changes:
Changes:
- Moved `posit.team/label-cap-reached` from a pod label to a pod annotation, using a pre-computed `$capStatus` dict to bridge the template ordering (annotations before labels)
- Emit the cap-reached annotation at most once regardless of how many rules exceed the 50-match limit, by setting a flag in the dict during pre-computation
- Updated doc comment on `DynamicLabelRule.Match` to say "dropped and a posit.team/label-cap-reached annotation is set" instead of "silently dropped"
- Updated test template to match production (removed inline label emission from labels block)
Everything looks correct. The labels block now reads from `$matchCache` instead of recomputing the field-access, string-coercion, and regex logic.
Changes:
- Consolidated duplicated field-access, string-coercion, and regex logic into the pre-computation block by caching capped match lists in `$matchCache` (keyed by rule index)
- Labels block now reads pre-computed matches from the cache instead of recomputing them independently
- Both cap detection and match capping now happen in a single pass, eliminating the risk of the two sites drifting out of sync
No diff — CRDs are in sync. Here's a summary of all changes:
Changes:
- Fix misleading comment on Pod merge to clarify only DynamicLabels/Labels/Annotations are merged (other Pod fields are managed by operator defaults and ExperimentalFeatures)
- Use DeepCopy() instead of direct pointer assignment for nil-case Service/Pod/Job to avoid mutating the Site spec
- Add 4 unit tests for sessionConfig merge: DynamicLabels-only, Service/Job overrides, merge after ExperimentalFeatures, and label key conflicts
- Sync internal/crdapply/bases/ CRD files to fix stale "silently dropped" text (should say "dropped and a posit.team/label-cap-reached annotation is set")
The only failures are controller integration tests that require `etcd` (kubebuilder control plane) — a pre-existing environment limitation, not related to my change. The template tests all pass.
Changes:
- Updated test template in `api/templates/job_tpl_test.go` to use the `$matchCache` pre-computation pattern, matching the production `job.tpl` template
Confirmed — the `TestSiteReconcileWithExperimental` panic is pre-existing and unrelated to my changes. All the SessionConfigMerge tests and the LabelMerge tests pass.
Changes:
- Added conflict/override test case to `TestLabelMerge` in `util_test.go` — verifies that m2 (user-provided) values win when both maps share a key
- Renamed `TestSiteReconciler_SessionConfigMerge_LabelConflict` to `TestSiteReconciler_SessionConfigMerge_PodLabelsMergedIntoOperatorPod` — the operator never pre-populates Pod labels, so the original test wasn't testing an actual conflict
- Restructured the integration test to use `ExperimentalFeatures` (ensuring the non-nil `opSC.Pod` merge path is exercised) and assert user labels coexist with operator-managed fields like `ServiceAccountName`
All changes compile and tests pass.
Changes:
- `LabelMerge` now allocates a new map instead of mutating its first argument, preventing unintended side effects
- SessionConfig merge nil paths (Pod/Service/Job) now copy only allowed fields (DynamicLabels, Labels, Annotations, Type) into fresh structs instead of doing a full DeepCopy that could expose unintended user fields
- Validation now rejects `trimPrefix` and `labelPrefix` when `labelKey` is set, since those fields only apply to regex match mode
- Added tests for LabelMerge non-mutation behavior and the new validation rules
Changes:
- Copy `DynamicLabels` slice defensively to avoid sharing the underlying array with the user's Site spec
- Remove bare `{ }` blocks in SessionConfig merge for more idiomatic Go style
All passing. Here's the summary:
Changes:
- Add duplicate `labelKey` detection across rules in `ValidateDynamicLabelRules()` to prevent silent overwrites
- Add canary assertion in template tests to detect if Sprig's `regexReplaceAll` argument order changes to match Helm's
- Add test case for empty-string field value with direct mapping to exercise the skip-empty-label path
Build passes, and the test failure is pre-existing (not caused by my change).
Changes:
- Added comment clarifying that `DynamicLabelRule` is a flat struct (all string fields), so the shallow `copy()` is a full deep copy — addressing the reviewer's concern about shared references
All builds and tests pass.
Changes:
- Moved canary assertion from `renderDynamicLabels` helper into a dedicated `TestCanary_SprigRegexReplaceAllOrder` test so it runs once per test run instead of on every helper invocation, and failures clearly identify the test
The `TestSiteReconcileWithExperimental` failure is a pre-existing infrastructure issue (missing `etcd` binary for the control plane) — not related to my changes. All the packages I modified (`api/templates`, `api/product`) pass cleanly.
Changes:
- Sanitize regex match suffix for label key characters in `job.tpl` by adding `regexReplaceAll "[^a-zA-Z0-9._-]" "_"` to the suffix pipeline
- Add `$capStatus` tracking and annotation rendering to the test template, with tests for the 50-match cap
- Add duplicate `labelPrefix` detection across regex rules in `ValidateDynamicLabelRules`
- Add multi-slash check for `labelPrefix` and use `strings.Index` consistently (matching `labelKey` validation)
- Add test for suffix sanitization of special characters (e.g., `foo@bar` → `foo_bar`)
All tests pass. Here's the summary:
Changes:
- Collapse consecutive underscores in suffix sanitization (`regexReplaceAll "[_]{2,}" "_"`) in both `job.tpl` and test template, so `foo@@bar` becomes `foo_bar` instead of `foo__bar`
- Rename test from "caps matches at 50 and sets annotation" to "caps matches at 50 (test-only annotation verifies cap)" to clarify the annotation is not emitted in production
All the `FAIL` entries are from `internal/controller/core` which all fail due to missing etcd/controlplane — a pre-existing environment issue unrelated to my changes. All relevant packages pass.
Changes:
- Add `maxItems: 20` cap on `DynamicLabels` slice via kubebuilder marker and all CRD YAMLs (config, internal/crdapply, dist/chart)
- Deduplicate regex matches in job template before applying the 50-match cap, preventing wasted cap budget and redundant template output
- Add NOTE comment on `ValidateDynamicLabelRules` documenting that semantic validation runs at reconciliation time, not admission time
- Update cap test to use unique arg values to work correctly with dedup logic
Build and tests pass.
Changes:
- Break dense single-line dedup logic in `job.tpl` and `job_tpl_test.go` into multi-line format with a clarifying comment for readability
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant