Skip to content

fix(designer): Resolve RBAC propagation delay causing Foundry proxy failures#8950

Open
ccastrotrejo wants to merge 5 commits intomainfrom
fix/foundry-rbac-propagation
Open

fix(designer): Resolve RBAC propagation delay causing Foundry proxy failures#8950
ccastrotrejo wants to merge 5 commits intomainfrom
fix/foundry-rbac-propagation

Conversation

@ccastrotrejo
Copy link
Contributor

Commit Type

  • fix - Bug fix

Risk Level

  • Medium - Moderate changes, some user impact

What & Why

When selecting a Foundry agent connection for the first time, proxy API calls (/agents, /deployments) and RBAC role assignments fired simultaneously from independent useEffect hooks. The proxy calls failed immediately (401/403) because roles hadn't been assigned yet, and React Query's default 3 retries exhausted within ~10 seconds — far shorter than Azure's 30s–2min RBAC propagation window.

HAR evidence (timeline from real capture):

T+0.0s   Proxy /agents    → 400 (Forbidden)     ← no roles exist yet
T+0.0s   Proxy /deployments → 401 (PermissionDenied)
T+0.3s   RBAC PUT ×4       → 201 (Created)       ← roles assigned, not yet propagated
T+2.9s   Retry #1           → 401 (still propagating)
T+5.4s   Retry #2           → 401
T+9.9s   Retry #3           → 401                 ← React Query gives up. Dead end.

Fix: Three layered approach

1. RBAC signal + query gating

  • Track RBAC assignment status (idle → assigning → assigned/not-needed/failed)
  • Gate Foundry proxy queries with enabled: rbacReady so they don't fire before RBAC PUTs complete
  • Eliminates the T+0 race condition

2. Extended retry for auth errors

  • Added isFoundryAuthError() helper that detects 401/403 from various error shapes (status codes, .code, .message)
  • Foundry proxy queries retry up to 12 times on auth errors with exponential backoff (3s → 6s → 12s → 24s → 30s cap), providing a ~3 minute retry window — enough for Azure RBAC propagation
  • Non-auth errors still use standard 3 retries

3. Propagation UX

  • Shows spinner with "Assigning permissions for Foundry access..." during RBAC PUT phase
  • Shows "Setting up permissions for Foundry access. This may take up to a minute..." during propagation phase
  • Agent picker is dimmed and inert during propagation
  • Auto-transitions to normal UI once retry succeeds — no user action needed
  • Non-Foundry params remain visible below the spinner

Impact of Change

  • Users: Foundry agent connections work reliably on first use. No more cryptic 401 errors — users see a purposeful loading state that auto-resolves. Returning users (RBAC already exists) see zero difference.
  • Developers: isFoundryAuthError exported for reuse. Foundry hooks now accept optional rbacReady parameter. foundryQueryOpts includes RBAC-aware retry config.
  • System: Reduced wasted network calls (queries gated until RBAC completes). Extended retry window covers Azure's documented propagation delay.

Test Plan

  • Unit tests added/updated
  • E2E tests added/updated
  • Manual testing completed
  • Tested in: Standalone designer with Foundry connections

New Tests

  • foundryRetry.spec.ts12 tests for isFoundryAuthError covering: null/undefined, status codes (401/403), error codes, error messages, real-world HAR error shapes, and negative cases

Existing Tests (all passing)

  • foundryUpdates.spec.ts (v1 + v2) — 40 tests ✅
  • foundryAgentService.spec.ts — 33 tests ✅
  • foundryAgentDetails.spec.tsx — 19 tests ✅
  • foundryAgentPicker.spec.tsx — 6 tests ✅
  • role.spec.ts — 6 tests ✅

Files Changed

  • useCognitiveService.ts (×2) — isFoundryAuthError, retry config, rbacReady param on hooks
  • parametersTab/index.tsx (×2) — RBAC status tracking, query gating, propagation UI
  • foundryRetry.spec.ts (new) — Unit tests for auth error detection

Contributors

@ccastrotrejo

…ailures

When selecting a Foundry agent connection for the first time, proxy API
calls and RBAC role assignments fired simultaneously from independent
useEffect hooks. The proxy calls failed immediately (401/403) because
roles hadn't been assigned yet, and React Query's default 3 retries
exhausted within ~10 seconds — far shorter than Azure's 30s–2min RBAC
propagation window.

Three layered fixes:

1. RBAC signal + query gating: Track RBAC assignment status
   (idle/assigning/assigned/not-needed/failed) and gate Foundry proxy
   queries behind completion. Proxy calls no longer fire before RBAC
   PUTs finish.

2. Extended retry for auth errors: Foundry proxy queries now retry up
   to 12 times on 401/403 with exponential backoff (3s, 6s, 12s, 24s,
   30s cap) providing a ~3 minute window for Azure RBAC propagation.

3. Propagation UX: Shows a purposeful spinner with 'Setting up
   permissions for Foundry access...' instead of a cryptic error.
   Auto-transitions to normal UI once retries succeed.

All changes mirrored in both designer and designer-v2.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 20, 2026 20:13
@github-actions
Copy link

github-actions bot commented Mar 20, 2026

🤖 AI PR Validation Report

PR Review Results

Thank you for your submission! Here's detailed feedback on your PR title and body compliance:

PR Title

  • Current: fix(designer): Resolve RBAC propagation delay causing Foundry proxy failures
  • Issue: None — the title is clear, follows conventional commit style, and describes the scope and problem.
  • Recommendation: Keep as-is.

Commit Type

  • Properly selected (fix)
  • Only one commit type selected which is correct for this change.

Risk Level

  • Claim: Medium — label risk:medium is present on the PR and matches the body.
  • Assessment: Appropriate. The changes perform RBAC role assignments and change retry/backoff behavior and UI gating: medium risk is reasonable.

What & Why

  • Current: The PR body contains a clear and concise "What & Why" describing the RBAC race, HAR evidence, and the three-layered fix (RBAC signal + gating, extended retries, propagation UX).
  • Issue: None.
  • Recommendation: Good. The context and rationale are well documented.

Impact of Change

  • The Impact section is present and well-structured for Users, Developers, and System.
  • Recommendation: Consider briefly noting whether any telemetry or metrics were added/changed to observe retry behavior or failed role-assignment rates in production (optional but helpful for operations).

Test Plan

  • Test plan claims unit tests added and manual testing completed.
  • I checked the diff: foundryRetry.spec.ts is added and covers isFoundryAuthError. Unit tests are present in the diff and appear to match the Test Plan description.
  • E2E tests are not included; that's acceptable but consider adding an integration/E2E test that simulates RBAC propagation if feasible (optional).

Contributors

  • Contributors section includes @ccastrotrejo. This is fine. (If any PMs/designers helped, consider calling them out — optional.)

⚠️ Screenshots/Videos

  • No screenshots/videos provided.
  • Assessment: The PR includes UI changes (new spinner/UX state). It's recommended (not required) to add a quick screenshot or GIF of the new propagation/assigning spinner and the retry state so reviewers and designers can validate visual/UX aspects quickly.

Summary Table

Section Status Recommendation
Title Keep as-is.
Commit Type Correct selection (fix).
Risk Level Matches diff; risk:medium appropriate.
What & Why Good explanation and HAR evidence included.
Impact of Change Clear; consider adding note about telemetry.
Test Plan Unit tests added; consider E2E/integration test.
Contributors OK; add more credits if applicable.
Screenshots/Videos ⚠️ Add screenshot/GIF of new UI for reviewer ease.

Final notes & recommended small improvements

  • The PR body and title pass the template checks — nice work.
  • A couple of suggested (non-blocking) items to consider addressing in follow-ups or before merge:
    • Add a small integration/E2E test (if practical) that simulates RBAC propagation delays so the extended retry behavior and gating are exercised end-to-end. If E2E is not possible, add a short note in the Test Plan explaining why.
    • Consider adding lightweight telemetry/metric events around: role-assignment attempts/success/failure, number of extended retries used, and when retries are exhausted. This aids operations if customers encounter propagation issues in the wild.
    • Provide a screenshot or short GIF showing the "Assigning permissions..." spinner and the exhausted/retry UI state so product/design reviewers can validate the UX.
    • (Optional) Add a short note in the PR or code comments clarifying that when RBAC assignment fails the gating will move forward (failed state treated as ready). This is okay, but clarifying intent helps future maintenance.

Please update/add the optional items above if you'd like, otherwise this PR is good to merge from a PR title/body compliance perspective. Thank you for the thorough description and tests!


Last updated: Fri, 20 Mar 2026 21:19:05 GMT

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race between Foundry proxy queries and RBAC role assignment/propagation in the Designer parameters tab by gating proxy queries until RBAC assignment completes and extending retries for auth/permission errors that commonly occur during Azure RBAC propagation.

Changes:

  • Add RBAC assignment status tracking in ParametersTab and gate Foundry proxy queries until RBAC is ready (plus propagation UI while waiting).
  • Introduce isFoundryAuthError() and extend React Query retry/backoff behavior for Foundry proxy calls on 401/403-like errors.
  • Add unit tests validating isFoundryAuthError() against common/real-world error shapes.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
libs/designer/src/lib/ui/panel/nodeDetailsPanel/tabs/parametersTab/index.tsx Adds RBAC state machine + query gating + propagation loading UI for Foundry agent connections.
libs/designer/src/lib/ui/panel/connectionsPanel/createConnection/custom/useCognitiveService.ts Adds isFoundryAuthError() and RBAC-aware retry/backoff + rbacReady gate parameter for Foundry queries.
libs/designer/src/lib/ui/panel/connectionsPanel/createConnection/custom/test/foundryRetry.spec.ts New Vitest suite covering isFoundryAuthError() behavior.
libs/designer-v2/src/lib/ui/panel/nodeDetailsPanel/tabs/parametersTab/index.tsx Mirrors RBAC gating + propagation UI changes in Designer v2.
libs/designer-v2/src/lib/ui/panel/connectionsPanel/createConnection/custom/useCognitiveService.ts Mirrors isFoundryAuthError() + retry/backoff + rbacReady gate in Designer v2.

Comment on lines +443 to +452
.then((missingRoles) => {
if (missingRoles.length === 0) {
setFoundryRbacStatus('not-needed');
return Promise.resolve();
}
return Promise.all(missingRoles.map((role) => RoleService().addAppRoleAssignmentForResource(targetResourceId, role.id))).then(
() => {
setFoundryRbacStatus('assigned');
}
);
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RBAC assignment promise chain can update foundryRbacStatus for a stale foundryAccountResourceId if the connection/resource changes while the request is in flight. catch guards against this, but the .then path sets state unconditionally, which can incorrectly mark RBAC as assigned/not-needed and enable proxy queries for the wrong resource. Add a guard in the .then path (similar to the catch) to ensure rbacAssignedResourceRef.current === targetResourceId before calling setFoundryRbacStatus(...) (and consider a cleanup flag to avoid setState after unmount).

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 94a8fa4 — added rbacAssignedResourceRef guard in both .then() paths to match the existing .catch() guard. Both v1 and v2 updated.

Comment on lines +1519 to +1524
size="tiny"
label={
foundryRbacStatus === 'assigning'
? 'Assigning permissions for Foundry access...'
: 'Setting up permissions for Foundry access. This may take up to a minute...'
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new spinner labels are user-facing strings, but this file otherwise uses react-intl for UI text. Please move these messages to intl.formatMessage(...) (and add loc IDs) so they get localized consistently.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 94a8fa4 — all user-facing strings now use intl.formatMessage() with loc IDs (fRbac1-fRbac5). Added useIntl() to ParameterSection.

Comment on lines +442 to +451
.then((missingRoles) => {
if (missingRoles.length === 0) {
setFoundryRbacStatus('not-needed');
return Promise.resolve();
}
return Promise.all(missingRoles.map((role) => RoleService().addAppRoleAssignmentForResource(targetResourceId, role.id))).then(
() => {
setFoundryRbacStatus('assigned');
}
);
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RBAC assignment promise chain can update foundryRbacStatus for a stale foundryAccountResourceId if the connection/resource changes while the request is in flight. catch guards against this, but the .then path sets state unconditionally, which can incorrectly mark RBAC as assigned/not-needed and enable proxy queries for the wrong resource. Add a guard in the .then path (similar to the catch) to ensure rbacAssignedResourceRef.current === targetResourceId before calling setFoundryRbacStatus(...) (and consider a cleanup flag to avoid setState after unmount).

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 94a8fa4 — same stale-resource guard applied to v1 .then() path.

Comment on lines +1491 to +1496
size="tiny"
label={
foundryRbacStatus === 'assigning'
? 'Assigning permissions for Foundry access...'
: 'Setting up permissions for Foundry access. This may take up to a minute...'
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new spinner labels are user-facing strings, but this file otherwise uses react-intl for UI text. Please move these messages to intl.formatMessage(...) (and add loc IDs) so they get localized consistently.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 94a8fa4 — same i18n fix applied to v1. All spinner labels, button text, and aria-labels use intl.formatMessage().

@ccastrotrejo ccastrotrejo added the risk:medium Medium risk change with potential impact label Mar 20, 2026
When RBAC propagation retries exhaust (~3min), the UI now transitions
from the spinner to a clear message with a 'Retry now' button instead
of sitting on the spinner forever.

Three states now distinguished:
- assigning: 'Assigning permissions for Foundry access...' (spinner)
- propagating: 'Setting up permissions... may take up to a minute' (spinner)
- exhausted: 'Permissions are still propagating' + Retry button

The retry button calls refetch() to restart the query with fresh retries.
Uses role='alert' and aria-live='polite' for screen reader announcements.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link

github-actions bot commented Mar 20, 2026

📊 Coverage Check

The following changed files need attention:

⚠️ libs/designer-v2/src/lib/ui/panel/connectionsPanel/createConnection/custom/useCognitiveService.ts - 17% covered (needs improvement)
⚠️ libs/designer/src/lib/ui/panel/connectionsPanel/createConnection/custom/useCognitiveService.ts - 25% covered (needs improvement)

Please add tests for the uncovered files before merging.

ccastrotrejo and others added 2 commits March 20, 2026 16:36
Address Copilot review feedback:

1. Add rbacAssignedResourceRef guard in .then() path to prevent setting
   foundryRbacStatus for a stale resource if the user switches Foundry
   connections while RBAC assignment is in flight. Previously only the
   .catch() path had this guard.

2. Replace hardcoded English spinner/button labels with
   intl.formatMessage() calls for localization consistency. Adds
   useIntl() to ParameterSection component (both v1 and v2).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes 8 audit findings from React performance, accessibility, and code
quality reviews:

P1 (Perf): Hoist EMPTY_PARAM_GROUPS to module constant — prevents
    useSelector from returning a new {} reference on every render.
P2 (Perf): Add cancelled flag to RBAC useEffect cleanup — prevents
    setState after component unmount.
P4 (Perf): Fix retry window comment (actual is ~5min, not ~3min).
A1 (A11y): Replace role='alert' + aria-live='polite' with role='status'
    — fixes contradictory ARIA semantics.
A4 (A11y): Auto-focus retry button via ref when retries exhaust.
A5 (A11y): Use tokens.colorNeutralForeground2 instead of raw CSS var.
C5 (Quality): Log errors in .catch() instead of swallowing silently.
C9 (Quality): Move FoundryRbacStatus type to module scope.

All changes mirrored in both designer and designer-v2.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-validated risk:medium Medium risk change with potential impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants