Skip to content

feat(AGX1-274): record task creator identity and FGAC migration safety#246

Draft
asherfink wants to merge 2 commits into
mainfrom
asher.fink/agx1-274-task-dual-write
Draft

feat(AGX1-274): record task creator identity and FGAC migration safety#246
asherfink wants to merge 2 commits into
mainfrom
asher.fink/agx1-274-task-dual-write

Conversation

@asherfink
Copy link
Copy Markdown

@asherfink asherfink commented May 21, 2026

Related work

Parent epic: AGX1-264 — per-task FGAC. Follow-ups bundled in AGX1-291.

This change is part of a 5-PR stack across 3 repos. Merge order: scaleapi/scaleapi#144783 (release sgp-authz 0.7.1) → scaleapi/agentex#353 → scaleapi/agentex#356 → this PR → #249.

Repo PR Purpose
scaleapi/scaleapi scaleapi/scaleapi#144783 sgp-authz 0.7.1 — Action.CANCEL
scaleapi/agentex scaleapi/agentex#353 agentex-auth per-account routing + cancel op
scaleapi/agentex scaleapi/agentex#356 agentex-auth register_resource API + cancel cleanup
scaleapi/scale-agentex #246 (this PR) task creator audit columns + FGAC dual-write + flag
scaleapi/scale-agentex #249 per-RPC operation rewire + 404/403 wrap

Two commits — keep them separate during review, the audit-column schema change is independent of the dual-write call sites.

Summary

Commit 1 — passive audit columns:

  • Adds creator_user_id / creator_service_account_id columns to the tasks table, populated from the request principal on AgentTaskService.create_task. Best-effort (NULLable; see caveat below).
  • Adds a CHECK ((creator_user_id IS NULL) OR (creator_service_account_id IS NULL)) to enforce at-most-one creator type at the DB layer (constraint name: ck_tasks_at_most_one_creator).
  • Adds partial indexes ix_tasks_creator_user_id and ix_tasks_creator_service_account_id (CREATE INDEX CONCURRENTLY) for future "tasks created by X" lookups.

Commit 2 — FGAC dual-write call sites + flag:

  • Adds an FGAC_TASKS_DUAL_WRITE env-var flag, injected into AgentTaskService via FastAPI DI. Gates the dual-write behavior end-to-end.
  • create_task calls register_resource(task, parent_resource=agent) on the authorization service after the Postgres row is persisted, so the task is registered with tenant + owner + parent_agent tuples atomically (via scaleapi/agentex#356's new endpoint).
  • delete_task calls deregister_resource(task) after the Postgres delete. Pre-resolves the task id by name first so the post-delete deregister doesn't race the lookup.
  • Both call sites share a _dual_write_with_retry(op_name, do_call, task_id) helper. Retries AuthenticationServiceUnavailableError / AuthenticationGatewayError with exponential backoff + jitter (3 retries → 4 total attempts max), mirroring AgentsACPUseCase.grant_with_retry. Non-transient exceptions are not retried.
  • Emits Datadog metrics (task_fgac_dual_write.attempt|success|retry|failure) tagged with op:register|deregister and exception_class:<name> on failure — these are the rollout signal for AGX1-291's operator runbook.

Migration safety

  • ALTER TABLE ... ADD CONSTRAINT ... NOT VALID + ALTER TABLE ... VALIDATE CONSTRAINT — splits the operation so the brief ACCESS EXCLUSIVE lock doesn't have to wait on an existence scan. tasks is high-write; a CHECK addition without NOT VALID would queue behind in-flight transactions and block readers until released.
  • Indexes created CONCURRENTLY in an autocommit_block.
  • Migration revision: a1f73ada66c5 (add_task_creator_columns). down_revision is 6c942325c828 (adding_task_cleaned_at, the current alembic head on main); migration_history.txt regenerated via alembic history. The ORM-side CheckConstraint in orm.py matches the DB-side (same constraint name + predicate).

Rollout

  • Flag-off (default): no behavior change. Audit columns populate but no FGAC tuples are written. Safe to merge and deploy.
  • Flag-on: register_resource and deregister_resource fire on create/delete. If they fail after retries, the Postgres row is still the durable record — orphan auth tuples can be cleaned up out of band per the AGX1-291 operator runbook using the creator-audit columns to identify them.
  • Operator rollout assumes a redeploy cycles pods; the flag is read once at DI-resolve time, so mid-process flips are intentionally invisible.

Audit-trail caveat

Creator attribution is best-effort: tasks created outside an HTTP request context (Temporal activities, background workers, any path that constructs AgentTaskService without request.state.principal_context) leave both columns NULL. The CHECK constraint allows both-NULL, and test_no_resolvable_creator_leaves_both_columns_null exercises this path.

What changed

  • database/migrations/alembic/versions/2026_05_21_1508_add_task_creator_columns_a1f73ada66c5.py (new): NOT VALID-pattern migration. down_revision = "6c942325c828".
  • src/adapters/orm.py: declarative CheckConstraint mirroring the DB constraint.
  • src/domain/entities/tasks.py: new optional fields on TaskEntity.
  • src/domain/services/task_service.py:
    • _principal_field helper (handles dict-vs-pydantic principal shape from the authn proxy).
    • create_task reads creator_user_id / creator_service_account_id from principal context.
    • AgentTaskService.__init__ takes dual_write_enabled: DEnvironmentVariable(EnvVarKeys.FGAC_TASKS_DUAL_WRITE).
    • _dual_write_with_retry(op_name, do_call, task_id) keyed by op name; reused from both call sites.
  • src/adapters/authorization/adapter_agentex_authz_proxy.py: forwards to agentex-auth's /v1/authz/register and /deregister.
  • src/config/environment_variables.py: new FGAC_TASKS_DUAL_WRITE key.
  • Tests:
    • test_task_audit_columns.py — testcontainers Postgres integration tests for the audit columns (creator population, mutual-exclusion CHECK, both-NULL allowed).
    • test_task_fgac_dual_write.py — covers register-on-create, deregister-on-delete, flag-off skip, transient retry-and-succeed (both register and deregister sides), retry exhaustion propagating with the Postgres row preserved, and the name-route ItemDoesNotExist swallow.
    • Existing unit/integration tests updated for the new dual_write_enabled constructor parameter.

Test plan

  • migration_lint.py — clean.
  • Ruff + ruff-format + alembic migration-safety lint clean (pre-commit hooks).
  • test_task_audit_columns.py — 7/7 pass locally via testcontainers.
  • test_task_fgac_dual_write.py — collects cleanly; runs in CI integration suite.
  • Manual: deploy to staging with flag off, confirm \d tasks shows new columns + constraint + indexes; flip flag on for one account, confirm task_fgac_dual_write.success fires.

@asherfink asherfink force-pushed the asher.fink/agx1-274-task-dual-write branch from 13fe4b2 to 7486e5a Compare May 26, 2026 20:22
@asherfink asherfink changed the title feat(AGX1-274): dual-write tasks to spark-authz behind FGAC_TASKS_DUAL_WRITE flag feat(AGX1-274): record task creator identity and FGAC migration safety May 26, 2026
@asherfink asherfink force-pushed the asher.fink/agx1-274-task-dual-write branch from 7486e5a to b9cb26b Compare May 26, 2026 20:56
asherfink added 2 commits May 27, 2026 17:08
…n creation

Adds two nullable creator-audit columns to the tasks table — creator_user_id
and creator_service_account_id — populated from the principal context at
create time. A CHECK constraint (ck_tasks_one_creator) enforces that at most
one is set.

This replaces the earlier dual-write draft: grants are already issued
unconditionally via grant_with_retry in agents_acp_use_case.py:239, and
per-account rollout routing belongs in agentex-auth (private), not in this
public Apache-2.0 codebase.
@asherfink asherfink force-pushed the asher.fink/agx1-274-task-dual-write branch from ad1e980 to 3a06be8 Compare May 27, 2026 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant