Skip to content

LOADTEST: Batch 2k host uuids, for profile reconciler.#45574

Draft
MagnusHJensen wants to merge 3 commits into
mainfrom
claude/optimize-apple-reconciler-batching-qOHb4
Draft

LOADTEST: Batch 2k host uuids, for profile reconciler.#45574
MagnusHJensen wants to merge 3 commits into
mainfrom
claude/optimize-apple-reconciler-batching-qOHb4

Conversation

@MagnusHJensen
Copy link
Copy Markdown
Member

Mirrors the Windows MDM reconciler pattern: each 30s tick picks up to 2000 distinct hosts with pending install/remove work, ordered by host_uuid ascending. The cursor is persisted in Redis (via the mysqlredis wrapper) so multi-tick passes converge over time without running the unbounded reconciliation in a single pass.

Intended for load testing: scopes per-tick writer pressure on host_mdm_apple_profiles and the nano command queue, which the current all-at-once reconciler can spike during bulk events (team transfer, profile add).

Key oddities mirrored from the Windows side:

  • Named return on ReconcileAppleProfiles so the deferred cursor write observes the actual exit error and skips advancement on failure.
  • Cursor write failures are logged-and-swallowed: a transient Redis blip should not poison the cron tick.
  • Steady-state silence: when the entire pending universe fits in one tick (cursor=="" and nextCursor==""), no cursor write or info log.
  • Bare mysql.Datastore returns "" for the cursor; only mysqlredis persists. Unit tests get the no-op stub.
  • Listing predicates filter out hosts whose state already matches desired state, so stale cursors and partial-failure retries converge idempotently.
  • Cursor predicate (h.uuid > ?) pushed into all 4 desired-state UNION arms (install + remove) for early per-branch filtering.

Test wiring: existing reconcile tests proxy the new scoped variant through the legacy ListMDMAppleProfilesToInstallAndRemove mock so subtest overrides flow through without per-subtest changes.

Mirrors the Windows MDM reconciler pattern: each 30s tick picks up to
2000 distinct hosts with pending install/remove work, ordered by
host_uuid ascending. The cursor is persisted in Redis (via the
mysqlredis wrapper) so multi-tick passes converge over time without
running the unbounded reconciliation in a single pass.

Intended for load testing: scopes per-tick writer pressure on
host_mdm_apple_profiles and the nano command queue, which the current
all-at-once reconciler can spike during bulk events (team transfer,
profile add).

Key oddities mirrored from the Windows side:
- Named return on ReconcileAppleProfiles so the deferred cursor write
  observes the actual exit error and skips advancement on failure.
- Cursor write failures are logged-and-swallowed: a transient Redis
  blip should not poison the cron tick.
- Steady-state silence: when the entire pending universe fits in one
  tick (cursor=="" and nextCursor==""), no cursor write or info log.
- Bare mysql.Datastore returns "" for the cursor; only mysqlredis
  persists. Unit tests get the no-op stub.
- Listing predicates filter out hosts whose state already matches
  desired state, so stale cursors and partial-failure retries
  converge idempotently.
- Cursor predicate (h.uuid > ?) pushed into all 4 desired-state UNION
  arms (install + remove) for early per-branch filtering.

Test wiring: existing reconcile tests proxy the new scoped variant
through the legacy ListMDMAppleProfilesToInstallAndRemove mock so
subtest overrides flow through without per-subtest changes.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

❌ Patch coverage is 64.76684% with 68 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.72%. Comparing base (4f8737e) to head (fc3fb53).
⚠️ Report is 71 commits behind head on main.

Files with missing lines Patch % Lines
server/datastore/mysql/apple_mdm.go 80.68% 15 Missing and 13 partials ⚠️
server/service/apple_mdm.go 25.80% 17 Missing and 6 partials ⚠️
server/datastore/mysqlredis/apple_recon_cursor.go 0.00% 17 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #45574      +/-   ##
==========================================
+ Coverage   66.71%   66.72%   +0.01%     
==========================================
  Files        2734     2741       +7     
  Lines      218824   219202     +378     
  Branches    10947    10947              
==========================================
+ Hits       145979   146261     +282     
- Misses      59626    59721      +95     
- Partials    13219    13220       +1     
Flag Coverage Δ
backend 68.57% <64.76%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The previous design ran the heavy desired-state UNION twice per tick
(once to find hosts-with-work, once to fetch payloads for that
window). For load testing we want to keep the original full
desired-state query — which has been SQL-optimized — but only invoke
it on a bounded host window.

New shape:
- Pass 1 (ListNextMDMAppleHostUUIDs): PK range scan on hosts.uuid +
  EXISTS lookup against nano_enrollments.device_id. No UNION, no
  desired-state derivation, no host_mdm_apple_profiles touch.
- Pass 2 (ListMDMAppleProfilesToInstallAndRemoveForHosts): unchanged.
  Runs the full optimized desired-state UNION scoped to the 5000-host
  window. Whichever subset actually has changes gets processed.

Trade-offs:
- Label drift / new-work detection latency is bounded by
  total_hosts/batch * tick_interval. At 5000 hosts/tick on a 30s
  tick that's <= 5 min for 50k hosts, <= 2.5 min for 25k.
- Hosts without pending work still consume a slot in their tick's
  batch. Fine because the scoped UNION on 5000 hosts is cheaper than
  the full UNION across the universe.
- The desired-state UNION is evaluated exactly once per tick.

Diverges from ListNextPendingMDMWindowsHostUUIDs which pushes the
cursor predicate into the desired-state arms to filter to
hosts-with-work. Comment in code explains why.

Bumps batchSize 2000 -> 5000.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants