Skip to content

[scheduler-extender] v2 refactor: reservation API, cache, RBAC, space accounting, agent fixes#193

Merged
AleksZimin merged 18 commits intoadd-common-scheduler-extender-v2from
add-common-scheduler-extender-v2-refactor
Mar 16, 2026
Merged

[scheduler-extender] v2 refactor: reservation API, cache, RBAC, space accounting, agent fixes#193
AleksZimin merged 18 commits intoadd-common-scheduler-extender-v2from
add-common-scheduler-extender-v2-refactor

Conversation

@AleksZimin
Copy link
Copy Markdown
Member

@AleksZimin AleksZimin commented Feb 9, 2026

Description

This PR refines the SDS common scheduler extender and related agent/module pieces on top of add-common-scheduler-extender-v2. Main changes:

Scheduler extender (images/sds-common-scheduler-extender/)

  • Evolved HTTP API: from /api/v1/volumes/* to /v1/lvg/filter-and-score and /v1/lvg/narrow-reservation with matching request/response types (FilterAndScore*, NarrowReservation*, LVMVolumeGroupInput, etc.).
  • Cache refactor: replaced nested LVG-in-cache structure with a pure reservation store (reservation ID → pools + TTL); LVG data comes from the controller-runtime client/informer. Unified StoragePoolKey, TTL cleanup, lazy TTL in reads (removed redundant pools map).
  • Scheduling logic: filter creates reservations across candidate pools; prioritize narrows reservations; PVC/LLV watchers call NarrowReservation / RemoveReservation on bind and lifecycle events; non-Ready LVGs excluded via isLVGSchedulable.
  • Capacity / races: available space uses an LLV-based formula with per-pool unaccounted-space calibration and LVG watcher recalibration; fixes thin-pool totalCapacity handling; reduces wrong free-space assumptions vs VGFree-only.
  • Ops / safety: request-scoped context with timeout in filter/prioritize handlers (avoids hanging on slow/blocked API); RBAC for ReplicatedStorageClass / ReplicatedStoragePool, moduleconfigs.deckhouse.io; newControlPlane module setting (replaces useLinstor) to choose extender vs LINSTOR for replicated PVCs; /stat extended with filter-and-score counters.
  • Tests: unit tests for cache, filter-and-score, narrow-reservation, helpers; linter cleanups (unparam, etc.).

Module / Kubernetes

  • Helm manifests for extender: Deployment, Service, webhook config, RBAC, Secret, ConfigMap; hook common-scheduler-extender-certs for TLS material.
  • OpenAPI values (openapi/values*.yaml) for new settings.
  • API: ReplicatedStorageClass / ReplicatedStoragePool types aligned with CRDs; registration updates.

Agent

  • Several thin pools per LVG support where applicable.
  • After lvcreate / lvextend / snapshot operations, use GetLV instead of waiting on scanner/cache to avoid extra requeues and busy-wait.
  • BlockDeviceFilter: sanitize label selectors so In/NotIn with nil/empty values no longer crash reconciliation.

Misc

  • .gitignore updates; optional debug helper under hack/ for watching resources vs pod logs.

Impact on the cluster: deploys/updates the scheduler extender workload and kube-scheduler webhook configuration; may cause scheduler extender pod restarts and webhook reloads during rollout. Does not by itself restart control-plane components beyond what a normal module upgrade does.

Why do we need it, and what problem does it solve?

The v2 extender stack needs a maintainable reservation model, correct free-space and thin-pool accounting under concurrent scheduling, and safe handler behavior (timeouts, RBAC, non-Ready LVGs). Without these, users can see hung filter/prioritize calls, wrong capacity decisions, stale reservations, or agent reconciliation crashes on malformed BlockDeviceFilter selectors.

This PR delivers those fixes and refactors on the existing v2 branch so replicated/local PVC scheduling stays correct and observable.

What is the expected result?

  • After enabling/upgrading the module: sds-common-scheduler-extender pods become Ready; MutatingWebhookConfiguration for the scheduler points to the extender service; no endless hangs in extender logs on List/Watch (RBAC fixed).
  • Scheduling local/replicated PVCs: extender returns filter/prioritize results within the request timeout; reservations align with chosen nodes; NotReady LVGs are not scored as usable pools.
  • Agent: fewer unnecessary requeues after LVM resize/create; BlockDeviceFilter with empty In/NotIn values no longer breaks the BD reconciliation loop.
  • ModuleConfig newControlPlane: when set as intended, the extender participates in replicated PVC scheduling; otherwise behavior falls back to LINSTOR-managed path per design.

Checklist

  • The code is covered by unit tests.
  • e2e tests passed.
  • Documentation updated according to the changes.
  • Changes were tested in the Kubernetes cluster manually.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
- Add POST /api/v1/volumes/bind endpoint
- Add BindVolumeRequest/BindVolumeResponse API types
- Add LVGRef struct and RemoveVolumeReservationsExcept method to cache
- Implement bindVolume handler: decode request, validate, clear unselected
  LVG/thinpool reservations for the volume
- Add cache tests (thick, thin, multiple keep, empty keep, idempotency)
- Add handler tests (valid bind, validation errors, method not allowed)

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
…Name

- Remove Type field from VolumeInput (filter-prioritize and bind requests)
- Infer type: any LVG with thinPoolName → thin; all empty → thick
- Add consistency validation: reject mixed thinPoolName in LVGs
- RemoveVolumeReservationsExcept: drop volumeType param; infer from keep; when
  keep empty, remove from both thick and thin
- Add TestCache_RemoveVolumeReservationsExcept_EmptyKeep_RemovesFromBoth
- Remove invalid volume type test from bind_volume_test

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Simplify the scheduler-extender cache from a deeply nested structure
(lvgEntry -> thickByPVC/thinByPool -> pvcEntry/volumeEntry) to a clean
two-map reservation store (pools + reservations). LVG resources are no
longer stored in the custom cache; they are read from the
controller-runtime informer cache via client.Client.

Cache changes:
- Unified StoragePoolKey (LVGName + optional ThinPoolName) replaces
  separate thick/thin handling everywhere
- pools map: pre-calculated reservedSize per pool for O(1) lookups
- reservations map: reservationID -> size + set of pools + TTL
- Methods: AddReservation, RemoveReservation, NarrowReservation,
  GetReservedSpace, HasReservation, GetAllPools, GetAllReservations
- Background goroutine for TTL-based cleanup of expired reservations

API changes:
- New routes: /v1/lvg/filter-and-score, /v1/lvg/narrow-reservation
  (replace /api/v1/volumes/filter-prioritize and /api/v1/volumes/bind)
- New request/response types: FilterAndScoreRequest/Response,
  NarrowReservationRequest/Response, LVMVolumeGroupInput,
  ScoredLVMVolumeGroup

Scheduler logic changes:
- filter: collects StoragePoolKeys across all filtered nodes per PVC,
  creates one reservation with N pool keys via AddReservation
- prioritize: after scoring, calls NarrowReservation to release
  reservations on nodes filtered out by kube or other extenders
- Helper functions (getAvailableSpace, checkPoolHasSpace,
  calculatePoolScore) combine client.Client for LVG capacity with
  cache.GetReservedSpace for reservations

Controller changes:
- PVC watcher: on selectedNode -> NarrowReservation to node's pools;
  on bound/delete -> RemoveReservation
- LVG watcher: deleted entirely; informer is started by field indexer
  registration in main.go, stale data expires via TTL
- LLV watcher (new): watches LVMLogicalVolume; on Phase=Created or
  Delete -> RemoveReservation to prevent double-counting

Infrastructure:
- Field indexer on LVMVolumeGroup.Status.Nodes.Name for efficient
  node-to-LVG lookups via client.MatchingFields
- Simplified cache constructor: NewCache(logger, cleanupInterval)
- Removed PVCExpiredDurationSec config, CacheSize config

Files deleted: filter_prioritize.go, bind_volume.go, bind_volume_test.go,
  lvg_watcher_cache.go, lvg_watcher_cache_test.go
Files created: filter_and_score.go, narrow_reservation.go,
  llv_watcher_cache.go

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
@AleksZimin AleksZimin force-pushed the add-common-scheduler-extender-v2-refactor branch from 931f375 to 9cccf0e Compare February 10, 2026 21:48
…hedulable check

Add isLVGSchedulable(*LVMVolumeGroup) that checks Status.Phase == Ready.
LVGs in NotReady, Terminating, etc. are excluded from:
- /v1/lvg/filter-and-score (via getAvailableSpace)
- /scheduler/filter (local and replicated PVCs)
- /scheduler/prioritize (local and replicated PVCs)

Integration points:
- getAvailableSpace(): return error if LVG not schedulable
- findMatchedSCLVG(): only consider schedulable LVGs when matching
- findLVGForNodeInRSP(): skip non-schedulable LVGs

Designed for future extension (e.g. Unschedulable field) in one place.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
- Remove PoolEntry type and pools map from cache
- GetReservedSpace and GetAllPools now compute reserved size by iterating
  reservations and skipping expired entries (lazy TTL check)
- Expired reservations are effectively ignored immediately after TTL
  without waiting for the 30s cleanup ticker
- AddReservation, removeReservation, NarrowReservation simplified
- HasReservation and GetReservation return false for expired entries
- Add Expired field to ReservationInfo; GetAllReservations returns all
  entries including expired for debug visibility
- Debug endpoints: mark expired reservations with [EXPIRED] in /cache,
  show active/expired counts in /stat
- Add tests: TestGetReservedSpace_SkipsExpired,
  TestGetAllPools_SkipsExpired, TestGetAllReservations_MarksExpired

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
@AleksZimin AleksZimin force-pushed the add-common-scheduler-extender-v2-refactor branch from b7372b3 to 231dd6a Compare February 10, 2026 22:58
…am, whitespace)

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
@AleksZimin AleksZimin force-pushed the add-common-scheduler-extender-v2-refactor branch from 231dd6a to 4013d50 Compare February 10, 2026 23:02
Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
@AleksZimin AleksZimin force-pushed the add-common-scheduler-extender-v2-refactor branch from f14e2e0 to 265ed41 Compare February 10, 2026 23:31
- Add handler_test_helpers_test.go: shared test helpers (newTestScheduler,
  readyLVG, readyLVGWithThinPool, notReadyLVG, newTestCache, newFakeClient)
- Add filter_and_score_test.go: 12 tests for validation, filtering, scoring,
  cache, NotReady LVGs, thin pools, and idempotent reservation replace
- Add narrow_reservation_test.go: 8 tests for validation, narrowing,
  non-existent reservation, and cache state verification

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
After lvcreate and lvextend, the in-memory cache contains stale data
until the scanner runs. This caused unnecessary 5s requeues and
blocking busy-wait loops.

- LLV create: replace getLVActualSize with commands.GetLV after lvcreate
- LLV resize: replace getLVActualSize with commands.GetLV after lvextend
- LLV extender: replace FindLV busy-wait loop with GetLV after lvextend
- LLVS snapshot: use GetLV after CreateThinLogicalVolumeSnapshot instead
  of requeueing for cache discovery

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
@dmgtn dmgtn force-pushed the add-common-scheduler-extender-v2-refactor branch from a75b782 to a2511e8 Compare March 8, 2026 22:56
@AleksZimin AleksZimin self-assigned this Mar 12, 2026
- Replace getUseLinstor with getNewControlPlane (inverted semantics):
  newControlPlane=true means the extender handles replicated PVC scheduling,
  false/absent means LINSTOR manages it.
- Add RBAC permissions for moduleconfigs.deckhouse.io to fix
  "cannot list resource" error in reflector logs.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
BlockDeviceFilter resources with In/NotIn matchExpressions and nil/empty
values caused metav1.LabelSelectorAsSelector to fail, breaking the
entire block device reconciliation loop.

Add sanitizeLabelSelector() that drops such vacuous expressions before
parsing. Add tests for nil values, empty values, and mixed cases.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
…oragePool and add debug tool

Add missing get/list/watch permissions for replicatedstorageclasses and
replicatedstoragepools to the scheduler-extender ClusterRole. Without them
the controller-runtime cached client blocks forever on informer sync,
hanging filter/prioritize handlers.

Add hack/debug.go — a standalone diagnostic tool that watches Kubernetes
resources via kubectl and prints colored diffs interleaved with pod logs.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
…r and prioritize handlers

Replace s.ctx (global) with a 4s timeout context derived from r.Context()
so that blocked API calls (e.g. informer waiting for RBAC-denied List/Watch)
return an error instead of hanging the handler goroutine indefinitely.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
@AleksZimin AleksZimin force-pushed the add-common-scheduler-extender-v2-refactor branch from 46567d2 to d825918 Compare March 13, 2026 13:45
…ffset approach

Replace stale VGFree-only computation with LLV-based available space
formula: min(totalCapacity - sumAllLLV - unaccountedSpace, reportedFree) - reserved.

Key changes:
- Add per-pool unaccounted space offset to reservation cache
- Add sumLLVSpace helper to sum spec.size for all LLVs on a storage pool
- Add CalibratePoolUnaccountedSpace to compute non-LLV volume offset
- Add LVG watcher controller to recalibrate on LVG status updates
- Register LLV field indexer (spec.lvmVolumeGroupName) for efficient queries
- Update getAvailableSpace to use min(llvBased, reportedFree) as safety net
- Add 16 unit tests covering thick/thin pools, calibration, and edge cases

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
…thin pool tests

totalCapacity for thin pools was incorrectly set to AllocatedSize (space
already handed out to thin LVs), which is 0 for empty pools. This caused
all filter requests to reject every node with "not enough space".

Fix: totalCapacity = AllocatedSize + AvailableSpace (full overprovisioned
capacity of the thin pool, analogous to VGSize for thick pools).

Add 7 thin pool unit tests covering empty pool, in-flight LLVs,
reservations, unaccounted space, and calibration scenarios.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
…lution

RSC spec.storagePool is now deprecated; the RSP name is stored in
status.storagePoolName by the sds-replicated-volume RSC controller.
Update RSC type to include spec.storage, status.storagePoolName, and
add GetStoragePoolName() helper. Update RSP type to include
status.eligibleNodes. Fix all extender code to resolve RSP name via
GetStoragePoolName() instead of the empty spec.storagePool field.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Remove always-constant parameters from readyLVGWithThinPool (vgSize)
and thinLLV (lvgName), hardcoding the values inside the helpers.

Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
@AleksZimin AleksZimin force-pushed the add-common-scheduler-extender-v2-refactor branch from 05adc48 to 031f3b8 Compare March 16, 2026 10:15
@AleksZimin AleksZimin merged commit b05ab4d into add-common-scheduler-extender-v2 Mar 16, 2026
10 of 11 checks passed
@AleksZimin AleksZimin deleted the add-common-scheduler-extender-v2-refactor branch March 16, 2026 11:57
@duckhawk duckhawk changed the title add-common-scheduler-extender-v2 refactor [scheduler-extender] v2 refactor: reservation API, cache, RBAC, space accounting, agent fixes Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant