[scheduler-extender] v2 refactor: reservation API, cache, RBAC, space accounting, agent fixes#193
Merged
AleksZimin merged 18 commits intoadd-common-scheduler-extender-v2from Mar 16, 2026
Conversation
Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
- Add POST /api/v1/volumes/bind endpoint - Add BindVolumeRequest/BindVolumeResponse API types - Add LVGRef struct and RemoveVolumeReservationsExcept method to cache - Implement bindVolume handler: decode request, validate, clear unselected LVG/thinpool reservations for the volume - Add cache tests (thick, thin, multiple keep, empty keep, idempotency) - Add handler tests (valid bind, validation errors, method not allowed) Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
…Name - Remove Type field from VolumeInput (filter-prioritize and bind requests) - Infer type: any LVG with thinPoolName → thin; all empty → thick - Add consistency validation: reject mixed thinPoolName in LVGs - RemoveVolumeReservationsExcept: drop volumeType param; infer from keep; when keep empty, remove from both thick and thin - Add TestCache_RemoveVolumeReservationsExcept_EmptyKeep_RemovesFromBoth - Remove invalid volume type test from bind_volume_test Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Simplify the scheduler-extender cache from a deeply nested structure (lvgEntry -> thickByPVC/thinByPool -> pvcEntry/volumeEntry) to a clean two-map reservation store (pools + reservations). LVG resources are no longer stored in the custom cache; they are read from the controller-runtime informer cache via client.Client. Cache changes: - Unified StoragePoolKey (LVGName + optional ThinPoolName) replaces separate thick/thin handling everywhere - pools map: pre-calculated reservedSize per pool for O(1) lookups - reservations map: reservationID -> size + set of pools + TTL - Methods: AddReservation, RemoveReservation, NarrowReservation, GetReservedSpace, HasReservation, GetAllPools, GetAllReservations - Background goroutine for TTL-based cleanup of expired reservations API changes: - New routes: /v1/lvg/filter-and-score, /v1/lvg/narrow-reservation (replace /api/v1/volumes/filter-prioritize and /api/v1/volumes/bind) - New request/response types: FilterAndScoreRequest/Response, NarrowReservationRequest/Response, LVMVolumeGroupInput, ScoredLVMVolumeGroup Scheduler logic changes: - filter: collects StoragePoolKeys across all filtered nodes per PVC, creates one reservation with N pool keys via AddReservation - prioritize: after scoring, calls NarrowReservation to release reservations on nodes filtered out by kube or other extenders - Helper functions (getAvailableSpace, checkPoolHasSpace, calculatePoolScore) combine client.Client for LVG capacity with cache.GetReservedSpace for reservations Controller changes: - PVC watcher: on selectedNode -> NarrowReservation to node's pools; on bound/delete -> RemoveReservation - LVG watcher: deleted entirely; informer is started by field indexer registration in main.go, stale data expires via TTL - LLV watcher (new): watches LVMLogicalVolume; on Phase=Created or Delete -> RemoveReservation to prevent double-counting Infrastructure: - Field indexer on LVMVolumeGroup.Status.Nodes.Name for efficient node-to-LVG lookups via client.MatchingFields - Simplified cache constructor: NewCache(logger, cleanupInterval) - Removed PVCExpiredDurationSec config, CacheSize config Files deleted: filter_prioritize.go, bind_volume.go, bind_volume_test.go, lvg_watcher_cache.go, lvg_watcher_cache_test.go Files created: filter_and_score.go, narrow_reservation.go, llv_watcher_cache.go Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
931f375 to
9cccf0e
Compare
…hedulable check Add isLVGSchedulable(*LVMVolumeGroup) that checks Status.Phase == Ready. LVGs in NotReady, Terminating, etc. are excluded from: - /v1/lvg/filter-and-score (via getAvailableSpace) - /scheduler/filter (local and replicated PVCs) - /scheduler/prioritize (local and replicated PVCs) Integration points: - getAvailableSpace(): return error if LVG not schedulable - findMatchedSCLVG(): only consider schedulable LVGs when matching - findLVGForNodeInRSP(): skip non-schedulable LVGs Designed for future extension (e.g. Unschedulable field) in one place. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
- Remove PoolEntry type and pools map from cache - GetReservedSpace and GetAllPools now compute reserved size by iterating reservations and skipping expired entries (lazy TTL check) - Expired reservations are effectively ignored immediately after TTL without waiting for the 30s cleanup ticker - AddReservation, removeReservation, NarrowReservation simplified - HasReservation and GetReservation return false for expired entries - Add Expired field to ReservationInfo; GetAllReservations returns all entries including expired for debug visibility - Debug endpoints: mark expired reservations with [EXPIRED] in /cache, show active/expired counts in /stat - Add tests: TestGetReservedSpace_SkipsExpired, TestGetAllPools_SkipsExpired, TestGetAllReservations_MarksExpired Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
b7372b3 to
231dd6a
Compare
…am, whitespace) Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
231dd6a to
4013d50
Compare
Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
f14e2e0 to
265ed41
Compare
- Add handler_test_helpers_test.go: shared test helpers (newTestScheduler, readyLVG, readyLVGWithThinPool, notReadyLVG, newTestCache, newFakeClient) - Add filter_and_score_test.go: 12 tests for validation, filtering, scoring, cache, NotReady LVGs, thin pools, and idempotent reservation replace - Add narrow_reservation_test.go: 8 tests for validation, narrowing, non-existent reservation, and cache state verification Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
After lvcreate and lvextend, the in-memory cache contains stale data until the scanner runs. This caused unnecessary 5s requeues and blocking busy-wait loops. - LLV create: replace getLVActualSize with commands.GetLV after lvcreate - LLV resize: replace getLVActualSize with commands.GetLV after lvextend - LLV extender: replace FindLV busy-wait loop with GetLV after lvextend - LLVS snapshot: use GetLV after CreateThinLogicalVolumeSnapshot instead of requeueing for cache discovery Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
a75b782 to
a2511e8
Compare
- Replace getUseLinstor with getNewControlPlane (inverted semantics): newControlPlane=true means the extender handles replicated PVC scheduling, false/absent means LINSTOR manages it. - Add RBAC permissions for moduleconfigs.deckhouse.io to fix "cannot list resource" error in reflector logs. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
BlockDeviceFilter resources with In/NotIn matchExpressions and nil/empty values caused metav1.LabelSelectorAsSelector to fail, breaking the entire block device reconciliation loop. Add sanitizeLabelSelector() that drops such vacuous expressions before parsing. Add tests for nil values, empty values, and mixed cases. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
…oragePool and add debug tool Add missing get/list/watch permissions for replicatedstorageclasses and replicatedstoragepools to the scheduler-extender ClusterRole. Without them the controller-runtime cached client blocks forever on informer sync, hanging filter/prioritize handlers. Add hack/debug.go — a standalone diagnostic tool that watches Kubernetes resources via kubectl and prints colored diffs interleaved with pod logs. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
…r and prioritize handlers Replace s.ctx (global) with a 4s timeout context derived from r.Context() so that blocked API calls (e.g. informer waiting for RBAC-denied List/Watch) return an error instead of hanging the handler goroutine indefinitely. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
46567d2 to
d825918
Compare
…ffset approach Replace stale VGFree-only computation with LLV-based available space formula: min(totalCapacity - sumAllLLV - unaccountedSpace, reportedFree) - reserved. Key changes: - Add per-pool unaccounted space offset to reservation cache - Add sumLLVSpace helper to sum spec.size for all LLVs on a storage pool - Add CalibratePoolUnaccountedSpace to compute non-LLV volume offset - Add LVG watcher controller to recalibrate on LVG status updates - Register LLV field indexer (spec.lvmVolumeGroupName) for efficient queries - Update getAvailableSpace to use min(llvBased, reportedFree) as safety net - Add 16 unit tests covering thick/thin pools, calibration, and edge cases Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
…thin pool tests totalCapacity for thin pools was incorrectly set to AllocatedSize (space already handed out to thin LVs), which is 0 for empty pools. This caused all filter requests to reject every node with "not enough space". Fix: totalCapacity = AllocatedSize + AvailableSpace (full overprovisioned capacity of the thin pool, analogous to VGSize for thick pools). Add 7 thin pool unit tests covering empty pool, in-flight LLVs, reservations, unaccounted space, and calibration scenarios. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
…lution RSC spec.storagePool is now deprecated; the RSP name is stored in status.storagePoolName by the sds-replicated-volume RSC controller. Update RSC type to include spec.storage, status.storagePoolName, and add GetStoragePoolName() helper. Update RSP type to include status.eligibleNodes. Fix all extender code to resolve RSP name via GetStoragePoolName() instead of the empty spec.storagePool field. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
Remove always-constant parameters from readyLVGWithThinPool (vgSize) and thinLLV (lvgName), hardcoding the values inside the helpers. Signed-off-by: Aleksandr Zimin <alexandr.zimin@flant.com>
05adc48 to
031f3b8
Compare
b05ab4d
into
add-common-scheduler-extender-v2
10 of 11 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR refines the SDS common scheduler extender and related agent/module pieces on top of
add-common-scheduler-extender-v2. Main changes:Scheduler extender (
images/sds-common-scheduler-extender/)/api/v1/volumes/*to/v1/lvg/filter-and-scoreand/v1/lvg/narrow-reservationwith matching request/response types (FilterAndScore*,NarrowReservation*,LVMVolumeGroupInput, etc.).StoragePoolKey, TTL cleanup, lazy TTL in reads (removed redundantpoolsmap).NarrowReservation/RemoveReservationon bind and lifecycle events; non-Ready LVGs excluded viaisLVGSchedulable.VGFree-only.ReplicatedStorageClass/ReplicatedStoragePool,moduleconfigs.deckhouse.io; newControlPlane module setting (replacesuseLinstor) to choose extender vs LINSTOR for replicated PVCs;/statextended with filter-and-score counters.unparam, etc.).Module / Kubernetes
openapi/values*.yaml) for new settings.ReplicatedStorageClass/ReplicatedStoragePooltypes aligned with CRDs; registration updates.Agent
GetLVinstead of waiting on scanner/cache to avoid extra requeues and busy-wait.In/NotInwith nil/empty values no longer crash reconciliation.Misc
.gitignoreupdates; optional debug helper underhack/for watching resources vs pod logs.Impact on the cluster: deploys/updates the scheduler extender workload and kube-scheduler webhook configuration; may cause scheduler extender pod restarts and webhook reloads during rollout. Does not by itself restart control-plane components beyond what a normal module upgrade does.
Why do we need it, and what problem does it solve?
The v2 extender stack needs a maintainable reservation model, correct free-space and thin-pool accounting under concurrent scheduling, and safe handler behavior (timeouts, RBAC, non-Ready LVGs). Without these, users can see hung filter/prioritize calls, wrong capacity decisions, stale reservations, or agent reconciliation crashes on malformed
BlockDeviceFilterselectors.This PR delivers those fixes and refactors on the existing v2 branch so replicated/local PVC scheduling stays correct and observable.
What is the expected result?
In/NotInvalues no longer breaks the BD reconciliation loop.newControlPlane: when set as intended, the extender participates in replicated PVC scheduling; otherwise behavior falls back to LINSTOR-managed path per design.Checklist