Add, Delete Storage Pool commands should be able execute on a host in maintenance#9301
Conversation
|
@blueorangutan package |
|
@abh1sar a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.18 #9301 +/- ##
============================================
- Coverage 12.24% 12.24% -0.01%
Complexity 9296 9296
============================================
Files 4699 4699
Lines 414331 414347 +16
Branches 51999 51008 -991
============================================
+ Hits 50731 50732 +1
- Misses 357295 357310 +15
Partials 6305 6305
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10111 |
|
@blueorangutan test |
|
@abh1sar a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
There was a problem hiding this comment.
@abh1sar ,
Steps that i followed
-
keep one of the host in a cluster in maintenance mode
-
Add a cluster wide primary storage
-
Exception observed
Logs
2024-06-25 13:06:12,682 DEBUG [o.a.c.s.d.l.CloudStackPrimaryDataStoreLifeCycleImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) createPool Params @ scheme - nfs storageHost - 10.0.32.4 hostPath - /acs/primary/ref-trl-6897-k-Mr8-kiran-chavala/ref-trl-6897-k-Mr8-kiran-chavala-kvm-pri1 port - -1
2024-06-25 13:06:12,690 DEBUG [o.a.c.s.d.l.CloudStackPrimaryDataStoreLifeCycleImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) creating pool pri2 on host 1
2024-06-25 13:06:12,694 WARN [c.c.a.m.AgentManagerImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) Resource [Host:1] is unreachable: Host 1: Unable to send class com.cloud.agent.api.CreateStoragePoolCommand because agent ref-trl-6898-k-Mr8-kiran-chavala-kvm1 is in maintenance mode
2024-06-25 13:06:12,699 WARN [o.a.c.s.d.l.CloudStackPrimaryDataStoreLifeCycleImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) Can not create storage pool through host 1 due to CreateStoragePoolCommand returns null
2024-06-25 13:06:12,699 DEBUG [c.c.s.StorageManagerImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) Failed to add data store: Can not create storage pool through host 1 due to CreateStoragePoolCommand returns null
com.cloud.utils.exception.CloudRuntimeException: Can not create storage pool through host 1 due to CreateStoragePoolCommand returns null
at org.apache.cloudstack.storage.datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl.createStoragePool(CloudStackPrimaryDataStoreLifeCycleImpl.java:415)
at org.apache.cloudstack.storage.datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl.attachCluster(CloudStackPrimaryDataStoreLifeCycleImpl.java:438)
at com.cloud.storage.StorageManagerImpl.createPool(StorageManagerImpl.java:851)
Ideally, when adding cluster wide primary storage, cloudstack should check for active hosts in the cluster and add the storage,
When the host comes out of maintainence mode the new storage should get added
Currently there is no issue if a zone wide storage is added when a host is in maintenance mode, once the host comes out of maintenance mode I see modify storage pool getting executed
2024-06-25 13:23:44,690 DEBUG [cloud.agent.Agent] (agentRequest-Handler-4:null) (logid:094fcf46) Seq 1-1815232124807020547: { Ans: , MgmtId: 167780928, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.ModifyStoragePoolAnswer":{"poolInfo":{"host":"10.0.32.4","localPath":"/mnt//42d866c8-547c-3741-a71e-4205227ffe28","hostPath":"/acs/primary/ref-trl-6897-k-Mr8-kiran-chavala/ref-trl-6897-k-Mr8-kiran-chavala-kvm-pri1","poolType":"NetworkFilesystem","capacityBytes":"(1.9990 TB) 2197949513728","availableBytes":"(668.27 GB) 717546848256"},"templateInfo":{},"datastoreClusterChildren":[],"result":"true","wait":"0","bypassHostMaintenance":"false"}}] }
No issue with the deletion of zone wide storage pool even if the host is in maintaience mode
2024-06-25 13:40:29,655 DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:null) (logid:57d7f82f) Request:Seq 1-1815232124807020577: { Cmd , MgmtId: 167780928, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.DeleteStoragePoolCommand":{"_pool":{"id":"3","uuid":"42d866c8-547c-3741-a71e-4205227ffe28","host":"10.0.32.4","path":"/acs/primary/ref-trl-6897-k-Mr8-kiran-chavala/ref-trl-6897-k-Mr8-kiran-chavala-kvm-pri1","port":"2049","type":"NetworkFilesystem"},"_localPath":"/mnt//42d866c8-547c-3741-a71e-4205227ffe28","_removeDatastore":"false","wait":"0","bypassHostMaintenance":"false"}}] }
2024-06-25 13:40:29,655 DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:null) (logid:57d7f82f) Processing command: com.cloud.agent.api.DeleteStoragePoolCommand
|
@blueorangutan package |
|
@abh1sar a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10124 |
|
[SF] Trillian test result (tid-10612)
|
|
@blueorangutan test |
|
@abh1sar a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-10626)
|
|
@blueorangutan test keepEnv |
|
@abh1sar a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
Working on resolving smoke test failures |
|
codewise lgtm @abh1sar |
|
[SF] Trillian test result (tid-10648)
|
|
The approach of ssh-ing into the in-maintenance-host to restart the agent even if the agent was already connected is not right, as it breaks the change done in #3239 I think better solution would be to allow createStoragePoolCommand to run on the host in maintenance mode (like ModifyStoragePoolCommand) |
Yes @abh1sar, If the agent is already connected, better check & update the storage pools with agent command. |
…he change to restart agent when host was already up and in maintenance
|
@blueorangutan package |
|
@abh1sar a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Have changed the approach, so this code is now obsolete. Please check. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10184 |
|
@blueorangutan test |
|
@abh1sar a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
vladimirpetrov
left a comment
There was a problem hiding this comment.
LGTM based on manual testing with NFS storage.
|
[SF] Trillian test result (tid-10670)
|
… maintenance (apache#9301) * Restart agent when host comes out of maintenance * Don't send CreateStoragePoolCommand to hosts in maintenance mode * CreateStoragePoolCommand can run when host in maintenance. Reverted the change to restart agent when host was already up and in maintenance * Reverted changes done to ResourceManagerImplTest
Description
This PR...
Fixes #9295
When a host is in maintenance, CreateStoragePoolCommand and DeleteStoragePoolCommand are not allowed to execute by the AgentAttache.
This will cause a new storage pool to not be present on the host even after it comes out of maintenance, as the cloudstack agent is not restarted when cancel maintain is called (#3239).
This also causes a deleted storage pool to never be removed from a host in maintenance.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
How did you try to break this feature and the system with this change?