Add, Delete Storage Pool commands should be able execute on a host in maintenance by abh1sar · Pull Request #9301 · apache/cloudstack

abh1sar · 2024-06-25T09:15:43Z

Description

This PR...

When a host is in maintenance, CreateStoragePoolCommand and DeleteStoragePoolCommand are not allowed to execute by the AgentAttache.
This will cause a new storage pool to not be present on the host even after it comes out of maintenance, as the cloudstack agent is not restarted when cancel maintain is called (#3239).
This also causes a deleted storage pool to never be removed from a host in maintenance.

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
build/CI
test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

Put a host in maintenance.
Add a new storage pool and verify that it is mounted on the host
Delete a storage pool and verify that it has been unmounted from the host
Take the host out of maintenance and verify again.

How did you try to break this feature and the system with this change?

abh1sar · 2024-06-25T09:24:53Z

@blueorangutan package

blueorangutan · 2024-06-25T09:26:04Z

@abh1sar a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

codecov · 2024-06-25T10:04:32Z

Codecov Report

Attention: Patch coverage is 27.27273% with 8 lines in your changes missing coverage. Please review.

Project coverage is 12.24%. Comparing base (351de5f) to head (50e09fe).
Report is 1 commits behind head on 4.18.

Files	Patch %	Lines
...n/java/com/cloud/resource/ResourceManagerImpl.java	0.00%	7 Missing ⚠️
...cycle/CloudStackPrimaryDataStoreLifeCycleImpl.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               4.18    #9301      +/-   ##
============================================
- Coverage     12.24%   12.24%   -0.01%     
  Complexity     9296     9296              
============================================
  Files          4699     4699              
  Lines        414331   414347      +16     
  Branches      51999    51008     -991     
============================================
+ Hits          50731    50732       +1     
- Misses       357295   357310      +15     
  Partials       6305     6305

Flag	Coverage Δ
unittests	`12.24% <27.27%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

blueorangutan · 2024-06-25T10:22:15Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10111

abh1sar · 2024-06-25T11:11:36Z

@blueorangutan test

blueorangutan · 2024-06-25T11:12:03Z

@abh1sar a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

sureshanaparti

clgtm

kiranchavala

@abh1sar ,

Steps that i followed

keep one of the host in a cluster in maintenance mode
Add a cluster wide primary storage
Exception observed

Logs


2024-06-25 13:06:12,682 DEBUG [o.a.c.s.d.l.CloudStackPrimaryDataStoreLifeCycleImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) createPool Params @ scheme - nfs storageHost - 10.0.32.4 hostPath - /acs/primary/ref-trl-6897-k-Mr8-kiran-chavala/ref-trl-6897-k-Mr8-kiran-chavala-kvm-pri1 port - -1
2024-06-25 13:06:12,690 DEBUG [o.a.c.s.d.l.CloudStackPrimaryDataStoreLifeCycleImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) creating pool pri2 on  host 1
2024-06-25 13:06:12,694 WARN  [c.c.a.m.AgentManagerImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) Resource [Host:1] is unreachable: Host 1: Unable to send class com.cloud.agent.api.CreateStoragePoolCommand because agent ref-trl-6898-k-Mr8-kiran-chavala-kvm1 is in maintenance mode
2024-06-25 13:06:12,699 WARN  [o.a.c.s.d.l.CloudStackPrimaryDataStoreLifeCycleImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) Can not create storage pool through host 1 due to CreateStoragePoolCommand returns null
2024-06-25 13:06:12,699 DEBUG [c.c.s.StorageManagerImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) Failed to add data store: Can not create storage pool through host 1 due to CreateStoragePoolCommand returns null
com.cloud.utils.exception.CloudRuntimeException: Can not create storage pool through host 1 due to CreateStoragePoolCommand returns null
        at org.apache.cloudstack.storage.datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl.createStoragePool(CloudStackPrimaryDataStoreLifeCycleImpl.java:415)
        at org.apache.cloudstack.storage.datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl.attachCluster(CloudStackPrimaryDataStoreLifeCycleImpl.java:438)
        at com.cloud.storage.StorageManagerImpl.createPool(StorageManagerImpl.java:851)

Ideally, when adding cluster wide primary storage, cloudstack should check for active hosts in the cluster and add the storage,

When the host comes out of maintainence mode the new storage should get added

Currently there is no issue if a zone wide storage is added when a host is in maintenance mode, once the host comes out of maintenance mode I see modify storage pool getting executed

 2024-06-25 13:23:44,690 DEBUG [cloud.agent.Agent] (agentRequest-Handler-4:null) (logid:094fcf46) Seq 1-1815232124807020547:  { Ans: , MgmtId: 167780928, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.ModifyStoragePoolAnswer":{"poolInfo":{"host":"10.0.32.4","localPath":"/mnt//42d866c8-547c-3741-a71e-4205227ffe28","hostPath":"/acs/primary/ref-trl-6897-k-Mr8-kiran-chavala/ref-trl-6897-k-Mr8-kiran-chavala-kvm-pri1","poolType":"NetworkFilesystem","capacityBytes":"(1.9990 TB) 2197949513728","availableBytes":"(668.27 GB) 717546848256"},"templateInfo":{},"datastoreClusterChildren":[],"result":"true","wait":"0","bypassHostMaintenance":"false"}}] }

No issue with the deletion of zone wide storage pool even if the host is in maintaience mode


2024-06-25 13:40:29,655 DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:null) (logid:57d7f82f) Request:Seq 1-1815232124807020577:  { Cmd , MgmtId: 167780928, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.DeleteStoragePoolCommand":{"_pool":{"id":"3","uuid":"42d866c8-547c-3741-a71e-4205227ffe28","host":"10.0.32.4","path":"/acs/primary/ref-trl-6897-k-Mr8-kiran-chavala/ref-trl-6897-k-Mr8-kiran-chavala-kvm-pri1","port":"2049","type":"NetworkFilesystem"},"_localPath":"/mnt//42d866c8-547c-3741-a71e-4205227ffe28","_removeDatastore":"false","wait":"0","bypassHostMaintenance":"false"}}] }
2024-06-25 13:40:29,655 DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:null) (logid:57d7f82f) Processing command: com.cloud.agent.api.DeleteStoragePoolCommand

abh1sar · 2024-06-25T19:26:00Z

@blueorangutan package

blueorangutan · 2024-06-25T19:28:04Z

@abh1sar a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2024-06-25T20:25:02Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10124

blueorangutan · 2024-06-25T23:20:43Z

[SF] Trillian test result (tid-10612)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42162 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9301-t10612-kvm-centos7.zip
Smoke tests completed. 107 look OK, 3 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_08_migrate_vm	`Error`	45.85	test_vm_life_cycle.py
test_01_cancel_host_maintenace_with_no_migration_jobs	`Error`	114.46	test_host_maintenance.py
test_disable_oobm_ha_state_ineligible	`Error`	1513.05	test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance	`Failure`	310.26	test_hostha_kvm.py

abh1sar · 2024-06-26T03:34:47Z

@blueorangutan test

blueorangutan · 2024-06-26T03:36:03Z

@abh1sar a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2024-06-26T15:38:41Z

[SF] Trillian test result (tid-10626)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 41860 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9301-t10626-kvm-centos7.zip
Smoke tests completed. 107 look OK, 3 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_08_migrate_vm	`Error`	45.92	test_vm_life_cycle.py
test_01_vpc_site2site_vpn	`Failure`	279.86	test_vpc_vpn.py
test_hostha_enable_ha_when_host_disabled	`Error`	3.68	test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance	`Error`	302.87	test_hostha_kvm.py
test_hostha_kvm_host_recovering	`Error`	7.13	test_hostha_kvm.py

abh1sar · 2024-06-27T05:46:00Z

@blueorangutan test keepEnv

blueorangutan · 2024-06-27T05:48:05Z

@abh1sar a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

abh1sar · 2024-06-27T08:40:03Z

Working on resolving smoke test failures

weizhouapache · 2024-06-27T09:43:23Z

codewise lgtm

@abh1sar
what about other resource states, like ErrorInMaintenance or ErrorInPrepareForMaintenance, PrepareForMaintenance ?

blueorangutan · 2024-06-27T18:45:25Z

[SF] Trillian test result (tid-10648)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 45084 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9301-t10648-kvm-centos7.zip
Smoke tests completed. 109 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_hostha_enable_ha_when_host_disabled	`Error`	4.74	test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance	`Error`	302.76	test_hostha_kvm.py

abh1sar · 2024-06-28T04:23:13Z

The approach of ssh-ing into the in-maintenance-host to restart the agent even if the agent was already connected is not right, as it breaks the change done in #3239

I think better solution would be to allow createStoragePoolCommand to run on the host in maintenance mode (like ModifyStoragePoolCommand)

@sureshanaparti @kiranchavala

sureshanaparti · 2024-06-28T04:41:45Z

The approach of ssh-ing into the in-maintenance-host to restart the agent even if the agent was already connected is not right, as it breaks the change done in #3239

I think better solution would be to allow createStoragePoolCommand to run on the host in maintenance mode (like ModifyStoragePoolCommand)

@sureshanaparti @kiranchavala

Yes @abh1sar, If the agent is already connected, better check & update the storage pools with agent command.

…he change to restart agent when host was already up and in maintenance

abh1sar · 2024-06-28T05:46:03Z

@blueorangutan package

blueorangutan · 2024-06-28T05:48:03Z

@abh1sar a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

abh1sar · 2024-06-28T06:18:59Z

codewise lgtm

@abh1sar what about other resource states, like ErrorInMaintenance or ErrorInPrepareForMaintenance, PrepareForMaintenance ?

Have changed the approach, so this code is now obsolete. Please check.

blueorangutan · 2024-06-28T06:49:13Z

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10184

abh1sar · 2024-06-28T06:51:19Z

@blueorangutan test

blueorangutan · 2024-06-28T06:52:03Z

@abh1sar a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

sureshanaparti

clgtm

vladimirpetrov

LGTM based on manual testing with NFS storage.

blueorangutan · 2024-06-28T18:16:50Z

[SF] Trillian test result (tid-10670)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 39649 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9301-t10670-kvm-centos7.zip
Smoke tests completed. 109 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test	Result	Time (s)	Test File
test_hostha_kvm_host_fencing	`Error`	106.67	test_hostha_kvm.py

… maintenance (apache#9301) * Restart agent when host comes out of maintenance * Don't send CreateStoragePoolCommand to hosts in maintenance mode * CreateStoragePoolCommand can run when host in maintenance. Reverted the change to restart agent when host was already up and in maintenance * Reverted changes done to ResourceManagerImplTest

Restart agent when host comes out of maintenance

66e0fbf

abh1sar requested a review from sureshanaparti June 25, 2024 09:16

boring-cyborg bot added component:agent component:orchestration labels Jun 25, 2024

abh1sar requested review from kiranchavala, shwstppr and weizhouapache June 25, 2024 09:16

abh1sar self-assigned this Jun 25, 2024

yadvr added this to the 4.19.1.0 milestone Jun 25, 2024

sureshanaparti approved these changes Jun 25, 2024

View reviewed changes

sureshanaparti added the status:needs-testing label Jun 25, 2024

kiranchavala reviewed Jun 25, 2024

View reviewed changes

Don't send CreateStoragePoolCommand to hosts in maintenance mode

85672d2

boring-cyborg bot added component:primary-storage component:solidfire component:storage labels Jun 25, 2024

sureshanaparti assigned kiranchavala Jun 25, 2024

sureshanaparti linked an issue Jun 25, 2024 that may be closed by this pull request

Storage pools are not mounted when the host is in maintenance mode or comes out of maintenance mode #9295

Closed

sureshanaparti requested a review from yadvr June 26, 2024 06:00

yadvr requested a review from harikrishna-patnala June 26, 2024 08:45

abh1sar added 2 commits June 28, 2024 11:10

CreateStoragePoolCommand can run when host in maintenance. Reverted t…

2c031ac

…he change to restart agent when host was already up and in maintenance

Reverted changes done to ResourceManagerImplTest

50e09fe

abh1sar changed the title ~~Restart agent when host comes out of maintenance~~ Add, Delete, Modify Storage Pool commands should be able execute on a host in maintenance Jun 28, 2024

abh1sar changed the title ~~Add, Delete, Modify Storage Pool commands should be able execute on a host in maintenance~~ Add, Delete Storage Pool commands should be able execute on a host in maintenance Jun 28, 2024

abh1sar requested review from kiranchavala and sureshanaparti June 28, 2024 06:19

sureshanaparti approved these changes Jun 28, 2024

View reviewed changes

sureshanaparti assigned vladimirpetrov and unassigned kiranchavala Jun 28, 2024

vladimirpetrov approved these changes Jun 28, 2024

View reviewed changes

sureshanaparti merged commit 644f3a3 into apache:4.18 Jun 28, 2024

sureshanaparti mentioned this pull request Jun 28, 2024

Storage pools are not mounted when the host is in maintenance mode or comes out of maintenance mode #9295

Closed

Conversation

abh1sar commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

Uh oh!

abh1sar commented Jun 25, 2024

Uh oh!

blueorangutan commented Jun 25, 2024

Uh oh!

codecov bot commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

blueorangutan commented Jun 25, 2024

Uh oh!

abh1sar commented Jun 25, 2024

Uh oh!

blueorangutan commented Jun 25, 2024

Uh oh!

sureshanaparti left a comment

Choose a reason for hiding this comment

Uh oh!

kiranchavala left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abh1sar commented Jun 25, 2024

Uh oh!

blueorangutan commented Jun 25, 2024

Uh oh!

blueorangutan commented Jun 25, 2024

Uh oh!

blueorangutan commented Jun 25, 2024

Uh oh!

abh1sar commented Jun 26, 2024

Uh oh!

blueorangutan commented Jun 26, 2024

Uh oh!

blueorangutan commented Jun 26, 2024

Uh oh!

abh1sar commented Jun 27, 2024

Uh oh!

blueorangutan commented Jun 27, 2024

Uh oh!

abh1sar commented Jun 27, 2024

Uh oh!

weizhouapache commented Jun 27, 2024

Uh oh!

blueorangutan commented Jun 27, 2024

Uh oh!

abh1sar commented Jun 28, 2024

Uh oh!

sureshanaparti commented Jun 28, 2024

Uh oh!

abh1sar commented Jun 28, 2024

Uh oh!

blueorangutan commented Jun 28, 2024

Uh oh!

abh1sar commented Jun 28, 2024

Uh oh!

blueorangutan commented Jun 28, 2024

Uh oh!

abh1sar commented Jun 28, 2024

Uh oh!

blueorangutan commented Jun 28, 2024

Uh oh!

sureshanaparti left a comment

Choose a reason for hiding this comment

Uh oh!

vladimirpetrov left a comment

Choose a reason for hiding this comment

Uh oh!

blueorangutan commented Jun 28, 2024

abh1sar commented Jun 25, 2024 •

edited

Loading

codecov bot commented Jun 25, 2024 •

edited

Loading

kiranchavala left a comment •

edited

Loading