Skip to content

Add, Delete Storage Pool commands should be able execute on a host in maintenance#9301

Merged
sureshanaparti merged 4 commits intoapache:4.18from
abh1sar:storage-maint-418
Jun 28, 2024
Merged

Add, Delete Storage Pool commands should be able execute on a host in maintenance#9301
sureshanaparti merged 4 commits intoapache:4.18from
abh1sar:storage-maint-418

Conversation

@abh1sar
Copy link
Contributor

@abh1sar abh1sar commented Jun 25, 2024

Description

This PR...

Fixes #9295

When a host is in maintenance, CreateStoragePoolCommand and DeleteStoragePoolCommand are not allowed to execute by the AgentAttache.
This will cause a new storage pool to not be present on the host even after it comes out of maintenance, as the cloudstack agent is not restarted when cancel maintain is called (#3239).
This also causes a deleted storage pool to never be removed from a host in maintenance.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

  1. Put a host in maintenance.
  2. Add a new storage pool and verify that it is mounted on the host
  3. Delete a storage pool and verify that it has been unmounted from the host
  4. Take the host out of maintenance and verify again.

How did you try to break this feature and the system with this change?

@abh1sar
Copy link
Contributor Author

abh1sar commented Jun 25, 2024

@blueorangutan package

@blueorangutan
Copy link

@abh1sar a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link

codecov bot commented Jun 25, 2024

Codecov Report

Attention: Patch coverage is 27.27273% with 8 lines in your changes missing coverage. Please review.

Project coverage is 12.24%. Comparing base (351de5f) to head (50e09fe).
Report is 1 commits behind head on 4.18.

Files Patch % Lines
...n/java/com/cloud/resource/ResourceManagerImpl.java 0.00% 7 Missing ⚠️
...cycle/CloudStackPrimaryDataStoreLifeCycleImpl.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               4.18    #9301      +/-   ##
============================================
- Coverage     12.24%   12.24%   -0.01%     
  Complexity     9296     9296              
============================================
  Files          4699     4699              
  Lines        414331   414347      +16     
  Branches      51999    51008     -991     
============================================
+ Hits          50731    50732       +1     
- Misses       357295   357310      +15     
  Partials       6305     6305              
Flag Coverage Δ
unittests 12.24% <27.27%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10111

@yadvr yadvr added this to the 4.19.1.0 milestone Jun 25, 2024
@abh1sar
Copy link
Contributor Author

abh1sar commented Jun 25, 2024

@blueorangutan test

@blueorangutan
Copy link

@abh1sar a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

Copy link
Member

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abh1sar ,

Steps that i followed

  1. keep one of the host in a cluster in maintenance mode

  2. Add a cluster wide primary storage

  3. Exception observed

Logs


2024-06-25 13:06:12,682 DEBUG [o.a.c.s.d.l.CloudStackPrimaryDataStoreLifeCycleImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) createPool Params @ scheme - nfs storageHost - 10.0.32.4 hostPath - /acs/primary/ref-trl-6897-k-Mr8-kiran-chavala/ref-trl-6897-k-Mr8-kiran-chavala-kvm-pri1 port - -1
2024-06-25 13:06:12,690 DEBUG [o.a.c.s.d.l.CloudStackPrimaryDataStoreLifeCycleImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) creating pool pri2 on  host 1
2024-06-25 13:06:12,694 WARN  [c.c.a.m.AgentManagerImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) Resource [Host:1] is unreachable: Host 1: Unable to send class com.cloud.agent.api.CreateStoragePoolCommand because agent ref-trl-6898-k-Mr8-kiran-chavala-kvm1 is in maintenance mode
2024-06-25 13:06:12,699 WARN  [o.a.c.s.d.l.CloudStackPrimaryDataStoreLifeCycleImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) Can not create storage pool through host 1 due to CreateStoragePoolCommand returns null
2024-06-25 13:06:12,699 DEBUG [c.c.s.StorageManagerImpl] (qtp1278254413-13:ctx-a68caea6 ctx-a0609d1c) (logid:30bd4a00) Failed to add data store: Can not create storage pool through host 1 due to CreateStoragePoolCommand returns null
com.cloud.utils.exception.CloudRuntimeException: Can not create storage pool through host 1 due to CreateStoragePoolCommand returns null
        at org.apache.cloudstack.storage.datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl.createStoragePool(CloudStackPrimaryDataStoreLifeCycleImpl.java:415)
        at org.apache.cloudstack.storage.datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl.attachCluster(CloudStackPrimaryDataStoreLifeCycleImpl.java:438)
        at com.cloud.storage.StorageManagerImpl.createPool(StorageManagerImpl.java:851)
        

Ideally, when adding cluster wide primary storage, cloudstack should check for active hosts in the cluster and add the storage,

When the host comes out of maintainence mode the new storage should get added


Currently there is no issue if a zone wide storage is added when a host is in maintenance mode, once the host comes out of maintenance mode I see modify storage pool getting executed

 2024-06-25 13:23:44,690 DEBUG [cloud.agent.Agent] (agentRequest-Handler-4:null) (logid:094fcf46) Seq 1-1815232124807020547:  { Ans: , MgmtId: 167780928, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.ModifyStoragePoolAnswer":{"poolInfo":{"host":"10.0.32.4","localPath":"/mnt//42d866c8-547c-3741-a71e-4205227ffe28","hostPath":"/acs/primary/ref-trl-6897-k-Mr8-kiran-chavala/ref-trl-6897-k-Mr8-kiran-chavala-kvm-pri1","poolType":"NetworkFilesystem","capacityBytes":"(1.9990 TB) 2197949513728","availableBytes":"(668.27 GB) 717546848256"},"templateInfo":{},"datastoreClusterChildren":[],"result":"true","wait":"0","bypassHostMaintenance":"false"}}] }
 

No issue with the deletion of zone wide storage pool even if the host is in maintaience mode


2024-06-25 13:40:29,655 DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:null) (logid:57d7f82f) Request:Seq 1-1815232124807020577:  { Cmd , MgmtId: 167780928, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.DeleteStoragePoolCommand":{"_pool":{"id":"3","uuid":"42d866c8-547c-3741-a71e-4205227ffe28","host":"10.0.32.4","path":"/acs/primary/ref-trl-6897-k-Mr8-kiran-chavala/ref-trl-6897-k-Mr8-kiran-chavala-kvm-pri1","port":"2049","type":"NetworkFilesystem"},"_localPath":"/mnt//42d866c8-547c-3741-a71e-4205227ffe28","_removeDatastore":"false","wait":"0","bypassHostMaintenance":"false"}}] }
2024-06-25 13:40:29,655 DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:null) (logid:57d7f82f) Processing command: com.cloud.agent.api.DeleteStoragePoolCommand

@abh1sar
Copy link
Contributor Author

abh1sar commented Jun 25, 2024

@blueorangutan package

@blueorangutan
Copy link

@abh1sar a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10124

@blueorangutan
Copy link

[SF] Trillian test result (tid-10612)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 42162 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9301-t10612-kvm-centos7.zip
Smoke tests completed. 107 look OK, 3 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_08_migrate_vm Error 45.85 test_vm_life_cycle.py
test_01_cancel_host_maintenace_with_no_migration_jobs Error 114.46 test_host_maintenance.py
test_disable_oobm_ha_state_ineligible Error 1513.05 test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance Failure 310.26 test_hostha_kvm.py

@abh1sar
Copy link
Contributor Author

abh1sar commented Jun 26, 2024

@blueorangutan test

@blueorangutan
Copy link

@abh1sar a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@sureshanaparti sureshanaparti requested a review from yadvr June 26, 2024 06:00
@blueorangutan
Copy link

[SF] Trillian test result (tid-10626)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 41860 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9301-t10626-kvm-centos7.zip
Smoke tests completed. 107 look OK, 3 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_08_migrate_vm Error 45.92 test_vm_life_cycle.py
test_01_vpc_site2site_vpn Failure 279.86 test_vpc_vpn.py
test_hostha_enable_ha_when_host_disabled Error 3.68 test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance Error 302.87 test_hostha_kvm.py
test_hostha_kvm_host_recovering Error 7.13 test_hostha_kvm.py

@abh1sar
Copy link
Contributor Author

abh1sar commented Jun 27, 2024

@blueorangutan test keepEnv

@blueorangutan
Copy link

@abh1sar a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@abh1sar
Copy link
Contributor Author

abh1sar commented Jun 27, 2024

Working on resolving smoke test failures

@weizhouapache
Copy link
Member

codewise lgtm

@abh1sar
what about other resource states, like ErrorInMaintenance or ErrorInPrepareForMaintenance, PrepareForMaintenance ?

@blueorangutan
Copy link

[SF] Trillian test result (tid-10648)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 45084 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9301-t10648-kvm-centos7.zip
Smoke tests completed. 109 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_hostha_enable_ha_when_host_disabled Error 4.74 test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance Error 302.76 test_hostha_kvm.py

@abh1sar
Copy link
Contributor Author

abh1sar commented Jun 28, 2024

The approach of ssh-ing into the in-maintenance-host to restart the agent even if the agent was already connected is not right, as it breaks the change done in #3239

I think better solution would be to allow createStoragePoolCommand to run on the host in maintenance mode (like ModifyStoragePoolCommand)

@sureshanaparti @kiranchavala

@sureshanaparti
Copy link
Contributor

The approach of ssh-ing into the in-maintenance-host to restart the agent even if the agent was already connected is not right, as it breaks the change done in #3239

I think better solution would be to allow createStoragePoolCommand to run on the host in maintenance mode (like ModifyStoragePoolCommand)

@sureshanaparti @kiranchavala

Yes @abh1sar, If the agent is already connected, better check & update the storage pools with agent command.

@abh1sar
Copy link
Contributor Author

abh1sar commented Jun 28, 2024

@blueorangutan package

@blueorangutan
Copy link

@abh1sar a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@abh1sar abh1sar changed the title Restart agent when host comes out of maintenance Add, Delete, Modify Storage Pool commands should be able execute on a host in maintenance Jun 28, 2024
@abh1sar abh1sar changed the title Add, Delete, Modify Storage Pool commands should be able execute on a host in maintenance Add, Delete Storage Pool commands should be able execute on a host in maintenance Jun 28, 2024
@abh1sar
Copy link
Contributor Author

abh1sar commented Jun 28, 2024

codewise lgtm

@abh1sar what about other resource states, like ErrorInMaintenance or ErrorInPrepareForMaintenance, PrepareForMaintenance ?

Have changed the approach, so this code is now obsolete. Please check.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10184

@abh1sar
Copy link
Contributor Author

abh1sar commented Jun 28, 2024

@blueorangutan test

@blueorangutan
Copy link

@abh1sar a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

Copy link
Contributor

@vladimirpetrov vladimirpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on manual testing with NFS storage.

@blueorangutan
Copy link

[SF] Trillian test result (tid-10670)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 39649 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9301-t10670-kvm-centos7.zip
Smoke tests completed. 109 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_hostha_kvm_host_fencing Error 106.67 test_hostha_kvm.py

dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Jul 2, 2024
… maintenance (apache#9301)

* Restart agent when host comes out of maintenance

* Don't send CreateStoragePoolCommand to hosts in maintenance mode

* CreateStoragePoolCommand can run when host in maintenance. Reverted the change to restart agent when host was already up and in maintenance

* Reverted changes done to ResourceManagerImplTest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

Storage pools are not mounted when the host is in maintenance mode or comes out of maintenance mode

7 participants