Skip to content

Fix false positives in Usage sanity checks of templates and network offerings #8136

Merged
DaanHoogland merged 5 commits intoapache:mainfrom
scclouds:rewrite-usage-sanity-checks-for-template-and-network
Feb 26, 2024
Merged

Fix false positives in Usage sanity checks of templates and network offerings #8136
DaanHoogland merged 5 commits intoapache:mainfrom
scclouds:rewrite-usage-sanity-checks-for-template-and-network

Conversation

@winterhazel
Copy link
Member

Description

This PR addresses two false positives in the Usage sanity checker:

(1)

A sanity check verifies if there are usage records of a template/ISO created after the template/ISO was removed.

In an environment with two zones, if a template gets removed from a zone and added back to it, cloud.template_zone_ref will have two entries related to the template; one marked as removed. In this situation, the operator will receive a false positive stating that there are usage records for the template/ISO after it was removed, even though the template/ISO is available in that zone.

This check was changed to verify if the template/ISO is currently available in that zone.

(2)

A sanity check verifies if there are usage records with a raw usage greater than the aggregation range.

If a VM has more than one NIC using the same network offering, CloudStack will group the usage of both NICs into a single network offering usage record; for example, two NICs used for 1 hour will result in an entry with a raw usage of 2 hours. In this situation, the operator will receive a false positive stating that there are usage records with a raw usage greater than the aggregation range.

This check was changed to consider the average usage of the NICs, instead of the raw usage.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

I changed the initial delay in UsageManagerImpl.java#L315 to 0 in order to run the sanity checks immediately when Usage starts. Then, whenever I wanted to run the sanity checks, I would set the last checked id in /usr/local/libexec/sanity-check-last-id to 1 and restart Usage.

To simulate the scenario in (1), I created the zone zone2, copied a template that was being used in zone1 to zone2 and deleted the template from zone1. Then, I copied it from zone2 to zone1.

Before applying the changes:

  • I ran the sanity checks and verified that CloudStack reported template/ISO usage records created after it was removed;

After applying the changes

  • I ran the sanity checks and verified that no errors were found;
  • I deleted the template from zone1, ran the sanity checks again and verified that CloudStack reported template/ISO usage records created after it was removed.

To simulate the scenario in (2), I set usage.stats.job.aggregation.range to 60 (hourly), created a VM with two NICs and waited until there were network offering usage records corresponding to the NICs. I verified that raw_usage was indeed 2.

Before applying the changes:

  • I ran the sanity checks and verified that CloudStack reported usage records with raw_usage greater than 1.

After applying the changes:

  • I ran the sanity checks and verified that CloudStack did not report usage records with raw_usage greater than 1;
  • I manually changed the raw_usage of a usage record from 2 to 3, ran the sanity checks again and verified that CloudStack reported network offering usage records with raw_usage greater than 1.

@winterhazel winterhazel changed the title Fix false positives in Usage sanity checks for templates and network offerings Fix false positives in Usage sanity checks of templates and network offerings Oct 24, 2023
@codecov
Copy link

codecov bot commented Oct 25, 2023

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (49cecae) 30.37% compared to head (234be64) 29.90%.
Report is 30 commits behind head on main.

Files Patch % Lines
.../main/java/com/cloud/usage/UsageSanityChecker.java 0.00% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #8136      +/-   ##
============================================
- Coverage     30.37%   29.90%   -0.48%     
+ Complexity    32633    32023     -610     
============================================
  Files          5352     5352              
  Lines        374419   374423       +4     
  Branches      54609    54609              
============================================
- Hits         113719   111957    -1762     
- Misses       245523   247415    +1892     
+ Partials      15177    15051     -126     
Flag Coverage Δ
simulator-marvin-tests 23.47% <0.00%> (-0.67%) ⬇️
uitests 4.38% <ø> (ø)
unit-tests 16.41% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good fixes @winterhazel
clgtm

@DaanHoogland
Copy link
Contributor

@winterhazel , you can add yourself to the contributers list in .asf.yaml
Just create a PR for us to merge. It will give you some extra possibilities, like automatically run the github actions and being able to label PRs.

Copy link
Member

@yadvr yadvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but needs additional review/testing

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@DaanHoogland
Copy link
Contributor

@blueorangutan test alma8 kvm-alma8

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (alma8 mgmt + kvm-alma8) has been kicked to run smoke tests

@winterhazel
Copy link
Member Author

@DaanHoogland @harikrishna-patnala @shwstppr could you guys please validate the changes in this PR for merging? Thank you.

Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM. Needs testing

@shwstppr
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✖️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 8329

@winterhazel
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8338

@shwstppr
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@shwstppr a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Co-authored-by: João Jandre <48719461+JoaoJandre@users.noreply.github.com>
@winterhazel
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el7 ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 8594

@winterhazel
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-9149)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 58566 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8136-t9149-kvm-centos7.zip
Smoke tests completed. 125 look OK, 4 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_04_create_iso_with_no_checksum Error 66.53 test_iso.py
test_04_deploy_vnf_appliance Error 92.41 test_vnf_templates.py
test_04_deploy_vnf_appliance Error 92.41 test_vnf_templates.py
test_05_delete_vnf_template Error 0.07 test_vnf_templates.py
ContextSuite context=TestVnfTemplates>:teardown Error 0.15 test_vnf_templates.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Error 345.81 test_vpc_redundant.py
test_02_redundant_VPC_default_routes Error 328.36 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Error 233.75 test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics Error 179.25 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 261.87 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 261.89 test_vpc_redundant.py
test_01_VPC_nics_after_destroy Error 171.66 test_vpc_router_nics.py
test_02_VPC_default_routes Error 172.74 test_vpc_router_nics.py

@DaanHoogland
Copy link
Contributor

@blueorangutan test alma9 kvm-alma9 keepEnv

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (alma9 mgmt + kvm-alma9) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-9182)
Environment: kvm-alma9 (x2), Advanced Networking with Mgmt server a9
Total time taken: 50461 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8136-t9182-kvm-alma9.zip
Smoke tests completed. 129 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@GutoVeronezi
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@GutoVeronezi a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8655

@DaanHoogland
Copy link
Contributor

I guess it kind of worked because

14:17:18.487 [main] ERROR com.cloud.usage.UsageSanityChecker - Error: Found 919 usage records with raw_usage > 24

after running

/usr/bin/java -Xmx2G -cp /usr/share/cloudstack-usage/*:/usr/share/cloudstack-usage/lib/*:/usr/share/cloudstack-mysql-ha/lib/*:/etc/cloudstack/usage:/usr/share/java/mysql-connector-java.jar:/usr/share/cloudstack-common com.cloud.usage.UsageSanityChecker

unfortunately I had to look up the issue in the code:

        addCheckCase("SELECT count(*) FROM `cloud_usage`.`cloud_usage` cu where usage_type not in (4,5) and raw_usage > "
                + aggregationHours,
                "usage records with raw_usage > " + aggregationHours,
                lastCheckId);

Out of scope for this PR but I think we can improve there, @winterhazel , cc @vishesh92 .
@JoaoJandre are your concerns met?

Copy link
Contributor

@JoaoJandre JoaoJandre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM, didn't test it

@DaanHoogland DaanHoogland merged commit d171582 into apache:main Feb 26, 2024
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Mar 5, 2024
…fferings (apache#8136)

Co-authored-by: João Jandre <48719461+JoaoJandre@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants