Skip to content

Fixed removal of hosts from certsmap when running certificate auto-renew#4156

Merged
yadvr merged 1 commit intoapache:masterfrom
shapeblue:failed_auto_renew
Jul 15, 2020
Merged

Fixed removal of hosts from certsmap when running certificate auto-renew#4156
yadvr merged 1 commit intoapache:masterfrom
shapeblue:failed_auto_renew

Conversation

@Spaceman1984
Copy link
Contributor

@Spaceman1984 Spaceman1984 commented Jun 19, 2020

Description

When a host connects to a management server, the host IP address and the certificate are stored in memory on the management server. This mapping is checked periodically to determine if any certificates are due to expire.

Before a certificate is renewed, a few checks are done to determine if the host is connected to the management server by fetching the host record from the database. The problem here is if the wrong record is fetched, the host is not checked for renewal.

This PR improves the host record fetch from the database by looking only at hosts that are not removed.

Fixes: #4129

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

This has been tested by setting the ca.framework.cert.validity.period and ca.framework.cert.expiry.alert.period to the same value. This is to ensure that a certificate is up for renewal as soon as it is issued.
Then watch the management server logs to see if auto-renewal happens.

This has also been tested by using two management servers and reprovisioning host security keys from the second management server and still having the certs auto-renew.

@yadvr yadvr added this to the 4.15.0.0 milestone Jun 19, 2020
@Spaceman1984
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@Spaceman1984 a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@Spaceman1984 Spaceman1984 changed the title Fixed removal of hosts from certsmap when running certificate auto-renwew Fixed removal of hosts from certsmap when running certificate auto-renew Jun 19, 2020
@blueorangutan
Copy link

Packaging result: ✔centos7 ✔debian. JID-1421

@Spaceman1984
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@Spaceman1984 a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@yadvr
Copy link
Member

yadvr commented Jun 24, 2020

@Spaceman1984 can you kick packaging and tests again; was purgeHost being called on a host already removed in DB? But how that does affect hosts that are already connected to a management server (i.e. host not removed in DB)?

@Spaceman1984
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@Spaceman1984 a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@Spaceman1984
Copy link
Contributor Author

@rhtyd I didn't see purgeHost being called on a host already removed in my testing. What I observed, was the host being ignored because a check was done on an empty management server field.

@blueorangutan
Copy link

Packaging result: ✔centos7 ✔debian. JID-1443

@Spaceman1984
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@Spaceman1984 a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Contributor

@nvazquez nvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Spaceman1984 this looks good but I still think it does not solve the issue completely: when the certificates are being removed from the active certificates map (in memory) on agent disconnection then those certificates will be skipped from the certificates loop in the CA background task

@Spaceman1984
Copy link
Contributor Author

Spaceman1984 commented Jun 24, 2020

@nvazquez there is no problem with the certs map losing certs when hosts disconnect because the cert map is repopulated when a host connects again.

If you provision keys from a different management server, the host will disconnect but can reconnect to any available management server in it's list.

If you test with 2 management servers and attach your debugger to both at the same time, you will see the cert map being populated on the management server the host is connecting to.

@Spaceman1984
Copy link
Contributor Author

Spaceman1984 commented Jun 24, 2020

Auto-renewal can only happen if the host has a valid certificate. As soon as the certificate becomes invalid, the host can't communicate with the management server anymore and therefore wouldn't be able to get a new certificate. So before the validity period of the certificate would run out and the host is not able to communicate with the management server, auto-renewal must happen. - The auto-renewal process was failing and therefore certificates were not being renewed. If auto-renewal is fixed, then there wouldn't be a problem with disconnected hosts.

@nvazquez
Copy link
Contributor

Thanks @Spaceman1984, tested manually and looks good.

@rhtyd I noticed the internal active certificates map uses the host IP as the key, which is then used for querying in the DB for that IP. Is there any reason for not using the internal host ID or the UUID as the map key and using the host IP instead?

@nvazquez
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos7 ✔debian. JID-1492

@nvazquez
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@yadvr
Copy link
Member

yadvr commented Jul 1, 2020

@nvazquez yes it's because the code which adds the X509 cert to the hashmap (currently) uses the incoming client's IP address as the key. It may need checking or use host UUID/ID instead of IP address if that corrects the implementation.

@yadvr yadvr requested a review from nvazquez July 1, 2020 03:10
@yadvr
Copy link
Member

yadvr commented Jul 4, 2020

ping @nvazquez please review

Copy link
Contributor

@nvazquez nvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on manual testing and code review

Copy link
Member

@yadvr yadvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but if the certMap is empty that issue may not be fixed by this PR.

@yadvr
Copy link
Member

yadvr commented Jul 8, 2020

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos7 ✔debian. JID-1554

@yadvr
Copy link
Member

yadvr commented Jul 8, 2020

@blueorangutan test

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-2034)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 44441 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4156-t2034-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 82 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Failure 362.40 test_vpc_redundant.py

@yadvr yadvr merged commit 7b88151 into apache:master Jul 15, 2020
Slair1 pushed a commit to ippathways/cloudstack that referenced this pull request Mar 11, 2021
…to-renewal (apache#4156)

When a host connects to a management server, the host IP address and the certificate are stored in memory on the management server. This mapping is checked periodically to determine if any certificates are due to expire.

Before a certificate is renewed, a few checks are done to determine if the host is connected to the management server by fetching the host record from the database. The problem here is if the wrong record is fetched, the host is not checked for renewal.

This PR improves the host record fetch from the database by looking only at hosts that are not removed.

Fixes: apache#4129
nlgordon pushed a commit to ippathways/cloudstack that referenced this pull request Aug 2, 2022
…to-renewal (apache#4156)

When a host connects to a management server, the host IP address and the certificate are stored in memory on the management server. This mapping is checked periodically to determine if any certificates are due to expire.

Before a certificate is renewed, a few checks are done to determine if the host is connected to the management server by fetching the host record from the database. The problem here is if the wrong record is fetched, the host is not checked for renewal.

This PR improves the host record fetch from the database by looking only at hosts that are not removed.

Fixes: apache#4129
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Certificate auto renewal fails after agent disconnection or re-provision certificate to host

6 participants