OCPBUGS-78093: AWS userProvisioned DNS: Obtain Load Balancer IPs from DNS names#10531
OCPBUGS-78093: AWS userProvisioned DNS: Obtain Load Balancer IPs from DNS names#10531sadasu wants to merge 2 commits intoopenshift:mainfrom
Conversation
|
@sadasu: This pull request references Jira Issue OCPBUGS-78093, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaces EC2 NIC/security-group IP derivation with DNS-based resolution: adds getIPsFromDNSName (DNS resolution with exponential backoff) and updates editIgnitionForCustomDNS to build private/public IP lists from APIServerELB.DNSName and SecondaryAPIServerELB.DNSName, with explicit error handling and preserved Ignition edit call. ChangesLoad Balancer IP Discovery
sequenceDiagram
participant Controller
participant DNS
participant ELB as LoadBalancer
participant IgnitionEditor
Controller->>DNS: getIPsFromDNSName(APIServerELB.DNSName)
DNS->>ELB: resolve DNS -> A/AAAA records
ELB-->>DNS: return IP addresses
DNS-->>Controller: IP list (with retries/backoff)
alt Public API enabled
Controller->>DNS: getIPsFromDNSName(SecondaryAPIServerELB.DNSName)
DNS->>ELB: resolve secondary DNS
ELB-->>DNS: return public IPs
DNS-->>Controller: public IP list
else Private install
Controller-->>Controller: publicIPAddresses = privateIPAddresses
end
Controller->>IgnitionEditor: EditIgnitionForCustomDNS(publicIPAddresses, privateIPAddresses)
IgnitionEditor-->>Controller: edited Ignition config
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/jira refresh |
|
@sadasu: This pull request references Jira Issue OCPBUGS-78093, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (gpei@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test ? |
|
@sadasu: This pull request references Jira Issue OCPBUGS-78093, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (gpei@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-custom-dns-techpreview |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/infrastructure/aws/clusterapi/ignition.go`:
- Around line 44-69: The code is populating privateIPAddresses from
awsCluster.Status.Network.APIServerELB and publicIPAddresses from
SecondaryAPIServerELB, but APIServerELB should be public and
SecondaryAPIServerELB private; in the block that handles
awsCluster.Status.Network.APIServerELB use net.LookupIP and append results to
publicIPAddresses (and update the Debugf message and error text to reference
APIServerELB -> public), and in the block for
awsCluster.Status.Network.SecondaryAPIServerELB append results to
privateIPAddresses (and update its Debugf message and error text to reference
SecondaryAPIServerELB -> private); keep the same net.LookupIP, error wrapping,
and nil/empty checks but swap which slice (publicIPAddresses vs
privateIPAddresses) gets populated.
- Around line 45-64: The DNS lookups for APIServerELB and SecondaryAPIServerELB
use net.LookupIP (in the blocks referencing
awsCluster.Status.Network.APIServerELB.DNSName and
awsCluster.Status.Network.SecondaryAPIServerELB.DNSName); replace these with a
context-aware resolver and retry: use (&net.Resolver{}).LookupIP(ctx, "ip",
dnsName) (or resolver.LookupIP) wrapped inside
wait.ExponentialBackoffWithContext to retry until ctx expires, preserving the
existing error wrapping and logs (update the error messages produced where
net.LookupIP errors are currently returned, and ensure you append resolved IPs
to privateIPAddresses and log via logrus.Debugf as before); make sure to use the
function’s ctx (or add one if missing) so retries honor the caller’s context.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 23105a17-7116-41d4-9eec-cf72ee1cac74
📒 Files selected for processing (1)
pkg/infrastructure/aws/clusterapi/ignition.go
ad889f0 to
059ca7c
Compare
|
/test e2e-aws-custom-dns-techpreview |
|
/payload-job periodic-ci-openshift-verification-tests-main-installation-nightly-5.0-aws-usgov-ipi-custom-dns-mini-perm-arm-f7 |
|
@tthvo: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d09b15e0-47ea-11f1-9957-0772f0f29464-0 |
|
/payload-job periodic-ci-openshift-openshift-tests-private-release-5.0-multi-nightly-aws-eusc-ipi-fips-tp-arm-f7 |
|
@tthvo: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f92e2180-47ec-11f1-90e7-6b77341e7dd5-0 |
| } else { | ||
| logrus.Info("AWS: APIServerELB DNS name is not available") | ||
| } |
There was a problem hiding this comment.
I guess this case should never happen. But if it somehow does, we should fail and stop the install because there is no IP to fetch ignition, right?
| logrus.Debugf("found private IP address %s associated with %s", *nic.PrivateIpAddress, *nic.Description) | ||
| privateIPAddresses = append(privateIPAddresses, *nic.PrivateIpAddress) | ||
| // Get public LB IP addresses from SecondaryAPIServerELB DNS name | ||
| if dnsName := awsCluster.Status.Network.SecondaryAPIServerELB.DNSName; dnsName != "" { |
There was a problem hiding this comment.
The field SecondaryAPIServerELB is only available in public clusters (i.e. External publishing). Should we just wrap this block in the check in.InstallConfig.Config.PublicAPI()? For example:
if in.InstallConfig.Config.PublicAPI() {
// Get public LB IP addresses from SecondaryAPIServerELB DNS name
if dnsName := awsCluster.Status.Network.SecondaryAPIServerELB.DNSName; dnsName != "" {
ips, err := getIPsFromDNSName(ctx, dnsName, "SecondaryAPIServerELB")
if err != nil {
return nil, err
}
publicIPAddresses = ips
} else {
logrus.Info("AWS: SecondaryAPIServerELB DNS name is not available")
}
} else {
// For private cluster installs, the API LB IP is the same as the API-Int LB IP
publicIPAddresses = privateIPAddresses
}There was a problem hiding this comment.
For example, an private install will report AWSCluster with:
secondaryapiserverelb:
arn: ""
name: ""
dnsname: ""
scheme: ""| return nil, fmt.Errorf("failed to describe network interfaces: %w", err) | ||
| // Get private LB IP addresses from APIServerELB DNS name | ||
| if dnsName := awsCluster.Status.Network.APIServerELB.DNSName; dnsName != "" { | ||
| ips, err := getIPsFromDNSName(ctx, dnsName, "APIServerELB") |
There was a problem hiding this comment.
nit: How about also including the scheme of the LB so that we can easily know whether the LB is internal or external? For example:
| ips, err := getIPsFromDNSName(ctx, dnsName, "APIServerELB") | |
| ips, err := getIPsFromDNSName(ctx, dnsName, fmt.Sprintf("APIServerELB (scheme: %s)", awsCluster.Status.Network.APIServerELB.Scheme)) |
| Duration: 1 * time.Second, | ||
| Factor: 2.0, | ||
| Jitter: 0.1, | ||
| Steps: 5, |
There was a problem hiding this comment.
I think this is a great idea! I am only concerned about EU Sovereign Cloud (EUSC), which is currently notorious for slow DNS propagation...
Trying to launch a payload job but it failed at image build 😓
There was a problem hiding this comment.
we are only expecting Load Balancers to be created
Oh, this is the concerning part 😅 I think we're resolving API LB's AWS-assigned domain to get the LB IP addresses, right?
Those domains, for example ci-op-d43y6vjn-963aa-zrkxv-ext-c43a9b9b85cb8087.elb.us-east-1.amazonaws.com, can be significantly slow to resolve to IP addresses in EUSC 🤷
There was a problem hiding this comment.
We have opened a related bug: https://redhat.atlassian.net/browse/OCPBUGS-83741. In some cases, it took up to 15 minutes XD
There was a problem hiding this comment.
Understood. Do we increase the duration then? to 15 minutes?
There was a problem hiding this comment.
Yes, for safe, I think so :D
Though, I really expect AWS to sort it out in the next few months and we can bring the value back down.
|
/verified by CI e2e-aws-custom-dns-techpreview indicates that this fix works for regular regions. |
|
@sadasu: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
059ca7c to
e69173c
Compare
|
/test e2e-aws-custom-dns-techpreview |
|
@sadasu: This pull request references Jira Issue OCPBUGS-78093, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (gpei@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/infrastructure/aws/clusterapi/ignition.go (1)
34-44: ⚡ Quick winPreserve the last DNS error in the final failure path.
Current retry logic returns timeout/context errors only, so the concrete DNS failure cause is lost. Capturing and wrapping the last lookup error would make failures much easier to triage.
Suggested diff
func getIPsFromDNSName(ctx context.Context, dnsName, lbType string) ([]string, error) { resolver := &net.Resolver{} var ips []net.IP + var lastLookupErr error err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{ Duration: 1 * time.Second, Factor: 2.0, Jitter: 0.1, Steps: 5, }, func(ctx context.Context) (bool, error) { var err error ips, err = resolver.LookupIP(ctx, "ip", dnsName) if err != nil { + lastLookupErr = err logrus.Debugf("AWS: DNS lookup for %s DNS name %q failed, retrying: %v", lbType, dnsName, err) return false, nil } return true, nil }) if err != nil { + if lastLookupErr != nil { + return nil, fmt.Errorf("failed to lookup IP for %s DNS name %q after retries: %w", lbType, dnsName, lastLookupErr) + } return nil, fmt.Errorf("failed to lookup IP for %s DNS name %q after retries: %w", lbType, dnsName, err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/infrastructure/aws/clusterapi/ignition.go` around lines 34 - 44, The retry loop currently discards the concrete DNS lookup error; declare a variable (e.g., lastLookupErr error) outside the retry closure, assign lastLookupErr = err whenever resolver.LookupIP(ctx, "ip", dnsName) returns an error inside the closure, and when the retry returns an error wrap/return lastLookupErr (or include it in the message) instead of only the retry/context error; update the fmt.Errorf call to reference lastLookupErr so the final error preserves the actual DNS failure for resolver.LookupIP, lbType and dnsName.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/infrastructure/aws/clusterapi/ignition.go`:
- Around line 34-44: The retry loop currently discards the concrete DNS lookup
error; declare a variable (e.g., lastLookupErr error) outside the retry closure,
assign lastLookupErr = err whenever resolver.LookupIP(ctx, "ip", dnsName)
returns an error inside the closure, and when the retry returns an error
wrap/return lastLookupErr (or include it in the message) instead of only the
retry/context error; update the fmt.Errorf call to reference lastLookupErr so
the final error preserves the actual DNS failure for resolver.LookupIP, lbType
and dnsName.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e7da8d16-7477-4040-b299-e2a7b8a4585c
📒 Files selected for processing (1)
pkg/infrastructure/aws/clusterapi/ignition.go
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/infrastructure/aws/clusterapi/ignition.go (1)
33-40:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep retrying until the DNS name yields at least one IP.
Line 40 treats any nil lookup error as success, so an empty
ipsresult would flow through as “resolved” and leave ignition customization with no usable LB addresses. Please makelen(ips) > 0part of the success condition.Suggested fix
}, func(ctx context.Context) (bool, error) { var err error ips, err = resolver.LookupIP(ctx, "ip", dnsName) if err != nil { logrus.Debugf("AWS: DNS lookup for %s DNS name %q failed, retrying: %v", lbType, dnsName, err) return false, nil } + if len(ips) == 0 { + logrus.Debugf("AWS: DNS lookup for %s DNS name %q returned no IPs yet, retrying", lbType, dnsName) + return false, nil + } return true, nil })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/infrastructure/aws/clusterapi/ignition.go` around lines 33 - 40, The retry predicate for resolver.LookupIP currently treats a nil error as success even if ips is empty; change the success condition in that anonymous function so it returns true only when err == nil AND len(ips) > 0 (otherwise log/debug that no IPs were returned and return false, nil to keep retrying); update references to resolver.LookupIP, the local variable ips, and the anonymous func(ctx context.Context) (bool, error) to implement this check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@pkg/infrastructure/aws/clusterapi/ignition.go`:
- Around line 33-40: The retry predicate for resolver.LookupIP currently treats
a nil error as success even if ips is empty; change the success condition in
that anonymous function so it returns true only when err == nil AND len(ips) > 0
(otherwise log/debug that no IPs were returned and return false, nil to keep
retrying); update references to resolver.LookupIP, the local variable ips, and
the anonymous func(ctx context.Context) (bool, error) to implement this check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8fa10813-5002-4c2c-b699-67fb4ed574c2
📒 Files selected for processing (1)
pkg/infrastructure/aws/clusterapi/ignition.go
32968e8 to
52821f4
Compare
This change obtains the API and API-Int LoadBalancer IP addresses by looking up the IP addresses of the DNSName of APIServerELB (internal LB) and SecondaryAPIServerELB (external LB).
Testing has revealed that in EU Sovereign Cloud (EUSC), it can take up to 15 mins for the DNS propogation of Load Balancer DNS Names. So, increasing timeout for to ~17 mins to account for that.
52821f4 to
c40d89f
Compare
|
/test e2e-aws-custom-dns-techpreview |
tthvo
left a comment
There was a problem hiding this comment.
/approve
Looks good to me! Just waiting on e2e 👀
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tthvo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/payload-job periodic-ci-openshift-verification-tests-main-installation-nightly-4.22-aws-usgov-ipi-custom-dns-mini-perm-arm-f7 |
|
@tthvo: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/733464c0-4979-11f1-8184-169a80cb3fca-0 |
|
/payload-job periodic-ci-openshift-verification-tests-main-installation-nightly-5.0-aws-usgov-ipi-custom-dns-mini-perm-arm-f7 |
|
@tthvo: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/97ba8590-4979-11f1-8383-01ade8c75c4e-0 |
|
/payload-job periodic-ci-openshift-openshift-tests-private-release-5.0-multi-nightly-aws-eusc-ipi-custom-dns-mini-perm-tp-arm-f7 |
|
@tthvo: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5cf824c0-497a-11f1-8921-f3c4ae4b4c1e-0 |
|
/payload-job periodic-ci-openshift-openshift-tests-private-release-4.22-multi-nightly-aws-eusc-ipi-custom-dns-mini-perm-tp-arm-f7 |
|
@tthvo: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f9d77430-4989-11f1-9e8f-440160f01252-0 |
|
/retest-required |
|
Arghh, I totally forgot arm64 periodic job cannot be run via And the |
|
@sadasu: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Proposed openshift/release#79003 to make the missing cluster-api-actuator-pkg image available to 5.0 and 5.1 CI jobs |
|
/payload-job periodic-ci-openshift-verification-tests-main-installation-nightly-5.0-aws-usgov-ipi-custom-dns-mini-perm-arm-f7 |
|
@sadasu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/71c00040-4a55-11f1-9d8e-788f77b6d3dc-0 |
|
/payload-job periodic-ci-openshift-openshift-tests-private-release-5.0-multi-nightly-aws-eusc-ipi-custom-dns-mini-perm-tp-arm-f7 |
|
@sadasu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7bd358c0-4a55-11f1-9bbc-4217447069b4-0 |
|
I think arm64 jobs (except us-gov bug, which uses amd64) likely fail. Let's try with amd64 ones 😁 /payload-job periodic-ci-openshift-openshift-tests-private-release-5.0-multi-nightly-aws-eusc-ipi-custom-dns-mini-perm-tp-amd-f28-destructive |
|
@tthvo: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0459c170-4a56-11f1-94af-43f919255bde-0 |
This change obtains the API and API-Int LoadBalancer IP addresses by looking up the IP address of the DNSName of APIServerELB (public LB) and SecondaryAPIServerELB (private LB).
The original implementation used security group to find the network interfaces that correspond to the load balancers. This approach did not work in AWS Top Secret regions.
The updated implementations uses the AWSCluster's NetworkStatus field to obtain the DNS Names of the API and API Int Load Balancers and performs a Lookup to obtain their IP addresses. This works for all regions and Load Balancer types like CLB and NLB.
Summary by CodeRabbit