Skip to content

OCPBUGS-78093: AWS userProvisioned DNS: Obtain Load Balancer IPs from DNS names#10531

Open
sadasu wants to merge 2 commits intoopenshift:mainfrom
sadasu:aws-custom-dns-lb-ip-revision
Open

OCPBUGS-78093: AWS userProvisioned DNS: Obtain Load Balancer IPs from DNS names#10531
sadasu wants to merge 2 commits intoopenshift:mainfrom
sadasu:aws-custom-dns-lb-ip-revision

Conversation

@sadasu
Copy link
Copy Markdown
Contributor

@sadasu sadasu commented May 1, 2026

This change obtains the API and API-Int LoadBalancer IP addresses by looking up the IP address of the DNSName of APIServerELB (public LB) and SecondaryAPIServerELB (private LB).

The original implementation used security group to find the network interfaces that correspond to the load balancers. This approach did not work in AWS Top Secret regions.

The updated implementations uses the AWSCluster's NetworkStatus field to obtain the DNS Names of the API and API Int Load Balancers and performs a Lookup to obtain their IP addresses. This works for all regions and Load Balancer types like CLB and NLB.

Summary by CodeRabbit

  • Refactor
    • Load balancer IP discovery now uses DNS resolution with retry/backoff instead of inspecting cloud network interfaces.
    • Adds explicit error handling and logging for missing or failed DNS lookups.
    • Public API installs obtain public IPs from the secondary endpoint; private installs reuse private addresses as before.

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels May 1, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sadasu: This pull request references Jira Issue OCPBUGS-78093, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This change obtains the API and API-Int LoadBalancer IP addresses by looking up the IP address of the DNSName of APIServerELB (public LB) and SecondaryAPIServerELB (private LB).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 1, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Replaces EC2 NIC/security-group IP derivation with DNS-based resolution: adds getIPsFromDNSName (DNS resolution with exponential backoff) and updates editIgnitionForCustomDNS to build private/public IP lists from APIServerELB.DNSName and SecondaryAPIServerELB.DNSName, with explicit error handling and preserved Ignition edit call.

Changes

Load Balancer IP Discovery

Layer / File(s) Summary
Helper / Resolution
pkg/infrastructure/aws/clusterapi/ignition.go
Added getIPsFromDNSName(ctx, dnsName, lbType) to resolve a DNS name to IP addresses using a resolver with exponential-backoff retries and logging.
Core Logic
pkg/infrastructure/aws/clusterapi/ignition.go
Rewrote editIgnitionForCustomDNS to collect privateIPAddresses from APIServerELB.DNSName and publicIPAddresses from SecondaryAPIServerELB.DNSName when Public API is enabled; for private installs publicIPAddresses is set to privateIPAddresses.
Error Handling / Logging
pkg/infrastructure/aws/clusterapi/ignition.go
Added explicit errors when DNS names are missing and added logging around DNS resolution successes/failures.
Integration / Invocation
pkg/infrastructure/aws/clusterapi/ignition.go
Removed prior EC2 NIC/security-group inspection logic; preserved the final call to clusterapi.EditIgnitionForCustomDNS(in, awstypes.Name, publicIPAddresses, privateIPAddresses) and updated imports (net, time, logrus, wait).
sequenceDiagram
    participant Controller
    participant DNS
    participant ELB as LoadBalancer
    participant IgnitionEditor

    Controller->>DNS: getIPsFromDNSName(APIServerELB.DNSName)
    DNS->>ELB: resolve DNS -> A/AAAA records
    ELB-->>DNS: return IP addresses
    DNS-->>Controller: IP list (with retries/backoff)
    alt Public API enabled
        Controller->>DNS: getIPsFromDNSName(SecondaryAPIServerELB.DNSName)
        DNS->>ELB: resolve secondary DNS
        ELB-->>DNS: return public IPs
        DNS-->>Controller: public IP list
    else Private install
        Controller-->>Controller: publicIPAddresses = privateIPAddresses
    end
    Controller->>IgnitionEditor: EditIgnitionForCustomDNS(publicIPAddresses, privateIPAddresses)
    IgnitionEditor-->>Controller: edited Ignition config
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: obtaining Load Balancer IPs from DNS names instead of security group-based derivation, which is the primary refactoring described in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR modifies only source code (ignition.go) with no test files or Ginkgo test declarations. The check is not applicable to non-test code.
Test Structure And Quality ✅ Passed No Ginkgo test code is present in this PR. The check only applies to test code review, and all changes are production code in non-test files.
Microshift Test Compatibility ✅ Passed PR modifies only ignition.go, an infrastructure utility file with no e2e tests. No Ginkgo test patterns found. Check not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added. PR only modifies ignition.go with DNS lookup business logic.
Topology-Aware Scheduling Compatibility ✅ Passed This PR modifies installer provisioning code for DNS-based LB IP resolution. It does not add deployment manifests or pod scheduling constraints.
Ote Binary Stdout Contract ✅ Passed Library code in pkg/ directory with only regular functions. No process-level code, no stdout writes, no logging misconfiguration. OTE Stdout Contract inapplicable to non-OTE library code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests are added. PR modifies infrastructure provisioning code (ignition.go) with DNS-based IP resolution, not test code. Check applies only when e2e tests are added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 1, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 1, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sadasu: This pull request references Jira Issue OCPBUGS-78093, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (gpei@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 1, 2026

/test ?

@openshift-ci openshift-ci Bot requested review from patrickdillon and tthvo May 1, 2026 18:25
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sadasu: This pull request references Jira Issue OCPBUGS-78093, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (gpei@redhat.com), skipping review request.

Details

In response to this:

This change obtains the API and API-Int LoadBalancer IP addresses by looking up the IP address of the DNSName of APIServerELB (public LB) and SecondaryAPIServerELB (private LB).

Summary by CodeRabbit

  • Refactor
  • Improved load balancer IP address discovery mechanism, enhancing performance and system reliability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 1, 2026

/test e2e-aws-custom-dns-techpreview

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/infrastructure/aws/clusterapi/ignition.go`:
- Around line 44-69: The code is populating privateIPAddresses from
awsCluster.Status.Network.APIServerELB and publicIPAddresses from
SecondaryAPIServerELB, but APIServerELB should be public and
SecondaryAPIServerELB private; in the block that handles
awsCluster.Status.Network.APIServerELB use net.LookupIP and append results to
publicIPAddresses (and update the Debugf message and error text to reference
APIServerELB -> public), and in the block for
awsCluster.Status.Network.SecondaryAPIServerELB append results to
privateIPAddresses (and update its Debugf message and error text to reference
SecondaryAPIServerELB -> private); keep the same net.LookupIP, error wrapping,
and nil/empty checks but swap which slice (publicIPAddresses vs
privateIPAddresses) gets populated.
- Around line 45-64: The DNS lookups for APIServerELB and SecondaryAPIServerELB
use net.LookupIP (in the blocks referencing
awsCluster.Status.Network.APIServerELB.DNSName and
awsCluster.Status.Network.SecondaryAPIServerELB.DNSName); replace these with a
context-aware resolver and retry: use (&net.Resolver{}).LookupIP(ctx, "ip",
dnsName) (or resolver.LookupIP) wrapped inside
wait.ExponentialBackoffWithContext to retry until ctx expires, preserving the
existing error wrapping and logs (update the error messages produced where
net.LookupIP errors are currently returned, and ensure you append resolved IPs
to privateIPAddresses and log via logrus.Debugf as before); make sure to use the
function’s ctx (or add one if missing) so retries honor the caller’s context.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 23105a17-7116-41d4-9eec-cf72ee1cac74

📥 Commits

Reviewing files that changed from the base of the PR and between 73b340a and 0f73425.

📒 Files selected for processing (1)
  • pkg/infrastructure/aws/clusterapi/ignition.go

Comment thread pkg/infrastructure/aws/clusterapi/ignition.go Outdated
Comment thread pkg/infrastructure/aws/clusterapi/ignition.go Outdated
@sadasu sadasu force-pushed the aws-custom-dns-lb-ip-revision branch 2 times, most recently from ad889f0 to 059ca7c Compare May 1, 2026 20:07
@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 1, 2026

@barbacbd and @tthvo Could you please review?

@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 1, 2026

/test e2e-aws-custom-dns-techpreview

@tthvo
Copy link
Copy Markdown
Member

tthvo commented May 4, 2026

/payload-job periodic-ci-openshift-verification-tests-main-installation-nightly-5.0-aws-usgov-ipi-custom-dns-mini-perm-arm-f7

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 4, 2026

@tthvo: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-verification-tests-main-installation-nightly-5.0-aws-usgov-ipi-custom-dns-mini-perm-arm-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d09b15e0-47ea-11f1-9957-0772f0f29464-0

@tthvo
Copy link
Copy Markdown
Member

tthvo commented May 4, 2026

/payload-job periodic-ci-openshift-openshift-tests-private-release-5.0-multi-nightly-aws-eusc-ipi-fips-tp-arm-f7

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 4, 2026

@tthvo: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-openshift-tests-private-release-5.0-multi-nightly-aws-eusc-ipi-fips-tp-arm-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f92e2180-47ec-11f1-90e7-6b77341e7dd5-0

Comment on lines +80 to 82
} else {
logrus.Info("AWS: APIServerELB DNS name is not available")
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this case should never happen. But if it somehow does, we should fail and stop the install because there is no IP to fetch ignition, right?

logrus.Debugf("found private IP address %s associated with %s", *nic.PrivateIpAddress, *nic.Description)
privateIPAddresses = append(privateIPAddresses, *nic.PrivateIpAddress)
// Get public LB IP addresses from SecondaryAPIServerELB DNS name
if dnsName := awsCluster.Status.Network.SecondaryAPIServerELB.DNSName; dnsName != "" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field SecondaryAPIServerELB is only available in public clusters (i.e. External publishing). Should we just wrap this block in the check in.InstallConfig.Config.PublicAPI()? For example:

if in.InstallConfig.Config.PublicAPI() {
	// Get public LB IP addresses from SecondaryAPIServerELB DNS name
	if dnsName := awsCluster.Status.Network.SecondaryAPIServerELB.DNSName; dnsName != "" {
		ips, err := getIPsFromDNSName(ctx, dnsName, "SecondaryAPIServerELB")
		if err != nil {
			return nil, err
		}
		publicIPAddresses = ips
	} else {
		logrus.Info("AWS: SecondaryAPIServerELB DNS name is not available")
	}
} else {
	// For private cluster installs, the API LB IP is the same as the API-Int LB IP
	publicIPAddresses = privateIPAddresses
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, an private install will report AWSCluster with:

secondaryapiserverelb:
  arn: ""
  name: ""
  dnsname: ""
  scheme: ""

return nil, fmt.Errorf("failed to describe network interfaces: %w", err)
// Get private LB IP addresses from APIServerELB DNS name
if dnsName := awsCluster.Status.Network.APIServerELB.DNSName; dnsName != "" {
ips, err := getIPsFromDNSName(ctx, dnsName, "APIServerELB")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How about also including the scheme of the LB so that we can easily know whether the LB is internal or external? For example:

Suggested change
ips, err := getIPsFromDNSName(ctx, dnsName, "APIServerELB")
ips, err := getIPsFromDNSName(ctx, dnsName, fmt.Sprintf("APIServerELB (scheme: %s)", awsCluster.Status.Network.APIServerELB.Scheme))

Comment on lines +29 to +32
Duration: 1 * time.Second,
Factor: 2.0,
Jitter: 0.1,
Steps: 5,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea! I am only concerned about EU Sovereign Cloud (EUSC), which is currently notorious for slow DNS propagation...

Trying to launch a payload job but it failed at image build 😓

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are only expecting Load Balancers to be created

Oh, this is the concerning part 😅 I think we're resolving API LB's AWS-assigned domain to get the LB IP addresses, right?

Those domains, for example ci-op-d43y6vjn-963aa-zrkxv-ext-c43a9b9b85cb8087.elb.us-east-1.amazonaws.com, can be significantly slow to resolve to IP addresses in EUSC 🤷

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have opened a related bug: https://redhat.atlassian.net/browse/OCPBUGS-83741. In some cases, it took up to 15 minutes XD

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Do we increase the duration then? to 15 minutes?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for safe, I think so :D

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for safe, I think so :D

Though, I really expect AWS to sort it out in the next few months and we can bring the value back down.

@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 5, 2026

/verified by CI

e2e-aws-custom-dns-techpreview indicates that this fix works for regular regions.
Fix will be tested in AWS Top secret regions after merge.

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sadasu: This PR has been marked as verified by CI.

Details

In response to this:

/verified by CI

e2e-aws-custom-dns-techpreview indicates that this fix works for regular regions.
Fix will be tested in AWS Top secret regions after merge.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 5, 2026

/retest

@sadasu sadasu force-pushed the aws-custom-dns-lb-ip-revision branch from 059ca7c to e69173c Compare May 5, 2026 21:03
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label May 5, 2026
@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 5, 2026

/test e2e-aws-custom-dns-techpreview

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sadasu: This pull request references Jira Issue OCPBUGS-78093, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (gpei@redhat.com), skipping review request.

Details

In response to this:

This change obtains the API and API-Int LoadBalancer IP addresses by looking up the IP address of the DNSName of APIServerELB (public LB) and SecondaryAPIServerELB (private LB).

The original implementation used security group to find the network interfaces that correspond to the load balancers. This approach did not work in AWS Top Secret regions.

The updated implementations uses the AWSCluster's NetworkStatus field to obtain the DNS Names of the API and API Int Load Balancers and performs a Lookup to obtain their IP addresses. This works for all regions and Load Balancer types like CLB and NLB.

Summary by CodeRabbit

  • Refactor
  • Load balancer IP discovery now derives addresses via DNS resolution with retry/backoff instead of inspecting cloud network interfaces.
  • Adds explicit error handling and logging for missing or failed DNS lookups.
  • Public API installs obtain public IPs from the secondary endpoint; private installs reuse private addresses as before.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/infrastructure/aws/clusterapi/ignition.go (1)

34-44: ⚡ Quick win

Preserve the last DNS error in the final failure path.

Current retry logic returns timeout/context errors only, so the concrete DNS failure cause is lost. Capturing and wrapping the last lookup error would make failures much easier to triage.

Suggested diff
 func getIPsFromDNSName(ctx context.Context, dnsName, lbType string) ([]string, error) {
 	resolver := &net.Resolver{}
 	var ips []net.IP
+	var lastLookupErr error
 	err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{
 		Duration: 1 * time.Second,
 		Factor:   2.0,
 		Jitter:   0.1,
 		Steps:    5,
 	}, func(ctx context.Context) (bool, error) {
 		var err error
 		ips, err = resolver.LookupIP(ctx, "ip", dnsName)
 		if err != nil {
+			lastLookupErr = err
 			logrus.Debugf("AWS: DNS lookup for %s DNS name %q failed, retrying: %v", lbType, dnsName, err)
 			return false, nil
 		}
 		return true, nil
 	})
 	if err != nil {
+		if lastLookupErr != nil {
+			return nil, fmt.Errorf("failed to lookup IP for %s DNS name %q after retries: %w", lbType, dnsName, lastLookupErr)
+		}
 		return nil, fmt.Errorf("failed to lookup IP for %s DNS name %q after retries: %w", lbType, dnsName, err)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/infrastructure/aws/clusterapi/ignition.go` around lines 34 - 44, The
retry loop currently discards the concrete DNS lookup error; declare a variable
(e.g., lastLookupErr error) outside the retry closure, assign lastLookupErr =
err whenever resolver.LookupIP(ctx, "ip", dnsName) returns an error inside the
closure, and when the retry returns an error wrap/return lastLookupErr (or
include it in the message) instead of only the retry/context error; update the
fmt.Errorf call to reference lastLookupErr so the final error preserves the
actual DNS failure for resolver.LookupIP, lbType and dnsName.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/infrastructure/aws/clusterapi/ignition.go`:
- Around line 34-44: The retry loop currently discards the concrete DNS lookup
error; declare a variable (e.g., lastLookupErr error) outside the retry closure,
assign lastLookupErr = err whenever resolver.LookupIP(ctx, "ip", dnsName)
returns an error inside the closure, and when the retry returns an error
wrap/return lastLookupErr (or include it in the message) instead of only the
retry/context error; update the fmt.Errorf call to reference lastLookupErr so
the final error preserves the actual DNS failure for resolver.LookupIP, lbType
and dnsName.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e7da8d16-7477-4040-b299-e2a7b8a4585c

📥 Commits

Reviewing files that changed from the base of the PR and between 059ca7c and e69173c.

📒 Files selected for processing (1)
  • pkg/infrastructure/aws/clusterapi/ignition.go

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/infrastructure/aws/clusterapi/ignition.go (1)

33-40: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep retrying until the DNS name yields at least one IP.

Line 40 treats any nil lookup error as success, so an empty ips result would flow through as “resolved” and leave ignition customization with no usable LB addresses. Please make len(ips) > 0 part of the success condition.

Suggested fix
 	}, func(ctx context.Context) (bool, error) {
 		var err error
 		ips, err = resolver.LookupIP(ctx, "ip", dnsName)
 		if err != nil {
 			logrus.Debugf("AWS: DNS lookup for %s DNS name %q failed, retrying: %v", lbType, dnsName, err)
 			return false, nil
 		}
+		if len(ips) == 0 {
+			logrus.Debugf("AWS: DNS lookup for %s DNS name %q returned no IPs yet, retrying", lbType, dnsName)
+			return false, nil
+		}
 		return true, nil
 	})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/infrastructure/aws/clusterapi/ignition.go` around lines 33 - 40, The
retry predicate for resolver.LookupIP currently treats a nil error as success
even if ips is empty; change the success condition in that anonymous function so
it returns true only when err == nil AND len(ips) > 0 (otherwise log/debug that
no IPs were returned and return false, nil to keep retrying); update references
to resolver.LookupIP, the local variable ips, and the anonymous func(ctx
context.Context) (bool, error) to implement this check.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pkg/infrastructure/aws/clusterapi/ignition.go`:
- Around line 33-40: The retry predicate for resolver.LookupIP currently treats
a nil error as success even if ips is empty; change the success condition in
that anonymous function so it returns true only when err == nil AND len(ips) > 0
(otherwise log/debug that no IPs were returned and return false, nil to keep
retrying); update references to resolver.LookupIP, the local variable ips, and
the anonymous func(ctx context.Context) (bool, error) to implement this check.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8fa10813-5002-4c2c-b699-67fb4ed574c2

📥 Commits

Reviewing files that changed from the base of the PR and between e69173c and 32968e8.

📒 Files selected for processing (1)
  • pkg/infrastructure/aws/clusterapi/ignition.go

@sadasu sadasu force-pushed the aws-custom-dns-lb-ip-revision branch from 32968e8 to 52821f4 Compare May 6, 2026 15:44
sadasu added 2 commits May 6, 2026 11:47
This change obtains the API and API-Int LoadBalancer IP addresses by
looking up the IP addresses of the DNSName of APIServerELB (internal LB)
and SecondaryAPIServerELB (external LB).
Testing has revealed that in EU Sovereign Cloud (EUSC), it can take
up to 15 mins for the DNS propogation of Load Balancer DNS Names. So,
increasing timeout for to ~17 mins to account for that.
@sadasu sadasu force-pushed the aws-custom-dns-lb-ip-revision branch from 52821f4 to c40d89f Compare May 6, 2026 15:52
@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 6, 2026

/test e2e-aws-custom-dns-techpreview

Copy link
Copy Markdown
Member

@tthvo tthvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Looks good to me! Just waiting on e2e 👀

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tthvo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2026
@tthvo
Copy link
Copy Markdown
Member

tthvo commented May 6, 2026

/payload-job periodic-ci-openshift-verification-tests-main-installation-nightly-4.22-aws-usgov-ipi-custom-dns-mini-perm-arm-f7

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

@tthvo: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-verification-tests-main-installation-nightly-4.22-aws-usgov-ipi-custom-dns-mini-perm-arm-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/733464c0-4979-11f1-8184-169a80cb3fca-0

@tthvo
Copy link
Copy Markdown
Member

tthvo commented May 6, 2026

/payload-job periodic-ci-openshift-verification-tests-main-installation-nightly-5.0-aws-usgov-ipi-custom-dns-mini-perm-arm-f7

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

@tthvo: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-verification-tests-main-installation-nightly-5.0-aws-usgov-ipi-custom-dns-mini-perm-arm-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/97ba8590-4979-11f1-8383-01ade8c75c4e-0

@tthvo
Copy link
Copy Markdown
Member

tthvo commented May 6, 2026

/payload-job periodic-ci-openshift-openshift-tests-private-release-5.0-multi-nightly-aws-eusc-ipi-custom-dns-mini-perm-tp-arm-f7

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

@tthvo: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-openshift-tests-private-release-5.0-multi-nightly-aws-eusc-ipi-custom-dns-mini-perm-tp-arm-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5cf824c0-497a-11f1-8921-f3c4ae4b4c1e-0

@tthvo
Copy link
Copy Markdown
Member

tthvo commented May 6, 2026

/payload-job periodic-ci-openshift-openshift-tests-private-release-4.22-multi-nightly-aws-eusc-ipi-custom-dns-mini-perm-tp-arm-f7

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

@tthvo: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-openshift-tests-private-release-4.22-multi-nightly-aws-eusc-ipi-custom-dns-mini-perm-tp-arm-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f9d77430-4989-11f1-9e8f-440160f01252-0

@tthvo
Copy link
Copy Markdown
Member

tthvo commented May 6, 2026

/retest-required

@tthvo
Copy link
Copy Markdown
Member

tthvo commented May 6, 2026

Arghh, I totally forgot arm64 periodic job cannot be run via /payload-job on PR (i.e. only amd64). The aws-usgov job actually installed an amd64 cluster even though its name indicates arm64 (job bug!) 🤦

And the 5.0 jobs can't even pass image build: DockerBuildFailed: Dockerfile build strategy has failed. Let's find other jobs... 👀

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

@sadasu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn c40d89f link true /test e2e-aws-ovn
ci/prow/e2e-aws-ovn-shared-vpc-custom-security-groups c40d89f link false /test e2e-aws-ovn-shared-vpc-custom-security-groups
ci/prow/e2e-aws-byo-subnet-role-security-groups c40d89f link false /test e2e-aws-byo-subnet-role-security-groups
ci/prow/e2e-aws-ovn-shared-vpc-edge-zones c40d89f link false /test e2e-aws-ovn-shared-vpc-edge-zones
ci/prow/e2e-aws-ovn-edge-zones c40d89f link false /test e2e-aws-ovn-edge-zones
ci/prow/e2e-aws-ovn-imdsv2 c40d89f link false /test e2e-aws-ovn-imdsv2
ci/prow/e2e-aws-ovn-heterogeneous c40d89f link false /test e2e-aws-ovn-heterogeneous

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 7, 2026

Proposed openshift/release#79003 to make the missing cluster-api-actuator-pkg image available to 5.0 and 5.1 CI jobs

@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 7, 2026

/payload-job periodic-ci-openshift-verification-tests-main-installation-nightly-5.0-aws-usgov-ipi-custom-dns-mini-perm-arm-f7

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

@sadasu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-verification-tests-main-installation-nightly-5.0-aws-usgov-ipi-custom-dns-mini-perm-arm-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/71c00040-4a55-11f1-9d8e-788f77b6d3dc-0

@sadasu
Copy link
Copy Markdown
Contributor Author

sadasu commented May 7, 2026

/payload-job periodic-ci-openshift-openshift-tests-private-release-5.0-multi-nightly-aws-eusc-ipi-custom-dns-mini-perm-tp-arm-f7

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

@sadasu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-openshift-tests-private-release-5.0-multi-nightly-aws-eusc-ipi-custom-dns-mini-perm-tp-arm-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7bd358c0-4a55-11f1-9bbc-4217447069b4-0

@tthvo
Copy link
Copy Markdown
Member

tthvo commented May 7, 2026

I think arm64 jobs (except us-gov bug, which uses amd64) likely fail. Let's try with amd64 ones 😁

/payload-job periodic-ci-openshift-openshift-tests-private-release-5.0-multi-nightly-aws-eusc-ipi-custom-dns-mini-perm-tp-amd-f28-destructive
/payload-job periodic-ci-openshift-openshift-tests-private-release-5.0-amd64-nightly-aws-usgov-ipi-custom-dns-mini-perm-tp-f7

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

@tthvo: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-openshift-tests-private-release-5.0-multi-nightly-aws-eusc-ipi-custom-dns-mini-perm-tp-amd-f28-destructive
  • periodic-ci-openshift-openshift-tests-private-release-5.0-amd64-nightly-aws-usgov-ipi-custom-dns-mini-perm-tp-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0459c170-4a56-11f1-94af-43f919255bde-0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants