OCPQE-31585: Auto-detect HyperShift with Kubernetes management cluster#30931
OCPQE-31585: Auto-detect HyperShift with Kubernetes management cluster#30931xiuwang wants to merge 2 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@xiuwang: This pull request references OCPQE-31585 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: xiuwang The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdded HyperShift-aware runtime detection in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.3)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/util/compat_otp/client.go`:
- Around line 43-44: Replace the call to
compat_otp.NewCLIWithoutNamespace(basename) with a direct call to
exutil.NewCLIWithoutNamespace(basename) so the Ginkgo BeforeEach(SetupProject)
hook (registered by exutil.NewCLI(basename)) is not registered; this ensures
SetupProject() is not executed—locate the return in
NewCLIWithoutNamespace/compat_otp.NewCLIWithoutNamespace and change it to call
exutil.NewCLIWithoutNamespace(basename) instead of going through
compat_otp.NewCLI.
- Around line 150-160: Use a context timeout and explicit error classification
instead of always returning false on any cmd.CombinedOutput error: replace
exec.Command with exec.CommandContext using a context.WithTimeout (e.g., 5s) and
check for context.DeadlineExceeded to treat timeouts as real failures; detect
exec.ErrNotFound (missing oc binary) from cmd.Start/Run/Error and treat it as a
real failure; when the process exits with a non-zero exit code, inspect the
combined output/stderr for expected "API not found" or "no resources found"
messages and only return false for those known-not-found patterns, otherwise
propagate or log the error (do not silently return false). Reference symbols:
exec.CommandContext, context.WithTimeout, cmd.CombinedOutput, exec.ErrNotFound,
KUBECONFIG.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5b303466-2cf2-4b1e-a91b-6d81c4e53dd3
📒 Files selected for processing (1)
test/extended/util/compat_otp/client.go
| cmd := exec.Command("oc", "api-resources", "--api-group=config.openshift.io", "-o=name") | ||
|
|
||
| // Use KUBECONFIG from environment if set | ||
| if kubeconfig := os.Getenv("KUBECONFIG"); kubeconfig != "" { | ||
| cmd.Env = append(os.Environ(), "KUBECONFIG="+kubeconfig) | ||
| } | ||
|
|
||
| output, err := cmd.CombinedOutput() | ||
| if err != nil { | ||
| // If the command fails, assume no clusterversion API | ||
| return false |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for timeout-related code in the repository
echo "=== Searching for timeout mechanisms in the codebase ==="
rg -i "timeout|context\." test/ --type go | head -30Repository: openshift/origin
Length of output: 3897
🏁 Script executed:
# Search for other uses of oc api-resources command
echo "=== Searching for other 'oc api-resources' usage ==="
rg "oc.*api-resources" . --type goRepository: openshift/origin
Length of output: 1232
🏁 Script executed:
# Check if there are any error type checks in this function or similar functions
echo "=== Checking for error handling patterns in the function ==="
sed -n '140,170p' test/extended/util/compat_otp/client.go -A 20 | rg -E "err|Error|status"Repository: openshift/origin
Length of output: 2357
🏁 Script executed:
# Also check if there are any error type checks or status code handling elsewhere
rg -A 10 "CombinedOutput" test/extended/util/compat_otp/client.goRepository: openshift/origin
Length of output: 309
🏁 Script executed:
# Check the function name and beginning to understand its purpose
head -50 test/extended/util/compat_otp/client.go | cat -nRepository: openshift/origin
Length of output: 2048
🏁 Script executed:
# Search for where HasClusterVersionAPI is called (if that's the function name)
rg "HasClusterVersionAPI|clusterversions" test/ --type go -B 2 -A 2Repository: openshift/origin
Length of output: 4400
🏁 Script executed:
# Check for EnvIsKubernetesCluster constant definition
rg "EnvIsKubernetesCluster" test/extended/util/compat_otp/client.goRepository: openshift/origin
Length of output: 109
🏁 Script executed:
# Check the complete isHyperShiftWithKubernetesManagement function
rg -B 5 "isHyperShiftWithKubernetesManagement" test/extended/util/compat_otp/client.go -A 20Repository: openshift/origin
Length of output: 2403
🏁 Script executed:
# Check if there's any context timeout used elsewhere in this file
rg "context\.|timeout|Context" test/extended/util/compat_otp/client.goRepository: openshift/origin
Length of output: 147
🏁 Script executed:
# Search for exec.Command usage with timeouts in the codebase
rg "exec\.Command.*context|context.*exec\.Command" test/ --type go -B 2 -A 5 | head -40Repository: openshift/origin
Length of output: 42
🏁 Script executed:
# Let's also check if there are any exec.CommandContext usages in the codebase to see the pattern
rg "CommandContext|exec\.CommandContext" test/ --type go -B 2 -A 3 | head -50Repository: openshift/origin
Length of output: 1989
🏁 Script executed:
# Check if there are similar CLI probe patterns that handle errors better
rg "oc.*api-resources|exec\.Command.*oc" test/ --type go -B 3 -A 8 | head -60Repository: openshift/origin
Length of output: 5372
🏁 Script executed:
# Search for how other parts handle CLI failures to understand the convention
rg "CombinedOutput\|Output\(\)" test/extended/util/compat_otp/ -A 3 -B 1Repository: openshift/origin
Length of output: 42
🏁 Script executed:
# Check if there's any logging or error inspection elsewhere in the codebase
rg "exec\.exit|os/exec|ExitError" test/ --type go | head -20Repository: openshift/origin
Length of output: 1280
🏁 Script executed:
# Check a specific example of how ExitError is used in db_image_helpers.go
rg "ExitError" test/extended/util/db_image_helpers.go -B 5 -A 3 | head -40Repository: openshift/origin
Length of output: 396
🏁 Script executed:
# Get the complete hasClusterVersionAPI function - let's be precise
head -165 test/extended/util/compat_otp/client.go | tail -20Repository: openshift/origin
Length of output: 843
Distinguish probe failures from expected "API not found" results and add execution timeout.
The error handler at lines 157-160 conflates multiple failure modes (missing oc binary, invalid KUBECONFIG, authentication failures, network timeouts, hanging API server) into a single false return. This masks real environment issues and defeats the detection function's purpose.
Add a context timeout using exec.CommandContext and differentiate between:
- Expected failure: exit code indicating resource group/API doesn't exist
- Real failures: missing binary, bad config, auth issues, timeouts
This prevents silent masking of configuration problems when the probe is used to determine environment type at line 43.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/util/compat_otp/client.go` around lines 150 - 160, Use a
context timeout and explicit error classification instead of always returning
false on any cmd.CombinedOutput error: replace exec.Command with
exec.CommandContext using a context.WithTimeout (e.g., 5s) and check for
context.DeadlineExceeded to treat timeouts as real failures; detect
exec.ErrNotFound (missing oc binary) from cmd.Start/Run/Error and treat it as a
real failure; when the process exits with a non-zero exit code, inspect the
combined output/stderr for expected "API not found" or "no resources found"
messages and only return false for those known-not-found patterns, otherwise
propagate or log the error (do not silently return false). Reference symbols:
exec.CommandContext, context.WithTimeout, cmd.CombinedOutput, exec.ErrNotFound,
KUBECONFIG.
|
Scheduling required tests: |
| // In this case, we use NewCLIWithoutNamespace to avoid OpenShift-specific | ||
| // initialization (like clusterversion checks) that would fail on pure Kubernetes | ||
| if isHyperShiftWithKubernetesManagement() { | ||
| return NewCLIWithoutNamespace(basename) |
There was a problem hiding this comment.
Question: Can you clarify why this change is needed?
The existing exutil.NewCLI() already handles pure Kubernetes clusters automatically:
// From test/extended/util/client.go:355
func (c *CLI) SetupProject() string {
exist, err := DoesApiResourceExist(c.AdminConfig(), "projects", "project.openshift.io")
if exist {
return c.setupProject() // Uses OpenShift ProjectRequest API
}
return c.setupNamespace() // Uses Kubernetes Namespace API
}So calling exutil.NewCLI() should work fine on Kubernetes management clusters - it will automatically detect the lack of OpenShift APIs and create a Kubernetes namespace instead.
Additional point: Even if a temporary namespace is created by SetupProject(), tests can still run cluster-scoped oc commands using oc.AsAdmin() which uses admin config without namespace context.
Could you share:
- What specific error are you seeing with
exutil.NewCLI()on Kubernetes management clusters? - Is the issue during initialization, or later when tests execute?
There was a problem hiding this comment.
🧩 Analysis chain
Let me check the codebase to better understand the flow and where the clusterversion error might originate.
🏁 Script executed:
#!/bin/bash
# Look at the WithoutNamespace implementation and NewCLI init
rg -n "WithoutNamespace\|clusterversion\|ClusterVersion" test/extended/util/client.go | head -60
# Check if clusterversion is accessed during CLI setup/BeforeEach outside SetupProject
rg -n "clusterversion" test/extended/util/ -r --include="*.go" | grep -v "_test.go" | head -40
# Find where clusterversion is accessed in compat_otp or related framework init
rg -n "clusterversion" test/extended/util/compat_otp/ -r --include="*.go" | head -20
# Check what WithoutNamespace does and how it differs from regular NewCLI
ast-grep --pattern 'func ($_ *CLI) WithoutNamespace() $_ {
$$$
}'
# Check if there's any BeforeEach or framework init that accesses clusterversion
rg -n "clusterversion" test/extended/ -r --include="*.go" -l | head -20
Length of output: 3995
🏁 Script executed:
#!/bin/bash
# Look at the WithoutNamespace implementation
grep -n "WithoutNamespace\|setupNamespace\|setupProject\|SetupProject" test/extended/util/client.go | head -40
# Look at what happens during BeforeEach in the CLI lifecycle - specifically any clusterversion access
grep -n "clusterversion\|ClusterVersion" test/extended/util/client.go | head -20
# Look at compat_otp/clusters.go to understand where clusterversion is accessed
cat test/extended/util/compat_otp/clusters.go | head -100
# Check if exutil.NewCLI BeforeEach/AfterEach registers any hooks that access OpenShift-specific APIs
grep -n "BeforeEach\|AfterEach\|JustBeforeEach" test/extended/util/client.go | head -30
# Check what WithoutNamespace actually skips
grep -n -A5 -B2 "func.*WithoutNamespace" test/extended/util/client.go | head -30
Length of output: 6403
@rioliu-rh, great question. Looking at the code more carefully, the reviewer's concern is largely valid. Here's what I found:
The SetupProject() path is already safe:
SetupProject() (client.go line 355) checks for projects API first:
exist, err := DoesApiResourceExist(c.AdminConfig(), "projects", "project.openshift.io")
if exist {
return c.setupProject() // OpenShift path — accesses ClusterVersions (lines 404, 423)
}
return c.setupNamespace() // Kubernetes path — does NOT access ClusterVersionsOn a Kubernetes management cluster, there's no projects API, so it routes to setupNamespace(). The reviewer is correct that exutil.NewCLI() should not cause a clusterversion error during setup.
The real problem is likely in test bodies, not CLI initialization:
clusters.go functions like GetClusterVersion(), GetReleaseImage(), etc. call oc get clusterversion directly without any guards. If HyperShift tests call these helpers against the management cluster's kubeconfig, that's where the error surfaces — not during NewCLI() init.
The PR's solution doesn't fully address the root cause:
Returning NewCLIWithoutNamespace(basename) skips the g.BeforeEach(SetupProject) hook, but test bodies can still call GetClusterVersion() and hit the same error. So the fix treats the symptom (CLI initialization) but not the actual failure point.
Additional concern with hasClusterVersionAPI():
The new function calls exec.Command("oc", "api-resources", ...) synchronously at suite initialization time (not in a BeforeEach), which:
- Creates a subprocess side-effect before any test runs
- Depends on
ocbeing in$PATH - Is redundant with the existing
DoesApiResourceExist()utility which uses the already-configured admin REST client
A cleaner approach would be to guard the OpenShift-specific helpers in clusters.go (like GetClusterVersion) with a check like DoesApiResourceExist(...), or add a HYPERSHIFT=true + no-ClusterVersion check inside those helpers, rather than changing the CLI factory function.
Could you share the actual test or stack trace showing where the clusterversion error occurs? That would confirm whether the fix needs to be in NewCLIForKubeOpenShift or in the helper functions.
There was a problem hiding this comment.
@rioliu-rh Thanks for the review.
You're correct that SetupProject() can detect and handle Kubernetes clusters. However, the issue here is specific to the HyperShift test architecture, not just about Kubernetes vs OpenShift API differences.
The Problem:
exutil.NewCLI()registers aBeforeEachhook that automatically callsSetupProject()for every test- In HyperShift scenarios on Kubernetes management clusters (like AKS), this causes the test framework to create a namespace on the management cluster before each test
- However, HyperShift tests actually need:
- Management cluster: Only cluster-scoped/admin access (no namespace needed)
- Guest cluster: Namespace creation happens separately viaoc.AsGuestKubeconf()with its own kubeconfig
Please correct me if I missed any thing, thanks.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Thanks for the clarification @xiuwang! I have a follow-up question to better understand the requirement:
You mentioned that HyperShift tests need "only cluster-scoped/admin access (no namespace needed)" on the management cluster. However, even if SetupProject() creates a temporary namespace, your tests can still run cluster-scoped commands without any issues:
// These work fine even with a temp namespace created:
oc.AsAdmin().Run("get").Args("nodes").Execute()
oc.AsAdmin().Run("get").Args("pods", "--all-namespaces").Execute()
oc.AsGuestKubeconf().Run("create").Args("namespace", "test").Execute()The temp namespace on the management cluster doesn't interfere with:
- ✅ Cluster-scoped operations via
oc.AsAdmin() - ✅ Guest cluster operations via
oc.AsGuestKubeconf() - ✅ Any management cluster resources
Could you clarify:
- Does
SetupProject()actually fail on Kubernetes management clusters? If so, what's the specific error message? - Or does it succeed in creating a namespace, but you're trying to avoid it for other reasons (performance, cleanliness, etc.)?
- If it's just about not needing a namespace - can you explain why using
oc.AsAdmin()in your tests isn't sufficient?
Understanding the actual problem will help us determine if this approach is the right solution, or if there's a simpler fix.
There was a problem hiding this comment.
No, SetupProject() itself doesn't fail on Kubernetes management clusters - it correctly detects the lack of
projects.project.openshift.io API and falls back to creating a regular Kubernetes namespace.
The actual problem: Initialization before SetupProject()
The issue occurs earlier in the initialization chain, specifically in exutil.NewCLI() which is called before any
test runs:
// From test/extended/util/client.go
func NewCLI(namespace string, ...opts) *CLI {
// ...
ginkgo.BeforeEach(cli.SetupProject) // ← Registered in BeforeEach
// ...
}
When exutil.NewCLI() is instantiated, it registers the BeforeEach hook. However, before this hook runs, the CLI
initialization performs several checks including:
- Attempting to detect cluster version
- Checking for OpenShift-specific APIs
- Initializing admin config with OpenShift assumptions
On a pure Kubernetes cluster (like AKS management cluster), these checks produce error messages in logs like:
{ fail [github.com/openshift/openshift-tests-private/test/extended/ota/osus/osus.go:207]: Unexpected error:
<*fmt.wrapError | 0xc001764700>:
Error running oc --kubeconfig=/tmp/kubeconfig-1206426696 get clusterversion -o jsonpath={..desired.version}:
StdOut>
error: the server doesn't have a resource type "clusterversion"
StdErr>
error: the server doesn't have a resource type "clusterversion"
exit status 1
There was a problem hiding this comment.
Thanks for the detailed explanation @xiuwang! However, I found the actual root cause by examining the test code.
The Real Problem: Test Code Issue, Not CLI Initialization
Looking at the error location you mentioned (openshift-tests-private/test/extended/ota/osus/osus.go:207), the issue is actually caused by module-level variable initialization in the test code:
// osus.go line 204-207
var _ = g.Describe("[sig-updates] OTA osus instance should", func() {
defer g.GinkgoRecover()
oc := compat_otp.NewCLI("osusinstace", compat_otp.KubeConfigPath())
var major, minor, err = getMajorMinorVersion(oc) // ← Line 206
o.Expect(err).NotTo(o.HaveOccurred()) // ← Line 207And getMajorMinorVersion() calls:
// utils.go:719
func getMajorMinorVersion(oc *exutil.CLI) (string, string, error) {
clusterVersion, _, err := compat_otp.GetClusterVersion(oc) // ← This runs "oc get clusterversion"
// ...
}The problem: This code executes at Go module load time (when the package is loaded by the runtime), NOT during BeforeEach or test execution. So GetClusterVersion() runs immediately when the test package loads, before any test hooks execute.
Why This PR Won't Fix The Problem
Your PR changes NewCLIForKubeOpenShift() to skip namespace creation, but:
- The error is from
GetClusterVersion()callingoc get clusterversion - This will fail on Kubernetes clusters regardless of whether a namespace was created
- The namespace creation is irrelevant to the clusterversion API check
The Correct Solution
Fix the test code in openshift-tests-private repo to check the environment before calling OpenShift-specific APIs:
var _ = g.Describe("[sig-updates] OTA osus instance should", func() {
defer g.GinkgoRecover()
oc := compat_otp.NewCLI("osusinstace", compat_otp.KubeConfigPath())
// Skip clusterversion checks for pure Kubernetes clusters
var major, minor string
if os.Getenv("HYPERSHIFT") != "true" {
var err error
major, minor, err = getMajorMinorVersion(oc)
o.Expect(err).NotTo(o.HaveOccurred())
}
// ... rest of tests
})Or better yet, check if the API exists:
var major, minor string
exists, _ := exutil.DoesApiResourceExist(oc.AdminConfig(), "clusterversions", "config.openshift.io")
if exists {
var err error
major, minor, err = getMajorMinorVersion(oc)
o.Expect(err).NotTo(o.HaveOccurred())
}This approach:
- ✅ Actually fixes the root cause (test calling OpenShift APIs on Kubernetes clusters)
- ✅ No changes needed to
originrepo - ✅ Clean separation of concerns
- ✅ Works for all HyperShift scenarios
Summary: The issue is not about CLI initialization or namespace creation - it's about test code calling OpenShift-specific APIs at module load time without checking if they exist first. The fix should be in the test code, not in the CLI factory functions.
1. Use exutil.NewCLIWithoutNamespace directly to avoid BeforeEach hook
- Changed from compat_otp.NewCLIWithoutNamespace to exutil.NewCLIWithoutNamespace
- Ensures SetupProject() is not executed in HyperShift Kubernetes scenario
2. Improve hasClusterVersionAPI error handling
- Added context timeout (5s) to prevent hanging
- Use exec.CommandContext for better timeout control
- Explicit error classification:
* exec.ErrNotFound: oc binary missing
* context.DeadlineExceeded: command timeout
* Known patterns: 'API not found', 'no resources found'
- Only return false for expected errors, not all errors
|
@xiuwang: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
HyperShift tests fail when running on pure Kubernetes management clusters (AKS) with errors like:
the server doesn't have a resource type "clusterversion"
This occurs because:
compat_otp.NewCLIForKubeOpenShift()function usesexutil.NewCLI()which tries to accessOpenShift-specific APIs
clusterversion.config.openshift.ioSolution
Add automatic detection of HyperShift environments with Kubernetes management clusters in
test/extended/util/compat_otp/client.go:Key Changes:
isHyperShiftWithKubernetesManagement()- Detects if:HYPERSHIFT=trueenvironment variable is setclusterversionAPI (indicating Kubernetes, not OpenShift)hasClusterVersionAPI()- Checks if the cluster has OpenShift'sclusterversionAPI by queryingapi-resourcesNewCLIForKubeOpenShift()enhancement - Now automatically returnsNewCLIWithoutNamespace()forHyperShift on Kubernetes management clusters