Skip to content

Added retry for TD policy propagation for security_test#237

Merged
pawbhard merged 3 commits intogrpc:mainfrom
pawbhard:security_test_fix
Mar 31, 2026
Merged

Added retry for TD policy propagation for security_test#237
pawbhard merged 3 commits intogrpc:mainfrom
pawbhard:security_test_fix

Conversation

@pawbhard
Copy link
Copy Markdown
Contributor

Security test sometime fails usually in first or second step due to not receiving policy from TD.

This PR add a retry mechanism to account for this.
Internal bug b/280071258

@pawbhard pawbhard requested a review from sergiitk March 24, 2026 13:03
@pawbhard pawbhard requested a review from a team as a code owner March 24, 2026 13:03
@pawbhard pawbhard requested a review from arjan-bal March 24, 2026 13:03
@pawbhard pawbhard changed the title Added retry for initial file for security_test Added retry for TD policy propagation for security_test Mar 24, 2026
@pawbhard
Copy link
Copy Markdown
Contributor Author

@arjan-bal
Copy link
Copy Markdown
Contributor

Is it possible to change the order in which GCP resources are created to ensure the connection uses TLS from the start? Initializing a server in plaintext and then upgrading to TLS poses a security risk for production workloads.

@pawbhard
Copy link
Copy Markdown
Contributor Author

Is it possible to change the order in which GCP resources are created to ensure the connection uses TLS from the start? Initializing a server in plaintext and then upgrading to TLS poses a security risk for production workloads.

Yes its the correct way and need infra changes. Its already added in description and analysis in b/331206277

Yes it posses risk for production workload, but not to us as we are just using this to test. We will take it as a tech debt item (to be created). These is no point on keeping a failing/flaky test open because of this. As this is not a test failure

@arjan-bal
Copy link
Copy Markdown
Contributor

Yes, it poses a risk for production workloads, but not to us, as we are just using this to test.

Ideally, E2E tests should strictly mirror the customer's user journey. When they deviate, we risk missing real-world regressions. For example, if this test created resources with the desired state instead of patching them—and we still observed these failures—it would highlight a legitimate security issue. By introducing this change, the test loses its ability to catch such vulnerabilities, creating a false sense of security.

We will track this as a tech debt item (ticket to be created). There is no point in keeping a failing/flaky test open because of this.

Do we have an estimate for the actual fix? In the meantime, is buggrep properly catching these failures, or is the on-call engineer spending time manually triaging them? I'd be much more comfortable approving a temporary workaround if it is accompanied by a concrete timeline for the real solution. Without that commitment, I believe we should prioritize the proper fix over masking the symptom.

@pawbhard
Copy link
Copy Markdown
Contributor Author

Yes, it poses a risk for production workloads, but not to us, as we are just using this to test.

Ideally, E2E tests should strictly mirror the customer's user journey. When they deviate, we risk missing real-world regressions. For example, if this test created resources with the desired state instead of patching them—and we still observed these failures—it would highlight a legitimate security issue. By introducing this change, the test loses its ability to catch such vulnerabilities, creating a false sense of security.

We will track this as a tech debt item (ticket to be created). There is no point in keeping a failing/flaky test open because of this.

Do we have an estimate for the actual fix? In the meantime, is buggrep properly catching these failures, or is the on-call engineer spending time manually triaging them? I'd be much more comfortable approving a temporary workaround if it is accompanied by a concrete timeline for the real solution. Without that commitment, I believe we should prioritize the proper fix over masking the symptom.

Agree, We want to mirror customer Journey. Disagree on the point that putting this fix we are losing ability to catch failure as we are currently patching security config. (With or without fix we will not be able to catch what is being described)

On-Caller will pick it up based on priority. Cannot commit on timeline. keeping the issue in test, will not help us in prioritising as I can see issue is open more than 3 years back. I suggest fix what we are currently have, and plan for improvement based on discussion (mirror user journey and not do security patching)

Due to higher number of issue happening, we see conflict in matchers increasing toil o on-caller. (This bug matcher is part of conflicts).

Let us discuss offline if more discussion is needed on this, and converge.

Copy link
Copy Markdown
Member

@sergiitk sergiitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as the tests pass

Comment thread framework/xds_k8s_testcase.py Outdated
@pawbhard
Copy link
Copy Markdown
Contributor Author

@pawbhard pawbhard merged commit 5ff4026 into grpc:main Mar 31, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants