Fix server test behavior to stop on first failure by rgsl888prabhu · Pull Request #925 · NVIDIA/cuopt

rgsl888prabhu · 2026-03-04T06:45:34Z

Description

If the server test fails because of bad server status, then stop at first instance.

Issue

closes #894

Checklist

I am familiar with the Contributing Guidelines.
Testing
- New or existing tests cover these changes
- Added tests
- Created an issue to follow-up
- NA
Documentation
- The documentation is up to date with these changes
- Added new documentation
- NA

coderabbitai · 2026-03-04T06:50:02Z

📝 Walkthrough

Walkthrough

The test utilities module was modified to add server health monitoring and early test termination. A global flag tracks server status, and a new helper function exits tests if the server becomes unreachable after initially being operational. HTTP client methods now catch connection errors and route them through this early-exit mechanism.

Changes

Cohort / File(s)	Summary
Server Health Monitoring `python/cuopt_server/cuopt_server/tests/utils/utils.py`	Added `_server_was_up` flag and `_exit_if_server_gone()` helper function. Enhanced `spinup_wait()` to perform initial healthcheck with 30-second timeout and set flag on success. Updated `poll_for_completion()`, `post()`, `get()`, and `delete()` methods to catch `ConnectionError` and `ConnectTimeout` exceptions and route them through early-exit logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: fixing server test behavior to stop on first failure due to bad server status.
Description check	✅ Passed	The description is directly related to the changeset, referencing issue `#894` and explaining the purpose of stopping tests on first server failure.
Linked Issues check	✅ Passed	The PR implements the requirement to detect server unavailability and exit early by adding health checks and connection error handling [`#894`].
Out of Scope Changes check	✅ Passed	All changes focus on server health monitoring and early exit mechanisms, directly addressing the linked issue requirement without introducing unrelated modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

python/cuopt_server/cuopt_server/tests/utils/utils.py (1)

320-333: ⚠️ Potential issue | 🟡 Minor

Healthcheck loop is off-by-one against the “30s” contract.

At Line 321, count is incremented and Line 322 breaks on count == 30 before performing that attempt, so you effectively allow 29 tries. This can prematurely exit under slow startup.

Suggested fix

 def spinup_wait():
     global _server_was_up
     client = RequestClient()
-    count = 0
     result = None
-    while True:
-        count += 1
-        if count == 30:
-            break
+    for _ in range(30):
         try:
             result = client.get("/cuopt/health")
-            break
+            if result is not None and result.status_code == 200:
+                _server_was_up = True
+                return
         except Exception:
             time.sleep(1)
-    if result is None or result.status_code != 200:
+    if result is None or result.status_code != 200:
         pytest.exit(
             "cuOpt server failed to pass healthcheck after 30s. "
             "Skipping all server tests. Check server logs for startup errors (e.g. cudf/GPU).",
             returncode=1,
         )
-    _server_was_up = True

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/cuopt_server/cuopt_server/tests/utils/utils.py` around lines 320 -
333, The healthcheck loop allows only 29 attempts because `count` is incremented
before the `client.get("/cuopt/health")` attempt; fix by making the loop perform
30 attempts: either move the `count += 1` to after the `try` (so the first
attempt counts), or change the break condition to `if count >= 30` (or replace
the manual counter with a `for _ in range(30)` loop), ensuring `result =
client.get("/cuopt/health")` is executed up to 30 times before exiting and using
`result` to decide the pytest.exit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuopt_server/cuopt_server/tests/utils/utils.py`:
- Around line 372-377: The requests calls in utils.py (e.g., the requests.get to
self.url + f"/cuopt/solution/{reqId}" that currently calls
_exit_if_server_gone(e) on exceptions) must include a timeout argument and
broaden exception handling to catch requests.Timeout instead of only
requests.ConnectTimeout; update every requests.get/requests.post/requests.delete
call noted in the diff to pass a sensible timeout (e.g., timeout=5 or a
configurable value) and change the except clauses from
(requests.ConnectionError, requests.ConnectTimeout) to
(requests.ConnectionError, requests.Timeout) so read timeouts are handled and
routed to _exit_if_server_gone or the existing error path.

---

Outside diff comments:
In `@python/cuopt_server/cuopt_server/tests/utils/utils.py`:
- Around line 320-333: The healthcheck loop allows only 29 attempts because
`count` is incremented before the `client.get("/cuopt/health")` attempt; fix by
making the loop perform 30 attempts: either move the `count += 1` to after the
`try` (so the first attempt counts), or change the break condition to `if count
>= 30` (or replace the manual counter with a `for _ in range(30)` loop),
ensuring `result = client.get("/cuopt/health")` is executed up to 30 times
before exiting and using `result` to decide the pytest.exit.

ℹ️ Review info

Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e1ed7665-30b5-4697-ac0e-c142968d3c79

📥 Commits

Reviewing files that changed from the base of the PR and between a92fd23 and 21bb1e0.

📒 Files selected for processing (1)

python/cuopt_server/cuopt_server/tests/utils/utils.py

coderabbitai · 2026-03-04T06:50:05Z

python/cuopt_server/cuopt_server/tests/utils/utils.py

+            try:
+                res = requests.get(
+                    self.url + f"/cuopt/solution/{reqId}", headers=headers
+                )
+            except (requests.ConnectionError, requests.ConnectTimeout) as e:
+                _exit_if_server_gone(e)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -nP 'requests\.(get|post|delete)\(' python/cuopt_server/cuopt_server/tests/utils/utils.py -A8 -B1

Repository: NVIDIA/cuopt

Length of output: 1909

🏁 Script executed:

cat -n python/cuopt_server/cuopt_server/tests/utils/utils.py | sed -n '360,450p'

Repository: NVIDIA/cuopt

Length of output: 3329

Add timeout parameter to all requests calls and broaden exception handling.

At lines 373–375, 386–388, 406–412, 431–433, and 439–444, requests.get(), requests.post(), and requests.delete() lack a timeout parameter, preventing read timeouts from triggering promptly. Additionally, exception handlers only catch ConnectTimeout, not the broader Timeout exception (which includes read timeouts), weakening fail-fast behavior when the server becomes unresponsive.

Add timeout to all requests calls and catch requests.Timeout instead of just requests.ConnectTimeout.

Suggested approach

class RequestClient: def __init__(self, port=5555): self.ip = "127.0.0.1" self.port = port self.url = f"http://{self.ip}:{self.port}" + self.http_timeout = 10 def poll_for_completion(self, reqId, delete=True): @@ - res = requests.get( - self.url + f"/cuopt/solution/{reqId}", headers=headers - ) - except (requests.ConnectionError, requests.ConnectTimeout) as e: + res = requests.get( + self.url + f"/cuopt/solution/{reqId}", + headers=headers, + timeout=self.http_timeout, + ) + except (requests.ConnectionError, requests.Timeout) as e: _exit_if_server_gone(e) @@ requests.delete( - self.url + f"/cuopt/solution/{reqId}", headers=headers + self.url + f"/cuopt/solution/{reqId}", + headers=headers, + timeout=self.http_timeout, ) - except (requests.ConnectionError, requests.ConnectTimeout) as e: + except (requests.ConnectionError, requests.Timeout) as e: _exit_if_server_gone(e) @@ res = requests.post( self.url + endpoint, params=params, headers=headers, json=json, data=data, + timeout=self.http_timeout, ) - except (requests.ConnectionError, requests.ConnectTimeout) as e: + except (requests.ConnectionError, requests.Timeout) as e: _exit_if_server_gone(e) @@ return requests.get( - self.url + endpoint, headers=headers, json=json + self.url + endpoint, headers=headers, json=json, timeout=self.http_timeout ) - except (requests.ConnectionError, requests.ConnectTimeout) as e: + except (requests.ConnectionError, requests.Timeout) as e: _exit_if_server_gone(e) @@ return requests.delete( self.url + endpoint, params=params, headers=headers, json=json, + timeout=self.http_timeout, ) - except (requests.ConnectionError, requests.ConnectTimeout) as e: + except (requests.ConnectionError, requests.Timeout) as e: _exit_if_server_gone(e)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/cuopt_server/cuopt_server/tests/utils/utils.py` around lines 372 - 377, The requests calls in utils.py (e.g., the requests.get to self.url + f"/cuopt/solution/{reqId}" that currently calls _exit_if_server_gone(e) on exceptions) must include a timeout argument and broaden exception handling to catch requests.Timeout instead of only requests.ConnectTimeout; update every requests.get/requests.post/requests.delete call noted in the diff to pass a sensible timeout (e.g., timeout=5 or a configurable value) and change the except clauses from (requests.ConnectionError, requests.ConnectTimeout) to (requests.ConnectionError, requests.Timeout) so read timeouts are handled and routed to _exit_if_server_gone or the existing error path.

fix server failure

21bb1e0

rgsl888prabhu requested a review from a team as a code owner March 4, 2026 06:45

rgsl888prabhu requested a review from Iroy30 March 4, 2026 06:45

rgsl888prabhu self-assigned this Mar 4, 2026

rgsl888prabhu added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels Mar 4, 2026

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

anandhkb added this to the 26.04 milestone Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix server test behavior to stop on first failure#925

Fix server test behavior to stop on first failure#925
rgsl888prabhu wants to merge 1 commit intoNVIDIA:mainfrom
rgsl888prabhu:fix_server_runnning_tests

rgsl888prabhu commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 4, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rgsl888prabhu commented Mar 4, 2026

Description

Issue

Checklist

Uh oh!

coderabbitai bot commented Mar 4, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants