Skip to content

Fix server test behavior to stop on first failure#925

Open
rgsl888prabhu wants to merge 1 commit intoNVIDIA:mainfrom
rgsl888prabhu:fix_server_runnning_tests
Open

Fix server test behavior to stop on first failure#925
rgsl888prabhu wants to merge 1 commit intoNVIDIA:mainfrom
rgsl888prabhu:fix_server_runnning_tests

Conversation

@rgsl888prabhu
Copy link
Collaborator

Description

If the server test fails because of bad server status, then stop at first instance.

Issue

closes #894

Checklist

  • I am familiar with the Contributing Guidelines.
  • Testing
    • New or existing tests cover these changes
    • Added tests
    • Created an issue to follow-up
    • NA
  • Documentation
    • The documentation is up to date with these changes
    • Added new documentation
    • NA

@rgsl888prabhu rgsl888prabhu requested a review from a team as a code owner March 4, 2026 06:45
@rgsl888prabhu rgsl888prabhu requested a review from Iroy30 March 4, 2026 06:45
@rgsl888prabhu rgsl888prabhu self-assigned this Mar 4, 2026
@rgsl888prabhu rgsl888prabhu added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels Mar 4, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

The test utilities module was modified to add server health monitoring and early test termination. A global flag tracks server status, and a new helper function exits tests if the server becomes unreachable after initially being operational. HTTP client methods now catch connection errors and route them through this early-exit mechanism.

Changes

Cohort / File(s) Summary
Server Health Monitoring
python/cuopt_server/cuopt_server/tests/utils/utils.py
Added _server_was_up flag and _exit_if_server_gone() helper function. Enhanced spinup_wait() to perform initial healthcheck with 30-second timeout and set flag on success. Updated poll_for_completion(), post(), get(), and delete() methods to catch ConnectionError and ConnectTimeout exceptions and route them through early-exit logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: fixing server test behavior to stop on first failure due to bad server status.
Description check ✅ Passed The description is directly related to the changeset, referencing issue #894 and explaining the purpose of stopping tests on first server failure.
Linked Issues check ✅ Passed The PR implements the requirement to detect server unavailability and exit early by adding health checks and connection error handling [#894].
Out of Scope Changes check ✅ Passed All changes focus on server health monitoring and early exit mechanisms, directly addressing the linked issue requirement without introducing unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/cuopt_server/cuopt_server/tests/utils/utils.py (1)

320-333: ⚠️ Potential issue | 🟡 Minor

Healthcheck loop is off-by-one against the “30s” contract.

At Line 321, count is incremented and Line 322 breaks on count == 30 before performing that attempt, so you effectively allow 29 tries. This can prematurely exit under slow startup.

Suggested fix
 def spinup_wait():
     global _server_was_up
     client = RequestClient()
-    count = 0
     result = None
-    while True:
-        count += 1
-        if count == 30:
-            break
+    for _ in range(30):
         try:
             result = client.get("/cuopt/health")
-            break
+            if result is not None and result.status_code == 200:
+                _server_was_up = True
+                return
         except Exception:
             time.sleep(1)
-    if result is None or result.status_code != 200:
+    if result is None or result.status_code != 200:
         pytest.exit(
             "cuOpt server failed to pass healthcheck after 30s. "
             "Skipping all server tests. Check server logs for startup errors (e.g. cudf/GPU).",
             returncode=1,
         )
-    _server_was_up = True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt_server/cuopt_server/tests/utils/utils.py` around lines 320 -
333, The healthcheck loop allows only 29 attempts because `count` is incremented
before the `client.get("/cuopt/health")` attempt; fix by making the loop perform
30 attempts: either move the `count += 1` to after the `try` (so the first
attempt counts), or change the break condition to `if count >= 30` (or replace
the manual counter with a `for _ in range(30)` loop), ensuring `result =
client.get("/cuopt/health")` is executed up to 30 times before exiting and using
`result` to decide the pytest.exit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuopt_server/cuopt_server/tests/utils/utils.py`:
- Around line 372-377: The requests calls in utils.py (e.g., the requests.get to
self.url + f"/cuopt/solution/{reqId}" that currently calls
_exit_if_server_gone(e) on exceptions) must include a timeout argument and
broaden exception handling to catch requests.Timeout instead of only
requests.ConnectTimeout; update every requests.get/requests.post/requests.delete
call noted in the diff to pass a sensible timeout (e.g., timeout=5 or a
configurable value) and change the except clauses from
(requests.ConnectionError, requests.ConnectTimeout) to
(requests.ConnectionError, requests.Timeout) so read timeouts are handled and
routed to _exit_if_server_gone or the existing error path.

---

Outside diff comments:
In `@python/cuopt_server/cuopt_server/tests/utils/utils.py`:
- Around line 320-333: The healthcheck loop allows only 29 attempts because
`count` is incremented before the `client.get("/cuopt/health")` attempt; fix by
making the loop perform 30 attempts: either move the `count += 1` to after the
`try` (so the first attempt counts), or change the break condition to `if count
>= 30` (or replace the manual counter with a `for _ in range(30)` loop),
ensuring `result = client.get("/cuopt/health")` is executed up to 30 times
before exiting and using `result` to decide the pytest.exit.

ℹ️ Review info
Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e1ed7665-30b5-4697-ac0e-c142968d3c79

📥 Commits

Reviewing files that changed from the base of the PR and between a92fd23 and 21bb1e0.

📒 Files selected for processing (1)
  • python/cuopt_server/cuopt_server/tests/utils/utils.py

Comment on lines +372 to +377
try:
res = requests.get(
self.url + f"/cuopt/solution/{reqId}", headers=headers
)
except (requests.ConnectionError, requests.ConnectTimeout) as e:
_exit_if_server_gone(e)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -nP 'requests\.(get|post|delete)\(' python/cuopt_server/cuopt_server/tests/utils/utils.py -A8 -B1

Repository: NVIDIA/cuopt

Length of output: 1909


🏁 Script executed:

cat -n python/cuopt_server/cuopt_server/tests/utils/utils.py | sed -n '360,450p'

Repository: NVIDIA/cuopt

Length of output: 3329


Add timeout parameter to all requests calls and broaden exception handling.

At lines 373–375, 386–388, 406–412, 431–433, and 439–444, requests.get(), requests.post(), and requests.delete() lack a timeout parameter, preventing read timeouts from triggering promptly. Additionally, exception handlers only catch ConnectTimeout, not the broader Timeout exception (which includes read timeouts), weakening fail-fast behavior when the server becomes unresponsive.

Add timeout to all requests calls and catch requests.Timeout instead of just requests.ConnectTimeout.

Suggested approach
 class RequestClient:
     def __init__(self, port=5555):
         self.ip = "127.0.0.1"
         self.port = port
         self.url = f"http://{self.ip}:{self.port}"
+        self.http_timeout = 10

     def poll_for_completion(self, reqId, delete=True):
@@
-                res = requests.get(
-                    self.url + f"/cuopt/solution/{reqId}", headers=headers
-                )
-            except (requests.ConnectionError, requests.ConnectTimeout) as e:
+                res = requests.get(
+                    self.url + f"/cuopt/solution/{reqId}",
+                    headers=headers,
+                    timeout=self.http_timeout,
+                )
+            except (requests.ConnectionError, requests.Timeout) as e:
                 _exit_if_server_gone(e)
@@
                 requests.delete(
-                    self.url + f"/cuopt/solution/{reqId}", headers=headers
+                    self.url + f"/cuopt/solution/{reqId}",
+                    headers=headers,
+                    timeout=self.http_timeout,
                 )
-            except (requests.ConnectionError, requests.ConnectTimeout) as e:
+            except (requests.ConnectionError, requests.Timeout) as e:
                 _exit_if_server_gone(e)
@@
             res = requests.post(
                 self.url + endpoint,
                 params=params,
                 headers=headers,
                 json=json,
                 data=data,
+                timeout=self.http_timeout,
             )
-        except (requests.ConnectionError, requests.ConnectTimeout) as e:
+        except (requests.ConnectionError, requests.Timeout) as e:
             _exit_if_server_gone(e)
@@
             return requests.get(
-                self.url + endpoint, headers=headers, json=json
+                self.url + endpoint, headers=headers, json=json, timeout=self.http_timeout
             )
-        except (requests.ConnectionError, requests.ConnectTimeout) as e:
+        except (requests.ConnectionError, requests.Timeout) as e:
             _exit_if_server_gone(e)
@@
             return requests.delete(
                 self.url + endpoint,
                 params=params,
                 headers=headers,
                 json=json,
+                timeout=self.http_timeout,
             )
-        except (requests.ConnectionError, requests.ConnectTimeout) as e:
+        except (requests.ConnectionError, requests.Timeout) as e:
             _exit_if_server_gone(e)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt_server/cuopt_server/tests/utils/utils.py` around lines 372 -
377, The requests calls in utils.py (e.g., the requests.get to self.url +
f"/cuopt/solution/{reqId}" that currently calls _exit_if_server_gone(e) on
exceptions) must include a timeout argument and broaden exception handling to
catch requests.Timeout instead of only requests.ConnectTimeout; update every
requests.get/requests.post/requests.delete call noted in the diff to pass a
sensible timeout (e.g., timeout=5 or a configurable value) and change the except
clauses from (requests.ConnectionError, requests.ConnectTimeout) to
(requests.ConnectionError, requests.Timeout) so read timeouts are handled and
routed to _exit_if_server_gone or the existing error path.

@anandhkb anandhkb added this to the 26.04 milestone Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] In CI for cuOpt service, skip the rest of tests if the server goes down

2 participants