Fix server test behavior to stop on first failure#925
Fix server test behavior to stop on first failure#925rgsl888prabhu wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughThe test utilities module was modified to add server health monitoring and early test termination. A global flag tracks server status, and a new helper function exits tests if the server becomes unreachable after initially being operational. HTTP client methods now catch connection errors and route them through this early-exit mechanism. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/cuopt_server/cuopt_server/tests/utils/utils.py (1)
320-333:⚠️ Potential issue | 🟡 MinorHealthcheck loop is off-by-one against the “30s” contract.
At Line 321,
countis incremented and Line 322 breaks oncount == 30before performing that attempt, so you effectively allow 29 tries. This can prematurely exit under slow startup.Suggested fix
def spinup_wait(): global _server_was_up client = RequestClient() - count = 0 result = None - while True: - count += 1 - if count == 30: - break + for _ in range(30): try: result = client.get("/cuopt/health") - break + if result is not None and result.status_code == 200: + _server_was_up = True + return except Exception: time.sleep(1) - if result is None or result.status_code != 200: + if result is None or result.status_code != 200: pytest.exit( "cuOpt server failed to pass healthcheck after 30s. " "Skipping all server tests. Check server logs for startup errors (e.g. cudf/GPU).", returncode=1, ) - _server_was_up = True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt_server/cuopt_server/tests/utils/utils.py` around lines 320 - 333, The healthcheck loop allows only 29 attempts because `count` is incremented before the `client.get("/cuopt/health")` attempt; fix by making the loop perform 30 attempts: either move the `count += 1` to after the `try` (so the first attempt counts), or change the break condition to `if count >= 30` (or replace the manual counter with a `for _ in range(30)` loop), ensuring `result = client.get("/cuopt/health")` is executed up to 30 times before exiting and using `result` to decide the pytest.exit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/cuopt_server/cuopt_server/tests/utils/utils.py`:
- Around line 372-377: The requests calls in utils.py (e.g., the requests.get to
self.url + f"/cuopt/solution/{reqId}" that currently calls
_exit_if_server_gone(e) on exceptions) must include a timeout argument and
broaden exception handling to catch requests.Timeout instead of only
requests.ConnectTimeout; update every requests.get/requests.post/requests.delete
call noted in the diff to pass a sensible timeout (e.g., timeout=5 or a
configurable value) and change the except clauses from
(requests.ConnectionError, requests.ConnectTimeout) to
(requests.ConnectionError, requests.Timeout) so read timeouts are handled and
routed to _exit_if_server_gone or the existing error path.
---
Outside diff comments:
In `@python/cuopt_server/cuopt_server/tests/utils/utils.py`:
- Around line 320-333: The healthcheck loop allows only 29 attempts because
`count` is incremented before the `client.get("/cuopt/health")` attempt; fix by
making the loop perform 30 attempts: either move the `count += 1` to after the
`try` (so the first attempt counts), or change the break condition to `if count
>= 30` (or replace the manual counter with a `for _ in range(30)` loop),
ensuring `result = client.get("/cuopt/health")` is executed up to 30 times
before exiting and using `result` to decide the pytest.exit.
ℹ️ Review info
Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e1ed7665-30b5-4697-ac0e-c142968d3c79
📒 Files selected for processing (1)
python/cuopt_server/cuopt_server/tests/utils/utils.py
| try: | ||
| res = requests.get( | ||
| self.url + f"/cuopt/solution/{reqId}", headers=headers | ||
| ) | ||
| except (requests.ConnectionError, requests.ConnectTimeout) as e: | ||
| _exit_if_server_gone(e) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -nP 'requests\.(get|post|delete)\(' python/cuopt_server/cuopt_server/tests/utils/utils.py -A8 -B1Repository: NVIDIA/cuopt
Length of output: 1909
🏁 Script executed:
cat -n python/cuopt_server/cuopt_server/tests/utils/utils.py | sed -n '360,450p'Repository: NVIDIA/cuopt
Length of output: 3329
Add timeout parameter to all requests calls and broaden exception handling.
At lines 373–375, 386–388, 406–412, 431–433, and 439–444, requests.get(), requests.post(), and requests.delete() lack a timeout parameter, preventing read timeouts from triggering promptly. Additionally, exception handlers only catch ConnectTimeout, not the broader Timeout exception (which includes read timeouts), weakening fail-fast behavior when the server becomes unresponsive.
Add timeout to all requests calls and catch requests.Timeout instead of just requests.ConnectTimeout.
Suggested approach
class RequestClient:
def __init__(self, port=5555):
self.ip = "127.0.0.1"
self.port = port
self.url = f"http://{self.ip}:{self.port}"
+ self.http_timeout = 10
def poll_for_completion(self, reqId, delete=True):
@@
- res = requests.get(
- self.url + f"/cuopt/solution/{reqId}", headers=headers
- )
- except (requests.ConnectionError, requests.ConnectTimeout) as e:
+ res = requests.get(
+ self.url + f"/cuopt/solution/{reqId}",
+ headers=headers,
+ timeout=self.http_timeout,
+ )
+ except (requests.ConnectionError, requests.Timeout) as e:
_exit_if_server_gone(e)
@@
requests.delete(
- self.url + f"/cuopt/solution/{reqId}", headers=headers
+ self.url + f"/cuopt/solution/{reqId}",
+ headers=headers,
+ timeout=self.http_timeout,
)
- except (requests.ConnectionError, requests.ConnectTimeout) as e:
+ except (requests.ConnectionError, requests.Timeout) as e:
_exit_if_server_gone(e)
@@
res = requests.post(
self.url + endpoint,
params=params,
headers=headers,
json=json,
data=data,
+ timeout=self.http_timeout,
)
- except (requests.ConnectionError, requests.ConnectTimeout) as e:
+ except (requests.ConnectionError, requests.Timeout) as e:
_exit_if_server_gone(e)
@@
return requests.get(
- self.url + endpoint, headers=headers, json=json
+ self.url + endpoint, headers=headers, json=json, timeout=self.http_timeout
)
- except (requests.ConnectionError, requests.ConnectTimeout) as e:
+ except (requests.ConnectionError, requests.Timeout) as e:
_exit_if_server_gone(e)
@@
return requests.delete(
self.url + endpoint,
params=params,
headers=headers,
json=json,
+ timeout=self.http_timeout,
)
- except (requests.ConnectionError, requests.ConnectTimeout) as e:
+ except (requests.ConnectionError, requests.Timeout) as e:
_exit_if_server_gone(e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/cuopt_server/cuopt_server/tests/utils/utils.py` around lines 372 -
377, The requests calls in utils.py (e.g., the requests.get to self.url +
f"/cuopt/solution/{reqId}" that currently calls _exit_if_server_gone(e) on
exceptions) must include a timeout argument and broaden exception handling to
catch requests.Timeout instead of only requests.ConnectTimeout; update every
requests.get/requests.post/requests.delete call noted in the diff to pass a
sensible timeout (e.g., timeout=5 or a configurable value) and change the except
clauses from (requests.ConnectionError, requests.ConnectTimeout) to
(requests.ConnectionError, requests.Timeout) so read timeouts are handled and
routed to _exit_if_server_gone or the existing error path.
Description
If the server test fails because of bad server status, then stop at first instance.
Issue
closes #894
Checklist