-
Notifications
You must be signed in to change notification settings - Fork 130
Description
Describe the bug
When solving multiple MIP problems in sequence, the Branch-and-Bound (B&B) solver uses far fewer CPU threads on the 2nd and subsequent solves than on the 1st. This is because the thread count is determined by calling omp_get_max_threads() - 1 at the start of every B&B run, not once at startup.
During the first solve, the Papilo presolver (or a CUDA library loaded at that time) calls omp_set_num_threads() on cuOPT's bundled libgomp, reducing the global OMP thread count (e.g., from 128 to 8). All subsequent solves then read back this contaminated value and use only 8 - 1 = 7 threads instead of the original 128 - 1 = 127.
Root cause location:
cpp/src/mip_heuristics/solver.cu:220-221
if (context.settings.num_cpu_threads < 0) {
// omp_get_max_threads() is called per-solve, not once at startup.
// If libgomp's thread count was reduced by Papilo on solve 1,
// solves 2+ silently get a fraction of the intended parallelism.
branch_and_bound_settings.num_threads = std::max(1, omp_get_max_threads() - 1);
} else {
branch_and_bound_settings.num_threads = std::max(1, context.settings.num_cpu_threads);
}The default value of num_cpu_threads is -1 (cpp/src/math_optimization/solver_settings.cu:87), so every user who does not explicitly pin this setting is affected.
Steps/Code to reproduce bug
#!/usr/bin/env python3
"""
thread_bug_mre.py — Minimal reproducible example for cuOPT OMP thread-count bug.
BUG SUMMARY
-----------
cuOPT's B&B solver determines its thread count by calling omp_get_max_threads()-1
at the start of each solve. During the *first* solve, Papilo (the presolve
library bundled with cuOPT) calls omp_set_num_threads(8) as a side effect of its
own initialization. This permanently lowers omp_get_max_threads() for the entire
process, so every subsequent solve uses 7 threads instead of 127.
OBSERVED OUTPUT (128-logical-core machine)
----------------------------------------------------------
[iter 0] omp_get_max_threads() = 128 → Using 127 CPU threads for B&B
[iter 1] omp_get_max_threads() = 8 → Using 7 CPU threads for B&B ← bug
[iter 2] omp_get_max_threads() = 8 → Using 7 CPU threads for B&B ← bug
The 18x thread-count drop (127 → 7) produces a proportional slowdown in the
branch-and-bound tree search, which is the dominant cost in MIP solving.
ROOT CAUSE
----------
Papilo calls omp_set_num_threads() during its first-pass initialization, likely
to limit its own parallelism. Because OpenMP maintains a single global thread-
count state per process (across all shared libgomp instances), this contaminates
cuOPT's subsequent omp_get_max_threads() calls.
WORKAROUND
----------
Capture omp_get_max_threads() ONCE at import time (before any solve), then
explicitly pin it each iteration via:
settings.set_parameter("num_cpu_threads", str(detected_count))
This bypasses the re-evaluation of omp_get_max_threads() inside cuOPT and locks
the thread count for the lifetime of the run.
USAGE
-----
# Show the bug:
python thread_bug_mre.py
# Show the fix:
python thread_bug_mre.py --fix
# Adjust iterations / time limit:
python thread_bug_mre.py --iters 3 --time-limit 60
"""
import argparse
import ctypes
import sys
import time
# ---------------------------------------------------------------------------
# Capture OMP thread count at import time — before any solve contaminates it.
# This is the key line of the fix; everything else is just plumbing.
# ---------------------------------------------------------------------------
try:
_gomp = ctypes.CDLL("libgomp.so.1")
_THREADS_AT_IMPORT = _gomp.omp_get_max_threads()
_PINNED_THREADS = max(1, _THREADS_AT_IMPORT - 1)
except Exception:
_gomp = None
_THREADS_AT_IMPORT = None
_PINNED_THREADS = None
def _omp_get_max_threads() -> int | None:
"""Return current omp_get_max_threads(), or None if libgomp unavailable."""
return _gomp.omp_get_max_threads() if _gomp else None
def build_knapsack(n_items: int = 2000, seed: int = 42):
"""
Build a random multi-dimensional binary knapsack MIP.
Maximize sum_i v_i * x_i
subject to
sum_i w_ij * x_i <= cap_j for j in 0..n_constraints-1
x_i in {0, 1}
Sized to require genuine B&B (not solvable at LP root), while remaining
small enough to demonstrate the thread-count bug within a few minutes.
"""
from cuopt.linear_programming.problem import INTEGER, MAXIMIZE, Problem, LinearExpression
import random
rng = random.Random(seed)
n_constraints = 10
# Capacity set to ~40% of total weight so the problem is genuinely hard.
weights = [[rng.randint(1, 20) for _ in range(n_items)] for _ in range(n_constraints)]
values = [rng.randint(1, 100) for _ in range(n_items)]
caps = [int(0.40 * sum(weights[j])) for j in range(n_constraints)]
problem = Problem("knapsack_mre")
problem.ObjSense = MAXIMIZE
vars_ = [
problem.addVariable(lb=0, ub=1, obj=float(values[i]), vtype=INTEGER, name=f"x_{i}")
for i in range(n_items)
]
for j in range(n_constraints):
expr = LinearExpression(vars_, [float(weights[j][i]) for i in range(n_items)], 0.0)
problem.addConstraint(expr <= caps[j], name=f"cap_{j}")
return problem, n_items, n_constraints
def main():
parser = argparse.ArgumentParser(
description="MRE for cuOPT OMP thread-count bug",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__,
)
parser.add_argument("--fix", action="store_true",
help="Apply the workaround (pin num_cpu_threads each iteration)")
parser.add_argument("--iters", type=int, default=3,
help="Number of solve iterations (default: 3)")
parser.add_argument("--time-limit", type=int, default=60,
help="Solver time limit per iteration in seconds (default: 60)")
parser.add_argument("--n-items", type=int, default=2000,
help="Number of binary variables in the knapsack (default: 2000)")
args = parser.parse_args()
from cuopt.linear_programming.solver_settings import SolverSettings
mode = "FIX APPLIED" if args.fix else "BUG MODE (no fix)"
sep = "=" * 72
print(sep)
print(f"cuOPT OMP thread-count MRE — {mode}")
print(sep)
print(f" omp_get_max_threads() at import time : {_THREADS_AT_IMPORT}")
print(f" pinned thread count (max-1) : {_PINNED_THREADS}")
print(f" iterations : {args.iters}")
print(f" time limit / iter : {args.time_limit}s")
print(f" knapsack items : {args.n_items}")
print(sep)
for i in range(args.iters):
omp_before = _omp_get_max_threads()
print(f"\n--- Iteration {i} (omp_get_max_threads()={omp_before} before solve) ---")
sys.stdout.flush()
t_build = time.time()
problem, n_vars, n_cons = build_knapsack(n_items=args.n_items, seed=42)
print(f" Built: {n_vars} vars, {n_cons} constraints ({time.time()-t_build:.2f}s)")
settings = SolverSettings()
settings.set_parameter("time_limit", args.time_limit)
if args.fix:
if _PINNED_THREADS is not None:
settings.set_parameter("num_cpu_threads", str(_PINNED_THREADS))
print(f" [fix] Pinning num_cpu_threads = {_PINNED_THREADS}")
else:
print(" [fix] libgomp unavailable — cannot pin threads")
t_solve = time.time()
problem.solve(settings)
elapsed = time.time() - t_solve
status = problem.Status.name
omp_after = _omp_get_max_threads()
print(f" Status: {status} | solve time: {elapsed:.1f}s "
f"| omp_get_max_threads() after: {omp_after}")
if not args.fix and i == 0 and omp_after != omp_before:
print(f"\n *** BUG OBSERVED: omp_get_max_threads() dropped "
f"{omp_before} → {omp_after} during solve. "
f"Next iteration will use only {omp_after - 1} B&B threads. ***")
print(f"\n{sep}")
print("SUMMARY")
print(sep)
if args.fix:
print(f" With fix: num_cpu_threads pinned to {_PINNED_THREADS} every iteration.")
print(" All iterations used the same thread count regardless of Papilo side effects.")
else:
print(" Without fix: iteration 0 used the full thread count.")
print(" Iterations 1+ used ~7 threads after Papilo contaminated omp_get_max_threads().")
print()
print(" Re-run with --fix to verify the workaround.")
print(sep)
if __name__ == "__main__":
main()Expected behavior
All MIP solves in a process should use the same number of CPU threads (i.e., whatever omp_get_max_threads() returned before the first solve), regardless of what third-party libraries loaded during earlier solves may have called on libgomp internally.
Environment details
- Environment location: Docker
- Method of cuOpt install: Docker
nvidia/cuopt:25.12.0a-cuda12.9-py3.13 - GPU: Applies to any (bug is CPU-side, in the OpenMP thread pool)
- Affected component: MIP / MILP solver (Branch-and-Bound); LP-only solves (PDLP, DualSimplex) are not affected