Skip to content

[BUG] MIP solver thread count degrades on 2nd+ solve due to omp_get_max_threads() contamination #900

@paulhendricks

Description

@paulhendricks

Describe the bug

When solving multiple MIP problems in sequence, the Branch-and-Bound (B&B) solver uses far fewer CPU threads on the 2nd and subsequent solves than on the 1st. This is because the thread count is determined by calling omp_get_max_threads() - 1 at the start of every B&B run, not once at startup.

During the first solve, the Papilo presolver (or a CUDA library loaded at that time) calls omp_set_num_threads() on cuOPT's bundled libgomp, reducing the global OMP thread count (e.g., from 128 to 8). All subsequent solves then read back this contaminated value and use only 8 - 1 = 7 threads instead of the original 128 - 1 = 127.

Root cause location:
cpp/src/mip_heuristics/solver.cu:220-221

if (context.settings.num_cpu_threads < 0) {
    // omp_get_max_threads() is called per-solve, not once at startup.
    // If libgomp's thread count was reduced by Papilo on solve 1,
    // solves 2+ silently get a fraction of the intended parallelism.
    branch_and_bound_settings.num_threads = std::max(1, omp_get_max_threads() - 1);
} else {
    branch_and_bound_settings.num_threads = std::max(1, context.settings.num_cpu_threads);
}

The default value of num_cpu_threads is -1 (cpp/src/math_optimization/solver_settings.cu:87), so every user who does not explicitly pin this setting is affected.


Steps/Code to reproduce bug

#!/usr/bin/env python3
"""
thread_bug_mre.py — Minimal reproducible example for cuOPT OMP thread-count bug.

BUG SUMMARY
-----------
cuOPT's B&B solver determines its thread count by calling omp_get_max_threads()-1
at the start of each solve.  During the *first* solve, Papilo (the presolve
library bundled with cuOPT) calls omp_set_num_threads(8) as a side effect of its
own initialization.  This permanently lowers omp_get_max_threads() for the entire
process, so every subsequent solve uses 7 threads instead of 127.

OBSERVED OUTPUT (128-logical-core machine)
----------------------------------------------------------
  [iter 0]  omp_get_max_threads() = 128  →  Using 127 CPU threads for B&B
  [iter 1]  omp_get_max_threads() =   8  →  Using   7 CPU threads for B&B  ← bug
  [iter 2]  omp_get_max_threads() =   8  →  Using   7 CPU threads for B&B  ← bug

The 18x thread-count drop (127 → 7) produces a proportional slowdown in the
branch-and-bound tree search, which is the dominant cost in MIP solving.

ROOT CAUSE
----------
Papilo calls omp_set_num_threads() during its first-pass initialization, likely
to limit its own parallelism.  Because OpenMP maintains a single global thread-
count state per process (across all shared libgomp instances), this contaminates
cuOPT's subsequent omp_get_max_threads() calls.

WORKAROUND
----------
Capture omp_get_max_threads() ONCE at import time (before any solve), then
explicitly pin it each iteration via:
    settings.set_parameter("num_cpu_threads", str(detected_count))

This bypasses the re-evaluation of omp_get_max_threads() inside cuOPT and locks
the thread count for the lifetime of the run.

USAGE
-----
  # Show the bug:
  python thread_bug_mre.py

  # Show the fix:
  python thread_bug_mre.py --fix

  # Adjust iterations / time limit:
  python thread_bug_mre.py --iters 3 --time-limit 60
"""

import argparse
import ctypes
import sys
import time

# ---------------------------------------------------------------------------
# Capture OMP thread count at import time — before any solve contaminates it.
# This is the key line of the fix; everything else is just plumbing.
# ---------------------------------------------------------------------------
try:
    _gomp = ctypes.CDLL("libgomp.so.1")
    _THREADS_AT_IMPORT = _gomp.omp_get_max_threads()
    _PINNED_THREADS    = max(1, _THREADS_AT_IMPORT - 1)
except Exception:
    _gomp = None
    _THREADS_AT_IMPORT = None
    _PINNED_THREADS    = None


def _omp_get_max_threads() -> int | None:
    """Return current omp_get_max_threads(), or None if libgomp unavailable."""
    return _gomp.omp_get_max_threads() if _gomp else None


def build_knapsack(n_items: int = 2000, seed: int = 42):
    """
    Build a random multi-dimensional binary knapsack MIP.

    Maximize  sum_i v_i * x_i
    subject to
        sum_i w_ij * x_i  <=  cap_j   for j in 0..n_constraints-1
        x_i in {0, 1}

    Sized to require genuine B&B (not solvable at LP root), while remaining
    small enough to demonstrate the thread-count bug within a few minutes.
    """
    from cuopt.linear_programming.problem import INTEGER, MAXIMIZE, Problem, LinearExpression

    import random
    rng = random.Random(seed)

    n_constraints = 10
    # Capacity set to ~40% of total weight so the problem is genuinely hard.
    weights = [[rng.randint(1, 20) for _ in range(n_items)] for _ in range(n_constraints)]
    values  = [rng.randint(1, 100) for _ in range(n_items)]
    caps    = [int(0.40 * sum(weights[j])) for j in range(n_constraints)]

    problem = Problem("knapsack_mre")
    problem.ObjSense = MAXIMIZE

    vars_ = [
        problem.addVariable(lb=0, ub=1, obj=float(values[i]), vtype=INTEGER, name=f"x_{i}")
        for i in range(n_items)
    ]

    for j in range(n_constraints):
        expr = LinearExpression(vars_, [float(weights[j][i]) for i in range(n_items)], 0.0)
        problem.addConstraint(expr <= caps[j], name=f"cap_{j}")

    return problem, n_items, n_constraints


def main():
    parser = argparse.ArgumentParser(
        description="MRE for cuOPT OMP thread-count bug",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=__doc__,
    )
    parser.add_argument("--fix",        action="store_true",
                        help="Apply the workaround (pin num_cpu_threads each iteration)")
    parser.add_argument("--iters",      type=int, default=3,
                        help="Number of solve iterations (default: 3)")
    parser.add_argument("--time-limit", type=int, default=60,
                        help="Solver time limit per iteration in seconds (default: 60)")
    parser.add_argument("--n-items",    type=int, default=2000,
                        help="Number of binary variables in the knapsack (default: 2000)")
    args = parser.parse_args()

    from cuopt.linear_programming.solver_settings import SolverSettings

    mode = "FIX APPLIED" if args.fix else "BUG MODE (no fix)"
    sep  = "=" * 72
    print(sep)
    print(f"cuOPT OMP thread-count MRE  —  {mode}")
    print(sep)
    print(f"  omp_get_max_threads() at import time : {_THREADS_AT_IMPORT}")
    print(f"  pinned thread count (max-1)          : {_PINNED_THREADS}")
    print(f"  iterations                           : {args.iters}")
    print(f"  time limit / iter                    : {args.time_limit}s")
    print(f"  knapsack items                       : {args.n_items}")
    print(sep)

    for i in range(args.iters):
        omp_before = _omp_get_max_threads()
        print(f"\n--- Iteration {i}  (omp_get_max_threads()={omp_before} before solve) ---")
        sys.stdout.flush()

        t_build = time.time()
        problem, n_vars, n_cons = build_knapsack(n_items=args.n_items, seed=42)
        print(f"  Built: {n_vars} vars, {n_cons} constraints  ({time.time()-t_build:.2f}s)")

        settings = SolverSettings()
        settings.set_parameter("time_limit", args.time_limit)

        if args.fix:
            if _PINNED_THREADS is not None:
                settings.set_parameter("num_cpu_threads", str(_PINNED_THREADS))
                print(f"  [fix] Pinning num_cpu_threads = {_PINNED_THREADS}")
            else:
                print("  [fix] libgomp unavailable — cannot pin threads")

        t_solve = time.time()
        problem.solve(settings)
        elapsed = time.time() - t_solve

        status = problem.Status.name
        omp_after = _omp_get_max_threads()
        print(f"  Status: {status}  |  solve time: {elapsed:.1f}s  "
              f"|  omp_get_max_threads() after: {omp_after}")

        if not args.fix and i == 0 and omp_after != omp_before:
            print(f"\n  *** BUG OBSERVED: omp_get_max_threads() dropped "
                  f"{omp_before}{omp_after} during solve. "
                  f"Next iteration will use only {omp_after - 1} B&B threads. ***")

    print(f"\n{sep}")
    print("SUMMARY")
    print(sep)
    if args.fix:
        print(f"  With fix: num_cpu_threads pinned to {_PINNED_THREADS} every iteration.")
        print("  All iterations used the same thread count regardless of Papilo side effects.")
    else:
        print("  Without fix: iteration 0 used the full thread count.")
        print("  Iterations 1+ used ~7 threads after Papilo contaminated omp_get_max_threads().")
        print()
        print("  Re-run with --fix to verify the workaround.")
    print(sep)


if __name__ == "__main__":
    main()

Expected behavior

All MIP solves in a process should use the same number of CPU threads (i.e., whatever omp_get_max_threads() returned before the first solve), regardless of what third-party libraries loaded during earlier solves may have called on libgomp internally.


Environment details

  • Environment location: Docker
  • Method of cuOpt install: Docker nvidia/cuopt:25.12.0a-cuda12.9-py3.13
  • GPU: Applies to any (bug is CPU-side, in the OpenMP thread pool)
  • Affected component: MIP / MILP solver (Branch-and-Bound); LP-only solves (PDLP, DualSimplex) are not affected

Metadata

Metadata

Labels

awaiting responseThis expects a response from maintainer or contributor depending on who requested in last comment.bugSomething isn't working

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions