Skip to content

More efficient large-scale single-node test distribution #3653

@vkarak

Description

@vkarak

Currently, to run a set of $N$ single-node tests on $M$ nodes using the --distribute option, ReFrame will generate $N\times M$ test jobs. For large scale runs (many tests, many nodes), this is inefficient for a number of reasons:

  1. ReFrame will have to instantiate and submit a very large amount of tests.
  2. ReFrame will have to generate multiple stage directories at once
  3. Since the jobs are independent, the overall throughput will be low, because every job will wait its turn in the scheduler.

Ideally, we would like such a scenario to be fulfilled by submitting a single job per node to be tested and then ReFrame run the set of tests inside the same job allocation.

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Feature.

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions