Skip to content

Latest commit

 

History

History
275 lines (205 loc) · 10.1 KB

File metadata and controls

275 lines (205 loc) · 10.1 KB

Porting a New Runtime to task-bench (Taskflow)

This guide is for agents implementing a new task-parallel runtime in the task-bench framework. It covers the core API, verification contract, build conventions, and common pitfalls.


A. Framework Architecture

task-bench is a parameterized benchmark suite where a shared core/ library defines task DAGs and kernels, and each runtime provides its own execution driver. The core library handles:

  • CLI parsing: -steps, -width, -type, -kernel, -iter, -scratch, -output, -imbalance, -field, -and, -nodes — all handled by App(argc, argv).
  • Graph topology: TaskGraph::dependencies() and TaskGraph::reverse_dependencies()
  • Task execution: TaskGraph::execute_point() runs the kernel AND validates inputs.
  • Timing/reporting: App::report_timing(elapsed) prints the expected output format.

Your runtime only needs to:

  1. Parse its own thread/process flags (e.g., -core N)
  2. Allocate data buffers
  3. Build and execute the task DAG with correct dependencies
  4. Time the execution and call report_timing()

B. Core API Reference

struct App {
  std::vector<TaskGraph> graphs;  // one per graph (multiple via -and flag)
  long nodes;                     // MPI node count (use 1 for shared-memory)
  App(int argc, char **argv);     // parses all standard CLI args
  void display() const;           // print graph config (call before timing)
  void report_timing(double elapsed_seconds) const;
};

struct TaskGraph : public task_graph_t {
  long offset_at_timestep(long timestep) const;
  long width_at_timestep(long timestep) const;
  long dependence_set_at_timestep(long timestep) const;

  // Returns intervals of dependency points (INCLUSIVE [a, b] pairs)
  std::vector<std::pair<long, long>> dependencies(long dset, long point) const;

  // Buffer-based overload (avoids heap allocation — preferred for hot loops)
  size_t dependencies(long dset, long point, std::pair<long, long> *deps) const;
  size_t num_dependencies(long dset, long point) const;

  void execute_point(long timestep, long point,
                     char *output_ptr, size_t output_bytes,
                     const char **input_ptr, const size_t *input_bytes,
                     size_t n_inputs,
                     char *scratch_ptr, size_t scratch_bytes) const;

  static void prepare_scratch(char *scratch_ptr, size_t scratch_bytes);
};

Key Fields on task_graph_t

Field Meaning
timesteps Number of timesteps (rows) in the DAG
max_width Maximum number of points (columns) per timestep
nb_fields Number of buffer slots for circular reuse (t % nb_fields)
output_bytes_per_task Size of each task's output buffer in bytes
scratch_bytes_per_task Size of scratch buffer (0 if none needed)
dependence Enum: TRIVIAL, NO_COMM, STENCIL_1D, ALL_TO_ALL, etc.

C. The execute_point Verification Contract

  1. Output: Fills output_ptr with (timestep, point) pairs for downstream validation.

  2. Input validation: Checks each input_ptr[idx] contains (timestep-1, dep_point). Wrong data → assertion failure.

  3. Timestep 0 / no-dependency tasks: Must still pass at least one input. Convention: pass the output buffer as a self-referencing input:

    if (timestep == 0 || input_points.empty()) {
      const char *self_input = output_ptr;
      size_t self_size = output_bytes;
      graph.execute_point(timestep, point, output_ptr, output_bytes,
                          &self_input, &self_size, 1,
                          scratch_ptr, scratch_bytes);
    }
  4. Scratch buffers: Initialize with TaskGraph::prepare_scratch() once per buffer (not per task). It writes a magic value that execute_point validates.

  5. Dependency filtering: dependencies() may return intervals extending beyond the previous timestep's range. Filter: only include d where last_offset <= d < last_offset + last_width.


D. Buffer Layout and nb_fields

Each task at timestep t, point x writes to slot [t % nb_fields][x].

  • nb_fields >= timesteps: no reuse conflicts; RAW edges suffice.
  • nb_fields < timesteps: multiple timesteps alias the same slot → WAW and WAR hazards. You need a slot tracker per (field, point) with last_writer and readers_since_write.

See ../docs/taskflow-patterns.md Section 6f for the full WAR/WAW tracking pattern.


E. Build System Conventions

File Purpose
main.cc Primary implementation
main_expl Alternative binary (symlink to main if not applicable)
Makefile Must produce main and main_expl targets
CXX ?= g++
CXXFLAGS = -std=c++17 -fPIC -pthread -O3
INC = -I../core -I<runtime-headers>
LIB = -L../core -lcore_s
include ../core/make_blas.mk
main.o: main.cc
	$(CXX) -c $(CXXFLAGS) $(INC) $<
main: main.o
	$(CXX) $^ $(LIB) $(LDFLAGS) -o $@
main_expl: main
	ln -sf main main_expl

Build core first: make -C core -j$(nproc).


F. CLI Argument Handling

void MyApp::parse_argument(int argc, char **argv) {
  for (int i = 1; i < argc; i++) {
    if (!strcmp(argv[i], "-core")) { nb_cores = atoi(argv[++i]); }
    else if (!strcmp(argv[i], "-p") || !strcmp(argv[i], "-S")) { i++; }  // ignore
  }
}

Accept -core N (or -worker N). Silently ignore flags from other runtimes. App(argc, argv) handles all standard flags.


G. Implementation Skeleton

struct MyRuntimeApp : public App {
  int nb_cores;

  MyRuntimeApp(int argc, char **argv) : App(argc, argv) {
    parse_argument(argc, argv);
  }

  void execute_main_loop() {
    display();

    // 1. Allocate buffers: [nb_fields][max_width] of output_bytes per graph
    // 2. Initialize scratch buffers (prepare_scratch once per buffer)
    // 3. Start timer, build DAG, run, stop timer:
    //    Timer::time_start();
    //    <build task DAG with emplace() + precede()>
    //    executor.run(taskflow).wait();
    //    double elapsed = Timer::time_end();
    // 4. report_timing(elapsed);
  }

  void parse_argument(int argc, char **argv) { /* see above */ }
};

int main(int argc, char **argv) {
  MyRuntimeApp app(argc, argv);
  app.execute_main_loop();
  return 0;
}

H. Common Porting Pitfalls

  1. Forgetting self-referencing input for timestep 0execute_point requires at least 1 input. Pass output_ptr as input when no dependencies exist.

  2. Not filtering dependency pointsdependencies() can return intervals outside [offset, offset+width). Filter before building input lists.

  3. Ignoring nb_fields buffer reuse — Skipping WAW/WAR tracking causes silent corruption with -field 4 tests.

  4. Forgetting prepare_scratch — Scratch must contain a magic value; execute_point validates it. Call once per buffer, not per task.

  5. Missing LD_LIBRARY_PATH — Must include core/ so the linker finds libcore_s.

  6. Lambda capture of class members — Copy to local variables first:

    int nw = nb_workers;  // copy member to local
    taskflow.emplace([nw]() { ... });
  7. Const captures — Lambda captures are const by default. Mark mutable if needed.

  8. Static members in local classes — C++ forbids them. Move to namespace scope.

  9. Scratch buffer strategy — Two mechanisms work:

    • Pre-allocated per-worker array indexed by executor.this_worker_id()
    • thread_local variable

    Either is correct, but all allocation and prepare_scratch() calls must happen outside the lambda, before DAG construction. Never resize, check init flags, or call prepare_scratch() inside the task body — at millions of tasks, per-task branches and lazy allocation add measurable overhead. Pre-allocate scratch arrays for all workers up front, then just index into them inside the lambda.

  10. Avoid copying TaskGraph into capturesTaskGraph contains multiple std::vectors. Copying it per task causes deep copies with heap allocations. Use a reference or pointer:

    const TaskGraph &graph = graphs[gi];  // reference, zero cost
    taskflow.emplace([&graph, ...]() { graph.execute_point(...); });
  11. Pre-compute dependencies during construction — Call graph.dependencies(dset, point) during the DAG construction loop, not inside task lambdas. The function does non-trivial work and would run on every worker thread at runtime if placed in the lambda.

  12. Keep it simple — In benchmarks of 18 independent ports, the shortest implementations (175-200 lines) consistently ranked in the top 3. Each extra abstraction risks adding heap allocations that increase per-task cost. Capture a pointer to the graph, use flat vectors for tracking, and let Taskflow handle scheduling.


I. Quick Reference Card

1. CLI:         struct MyApp : public App { MyApp(c,v) : App(c,v) { parse -worker } };
2. Buffers:     buf[nb_fields * max_width], each output_bytes_per_task bytes
                Slot index: (t % nb_fields) * max_width + point
3. Dependencies: graph.dependencies(dset, point) → filter to [last_offset, last_offset+last_width)
4. Timestep 0:  pass output_ptr as self-referencing input (n_inputs=1)
5. WAW/WAR:     last_writer[slot], has_writer[slot], readers[slot] per buffer slot
6. Scratch:     pre-allocate per-worker, call prepare_scratch() OUTSIDE the lambda, index inside

Recommended Test Commands

# Basic correctness
./main -kernel compute_bound -iter 1024 -type stencil_1d -steps 10 -width 10 -worker 4

# Empty kernel
./main -kernel empty -type trivial -steps 10 -width 10 -worker 2

# WAR/WAW correctness (field reuse — critical test)
./main -kernel compute_bound -iter 256 -type fft -steps 16 -width 8 -worker 4 -field 2

# Expected: all print "Elapsed Time ..." with no assertion failures

DAG Strategy

Prefer a monolithic DAG (one tf::Taskflow, one executor.run().wait()) — simpler and generally faster at all tested scales (up to 5M tasks). Consider batching only if the DAG would exceed available RAM (~0.5 GB per 1M tasks for metadata).