Porting a New Runtime to task-bench (Taskflow)

This guide is for agents implementing a new task-parallel runtime in the task-bench framework. It covers the core API, verification contract, build conventions, and common pitfalls.

A. Framework Architecture

task-bench is a parameterized benchmark suite where a shared core/ library defines task DAGs and kernels, and each runtime provides its own execution driver. The core library handles:

CLI parsing: -steps, -width, -type, -kernel, -iter, -scratch, -output, -imbalance, -field, -and, -nodes — all handled by App(argc, argv).
Graph topology: TaskGraph::dependencies() and TaskGraph::reverse_dependencies()
Task execution: TaskGraph::execute_point() runs the kernel AND validates inputs.
Timing/reporting: App::report_timing(elapsed) prints the expected output format.

Your runtime only needs to:

Parse its own thread/process flags (e.g., -core N)
Allocate data buffers
Build and execute the task DAG with correct dependencies
Time the execution and call report_timing()

B. Core API Reference

struct App {
  std::vector<TaskGraph> graphs;  // one per graph (multiple via -and flag)
  long nodes;                     // MPI node count (use 1 for shared-memory)
  App(int argc, char **argv);     // parses all standard CLI args
  void display() const;           // print graph config (call before timing)
  void report_timing(double elapsed_seconds) const;
};

struct TaskGraph : public task_graph_t {
  long offset_at_timestep(long timestep) const;
  long width_at_timestep(long timestep) const;
  long dependence_set_at_timestep(long timestep) const;

  // Returns intervals of dependency points (INCLUSIVE [a, b] pairs)
  std::vector<std::pair<long, long>> dependencies(long dset, long point) const;

  // Buffer-based overload (avoids heap allocation — preferred for hot loops)
  size_t dependencies(long dset, long point, std::pair<long, long> *deps) const;
  size_t num_dependencies(long dset, long point) const;

  void execute_point(long timestep, long point,
                     char *output_ptr, size_t output_bytes,
                     const char **input_ptr, const size_t *input_bytes,
                     size_t n_inputs,
                     char *scratch_ptr, size_t scratch_bytes) const;

  static void prepare_scratch(char *scratch_ptr, size_t scratch_bytes);
};

Key Fields on `task_graph_t`

Field	Meaning
`timesteps`	Number of timesteps (rows) in the DAG
`max_width`	Maximum number of points (columns) per timestep
`nb_fields`	Number of buffer slots for circular reuse (`t % nb_fields`)
`output_bytes_per_task`	Size of each task's output buffer in bytes
`scratch_bytes_per_task`	Size of scratch buffer (0 if none needed)
`dependence`	Enum: TRIVIAL, NO_COMM, STENCIL_1D, ALL_TO_ALL, etc.

C. The `execute_point` Verification Contract

Output: Fills output_ptr with (timestep, point) pairs for downstream validation.
Input validation: Checks each input_ptr[idx] contains (timestep-1, dep_point). Wrong data → assertion failure.

Timestep 0 / no-dependency tasks: Must still pass at least one input. Convention: pass the output buffer as a self-referencing input:

if (timestep == 0 || input_points.empty()) {
  const char *self_input = output_ptr;
  size_t self_size = output_bytes;
  graph.execute_point(timestep, point, output_ptr, output_bytes,
                      &self_input, &self_size, 1,
                      scratch_ptr, scratch_bytes);
}

Scratch buffers: Initialize with TaskGraph::prepare_scratch() once per buffer (not per task). It writes a magic value that execute_point validates.
Dependency filtering: dependencies() may return intervals extending beyond the previous timestep's range. Filter: only include d where last_offset <= d < last_offset + last_width.

D. Buffer Layout and `nb_fields`

Each task at timestep t, point x writes to slot [t % nb_fields][x].

nb_fields >= timesteps: no reuse conflicts; RAW edges suffice.
nb_fields < timesteps: multiple timesteps alias the same slot → WAW and WAR hazards. You need a slot tracker per (field, point) with last_writer and readers_since_write.

See ../docs/taskflow-patterns.md Section 6f for the full WAR/WAW tracking pattern.

E. Build System Conventions

File	Purpose
`main.cc`	Primary implementation
`main_expl`	Alternative binary (symlink to `main` if not applicable)
`Makefile`	Must produce `main` and `main_expl` targets

CXX ?= g++
CXXFLAGS = -std=c++17 -fPIC -pthread -O3
INC = -I../core -I<runtime-headers>
LIB = -L../core -lcore_s
include ../core/make_blas.mk
main.o: main.cc
	$(CXX) -c $(CXXFLAGS) $(INC) $<
main: main.o
	$(CXX) $^ $(LIB) $(LDFLAGS) -o $@
main_expl: main
	ln -sf main main_expl

Build core first: make -C core -j$(nproc).

F. CLI Argument Handling

void MyApp::parse_argument(int argc, char **argv) {
  for (int i = 1; i < argc; i++) {
    if (!strcmp(argv[i], "-core")) { nb_cores = atoi(argv[++i]); }
    else if (!strcmp(argv[i], "-p") || !strcmp(argv[i], "-S")) { i++; }  // ignore
  }
}

Accept -core N (or -worker N). Silently ignore flags from other runtimes. App(argc, argv) handles all standard flags.

G. Implementation Skeleton

struct MyRuntimeApp : public App {
  int nb_cores;

  MyRuntimeApp(int argc, char **argv) : App(argc, argv) {
    parse_argument(argc, argv);
  }

  void execute_main_loop() {
    display();

    // 1. Allocate buffers: [nb_fields][max_width] of output_bytes per graph
    // 2. Initialize scratch buffers (prepare_scratch once per buffer)
    // 3. Start timer, build DAG, run, stop timer:
    //    Timer::time_start();
    //    <build task DAG with emplace() + precede()>
    //    executor.run(taskflow).wait();
    //    double elapsed = Timer::time_end();
    // 4. report_timing(elapsed);
  }

  void parse_argument(int argc, char **argv) { /* see above */ }
};

int main(int argc, char **argv) {
  MyRuntimeApp app(argc, argv);
  app.execute_main_loop();
  return 0;
}

H. Common Porting Pitfalls

Forgetting self-referencing input for timestep 0 — execute_point requires at least 1 input. Pass output_ptr as input when no dependencies exist.
Not filtering dependency points — dependencies() can return intervals outside [offset, offset+width). Filter before building input lists.
Ignoring nb_fields buffer reuse — Skipping WAW/WAR tracking causes silent corruption with -field 4 tests.
Forgetting prepare_scratch — Scratch must contain a magic value; execute_point validates it. Call once per buffer, not per task.
Missing LD_LIBRARY_PATH — Must include core/ so the linker finds libcore_s.

Lambda capture of class members — Copy to local variables first:

int nw = nb_workers;  // copy member to local
taskflow.emplace([nw]() { ... });

Const captures — Lambda captures are const by default. Mark mutable if needed.
Static members in local classes — C++ forbids them. Move to namespace scope.
Scratch buffer strategy — Two mechanisms work:
- Pre-allocated per-worker array indexed by executor.this_worker_id()
- thread_local variable
Either is correct, but all allocation and prepare_scratch() calls must happen outside the lambda, before DAG construction. Never resize, check init flags, or call prepare_scratch() inside the task body — at millions of tasks, per-task branches and lazy allocation add measurable overhead. Pre-allocate scratch arrays for all workers up front, then just index into them inside the lambda.
Avoid copying TaskGraph into captures — TaskGraph contains multiple std::vectors. Copying it per task causes deep copies with heap allocations. Use a reference or pointer:
```
const TaskGraph &graph = graphs[gi];  // reference, zero cost
taskflow.emplace([&graph, ...]() { graph.execute_point(...); });
```
Pre-compute dependencies during construction — Call graph.dependencies(dset, point) during the DAG construction loop, not inside task lambdas. The function does non-trivial work and would run on every worker thread at runtime if placed in the lambda.
Keep it simple — In benchmarks of 18 independent ports, the shortest implementations (175-200 lines) consistently ranked in the top 3. Each extra abstraction risks adding heap allocations that increase per-task cost. Capture a pointer to the graph, use flat vectors for tracking, and let Taskflow handle scheduling.

I. Quick Reference Card

1. CLI:         struct MyApp : public App { MyApp(c,v) : App(c,v) { parse -worker } };
2. Buffers:     buf[nb_fields * max_width], each output_bytes_per_task bytes
                Slot index: (t % nb_fields) * max_width + point
3. Dependencies: graph.dependencies(dset, point) → filter to [last_offset, last_offset+last_width)
4. Timestep 0:  pass output_ptr as self-referencing input (n_inputs=1)
5. WAW/WAR:     last_writer[slot], has_writer[slot], readers[slot] per buffer slot
6. Scratch:     pre-allocate per-worker, call prepare_scratch() OUTSIDE the lambda, index inside

Recommended Test Commands

# Basic correctness
./main -kernel compute_bound -iter 1024 -type stencil_1d -steps 10 -width 10 -worker 4

# Empty kernel
./main -kernel empty -type trivial -steps 10 -width 10 -worker 2

# WAR/WAW correctness (field reuse — critical test)
./main -kernel compute_bound -iter 256 -type fft -steps 16 -width 8 -worker 4 -field 2

# Expected: all print "Elapsed Time ..." with no assertion failures

DAG Strategy

Prefer a monolithic DAG (one tf::Taskflow, one executor.run().wait()) — simpler and generally faster at all tested scales (up to 5M tasks). Consider batching only if the DAG would exceed available RAM (~0.5 GB per 1M tasks for metadata).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Porting a New Runtime to task-bench (Taskflow)

A. Framework Architecture

B. Core API Reference

Key Fields on `task_graph_t`

C. The `execute_point` Verification Contract

D. Buffer Layout and `nb_fields`

E. Build System Conventions

F. CLI Argument Handling

G. Implementation Skeleton

H. Common Porting Pitfalls

I. Quick Reference Card

Recommended Test Commands

DAG Strategy

FilesExpand file tree

taskbench-porting.md

Latest commit

History

taskbench-porting.md

File metadata and controls

Porting a New Runtime to task-bench (Taskflow)

A. Framework Architecture

B. Core API Reference

Key Fields on task_graph_t

C. The execute_point Verification Contract

D. Buffer Layout and nb_fields

E. Build System Conventions

F. CLI Argument Handling

G. Implementation Skeleton

H. Common Porting Pitfalls

I. Quick Reference Card

Recommended Test Commands

DAG Strategy

Key Fields on `task_graph_t`

C. The `execute_point` Verification Contract

D. Buffer Layout and `nb_fields`