This guide is for agents implementing a new task-parallel runtime in the task-bench framework. It covers the core API, verification contract, build conventions, and common pitfalls.
task-bench is a parameterized benchmark suite where a shared core/ library defines task
DAGs and kernels, and each runtime provides its own execution driver. The core library handles:
- CLI parsing:
-steps,-width,-type,-kernel,-iter,-scratch,-output,-imbalance,-field,-and,-nodes— all handled byApp(argc, argv). - Graph topology:
TaskGraph::dependencies()andTaskGraph::reverse_dependencies() - Task execution:
TaskGraph::execute_point()runs the kernel AND validates inputs. - Timing/reporting:
App::report_timing(elapsed)prints the expected output format.
Your runtime only needs to:
- Parse its own thread/process flags (e.g.,
-core N) - Allocate data buffers
- Build and execute the task DAG with correct dependencies
- Time the execution and call
report_timing()
struct App {
std::vector<TaskGraph> graphs; // one per graph (multiple via -and flag)
long nodes; // MPI node count (use 1 for shared-memory)
App(int argc, char **argv); // parses all standard CLI args
void display() const; // print graph config (call before timing)
void report_timing(double elapsed_seconds) const;
};
struct TaskGraph : public task_graph_t {
long offset_at_timestep(long timestep) const;
long width_at_timestep(long timestep) const;
long dependence_set_at_timestep(long timestep) const;
// Returns intervals of dependency points (INCLUSIVE [a, b] pairs)
std::vector<std::pair<long, long>> dependencies(long dset, long point) const;
// Buffer-based overload (avoids heap allocation — preferred for hot loops)
size_t dependencies(long dset, long point, std::pair<long, long> *deps) const;
size_t num_dependencies(long dset, long point) const;
void execute_point(long timestep, long point,
char *output_ptr, size_t output_bytes,
const char **input_ptr, const size_t *input_bytes,
size_t n_inputs,
char *scratch_ptr, size_t scratch_bytes) const;
static void prepare_scratch(char *scratch_ptr, size_t scratch_bytes);
};| Field | Meaning |
|---|---|
timesteps |
Number of timesteps (rows) in the DAG |
max_width |
Maximum number of points (columns) per timestep |
nb_fields |
Number of buffer slots for circular reuse (t % nb_fields) |
output_bytes_per_task |
Size of each task's output buffer in bytes |
scratch_bytes_per_task |
Size of scratch buffer (0 if none needed) |
dependence |
Enum: TRIVIAL, NO_COMM, STENCIL_1D, ALL_TO_ALL, etc. |
-
Output: Fills
output_ptrwith(timestep, point)pairs for downstream validation. -
Input validation: Checks each
input_ptr[idx]contains(timestep-1, dep_point). Wrong data → assertion failure. -
Timestep 0 / no-dependency tasks: Must still pass at least one input. Convention: pass the output buffer as a self-referencing input:
if (timestep == 0 || input_points.empty()) { const char *self_input = output_ptr; size_t self_size = output_bytes; graph.execute_point(timestep, point, output_ptr, output_bytes, &self_input, &self_size, 1, scratch_ptr, scratch_bytes); }
-
Scratch buffers: Initialize with
TaskGraph::prepare_scratch()once per buffer (not per task). It writes a magic value thatexecute_pointvalidates. -
Dependency filtering:
dependencies()may return intervals extending beyond the previous timestep's range. Filter: only includedwherelast_offset <= d < last_offset + last_width.
Each task at timestep t, point x writes to slot [t % nb_fields][x].
nb_fields >= timesteps: no reuse conflicts; RAW edges suffice.nb_fields < timesteps: multiple timesteps alias the same slot → WAW and WAR hazards. You need a slot tracker per(field, point)withlast_writerandreaders_since_write.
See ../docs/taskflow-patterns.md Section 6f for the full WAR/WAW tracking pattern.
| File | Purpose |
|---|---|
main.cc |
Primary implementation |
main_expl |
Alternative binary (symlink to main if not applicable) |
Makefile |
Must produce main and main_expl targets |
CXX ?= g++
CXXFLAGS = -std=c++17 -fPIC -pthread -O3
INC = -I../core -I<runtime-headers>
LIB = -L../core -lcore_s
include ../core/make_blas.mk
main.o: main.cc
$(CXX) -c $(CXXFLAGS) $(INC) $<
main: main.o
$(CXX) $^ $(LIB) $(LDFLAGS) -o $@
main_expl: main
ln -sf main main_explBuild core first: make -C core -j$(nproc).
void MyApp::parse_argument(int argc, char **argv) {
for (int i = 1; i < argc; i++) {
if (!strcmp(argv[i], "-core")) { nb_cores = atoi(argv[++i]); }
else if (!strcmp(argv[i], "-p") || !strcmp(argv[i], "-S")) { i++; } // ignore
}
}Accept -core N (or -worker N). Silently ignore flags from other runtimes.
App(argc, argv) handles all standard flags.
struct MyRuntimeApp : public App {
int nb_cores;
MyRuntimeApp(int argc, char **argv) : App(argc, argv) {
parse_argument(argc, argv);
}
void execute_main_loop() {
display();
// 1. Allocate buffers: [nb_fields][max_width] of output_bytes per graph
// 2. Initialize scratch buffers (prepare_scratch once per buffer)
// 3. Start timer, build DAG, run, stop timer:
// Timer::time_start();
// <build task DAG with emplace() + precede()>
// executor.run(taskflow).wait();
// double elapsed = Timer::time_end();
// 4. report_timing(elapsed);
}
void parse_argument(int argc, char **argv) { /* see above */ }
};
int main(int argc, char **argv) {
MyRuntimeApp app(argc, argv);
app.execute_main_loop();
return 0;
}-
Forgetting self-referencing input for timestep 0 —
execute_pointrequires at least 1 input. Passoutput_ptras input when no dependencies exist. -
Not filtering dependency points —
dependencies()can return intervals outside[offset, offset+width). Filter before building input lists. -
Ignoring
nb_fieldsbuffer reuse — Skipping WAW/WAR tracking causes silent corruption with-field 4tests. -
Forgetting
prepare_scratch— Scratch must contain a magic value;execute_pointvalidates it. Call once per buffer, not per task. -
Missing
LD_LIBRARY_PATH— Must includecore/so the linker findslibcore_s. -
Lambda capture of class members — Copy to local variables first:
int nw = nb_workers; // copy member to local taskflow.emplace([nw]() { ... });
-
Const captures — Lambda captures are
constby default. Markmutableif needed. -
Static members in local classes — C++ forbids them. Move to namespace scope.
-
Scratch buffer strategy — Two mechanisms work:
- Pre-allocated per-worker array indexed by
executor.this_worker_id() thread_localvariable
Either is correct, but all allocation and
prepare_scratch()calls must happen outside the lambda, before DAG construction. Never resize, check init flags, or callprepare_scratch()inside the task body — at millions of tasks, per-task branches and lazy allocation add measurable overhead. Pre-allocate scratch arrays for all workers up front, then just index into them inside the lambda. - Pre-allocated per-worker array indexed by
-
Avoid copying TaskGraph into captures —
TaskGraphcontains multiplestd::vectors. Copying it per task causes deep copies with heap allocations. Use a reference or pointer:const TaskGraph &graph = graphs[gi]; // reference, zero cost taskflow.emplace([&graph, ...]() { graph.execute_point(...); });
-
Pre-compute dependencies during construction — Call
graph.dependencies(dset, point)during the DAG construction loop, not inside task lambdas. The function does non-trivial work and would run on every worker thread at runtime if placed in the lambda. -
Keep it simple — In benchmarks of 18 independent ports, the shortest implementations (175-200 lines) consistently ranked in the top 3. Each extra abstraction risks adding heap allocations that increase per-task cost. Capture a pointer to the graph, use flat vectors for tracking, and let Taskflow handle scheduling.
1. CLI: struct MyApp : public App { MyApp(c,v) : App(c,v) { parse -worker } };
2. Buffers: buf[nb_fields * max_width], each output_bytes_per_task bytes
Slot index: (t % nb_fields) * max_width + point
3. Dependencies: graph.dependencies(dset, point) → filter to [last_offset, last_offset+last_width)
4. Timestep 0: pass output_ptr as self-referencing input (n_inputs=1)
5. WAW/WAR: last_writer[slot], has_writer[slot], readers[slot] per buffer slot
6. Scratch: pre-allocate per-worker, call prepare_scratch() OUTSIDE the lambda, index inside
# Basic correctness
./main -kernel compute_bound -iter 1024 -type stencil_1d -steps 10 -width 10 -worker 4
# Empty kernel
./main -kernel empty -type trivial -steps 10 -width 10 -worker 2
# WAR/WAW correctness (field reuse — critical test)
./main -kernel compute_bound -iter 256 -type fft -steps 16 -width 8 -worker 4 -field 2
# Expected: all print "Elapsed Time ..." with no assertion failuresPrefer a monolithic DAG (one tf::Taskflow, one executor.run().wait()) — simpler and
generally faster at all tested scales (up to 5M tasks). Consider batching only if the DAG
would exceed available RAM (~0.5 GB per 1M tasks for metadata).