🚀 [Performance] Bottlenecks in call_kernel: Repetitive Hashing & Redundant Evaluation
Problem Description
Profiling of the call_kernel execution path reveals two significant performance bottlenecks that together account for the majority of execution time during circuit building.
1. Pool::add(kernel) rehashes entire KernelPrimitive on every call
call_kernel calls self.kernel_primitives.add(kernel), which performs a HashMap::get(v) lookup. This requires hashing the entire KernelPrimitive struct—including both ir_for_later_compilation and ir_for_calling RootCircuit IRs with all their instructions—on every invocation, even when the kernel is already registered.
Profiling Data
| Kernel |
kernel_primitives.add() |
Total call_kernel |
pre_attn_sln |
1.17s |
1.27s |
freivalds_shared_x_qkv |
3.38s |
4.22s |
Proposed Fix
- Short-term: Cache the kernel ID on the caller side or use pointer-based identity for fast-path lookup to avoid repeated full-struct hashing.
- Long-term: Introduce a
register_kernel method that returns a reusable ID, and a call_kernel_by_id variant that skips the Pool::add overhead.
2. eval_safe_simd runs unconditionally for pure-constraint kernels
call_kernel always evaluates kernel.ir_for_calling().eval_safe_simd(...) for every parallel instance. This is often unnecessary:
- Pure-constraint kernels: For kernels where all IO specs are inputs (no outputs),
ir_for_calling has its constraints stripped and produces no meaningful output.
- Known outputs: In many use cases, output values are already computed externally (e.g., during model inference).
call_kernel is invoked only to register the kernel call for later proving, making the SIMD re-computation redundant.
Profiling Data (After fixing issue 1)
| Kernel |
eval_safe_simd |
Total call_kernel |
freivalds_shared_x_qkv |
74ms |
~100ms |
freivalds_shared_x_qkv |
579ms |
~840ms |
Proposed Fix
- Optimization: Skip
eval_safe_simd for kernels with no output specs:
!kernel.io_specs().iter().any(|s| s.is_output)
🚀 [Performance] Bottlenecks in
call_kernel: Repetitive Hashing & Redundant EvaluationProblem Description
Profiling of the
call_kernelexecution path reveals two significant performance bottlenecks that together account for the majority of execution time during circuit building.1.
Pool::add(kernel)rehashes entire KernelPrimitive on every callcall_kernelcallsself.kernel_primitives.add(kernel), which performs aHashMap::get(v)lookup. This requires hashing the entireKernelPrimitivestruct—including bothir_for_later_compilationandir_for_callingRootCircuit IRs with all their instructions—on every invocation, even when the kernel is already registered.Profiling Data
kernel_primitives.add()call_kernelpre_attn_slnfreivalds_shared_x_qkvProposed Fix
register_kernelmethod that returns a reusable ID, and acall_kernel_by_idvariant that skips thePool::addoverhead.2.
eval_safe_simdruns unconditionally for pure-constraint kernelscall_kernelalways evaluateskernel.ir_for_calling().eval_safe_simd(...)for every parallel instance. This is often unnecessary:ir_for_callinghas its constraints stripped and produces no meaningful output.call_kernelis invoked only to register the kernel call for later proving, making the SIMD re-computation redundant.Profiling Data (After fixing issue 1)
eval_safe_simdcall_kernelfreivalds_shared_x_qkvfreivalds_shared_x_qkvProposed Fix
eval_safe_simdfor kernels with no output specs: