call_kernel performance: Pool hashing and unnecessary eval_safe_simd dominate runtime

## 🚀 [Performance] Bottlenecks in `call_kernel`: Repetitive Hashing & Redundant Evaluation

### Problem Description
Profiling of the `call_kernel` execution path reveals two significant performance bottlenecks that together account for the majority of execution time during circuit building.

---

### 1. `Pool::add(kernel)` rehashes entire KernelPrimitive on every call
`call_kernel` calls `self.kernel_primitives.add(kernel)`, which performs a `HashMap::get(v)` lookup. This requires hashing the entire `KernelPrimitive` struct—including both `ir_for_later_compilation` and `ir_for_calling` RootCircuit IRs with all their instructions—on **every invocation**, even when the kernel is already registered.

#### Profiling Data
| Kernel | `kernel_primitives.add()` | Total `call_kernel` |
| :--- | :--- | :--- |
| `pre_attn_sln` | 1.17s | 1.27s |
| `freivalds_shared_x_qkv` | 3.38s | 4.22s |

#### Proposed Fix
- **Short-term:** Cache the kernel ID on the caller side or use pointer-based identity for fast-path lookup to avoid repeated full-struct hashing.
- **Long-term:** Introduce a `register_kernel` method that returns a reusable ID, and a `call_kernel_by_id` variant that skips the `Pool::add` overhead.

---

### 2. `eval_safe_simd` runs unconditionally for pure-constraint kernels
`call_kernel` always evaluates `kernel.ir_for_calling().eval_safe_simd(...)` for every parallel instance. This is often unnecessary:
1. **Pure-constraint kernels:** For kernels where all IO specs are inputs (no outputs), `ir_for_calling` has its constraints stripped and produces no meaningful output.
2. **Known outputs:** In many use cases, output values are already computed externally (e.g., during model inference). `call_kernel` is invoked only to register the kernel call for later proving, making the SIMD re-computation redundant.

#### Profiling Data (After fixing issue 1)
| Kernel | `eval_safe_simd` | Total `call_kernel` |
| :--- | :--- | :--- |
| `freivalds_shared_x_qkv` | 74ms | ~100ms |
| `freivalds_shared_x_qkv` | 579ms | ~840ms |

#### Proposed Fix
- **Optimization:** Skip `eval_safe_simd` for kernels with no output specs:
  ```rust
  !kernel.io_specs().iter().any(|s| s.is_output)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

call_kernel performance: Pool hashing and unnecessary eval_safe_simd dominate runtime #202

🚀 [Performance] Bottlenecks in `call_kernel`: Repetitive Hashing & Redundant Evaluation

Problem Description

1. `Pool::add(kernel)` rehashes entire KernelPrimitive on every call

Profiling Data

Proposed Fix

2. `eval_safe_simd` runs unconditionally for pure-constraint kernels

Profiling Data (After fixing issue 1)

Proposed Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kernel	`eval_safe_simd`	Total `call_kernel`
`freivalds_shared_x_qkv`	74ms	~100ms
`freivalds_shared_x_qkv`	579ms	~840ms

call_kernel performance: Pool hashing and unnecessary eval_safe_simd dominate runtime #202

Description

🚀 [Performance] Bottlenecks in call_kernel: Repetitive Hashing & Redundant Evaluation

Problem Description

1. Pool::add(kernel) rehashes entire KernelPrimitive on every call

Profiling Data

Proposed Fix

2. eval_safe_simd runs unconditionally for pure-constraint kernels

Profiling Data (After fixing issue 1)

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

🚀 [Performance] Bottlenecks in `call_kernel`: Repetitive Hashing & Redundant Evaluation

1. `Pool::add(kernel)` rehashes entire KernelPrimitive on every call

2. `eval_safe_simd` runs unconditionally for pure-constraint kernels