feat: bidirectional links, trap_exit, and start_linked for actors#166
feat: bidirectional links, trap_exit, and start_linked for actors#166ElFantasma wants to merge 2 commits into
Conversation
🤖 Kimi Code ReviewReview SummaryThe PR implements Erlang-style bidirectional linking with Critical Issues1. Deadlock in Link Registration (
|
🤖 Claude Code ReviewNow I have everything I need. Let me write up the review. PR #166 Review: Bidirectional Links,
|
🤖 Codex Code ReviewFindings
Tests were not run here: Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Greptile SummaryThis PR implements Erlang-style bidirectional actor links with
Confidence Score: 3/5The core link primitives are well-structured and the test suite is thorough, but the post-registration dead-actor check in The duplicate-signal window exists in both backends: Both
|
| Filename | Overview |
|---|---|
| concurrency/src/link.rs | New module implementing bidirectional link primitives; TrapExitFlag/LinkedExitReason are accidentally pub, and propagate_exit inconsistently handles poisoned peer locks with a silent if let Ok drop. |
| concurrency/src/tasks/actor.rs | Adds link/unlink/trap_exit to Context and start_linked to ActorStart; the post-registration dead-actor check in ctx.link() can deliver a duplicate Exit to trapping actors when the peer dies concurrently with link setup. |
| concurrency/src/threads/actor.rs | Mirrors tasks-mode changes for threads; carries the same duplicate-Exit race in ctx.link() and additionally, OS preemption between the two register_link critical sections can produce a missed signal when the peer dies between step A and step B while completion is not yet published. |
| concurrency/src/child_handle.rs | Extends ChildHandle with trap_exit, links, linked_reason, and send_exit fields for type-erased signal delivery; changes are mechanical and consistent. |
| concurrency/src/lib.rs | Adds pub mod link and re-exports Exit; straightforward. |
| examples/exit_reason/src/main.rs | Adds Scenario 9 demonstrating supervisor-style trap_exit; example code is clear and correct. |
Sequence Diagram
sequenceDiagram
participant A as Actor A (linker)
participant RT as register_link
participant B as Actor B (target)
participant PX as propagate_exit
A->>RT: lock own_links, add B entry
A->>RT: lock peer_links, add A entry
Note over RT: Two separate critical sections
B->>PX: actor dies, drain own_links
PX->>A: signal(A) first delivery
PX->>B: completion_tx.send(reason)
A->>B: exit_reason() returns Some(reason)
A->>A: signal(A) duplicate delivery
Note over A: Trapping actor has 2 Exit envelopes
Note over A: for the same peer death
Comments Outside Diff (1)
-
concurrency/src/tasks/actor.rs, line 574-583 (link)Duplicate
Exitdelivery race for trapping actorsAfter
register_linksucceeds, if the peer dies and itspropagate_exitruns before thisexit_reason()check, the linker receives the signal twice: once frompropagate_exit(which foundown_idinpeer_links) and once from this post-registration check (which then seesexit_reason()isSome). For a non-trapping actor the secondcancel()call is a no-op, but for a trapping actor it enqueues a secondExitEnvelope, soexit_receivedfires with two identicalExitnotifications from the same peer death.The same race exists in
threads/actor.rsat the correspondinglink()method.Prompt To Fix With AI
This is a comment left during a code review. Path: concurrency/src/tasks/actor.rs Line: 574-583 Comment: **Duplicate `Exit` delivery race for trapping actors** After `register_link` succeeds, if the peer dies and its `propagate_exit` runs before this `exit_reason()` check, the linker receives the signal twice: once from `propagate_exit` (which found `own_id` in `peer_links`) and once from this post-registration check (which then sees `exit_reason()` is `Some`). For a non-trapping actor the second `cancel()` call is a no-op, but for a trapping actor it enqueues a second `ExitEnvelope`, so `exit_received` fires with two identical `Exit` notifications from the same peer death. The same race exists in `threads/actor.rs` at the corresponding `link()` method. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 4
concurrency/src/tasks/actor.rs:574-583
**Duplicate `Exit` delivery race for trapping actors**
After `register_link` succeeds, if the peer dies and its `propagate_exit` runs before this `exit_reason()` check, the linker receives the signal twice: once from `propagate_exit` (which found `own_id` in `peer_links`) and once from this post-registration check (which then sees `exit_reason()` is `Some`). For a non-trapping actor the second `cancel()` call is a no-op, but for a trapping actor it enqueues a second `ExitEnvelope`, so `exit_received` fires with two identical `Exit` notifications from the same peer death.
The same race exists in `threads/actor.rs` at the corresponding `link()` method.
### Issue 2 of 4
concurrency/src/link.rs:182-184
`propagate_exit` uses `unwrap_or_else` (recover from a poisoned lock) for `own_links` but silently discards poisoned-lock errors for `entry.peer_links` via `if let Ok`. If a peer thread panics while holding its link-table lock, the cleanup step is silently skipped, leaving a stale back-reference to the dying actor in the peer's table. Consistent error handling with `unwrap_or_else` is safer.
```suggestion
let mut peer_table = entry.peer_links.lock().unwrap_or_else(|p| p.into_inner());
peer_table.retain(|e| e.peer_id != own_id);
```
### Issue 3 of 4
concurrency/src/link.rs:36-42
`TrapExitFlag` and `LinkedExitReason` are declared `pub` in a publicly re-exported module (`pub mod link`), so they form part of the library's stable surface. Both are pure internal implementation details — users only ever interact with `ctx.trap_exit(bool)` and `Exit`. Restricting to `pub(crate)` prevents downstream code from accidentally depending on the concrete `Arc<AtomicBool>` / `Arc<Mutex<…>>` representation.
```suggestion
pub(crate) type TrapExitFlag = Arc<AtomicBool>;
/// Per-actor slot holding the exit reason of a linked actor whose death
/// triggered cancellation. When a non-trapping actor is cancelled by a link
/// signal, this slot is set so the actor's own exit reason propagates
/// transitively through further links.
pub(crate) type LinkedExitReason = Arc<Mutex<Option<ExitReason>>>;
```
### Issue 4 of 4
concurrency/src/tasks/actor.rs:825-830
The docstring says the link is established "atomically" and that the race window is "closed", but `start()` and `link()` are two separate operations: the child spawns and can process messages (or even panic) before `parent_ctx.link(...)` runs. The `link()` call does handle a dead target via the post-registration `exit_reason()` check, but the guarantee is not atomic — it's a best-effort check subject to the duplicate-delivery race noted above. The comment should be softened to avoid misleading users about the synchronisation semantics.
```suggestion
/// Start the actor and link it to the caller's context.
///
/// The link is registered as soon as the actor is spawned. If the child
/// dies before the link is established, `link()` detects this and delivers
/// the exit signal immediately. This does not provide strict atomicity —
/// the child may process a message between spawn and the link registration.
fn start_linked<P: Actor>(self, parent_ctx: &Context<P>) -> ActorRef<Self> {
```
Reviews (1): Last reviewed commit: "feat: add bidirectional links, trap_exit..." | Re-trigger Greptile
| if let Ok(mut peer_table) = entry.peer_links.lock() { | ||
| peer_table.retain(|e| e.peer_id != own_id); | ||
| } |
There was a problem hiding this comment.
propagate_exit uses unwrap_or_else (recover from a poisoned lock) for own_links but silently discards poisoned-lock errors for entry.peer_links via if let Ok. If a peer thread panics while holding its link-table lock, the cleanup step is silently skipped, leaving a stale back-reference to the dying actor in the peer's table. Consistent error handling with unwrap_or_else is safer.
| if let Ok(mut peer_table) = entry.peer_links.lock() { | |
| peer_table.retain(|e| e.peer_id != own_id); | |
| } | |
| let mut peer_table = entry.peer_links.lock().unwrap_or_else(|p| p.into_inner()); | |
| peer_table.retain(|e| e.peer_id != own_id); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: concurrency/src/link.rs
Line: 182-184
Comment:
`propagate_exit` uses `unwrap_or_else` (recover from a poisoned lock) for `own_links` but silently discards poisoned-lock errors for `entry.peer_links` via `if let Ok`. If a peer thread panics while holding its link-table lock, the cleanup step is silently skipped, leaving a stale back-reference to the dying actor in the peer's table. Consistent error handling with `unwrap_or_else` is safer.
```suggestion
let mut peer_table = entry.peer_links.lock().unwrap_or_else(|p| p.into_inner());
peer_table.retain(|e| e.peer_id != own_id);
```
How can I resolve this? If you propose a fix, please make it concise.| pub type TrapExitFlag = Arc<AtomicBool>; | ||
|
|
||
| /// Per-actor slot holding the exit reason of a linked actor whose death | ||
| /// triggered cancellation. When a non-trapping actor is cancelled by a link | ||
| /// signal, this slot is set so the actor's own exit reason propagates | ||
| /// transitively through further links. | ||
| pub type LinkedExitReason = Arc<Mutex<Option<ExitReason>>>; |
There was a problem hiding this comment.
TrapExitFlag and LinkedExitReason are declared pub in a publicly re-exported module (pub mod link), so they form part of the library's stable surface. Both are pure internal implementation details — users only ever interact with ctx.trap_exit(bool) and Exit. Restricting to pub(crate) prevents downstream code from accidentally depending on the concrete Arc<AtomicBool> / Arc<Mutex<…>> representation.
| pub type TrapExitFlag = Arc<AtomicBool>; | |
| /// Per-actor slot holding the exit reason of a linked actor whose death | |
| /// triggered cancellation. When a non-trapping actor is cancelled by a link | |
| /// signal, this slot is set so the actor's own exit reason propagates | |
| /// transitively through further links. | |
| pub type LinkedExitReason = Arc<Mutex<Option<ExitReason>>>; | |
| pub(crate) type TrapExitFlag = Arc<AtomicBool>; | |
| /// Per-actor slot holding the exit reason of a linked actor whose death | |
| /// triggered cancellation. When a non-trapping actor is cancelled by a link | |
| /// signal, this slot is set so the actor's own exit reason propagates | |
| /// transitively through further links. | |
| pub(crate) type LinkedExitReason = Arc<Mutex<Option<ExitReason>>>; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: concurrency/src/link.rs
Line: 36-42
Comment:
`TrapExitFlag` and `LinkedExitReason` are declared `pub` in a publicly re-exported module (`pub mod link`), so they form part of the library's stable surface. Both are pure internal implementation details — users only ever interact with `ctx.trap_exit(bool)` and `Exit`. Restricting to `pub(crate)` prevents downstream code from accidentally depending on the concrete `Arc<AtomicBool>` / `Arc<Mutex<…>>` representation.
```suggestion
pub(crate) type TrapExitFlag = Arc<AtomicBool>;
/// Per-actor slot holding the exit reason of a linked actor whose death
/// triggered cancellation. When a non-trapping actor is cancelled by a link
/// signal, this slot is set so the actor's own exit reason propagates
/// transitively through further links.
pub(crate) type LinkedExitReason = Arc<Mutex<Option<ExitReason>>>;
```
How can I resolve this? If you propose a fix, please make it concise.| /// Atomically start the actor and link it to the caller's context. | ||
| /// | ||
| /// The link is established before the new actor processes any messages, | ||
| /// closing the race window where a child could die before the parent's | ||
| /// `link()` call completes. | ||
| fn start_linked<P: Actor>(self, parent_ctx: &Context<P>) -> ActorRef<Self> { |
There was a problem hiding this comment.
The docstring says the link is established "atomically" and that the race window is "closed", but
start() and link() are two separate operations: the child spawns and can process messages (or even panic) before parent_ctx.link(...) runs. The link() call does handle a dead target via the post-registration exit_reason() check, but the guarantee is not atomic — it's a best-effort check subject to the duplicate-delivery race noted above. The comment should be softened to avoid misleading users about the synchronisation semantics.
| /// Atomically start the actor and link it to the caller's context. | |
| /// | |
| /// The link is established before the new actor processes any messages, | |
| /// closing the race window where a child could die before the parent's | |
| /// `link()` call completes. | |
| fn start_linked<P: Actor>(self, parent_ctx: &Context<P>) -> ActorRef<Self> { | |
| /// Start the actor and link it to the caller's context. | |
| /// | |
| /// The link is registered as soon as the actor is spawned. If the child | |
| /// dies before the link is established, `link()` detects this and delivers | |
| /// the exit signal immediately. This does not provide strict atomicity — | |
| /// the child may process a message between spawn and the link registration. | |
| fn start_linked<P: Actor>(self, parent_ctx: &Context<P>) -> ActorRef<Self> { |
Prompt To Fix With AI
This is a comment left during a code review.
Path: concurrency/src/tasks/actor.rs
Line: 825-830
Comment:
The docstring says the link is established "atomically" and that the race window is "closed", but `start()` and `link()` are two separate operations: the child spawns and can process messages (or even panic) before `parent_ctx.link(...)` runs. The `link()` call does handle a dead target via the post-registration `exit_reason()` check, but the guarantee is not atomic — it's a best-effort check subject to the duplicate-delivery race noted above. The comment should be softened to avoid misleading users about the synchronisation semantics.
```suggestion
/// Start the actor and link it to the caller's context.
///
/// The link is registered as soon as the actor is spawned. If the child
/// dies before the link is established, `link()` detects this and delivers
/// the exit signal immediately. This does not provide strict atomicity —
/// the child may process a message between spawn and the link registration.
fn start_linked<P: Actor>(self, parent_ctx: &Context<P>) -> ActorRef<Self> {
```
How can I resolve this? If you propose a fix, please make it concise.
Bot Review ResponsesAll actionable findings addressed in 0f48d9d. Fixed[Bug] Duplicate Fixed by adding [Doc] Doc softened in both tasks and threads modes: "not strictly atomic — the child may begin executing [Bug] Stale entries when linking to dead actor — Codex #3 Same fix as #1: when linking to an already-dead target, [Minor] Fixed. They were unintentionally [Minor] Inconsistent poisoned-mutex handling in Fixed. Now uses [Cosmetic] Lost Restored with an expanded explanation of the panic-fallback semantics. Flagged as wrong / not actionable[Kimi #1] Deadlock in link registration — incorrect. [Kimi #2] Nits left as-is (not blocking)
Test status
|
Closes #131 (the monitor half landed in #165; this PR completes the issue).
Summary
Phase 3d of the supervision-trees roadmap: Erlang-style bidirectional links between actors with
trap_exitsemantics.What's new
Exit { from: ActorId, reason: ExitReason }— system message delivered to linked actors that trap exitsActor::exit_received— new lifecycle callback (default no-op) that receivesExitnotifications. Mirrorsstarted/stopped— noHandler<Exit>boilerplate neededctx.link(&handle)/ctx.unlink(&handle)— bidirectional link managementctx.trap_exit(true)— convert exit signals intoExitmessages instead of dyingstart_linked(parent_ctx)— atomic spawn + link via theActorStarttraitExitReason::Killis untrappable — bypassestrap_exitand cancels the actorA → B → Cpropagate correctlyDesign notes
Arc<Mutex<Vec<LinkEntry>>>). No global state — peers register in each other's tables via closures captured at link time.ExitEnvelopeis a system envelope dispatched viaactor.exit_received()— noHandler<Exit>bound on linked actors.LinkedExitReasonslot on each actor carries the propagated reason so transitive chain propagation works correctly through non-trapping middle actors.link()is a no-op).Files
concurrency/src/link.rs—Exit,LinkEntry,LinkTable,make_signal,register_link,propagate_exitconcurrency/src/child_handle.rs—ChildHandlenow carriestrap_exit,links,linked_reason,send_exitfor type-erased signal deliveryconcurrency/src/tasks/actor.rs—Context/ActorRefgain link state;Actor::exit_receivedcallback;link/unlink/trap_exitmethods;start_linkedonActorStart;run_actorpropagates exit signals on actor deathconcurrency/src/threads/actor.rs— same for threads modeexamples/exit_reasonadds Scenario 9 demonstrating supervisor-style trap_exitTest plan
start_linkedcargo run -p exit_reason(Scenario 9 demonstrates the supervisor pattern)