Skip to content

Async refactor#617

Open
richarddd wants to merge 25 commits intoDelSkayn:masterfrom
richarddd:async-refactor
Open

Async refactor#617
richarddd wants to merge 25 commits intoDelSkayn:masterfrom
richarddd:async-refactor

Conversation

@richarddd
Copy link
Copy Markdown
Collaborator

Description of changes

Sync refactor

Checklist

  • Added change to the changelog
  • Created unit tests for my feature if needed

@Sytten
Copy link
Copy Markdown
Collaborator

Sytten commented Feb 7, 2026

I feel I asked @DelSkayn take on that PR or is that a new approach? He had some valid concerns around fair scheduling with the refactor but I cant find the comment.

Actix had the same issue in the past, it was always polling in the same order instead of a fair scheduling like Tokio so you would have sometimes pretty bad P99.

// If we polled all the futures atleas once,
// or more then one future immediatily queued itself after being polled,
// yield back to the parent schedular.
if yielded > 2 || iteration > self.len.get() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That previous implementation might not be an ideal case of fair scheduling but there was an attempt to poll multiple futures and not just the top of the queue.

Comment thread core/src/runtime/task_queue.rs Outdated

let mut progress = false;

while let Some(p) = self.ready.with(|r| r.pop()) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new system only drives one future at a time unless I dont understand. I would like to see a benchmark where you have 10 random futures and 2-3 of them are very slow (10s+ of pending time) to see how two versions perform.

Copy link
Copy Markdown
Collaborator Author

@richarddd richarddd Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a single call to poll(), it drains the entire ready queue and polls every task that has been woken. It's not limited to one future per call.

In the scenarios you described the slow futures would return Pending and sit idle until their waker fires. The other futures would continue to make progress normally. The slow futures don't consume any CPU while pending and they're just stored in their slots waiting to be re-queued.

Comment thread core/src/runtime/task_queue.rs Outdated
Comment on lines +273 to +282
unsafe fn waker_wake_by_ref(p: *const ()) {
let slot = &*(p as *const Slot);
if get_flag(&slot.active) && try_set_true(&slot.queued) {
let queue = &*slot.queue;
queue.ready.with(|r| r.push(p as *mut Slot));
if let Some(w) = queue.waker.take_clone() {
w.wake_by_ref();
}
}
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unsound when the parallel is disabled. Waker is Send + Sync and therefore all it's functions must be callable from a different thread, read the documentation for RawWakerVTable for more info.

This current implementation uses a UnsafeCell without synchronization when parallel is disabled and therefore can cause race conditions.

@DelSkayn
Copy link
Copy Markdown
Owner

DelSkayn commented Feb 8, 2026

What is your goal for writing a new schedular? The current schedular is quite capable, being both O(1) and lock-free. This implementation is simpler but I don't think that is sufficient reason for rewriting the old one.

@Sytten
Copy link
Copy Markdown
Collaborator

Sytten commented Feb 10, 2026

@DelSkayn Last time we talked it was more performant by a factor, so it was worth it + less complex overall

@DelSkayn
Copy link
Copy Markdown
Owner

DelSkayn commented Feb 10, 2026

Was that with or without the parallel feature? Because I can see this being faster without the parallel feature as pushing into a vector is probably faster then pushing into a lock free linked list. However that implementation is unsound and needs a mutex to protect against race conditions. I doubt that the mutex wrapped vector is faster then the lock free linked list.

The current schedular has a similar implementation to other sub-schedular implementations like the futures crate FuturesUnordered and tokio's JoinSet which are both implemented as a lock-free list of tasks.

@richarddd richarddd marked this pull request as draft March 9, 2026 13:57
@richarddd
Copy link
Copy Markdown
Collaborator Author

richarddd commented Apr 22, 2026

@DelSkayn @Sytten i have left this hanging for to long but did some major refactors and experimentation. The whole point of this PR was to "simplify" the async bridge between JS and Rust and at the same time increase performance.

This is accomplished via an arena allocation strategy with a bunch of slots for futures and massively improves performance when a lot of spawns are created.

Parallel (worst of a few runs)

Scenario master async-refactor Δ
spawn 100 9.5 µs (10.5 M/s) 8.8 µs (11.4 M/s) +9%
spawn 1K 147.9 µs (6.8 M/s) 42.3 µs (23.6 M/s) +249%
spawn 100K 6.0 ms (16.6 M/s) 4.7 ms (21.1 M/s) +27%
spawn 1M 72.4 ms (13.8 M/s) 47.2 ms (21.2 M/s) +54%
JS promises 100 1039.3 µs (0.10 M/s) 865.2 µs (0.12 M/s) +20%
JS promises 1K 4.8 ms (0.21 M/s) 4.3 ms (0.23 M/s) +10%
JS promises 10K 48.6 ms (0.21 M/s) 47.9 ms (0.21 M/s) same
JS promises 100K 440.9 ms (0.23 M/s) 457.2 ms (0.22 M/s) -4%
chained 100 219.4 µs (0.46 M/s) 535.0 µs (0.19 M/s) -59%
chained 1K 1.8 ms (0.57 M/s) 1.9 ms (0.53 M/s) -7%
chained 10K 14.9 ms (0.67 M/s) 15.0 ms (0.67 M/s) same
chained 100K 147.6 ms (0.68 M/s) 184.0 ms (0.54 M/s) -21%
concurrent 100 20.5 µs (4.9 M/s) 68.0 µs (1.5 M/s) -70%
concurrent 1K 83.4 µs (12.0 M/s) 26.9 µs (37.2 M/s) +210%
concurrent 100K 6.9 ms (14.4 M/s) 2.2 ms (44.7 M/s) +210%
concurrent 1M 83.1 ms (12.0 M/s) 22.6 ms (44.3 M/s) +269%

Non-parallel (worst of a few runs)

Scenario master async-refactor Δ
spawn 100 29.7 µs (3.4 M/s) 10.6 µs (9.5 M/s) +179%
spawn 1K 140.7 µs (7.1 M/s) 52.0 µs (19.3 M/s) +172%
spawn 100K 11.0 ms (9.1 M/s) 4.6 ms (21.7 M/s) +138%
spawn 1M 125.6 ms (8.0 M/s) 44.3 ms (22.6 M/s) +183%
JS promises 100 1601.1 µs (0.06 M/s) 852.4 µs (0.12 M/s) +100%
JS promises 1K 8.1 ms (0.12 M/s) 4.4 ms (0.23 M/s) +92%
JS promises 10K 77.7 ms (0.13 M/s) 45.9 ms (0.22 M/s) +69%
JS promises 100K 703.7 ms (0.14 M/s) 481.0 ms (0.21 M/s) +50%
chained 100 432.4 µs (0.23 M/s) 273.7 µs (0.37 M/s) +61%
chained 1K 2.8 ms (0.36 M/s) 1.4 ms (0.70 M/s) +94%
chained 10K 27.0 ms (0.37 M/s) 19.5 ms (0.51 M/s) +38%
chained 100K 246.8 ms (0.41 M/s) 138.5 ms (0.72 M/s) +76%
concurrent 100 21.4 µs (4.7 M/s) 23.1 µs (4.3 M/s) -9%
concurrent 1K 105.9 µs (9.4 M/s) 30.1 µs (33.2 M/s) +253%
concurrent 100K 20.4 ms (4.9 M/s) 1.8 ms (56.1 M/s) +1045%
concurrent 1M 98.5 ms (10.2 M/s) 20.7 ms (48.3 M/s) +374%

Benchmark in this branch

@richarddd richarddd marked this pull request as ready for review April 23, 2026 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants