Skip to content

Conversation

@dt
Copy link

@dt dt commented Oct 6, 2025

This change is Reviewable

@dt dt requested review from sumeerbhola and tbg October 6, 2025 22:17
Copy link

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with experimenting with this Go runtime enhancement. I definitely want to see experimental evidence of the benefit.

PS Should probably update the print in schedtrace to include sched.bgqsize.

@rickystewart rickystewart force-pushed the cockroach-go1.23.12 branch 3 times, most recently from 84fef0d to ec86954 Compare October 9, 2025 21:21
@dt dt force-pushed the yield branch 5 times, most recently from e673937 to 642d058 Compare October 16, 2025 02:33
@dt dt changed the title runtime: add runtime.BackgroundYield() runtime: add runtime.Yield() Oct 16, 2025
@sumeerbhola
Copy link

Given that we are getting closer to merging this, I think we need a plan for how we will maintain this. Specifically the couple of lines of code scattered over various parts in proc.go e.g. we need a list of what cases we need to integrate with the changes, and how to go about finding those cases in the scheduler code.

@dt dt force-pushed the yield branch 2 times, most recently from a642b54 to 4a07233 Compare October 25, 2025 21:45
@dt
Copy link
Author

dt commented Oct 25, 2025

the couple of lines of code scattered over various parts in proc.go

Per the discussion thread above and in the google doc, I've cut these "lines scattered around" (assuming we're referring to the same ones) since my current approach is to just do searches of runqs when npidle is zero. This seems to perform about the same, and better handled added a check of netpoll which I realized I wanted anyway, so the diff is now much closer to a pure, more self contained addition: we need two additions to findRunnable, though they don't depend on anything other than the new yieldq, then two new fields: the yieldq in schedt and the yieldchecks counter/timestamp in G. Other than that it's just the pure addition of the Yield() function and its helpers now.

@dt
Copy link
Author

dt commented Oct 25, 2025

I definitely want to see experimental evidence of the benefit.

Here's without/with making all Pacer.Pace() calls include a call to runtime.Yield() (even those where Pacer is nil/elastic AC is off) on a 5x8vcpu cluster running a TPCC 5k IMPORT while serving kv95 20k QPS in the foreground (note the units changed in the first one). CPU utilization is smoother and higher with Yields.

Screenshot 2025-10-25 at 17 11 35 Screenshot 2025-10-25 at 17 11 22 Screenshot 2025-10-25 at 17 12 01 Screenshot 2025-10-25 at 17 12 10

Without a foreground workload, IMPORT throughput is higher with Yields than without, somewhat surprisingly (by 5-10%) while it is slightly lower when there is a workload to yield to. CPU profiling indicate that while maintaining 98% CPU utilization, the Yield() calls account for about 0.8-0.9% of IMPORT's CPU usage and when under <90% utilization more like 0.4-0.6%.

Copy link

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that this iteration is less intrusive than the previous one. Mostly comments about needing more comments.

gp.yieldchecks = now

for i := range allp {
// We don't need the extra accuracy (and cost) of runqempty here either.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the expense of runqempty problematic? Seems like you're only doing this infrequently.

Copy link
Author

@dt dt Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that expensive (unless you have a silly number of cores) at least in relative terms in this branch where we have a whole syscall to netpoll. But I think it makes sense to skip it above when checking the local runq and if we're okay skipping it there, we should be okay skipping it here too (they can all be stolen from concurrently all the same).

@petermattis
Copy link

The results are very compelling.

Copy link

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola reviewed 1 of 2 files at r3, all commit messages.
Reviewable status: 1 of 5 files reviewed, 25 unresolved discussions (waiting on @dt, @petermattis, and @tbg)


src/runtime/runtime2.go line 515 at r3 (raw file):

	runningnanos  int64 // wall time spent in the running state

	yieldchecks uint32 // a packed approx time and count of maybeYield checks.

This needs a longer code comment elaborating on what exactly this represents and what the packing scheme is.


src/runtime/proc.go line 420 at r3 (raw file):

		// To avoid thrashing between yields, set yieldchecks to 1: if we yield
		// right back and see this sentinel we'll park instead to break the cycle.
		gp.yieldchecks = 1

So sometimes yieldchecks is a packed field and sometimes 1? This needs code comments.


src/runtime/proc.go line 7251 at r3 (raw file):

}

// yield_put is the gopark unlock function for Yield. It enqueues the goroutine

I'm confused by the "unlock function" terminology, given there are callers of gopark that pass nil. And I want to make sure we document in a code comment why this is correct from the perspective of the following comment in gopark:

// unlockf must not access this G's stack, as it may be moved between
// the call to gopark and the call to unlockf.
//
// Note that because unlockf is called after putting the G into a waiting
// state, the G may have already been readied by the time unlockf is called
// unless there is external synchronization preventing the G from being
// readied. If unlockf returns false, it must guarantee that the G cannot be
// externally readied.

We don't access the stack, so fine. Is "readied" means transitioning back to runnable state? I suppose that can't happen either because we haven't even put it in the yieldq yet, so no one can discover it and make it runnable. Anything else I am missing?

@dt dt force-pushed the yield branch 5 times, most recently from 891a05a to 1079405 Compare October 28, 2025 03:58
Copy link
Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 25 unresolved discussions (waiting on @petermattis, @sumeerbhola, and @tbg)


src/runtime/proc.go line 7251 at r3 (raw file):

Previously, sumeerbhola wrote…

I'm confused by the "unlock function" terminology, given there are callers of gopark that pass nil. And I want to make sure we document in a code comment why this is correct from the perspective of the following comment in gopark:

// unlockf must not access this G's stack, as it may be moved between
// the call to gopark and the call to unlockf.
//
// Note that because unlockf is called after putting the G into a waiting
// state, the G may have already been readied by the time unlockf is called
// unless there is external synchronization preventing the G from being
// readied. If unlockf returns false, it must guarantee that the G cannot be
// externally readied.

We don't access the stack, so fine. Is "readied" means transitioning back to runnable state? I suppose that can't happen either because we haven't even put it in the yieldq yet, so no one can discover it and make it runnable. Anything else I am missing?

Yeah, that's pretty much it: nothing is going to ready this G until findRunnable pull it from the yieldq so it is our to pu there, so I think we can guarantee it isn't externally readied.

@dt dt force-pushed the yield branch 2 times, most recently from 871cd2f to 83b02b4 Compare October 28, 2025 20:23
@dt dt changed the base branch from cockroach-go1.23.12 to cockroach-go1.25.3 October 28, 2025 20:23
@dt
Copy link
Author

dt commented Oct 28, 2025

Only one data point but I just rebased this from go 1.23 to go 1.25 since CRDB just switched today and the rebase was clean with no conflicts on the scaled back touch points of just findRunnable+two new fields. I think the previous version with the extra pending work atomic did have master/1.25 -> 1.23 conflicts when I went the other way, so maybe some evidence that being more contained is indeed helpful here.

Copy link

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks good to me. I'll defer to @sumeerbhola for final approval. Really excited about the benefits you're seeing in test scenarios.

// Set yieldchecks to just new high timestamp bits, cleaning counter.
gp.yieldchecks = now

// Check runqs of all Ps; if we find anything park free this P to steal.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: anything parked?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we find anything , park to free this P. Expanded the comment.

Copy link

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola reviewed 2 of 3 files at r5.
Reviewable status: 2 of 6 files reviewed, 39 unresolved discussions (waiting on @dt, @petermattis, and @tbg)


src/runtime/proc.go line 463 at r5 (raw file):

which it will be when they're zero if we don't yield,

what is "they" in "they're"?
I am still not very clear on the yieldchecks states and state transitions. Please spell out the state machine and transitions somewhat precisely in a code comment, ideally where yieldchecks is declared.


src/runtime/proc.go line 470 at r5 (raw file):
Thinking out aloud. There are 3 assignments to yieldchecks in this file
yieldchecks = 1
yieldchecks = now
yieldchecks = prev | (yieldCountMask / 2)

Say one of the latter two ran and it parked itself in the global yieldq. At some point it starts running again. It will call yield and the first assignment runs and it puts itself in the back of the local runq. For it not to park itself in the yieldq on the next yield, the local runq must be empty, yes?

Say it is empty, so it falls through to the code below. Where we increment it to 2, and so count = 2 and count +1 = 3, and we don't immediately check the clock. Then the next yield sets sees the transition since count = 3 and count + 1 = 4. But the clock value is 0 since we haven't initialized it, so now != prev is true and we will check all the other queues. Seems ok, but this transition between the two modes of using yieldchecks needs a precise explanation and preferably a statements of invariants. I know the earlier comment is trying to do that, but IMHO it doesn't spell it out in detail. Without that, the reader has to do the hard work of fully figuring it out.

// We can clobber yieldchecks here since we're
// actively yielding -- we don't need the counter to decide to do so. And
// our sentinel will in turn be clobbered the very next time the time is put
// in the upper bits, which it will be when they're zero if we don't yield, ...


src/runtime/proc.go line 523 at r5 (raw file):
A longer comment would help. Something like:

// count & (count + 1) will be 0 on transitions from 2^k-1 to 2^k for every value of k, so k=0, 1, ..., which means we read the clock with exponential backoff. When k=11, we reach the maximum value of the counter, and we will also sample on the transition from 2^11-1 to 2^11, after which k will become 0 and we will resume faster sampling.


after which k will become 0 and we will resume faster sampling.

I see now that we don't do that in that we do gp.yieldchecks = prev | (yieldCountMask / 2) in that we are going back to 2^10-1. But we are clobbering the time bits, which seems wrong.


src/runtime/proc.go line 456 at r5 (raw file):

		// yieldq and potentially switching Ps. While that's our preferred choice,
		// we want to avoid thrashing back and forth between multiple Yield-calling
		// goroutines: in such a case it is better to just park one so the other

... just park one on the global yieldq ...


src/runtime/proc.go line 457 at r5 (raw file):

		// we want to avoid thrashing back and forth between multiple Yield-calling
		// goroutines: in such a case it is better to just park one so the other
		// stops seeing it in the queue and yielding to it. To detect and break this

... stops seeing it in the local P's runq and yielding ...


src/runtime/proc.go line 510 at r5 (raw file):

	// uint32 on G: its 11 lower bits store a counter while the remaining 21 bits
	// store nanos quantized to 0.25ms "epochs" by discarding the lower 18 bits.
	// of a int64 nanotime() value. For counter values after increment of 2^k-1,

what is k?


src/runtime/proc.go line 515 at r5 (raw file):

	//
	// Choosing 11 bits for a counter allows backing off to a rate of checking the
	// clock once every 1k calls if called extremely frequently; it seems unlikely

Isn't 10 bits enough to represent 1023, so why 11 bits?


src/runtime/proc.go line 519 at r5 (raw file):

	// higher backoff. The 21 remaining bits allows ~9mins between rollover of
	// the epoch: the slim chance of a false negative is quite acceptable as if we
	// hit it, we just delay one check of the runqs by a quarter millisecond.

This false negative comment is not very clear to me.

IIUC, the risk is that when (count & (count + 1)) == 0, 9 minutes (with fidelity of 0.25ms) have elapsed and we won't check the other runqs. Since we vary the count interval between sampling the clock, due to the exponential backoff on the count interval, we will not keep hitting this pathological case of 9min having elapsed.

The above understanding is not consistent with the comment "we just delay one check of the runqs by a quarter millisecond", since if 9min have elapsed and we exponentially backoff, then 18min will have elapsed. I am probably missing something.

Could you add a longer code comment.


src/runtime/proc.go line 520 at r5 (raw file):

	// the epoch: the slim chance of a false negative is quite acceptable as if we
	// hit it, we just delay one check of the runqs by a quarter millisecond.
	const yieldCountBits, yieldCountMask = 11, (1 << 11) - 1

so yieldCountMask is 2047, and not 1023, yes?


src/runtime/proc.go line 521 at r5 (raw file):

	// hit it, we just delay one check of the runqs by a quarter millisecond.
	const yieldCountBits, yieldCountMask = 11, (1 << 11) - 1
	const yieldEpochShift = 18 - yieldCountBits // only need to shift by the differnce, then mask.

18 - 11? Why?
18 is the quantization of the time, which corresponds to 1 epoch.
I would expect we would right shift time by 18. Then we would clear out everything except for the lowest 11 bits.
But we are doing

now := uint32(nanotime()>>yieldEpochShift) &^ yieldCountMask

So we are right shifting by 7 and then clearing the lowest 11 bits. Confused.
Can this yieldchecks be moved into a struct with methods and a simple unit test. Hopefully all the methods will inline. It would also help with my earlier state machine comment.


Hmm, we do gp.yieldchecks = now, so I can see we want the quantized value and not use the lowest 11 bits. So that means taking the clock value, quantizing by clearing the lowest 18 bits, then right shifting by 18 bits and then left shifting by 11 bits. I think what is happening here is doing this in one step by right shifting by 7 and clearing the lowest 11. This is very straightforward code, if there were code comments explaining what it is doing, but without those code comments at least for me this is not decipherable without spending far too much time on it.


src/runtime/proc.go line 522 at r5 (raw file):

	const yieldCountBits, yieldCountMask = 11, (1 << 11) - 1
	const yieldEpochShift = 18 - yieldCountBits // only need to shift by the differnce, then mask.
	gp.yieldchecks++

When this goroutine comes back from


src/runtime/proc.go line 568 at r5 (raw file):

// park to Yield is considered "waiting" rather than "runnable" as it is blocked
// in this state until there is strictly spare execution capacity available to
// resume it, unlike runnable goroutines which generally take runs running at

take turns?

@dt dt force-pushed the yield branch 2 times, most recently from 65c1703 to b808685 Compare December 5, 2025 21:34
Copy link
Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 7 files reviewed, 39 unresolved discussions (waiting on @petermattis, @sumeerbhola, and @tbg)


src/runtime/proc.go line 420 at r3 (raw file):

Previously, sumeerbhola wrote…

So sometimes yieldchecks is a packed field and sometimes 1? This needs code comments.

I reworked this comment.


src/runtime/proc.go line 463 at r5 (raw file):

Previously, sumeerbhola wrote…

which it will be when they're zero if we don't yield,

what is "they" in "they're"?
I am still not very clear on the yieldchecks states and state transitions. Please spell out the state machine and transitions somewhat precisely in a code comment, ideally where yieldchecks is declared.

I reworked this comment.


src/runtime/proc.go line 470 at r5 (raw file):

Previously, sumeerbhola wrote…

Thinking out aloud. There are 3 assignments to yieldchecks in this file
yieldchecks = 1
yieldchecks = now
yieldchecks = prev | (yieldCountMask / 2)

Say one of the latter two ran and it parked itself in the global yieldq. At some point it starts running again. It will call yield and the first assignment runs and it puts itself in the back of the local runq. For it not to park itself in the yieldq on the next yield, the local runq must be empty, yes?

Say it is empty, so it falls through to the code below. Where we increment it to 2, and so count = 2 and count +1 = 3, and we don't immediately check the clock. Then the next yield sets sees the transition since count = 3 and count + 1 = 4. But the clock value is 0 since we haven't initialized it, so now != prev is true and we will check all the other queues. Seems ok, but this transition between the two modes of using yieldchecks needs a precise explanation and preferably a statements of invariants. I know the earlier comment is trying to do that, but IMHO it doesn't spell it out in detail. Without that, the reader has to do the hard work of fully figuring it out.

// We can clobber yieldchecks here since we're
// actively yielding -- we don't need the counter to decide to do so. And
// our sentinel will in turn be clobbered the very next time the time is put
// in the upper bits, which it will be when they're zero if we don't yield, ...

I reworked this comment and the one below about yeildchecks in detail. I decided here -- the only function where we use it -- is a better place for this commentary than where the field is defined off in g but did add a note there to look at the function.


src/runtime/proc.go line 510 at r5 (raw file):

Previously, sumeerbhola wrote…

what is k?

that just was meant to convey values of the form 2^k-1, i.e. 1, 3, 7, 15, etc. Spelled this out in reworked comment.


src/runtime/proc.go line 515 at r5 (raw file):

Previously, sumeerbhola wrote…

Isn't 10 bits enough to represent 1023, so why 11 bits?

expanded the comment. we check and then cut the counter in half when it is about to overflow, so that gives us another N/2 calls until we'd overflow again, so for N/2=1k -> n=2k thus 11 bits.


src/runtime/proc.go line 519 at r5 (raw file):

Previously, sumeerbhola wrote…

This false negative comment is not very clear to me.

IIUC, the risk is that when (count & (count + 1)) == 0, 9 minutes (with fidelity of 0.25ms) have elapsed and we won't check the other runqs. Since we vary the count interval between sampling the clock, due to the exponential backoff on the count interval, we will not keep hitting this pathological case of 9min having elapsed.

The above understanding is not consistent with the comment "we just delay one check of the runqs by a quarter millisecond", since if 9min have elapsed and we exponentially backoff, then 18min will have elapsed. I am probably missing something.

Could you add a longer code comment.

Expanded and reworked the commentary here.


src/runtime/proc.go line 520 at r5 (raw file):

Previously, sumeerbhola wrote…

so yieldCountMask is 2047, and not 1023, yes?

yeah, 2047.


src/runtime/proc.go line 521 at r5 (raw file):

Previously, sumeerbhola wrote…

18 - 11? Why?
18 is the quantization of the time, which corresponds to 1 epoch.
I would expect we would right shift time by 18. Then we would clear out everything except for the lowest 11 bits.
But we are doing

now := uint32(nanotime()>>yieldEpochShift) &^ yieldCountMask

So we are right shifting by 7 and then clearing the lowest 11 bits. Confused.
Can this yieldchecks be moved into a struct with methods and a simple unit test. Hopefully all the methods will inline. It would also help with my earlier state machine comment.


Hmm, we do gp.yieldchecks = now, so I can see we want the quantized value and not use the lowest 11 bits. So that means taking the clock value, quantizing by clearing the lowest 18 bits, then right shifting by 18 bits and then left shifting by 11 bits. I think what is happening here is doing this in one step by right shifting by 7 and clearing the lowest 11. This is very straightforward code, if there were code comments explaining what it is doing, but without those code comments at least for me this is not decipherable without spending far too much time on it.

I spelled the out in the added comment: no reason to shift down by 18 just to shift back up by 11 to make room for the counter, so only shift down by the net 7.


src/runtime/proc.go line 522 at r5 (raw file):

Previously, sumeerbhola wrote…

When this goroutine comes back from

Done.


src/runtime/proc.go line 523 at r5 (raw file):

Previously, sumeerbhola wrote…

A longer comment would help. Something like:

// count & (count + 1) will be 0 on transitions from 2^k-1 to 2^k for every value of k, so k=0, 1, ..., which means we read the clock with exponential backoff. When k=11, we reach the maximum value of the counter, and we will also sample on the transition from 2^11-1 to 2^11, after which k will become 0 and we will resume faster sampling.


after which k will become 0 and we will resume faster sampling.

I see now that we don't do that in that we do gp.yieldchecks = prev | (yieldCountMask / 2) in that we are going back to 2^10-1. But we are clobbering the time bits, which seems wrong.

yeah, expanded this comment to cover this. Not following your concern on clobbering though? prev is the ts bits and we're or'ing it with the half saturated counter?


src/runtime/proc.go line 568 at r5 (raw file):

Previously, sumeerbhola wrote…

take turns?

Done.


src/runtime/runtime2.go line 515 at r3 (raw file):

Previously, sumeerbhola wrote…

This needs a longer code comment elaborating on what exactly this represents and what the packing scheme is.

Done. (refers to Yield() now)

// Set yieldchecks to just new high timestamp bits, cleaning counter.
gp.yieldchecks = now

// Check runqs of all Ps; if we find anything park free this P to steal.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we find anything , park to free this P. Expanded the comment.

Copy link

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@sumeerbhola reviewed 2 of 3 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: 4 of 7 files reviewed, 43 unresolved discussions (waiting on @dt, @petermattis, and @tbg)


src/runtime/proc.go line 467 at r7 (raw file):
I don't see any specific "stale epoch" code. I think it is more akin to what I wrote in a previous comment

Where we increment it to 2, and so count = 2 and count +1 = 3, and we don't immediately check the clock. Then the next yield sets sees the transition since count = 3 and count + 1 = 4. But the clock value is 0 since we haven't initialized it, so now != prev is true and we will check all the other queues.

Illustrating it with the example above would make it easier to follow.


src/runtime/proc.go line 501 at r7 (raw file):

	// passed. We define "enough" as approximately 0.25ms: long enough to keep
	// overhead low even for a caller in a tight loop, and hopefully even to give
	// the goroutine locally ahead of the blocked work a chance to locally yield

The "the goroutine locally ahead of the blocked work a chance to locally yield" could be clearer.

Presumably the blocked work is the one waiting on one the other P's runqs or the netpoll work, yes?
And I am guessing the "goroutine locally ahead of the blocked work" is the currently running goroutine on that remote P.
I think elaborating on this will be helpful for the reader.


src/runtime/proc.go line 523 at r7 (raw file):

	// uint32 (yieldchecks): the upper 21 bits store the low bits of the quantized
	// timestamp and the lower 11 bits store the call counter. Given the counter
	// resets to half its value when saturated, this results in plateauing at a

consider adding something like
... when saturated (at 2k-1), this results ...

which will make the later 1k call phrasing trivial to understand.


src/runtime/proc.go line 530 at r7 (raw file):

	// to be effective, this is not a major concern (at worst likely just means an
	// expensive check is deferred for an extra 0.25ms at which point it no longer
	// matches).

I am still unsure about this wrapping 9 min comment (not the logic itself), specifically the "just means ... deferred for an extra 0.25ms".

If 9min have elapsed from say when we checked at count=512 to count=1024, we are clearly not calling yield frequently anymore (540s/512 > 1s). And since we got to 512 without resetting to 0, a total of 0.25ms didn't elapse for 512 counter increments, so we were calling it very frequently before (0.25ms/512 = ~0.5 micros). So perhaps all this comment needs to say is that the yield calling code should not have such many orders of magnitude difference in the calling frequency.

Copy link
Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 7 files reviewed, 43 unresolved discussions (waiting on @petermattis, @sumeerbhola, and @tbg)


src/runtime/proc.go line 467 at r7 (raw file):

Previously, sumeerbhola wrote…

I don't see any specific "stale epoch" code. I think it is more akin to what I wrote in a previous comment

Where we increment it to 2, and so count = 2 and count +1 = 3, and we don't immediately check the clock. Then the next yield sets sees the transition since count = 3 and count + 1 = 4. But the clock value is 0 since we haven't initialized it, so now != prev is true and we will check all the other queues.

Illustrating it with the example above would make it easier to follow.

Right, "stale epoch" is just referring to a "prev" epoch not being equal to now ie. being stale, when we next check (at count == 3).

How about this: 1 is a valid packed prev+count value, with prev=0/count=1 so if we later call Yield with no local runq and fall through to the maybe-do-epensive-checks code below, it will just increment it as usual; when count=3 it will compare prev=0 to the clock and do a check.


src/runtime/proc.go line 501 at r7 (raw file):

Previously, sumeerbhola wrote…

The "the goroutine locally ahead of the blocked work a chance to locally yield" could be clearer.

Presumably the blocked work is the one waiting on one the other P's runqs or the netpoll work, yes?
And I am guessing the "goroutine locally ahead of the blocked work" is the currently running goroutine on that remote P.
I think elaborating on this will be helpful for the reader.

I just removed the "locally ahead" mention here; I think the main thing is just low-overhead of yield checks while bounding time work can wait to be noticed by a yield check, so better to just focus on that in the commentary.


src/runtime/proc.go line 523 at r7 (raw file):

Previously, sumeerbhola wrote…

consider adding something like
... when saturated (at 2k-1), this results ...

which will make the later 1k call phrasing trivial to understand.

Done.


src/runtime/proc.go line 530 at r7 (raw file):

Previously, sumeerbhola wrote…

I am still unsure about this wrapping 9 min comment (not the logic itself), specifically the "just means ... deferred for an extra 0.25ms".

If 9min have elapsed from say when we checked at count=512 to count=1024, we are clearly not calling yield frequently anymore (540s/512 > 1s). And since we got to 512 without resetting to 0, a total of 0.25ms didn't elapse for 512 counter increments, so we were calling it very frequently before (0.25ms/512 = ~0.5 micros). So perhaps all this comment needs to say is that the yield calling code should not have such many orders of magnitude difference in the calling frequency.

how about this: // Note: 21 bits gives us ~2M distinct 0.25ms quantized times before we wrap

// around once every ~9 minutes. Since we compare exact equality, one would

// need to not check the clock at all for ~9mins, then check it on the exact

// 0.25ms tick to not see it change. To not check it at all for 9mins would

// imply a dramatic reduction in Yield call frequency; given frequent calls

// are what make Yield effective, this is not a practical concern.

Change-Id: Idbe3438f5f06cae82dc5dcc56c52347d20e3e20a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants