Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions A61-IPv4-IPv6-dualstack-backends.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,9 +130,13 @@ address list. Specifically:
reports READY, which happens after all handshakes are complete),
we choose that connection. If there is a timer running, we cancel
the timer.
- Each time a subchannel reports TRANSIENT_FAILURE, we will increase a
counter for the number of connection failures. The counter is reset
any time re-resolution is requested.
- We will wait for at least one connection attempt on every address to
fail before we consider the first pass to be complete. At that point,
we will request re-resolution. As per [gRFC A62][A62], we will report
we will request re-resolution if the number of connection failures is at
least the number of subchannels. As per [gRFC A62][A62], we will report
TRANSIENT_FAILURE state and will continue trying to connect. We will
stay in TRANSIENT_FAILURE until either (a) we become connected or (b)
the LB policy is destroyed by the channel shutting down or going IDLE.
Expand All @@ -143,18 +147,18 @@ all times, with no regard for the order of the addresses. Each
individual subchannel will provide [backoff behavior][backoff-spec],
reporting TRANSIENT_FAILURE while in backoff and then IDLE when backoff
has finished. The pick_first policy will therefore automatically
request a connection whenever a subchannel reports IDLE. We will count
the number of connection failures, and when that number reaches the
number of subchannels, we will request re-resolution; note that because
request a connection whenever a subchannel reports IDLE. When the
number of connection failures reaches the number of subchannels (and not
in the first pass), we will request re-resolution; note that because
Comment on lines +151 to +152
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not in the first pass? If we're going to trigger this based solely on the number of connection attempt failures, then shouldn't we do that regardless of whether or not we're in the first pass? It seems like if we're already in TF and we get a new address list, we shouldn't effectively reset the counter.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think I changed any semantics here. I just made it more explicit. This paragraph only applies when we swap "modes" away from first pass.

I don't think we need to re-resolve during the first pass as we haven't actually tried all the addresses yet. We trigger re-resolution after that completes. If we are doing a first pass, we're not guaranteed to be in TF...

I could understand basing some of this off of "in TF" vs "not in TF," but the existing design uses "in first pass" vs "not in first pass." So changes there are more invasive.

the backoff state will differ across the subchannels, this may mean that
we have seen multiple failures of a single subchannel and no failures
from another subchannel, but this is a close enough approximation and
very simple to implement.
Comment on lines 133 to 156
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be clearer to decouple the discussion of re-resolution behavior from the rest of the description here by putting it in its own paragraph:

Suggested change
- We will wait for at least one connection attempt on every address to
fail before we consider the first pass to be complete. As per
[gRFC A62][A62], we will report TRANSIENT_FAILURE state and will
continue trying to connect. We will stay in TRANSIENT_FAILURE until
either (a) we become connected or (b) the LB policy is destroyed by the
channel shutting down or going IDLE.
If the first pass completes without a successful connection attempt, we
will switch to a mode where we keep trying to connect to all addresses at
all times, with no regard for the order of the addresses. Each
individual subchannel will provide [backoff behavior][backoff-spec],
reporting TRANSIENT_FAILURE while in backoff and then IDLE when backoff
has finished. The pick_first policy will therefore automatically request a
connection whenever a subchannel reports IDLE.
Each time a subchannel reports TRANSIENT_FAILURE, we will increase a
counter for the number of connection failures. The counter is reset any time
re-resolution is requested. When the number of connection failures reaches
the number of subchannels (and not in the first pass), we will request
re-resolution; note that because the backoff state will differ across the
subchannels, this may mean that we have seen multiple failures of a single
subchannel and no failures from another subchannel, but this is a close
enough approximation and very simple to implement.

Copy link
Copy Markdown
Member Author

@ejona86 ejona86 Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your use of the github "suggestion" feature is broken. You might want to avoid using it next time or properly select the multiple lines necessary to make it useful. (Or... I don't even know. I see you selected multiple lines now. I have no clue what github is doing. Although the multiple lines selected still looks suspect.)

That seems strictly worse to me, as it needs to alternate between "first pass" and "not first pass". It's also not as clear whether this paragraph is a continuation of the "not first pass" paragraph above; in fact, reading what you wrote I'd assume it is a continuation and only applies when we swap "modes", because the "(and not in the first pass)" is a parenthetical.

I don't understand why "note that because the backoff" is combined with a semicolon with the previous sentence. They seem unrelated.

It looks like your new reorganization broke the language, because now we're not requesting re-resolution when we leave the first pass. Was that your intention? I can't actually tell your intention with the language/organization present.


Note that every time the LB policy receives a new address list, it will
start an initial Happy Eyeballs pass over the new list, even if some of
start a new first pass over the new list using Happy Eyeballs, even if some of
the subchannels are not actually new due to their addresses having been
present on both the old and new lists. This means that on the initial
present on both the old and new lists. This means that on the first
pass through the address list for a subsequent address list update, when
pick_first decides to start a connection attempt on a given subchannel
(whether because it is the first subchannel in the list or because the
Expand Down