Skip to content

Add fail action 6, will fallback to serving stale if retry attempts are exhausted#12852

Open
ezelkow1 wants to merge 13 commits intoapache:masterfrom
ezelkow1:fa6
Open

Add fail action 6, will fallback to serving stale if retry attempts are exhausted#12852
ezelkow1 wants to merge 13 commits intoapache:masterfrom
ezelkow1:fa6

Conversation

@ezelkow1
Copy link
Member

@ezelkow1 ezelkow1 commented Feb 3, 2026

Adds a new fail action that will combine actions 2 and 5 so that if retries are exhausted after attempting collapse then it will also check if it can serve stale if it has an object before deciding to go upstream.

This also refactors a tiny bit of the going to origin logic so that when we are in these retry states we can avoid multiple CACHE_LOOKUP hook calls since previously plugins could get hit with these multiple times. Also fixes some stats issues around counting hit vs. stale or dupe counts.

For now I mainly want to see if this passes all tests since I can't run all locally

@ezelkow1 ezelkow1 self-assigned this Feb 3, 2026
@ezelkow1 ezelkow1 requested review from bneradt and removed request for bneradt February 3, 2026 20:37
@bryancall bryancall requested a review from Copilot February 3, 2026 21:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new cache write-lock failure mode (cache_open_write_fail_action = 6) that combines read-retry (request collapsing) with a stale-serving fallback when retries are exhausted, along with refactors intended to reduce duplicate CACHE_LOOKUP_COMPLETE hook invocations and improve cache hit/stale statistics attribution.

Changes:

  • Introduces READ_RETRY_STALE_ON_REVALIDATE (action 6) across config, core cache state machine logic, and documentation.
  • Refactors “go to origin” flows and defers CACHE_LOOKUP_COMPLETE in certain read-retry scenarios to avoid repeated plugin hook invocations.
  • Adds a new gold test/replay as a smoke/config-acceptance test for action 6.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/gold_tests/cache/replay/cache-read-retry-stale.replay.yaml Adds a replay covering basic caching behavior with action 6 configured.
tests/gold_tests/cache/cache-read-retry-stale.test.py Adds a gold test wrapper for the new replay.
src/records/RecordsConfig.cc Updates inline config commentary to include actions 5 and 6.
include/proxy/http/HttpConfig.h Adds enum value READ_RETRY_STALE_ON_REVALIDATE = 0x06.
src/proxy/http/HttpConfig.cc Extends config validation logic to include action 6.
src/proxy/http/HttpCacheSM.cc Generalizes READ_RETRY checks to include action 6.
include/proxy/http/HttpTransact.h Adds new state flag + new helper function declarations.
src/proxy/http/HttpTransact.cc Implements stale-serving attribution logic, deferred hook behavior, and go-to-origin refactor.
src/proxy/http/HttpSM.cc Defers CACHE_LOOKUP_COMPLETE hook for actions 5/6 in open-read-fail path.
doc/admin-guide/files/records.yaml.en.rst Documents action 6 and updates references to actions that ignore write-retry configs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ezelkow1 and others added 3 commits February 3, 2026 15:03
yes, this is a good fix

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ezelkow1 ezelkow1 marked this pull request as draft February 4, 2026 00:13
@ezelkow1
Copy link
Member Author

ezelkow1 commented Feb 4, 2026

converting to draft, there is one corner case Im trying to work out (who knew trying to defer hook firings until the ultimate result after looping through the SM is known would be complicated :) )

@ezelkow1
Copy link
Member Author

ezelkow1 commented Feb 4, 2026

[approve ci autest 0]

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@bneradt bneradt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I'm Claude (an AI assistant) — asked by @bneradt to review this PR. Here are my concerns:


1. serving_stale_due_to_write_lock changes behavior for existing actions 2 and 3

The new flag is set inside what_is_document_freshness whenever the STALE_ON_REVALIDATE bitmask check passes:

s->serving_stale_due_to_write_lock = true;

This triggers for all actions with the STALE_ON_REVALIDATE bit set: actions 2 (0x02), 3 (0x03), and the new 6 (0x06). The downstream effects in HandleCacheOpenReadHitFreshness (overriding cache_lookup_result to HIT_STALE) and HandleCacheOpenReadHit (changing VIA strings, skipping the REVALIDATION_FAILED warning code path) alter stats and VIA behavior for existing actions 2 and 3. The PR description mentions fixing stats issues, but this is a behavior change to existing, shipped functionality that should be clearly called out. If these are separate bug fixes for actions 2/3, they might be better as a separate commit or PR to isolate risk.

2. Action 5 stale-object path now skips HandleCacheOpenReadHit entirely

Previously for action 5 when READ_RETRY found a stale object:

// old code
s->hdr_info.server_request.destroy();
HandleCacheOpenReadHitFreshness(s);  // → STALE → HandleCacheOpenReadHit → revalidate with conditional request

Now:

// new code, stale + can_serve_stale==false (action 5)
s->cache_info.action = CacheAction_t::NO_ACTION;
handle_cache_write_lock_go_to_origin(s);  // goes directly to origin

The old path went through HandleCacheOpenReadHit, which would build a conditional revalidation request (If-Modified-Since / If-None-Match) before contacting origin. The new path goes directly to origin with NO_ACTION and a destroyed server_request, so no conditional headers are sent. This means a potential 304 response (saving bandwidth) becomes impossible, replaced by a full 200 response that also won't be cached. Depending on the workload, this could be a meaningful performance regression for action 5 with stale content.

3. Inconsistent use of is_read_retry_action helper

A nice helper is defined in HttpCacheSM.cc's anonymous namespace, but then the exact same two-value comparison is written inline in HttpSM.cc (at least 3 places), HttpConfig.cc, and HttpTransact.cc. If a future action is added, all those inline checks need updating. Consider moving this helper to the header (e.g., HttpConfig.h next to the enum) so it can be shared across all files.

4. VIA_SERVER_RESULT = VIA_SERVER_ERROR seems misleading for write lock failure

if (s->serving_stale_due_to_write_lock) {
    SET_VIA_STRING(VIA_SERVER_RESULT, VIA_SERVER_ERROR);
}

The server wasn't contacted and didn't return an error — the issue is cache write lock contention. Setting the VIA server result to "error" could mislead operators monitoring VIA strings for actual origin issues.

5. Temporary override trick to evaluate "real" freshness is fragile

MgmtByte saved_action           = s->cache_open_write_fail_action;
s->cache_open_write_fail_action = static_cast<MgmtByte>(CacheOpenWriteFailAction_t::READ_RETRY);
Freshness_t freshness           = what_is_document_freshness(s, &s->hdr_info.client_request, obj->response_get());
s->cache_open_write_fail_action = saved_action;

This save/restore pattern to prevent the STALE_ON_REVALIDATE short-circuit in what_is_document_freshness is fragile. If what_is_document_freshness later gains side effects based on the action value, this trick silently changes behavior. Consider a dedicated freshness evaluation path (a parameter, or a separate function) that doesn't have the STALE_ON_REVALIDATE short-circuit, rather than temporarily mutating state.

6. Test coverage is minimal

The test verifies action 6 is accepted, basic caching works, and ATS doesn't crash. It does not test the actual stale-serving behavior that is the feature's purpose. Given the complexity of the state machine changes (deferred hooks, new code paths for both fresh and stale objects, interactions between actions 5 and 6), stronger testing would reduce risk.

7. CACHE_LOOKUP_COMPLETE deferral is a behavioral change for action 5

The PR changes action 5's semantics: previously plugins could see CACHE_LOOKUP_COMPLETE multiple times (which the docs noted), now it fires once with the final result. The docs for action 5 are updated accordingly, but existing plugin authors who rely on the old multi-fire behavior may be surprised. This seems like a good improvement, but it's worth calling out as a breaking change for action 5 consumers.

@ezelkow1
Copy link
Member Author

ezelkow1 commented Feb 5, 2026

  1. This is actually a bug fix, the current code is wrong for ALL write_fail_actions involving stale (including previous 2/3). Before it was actually reporting these as FRESH when they were in fact STALE.

  2. Will implement, just instead of always destroying will do it conditionally so we can still get 304's on write lock fails.

  3. Just a nice to have, could be done later

  4. will just remove the server_error so it uses the default since it did not go upstream and served stale here

  5. Sort of agree here too, thinking about just adding a boolean to the what_is_document_freshness instead to denote if it should skip these checks instead of faking it

Evan Zelkowitz added 2 commits February 5, 2026 14:55
…mmediately falling back to serving stale. Now it stores of a copy of the object ptr that was found stale so it can follow the same retry path as 5 which if that fails then it will serve the stale object
@ezelkow1 ezelkow1 marked this pull request as ready for review February 6, 2026 20:31
ezelkow1 and others added 3 commits February 6, 2026 13:32
Removed unnecessary comment about test case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants