Add fail action 6, will fallback to serving stale if retry attempts are exhausted#12852
Add fail action 6, will fallback to serving stale if retry attempts are exhausted#12852ezelkow1 wants to merge 13 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new cache write-lock failure mode (cache_open_write_fail_action = 6) that combines read-retry (request collapsing) with a stale-serving fallback when retries are exhausted, along with refactors intended to reduce duplicate CACHE_LOOKUP_COMPLETE hook invocations and improve cache hit/stale statistics attribution.
Changes:
- Introduces
READ_RETRY_STALE_ON_REVALIDATE(action6) across config, core cache state machine logic, and documentation. - Refactors “go to origin” flows and defers
CACHE_LOOKUP_COMPLETEin certain read-retry scenarios to avoid repeated plugin hook invocations. - Adds a new gold test/replay as a smoke/config-acceptance test for action
6.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/gold_tests/cache/replay/cache-read-retry-stale.replay.yaml | Adds a replay covering basic caching behavior with action 6 configured. |
| tests/gold_tests/cache/cache-read-retry-stale.test.py | Adds a gold test wrapper for the new replay. |
| src/records/RecordsConfig.cc | Updates inline config commentary to include actions 5 and 6. |
| include/proxy/http/HttpConfig.h | Adds enum value READ_RETRY_STALE_ON_REVALIDATE = 0x06. |
| src/proxy/http/HttpConfig.cc | Extends config validation logic to include action 6. |
| src/proxy/http/HttpCacheSM.cc | Generalizes READ_RETRY checks to include action 6. |
| include/proxy/http/HttpTransact.h | Adds new state flag + new helper function declarations. |
| src/proxy/http/HttpTransact.cc | Implements stale-serving attribution logic, deferred hook behavior, and go-to-origin refactor. |
| src/proxy/http/HttpSM.cc | Defers CACHE_LOOKUP_COMPLETE hook for actions 5/6 in open-read-fail path. |
| doc/admin-guide/files/records.yaml.en.rst | Documents action 6 and updates references to actions that ignore write-retry configs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
yes, this is a good fix Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
converting to draft, there is one corner case Im trying to work out (who knew trying to defer hook firings until the ultimate result after looping through the SM is known would be complicated :) ) |
…iable to keep track of when to fire
|
[approve ci autest 0] |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
bneradt
left a comment
There was a problem hiding this comment.
Hi, I'm Claude (an AI assistant) — asked by @bneradt to review this PR. Here are my concerns:
1. serving_stale_due_to_write_lock changes behavior for existing actions 2 and 3
The new flag is set inside what_is_document_freshness whenever the STALE_ON_REVALIDATE bitmask check passes:
s->serving_stale_due_to_write_lock = true;This triggers for all actions with the STALE_ON_REVALIDATE bit set: actions 2 (0x02), 3 (0x03), and the new 6 (0x06). The downstream effects in HandleCacheOpenReadHitFreshness (overriding cache_lookup_result to HIT_STALE) and HandleCacheOpenReadHit (changing VIA strings, skipping the REVALIDATION_FAILED warning code path) alter stats and VIA behavior for existing actions 2 and 3. The PR description mentions fixing stats issues, but this is a behavior change to existing, shipped functionality that should be clearly called out. If these are separate bug fixes for actions 2/3, they might be better as a separate commit or PR to isolate risk.
2. Action 5 stale-object path now skips HandleCacheOpenReadHit entirely
Previously for action 5 when READ_RETRY found a stale object:
// old code
s->hdr_info.server_request.destroy();
HandleCacheOpenReadHitFreshness(s); // → STALE → HandleCacheOpenReadHit → revalidate with conditional requestNow:
// new code, stale + can_serve_stale==false (action 5)
s->cache_info.action = CacheAction_t::NO_ACTION;
handle_cache_write_lock_go_to_origin(s); // goes directly to originThe old path went through HandleCacheOpenReadHit, which would build a conditional revalidation request (If-Modified-Since / If-None-Match) before contacting origin. The new path goes directly to origin with NO_ACTION and a destroyed server_request, so no conditional headers are sent. This means a potential 304 response (saving bandwidth) becomes impossible, replaced by a full 200 response that also won't be cached. Depending on the workload, this could be a meaningful performance regression for action 5 with stale content.
3. Inconsistent use of is_read_retry_action helper
A nice helper is defined in HttpCacheSM.cc's anonymous namespace, but then the exact same two-value comparison is written inline in HttpSM.cc (at least 3 places), HttpConfig.cc, and HttpTransact.cc. If a future action is added, all those inline checks need updating. Consider moving this helper to the header (e.g., HttpConfig.h next to the enum) so it can be shared across all files.
4. VIA_SERVER_RESULT = VIA_SERVER_ERROR seems misleading for write lock failure
if (s->serving_stale_due_to_write_lock) {
SET_VIA_STRING(VIA_SERVER_RESULT, VIA_SERVER_ERROR);
}The server wasn't contacted and didn't return an error — the issue is cache write lock contention. Setting the VIA server result to "error" could mislead operators monitoring VIA strings for actual origin issues.
5. Temporary override trick to evaluate "real" freshness is fragile
MgmtByte saved_action = s->cache_open_write_fail_action;
s->cache_open_write_fail_action = static_cast<MgmtByte>(CacheOpenWriteFailAction_t::READ_RETRY);
Freshness_t freshness = what_is_document_freshness(s, &s->hdr_info.client_request, obj->response_get());
s->cache_open_write_fail_action = saved_action;This save/restore pattern to prevent the STALE_ON_REVALIDATE short-circuit in what_is_document_freshness is fragile. If what_is_document_freshness later gains side effects based on the action value, this trick silently changes behavior. Consider a dedicated freshness evaluation path (a parameter, or a separate function) that doesn't have the STALE_ON_REVALIDATE short-circuit, rather than temporarily mutating state.
6. Test coverage is minimal
The test verifies action 6 is accepted, basic caching works, and ATS doesn't crash. It does not test the actual stale-serving behavior that is the feature's purpose. Given the complexity of the state machine changes (deferred hooks, new code paths for both fresh and stale objects, interactions between actions 5 and 6), stronger testing would reduce risk.
7. CACHE_LOOKUP_COMPLETE deferral is a behavioral change for action 5
The PR changes action 5's semantics: previously plugins could see CACHE_LOOKUP_COMPLETE multiple times (which the docs noted), now it fires once with the final result. The docs for action 5 are updated accordingly, but existing plugin authors who rely on the old multi-fire behavior may be surprised. This seems like a good improvement, but it's worth calling out as a breaking change for action 5 consumers.
|
…mmediately falling back to serving stale. Now it stores of a copy of the object ptr that was found stale so it can follow the same retry path as 5 which if that fails then it will serve the stale object
moving instructions below license
Removed unnecessary comment about test case.
Adds a new fail action that will combine actions 2 and 5 so that if retries are exhausted after attempting collapse then it will also check if it can serve stale if it has an object before deciding to go upstream.
This also refactors a tiny bit of the going to origin logic so that when we are in these retry states we can avoid multiple CACHE_LOOKUP hook calls since previously plugins could get hit with these multiple times. Also fixes some stats issues around counting hit vs. stale or dupe counts.
For now I mainly want to see if this passes all tests since I can't run all locally