cachedb_redis: add ASK redirect handling for cluster resharding#3852
Merged
liviuchircu merged 1 commit intoOpenSIPS:masterfrom Apr 22, 2026
Merged
Conversation
Member
|
This appears to be a partial duplicate of #3815 |
Contributor
Author
|
Thanks for pointing that out @NormB. You're right — the hash slot fix ( The unique part of this PR is the ASK redirect handling (detection + Happy to rebase and drop the hash slot change once #3815 merges, keeping only the ASK redirect logic. Or if you'd prefer, I can split this into a separate ASK-only PR right now. Let me know what works best. |
This was referenced Mar 30, 2026
Add support for Redis ASK redirects during cluster resharding. When a slot is being migrated between nodes, Redis returns an ASK response instead of MOVED. Unlike MOVED (permanent redirect), ASK is a one-time redirect that requires sending the ASKING command to the target node before retrying the original query. The implementation: - Detects ASK responses alongside existing MOVED handling - Sends ASKING command to the target node before retrying - Reuses the MOVED redirect infrastructure (endpoint lookup, reconnection, retry logic) Also refactor parse_moved_reply() into parse_redirect_reply() that accepts the prefix as a parameter, with inline wrappers parse_moved_reply() and parse_ask_reply() for backward compatibility. Partially addresses OpenSIPS#2811
f3ee6c8 to
fc8a736
Compare
Contributor
Author
NormB
added a commit
to NormB/opensips-archive
that referenced
this pull request
Apr 20, 2026
With use_tls=1 on cachedb_redis, every SIP worker that touches a
Redis/Valkey connection SIGSEGVs on the first SSL_new. Two
independent problems stacked on the same line:
1. The old code indexed d->ctx as a per-process array:
SSL_new(((void**)d->ctx)[process_no])
This predates the tls_mgm refactor (upstream 11aa15d,
'refactor TCP/TLS code to have everything in a single process'),
after which ctx became a single SSL_CTX *. Every worker with
process_no > 0 read garbage past the SSL_CTX struct.
2. Even after fixing the dereference, d->ctx is not safely usable
from sibling workers. tls_mgm builds its SSL_CTX in
child_init(PROC_TCP_MAIN), which runs *after* SIP workers and
the MI process have already forked. So those workers see
d->ctx == NULL when their cachedb_redis:child_init ->
cachedb_do_init -> redis_init_ssl path fires. Even waiting for
TCP_MAIN doesn't help: the SSL_CTX was built post-fork in
TCP_MAIN's address space, and OpenSSL embeds pointers into
per-process storage inside the CTX (method tables, providers,
OSSL_LIB_CTX).
Fix: each worker builds its own SSL_CTX from the static
configuration on the tls_domain (cert, pkey, ca, verify_cert).
CTXs are cached per-worker, keyed by tls_domain pointer, so
multiple connections sharing a domain share one CTX inside that
worker. Floors min proto at TLS 1.2 to match Valkey/Redis 7+
defaults and the tls_method=TLSv1_2 that most tls_mgm configs
already set.
Also tightens the adjacent error paths: SSL_free the SSL on
redisInitiateSSL failure, and promote the printf in that path to
LM_ERR for consistency.
Note: a companion parser fix for 'tls-port' in parse_cluster_shards
is still needed for TLS cluster topology updates to use the right
port on MOVED/ASK redirects, but depends on PR OpenSIPS#3852
(parse_cluster_shards introduction); it can land as a small
follow-up once that PR is merged.
liviuchircu
approved these changes
Apr 22, 2026
liviuchircu
pushed a commit
that referenced
this pull request
Apr 22, 2026
Replace the static cluster topology (built once at startup, never
refreshed) with runtime discovery and automatic refresh:
Topology discovery and refresh:
- Probe CLUSTER SHARDS (Redis 7+) with fallback to CLUSTER SLOTS
(Redis 3+) for backward compatibility
- O(1) slot_table[16384] lookup replaces per-query linked-list scan
- Automatic topology refresh on MOVED redirect, connection failure,
or query targeting an unmapped slot (rate-limited to 1/sec)
- Dynamic node creation when MOVED points to an unknown endpoint
- Stale node pruning during refresh with safe connection cleanup
- Cap redirect loop at 5 max redirects to prevent worker hang on
pathological cluster state
Cluster observability via MI commands:
- redis_cluster_info: full topology dump including per-node connection
status, slot assignments, query/error/moved/ask counters, and
last activity timestamp
- redis_cluster_refresh: trigger manual topology refresh (bypasses
rate limit)
- redis_ping_nodes: per-node PING with microsecond latency reporting
- All MI commands support optional group filter parameter
Statistics:
- redis_queries, redis_queries_failed, redis_moved, redis_ask,
redis_topology_refreshes (module-level stat counters)
- Per-node query, error, moved, ask counters in redis_cluster_info
Hash slot correctness:
- Hash tag {…} extraction per Redis Cluster specification
- CRC16 modulo 16384 replaces bitwise AND with slots_assigned
ASK redirect handling:
- Detect ASK responses alongside existing MOVED handling
- Send ASKING command to target node before retrying original query
- Do not update slot map (ASK is a temporary mid-migration redirect)
- Refactor parse_moved_reply into parse_redirect_reply with prefix
parameter; inline wrappers for backward compatibility
Connection reliability:
- TCP keepalive via redis_keepalive parameter (default 10s)
- Stack allocation for redis_moved structs (eliminates OOM paths)
- NULL guards on malformed CLUSTER SHARDS/SLOTS reply elements
- Integer overflow protection in slot and port parsing
- NULL guards in MI command handlers for group_name/initial_url
Documentation:
- New section: Redis Cluster Support (topology discovery, automatic
refresh, MOVED/ASK handling, hash tags)
- MI command reference: redis_cluster_info, redis_cluster_refresh,
redis_ping_nodes
- Authentication URL format documentation (classic, ACL, no-auth)
- New parameter: redis_keepalive
Test suite (186 tests):
- C unit tests: hash slot calculation (37), MI counter helpers (41)
- Integration: topology startup (12), ASK redirect (16), topology
refresh (13), MI commands (50), edge cases (16)
- Trap EXIT handlers for safe cluster state restoration
- python3 preflight checks for JSON-dependent tests
Depends on: #3815 (hash tag + modulo fix), #3852 (ASK redirect)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add support for Redis ASK redirects during cluster resharding, completing the cluster redirect handling started in PR #3639.
Background
When a Redis cluster is resharding (migrating slots between nodes), the source node returns an
ASKresponse instead ofMOVED. The key difference:ASKINGto the target node before retryingWithout ASK handling, queries during resharding operations fail with unhandled error responses.
Implementation
ASKINGcommand to the target node before retrying the original queryparse_moved_reply()intoparse_redirect_reply()that accepts the prefix as a parameter, with inline wrappersparse_moved_reply()andparse_ask_reply()for backward compatibilityNote on hash slot fix
The original version of this PR also included a hash slot calculation fix. That has been removed per @NormB's feedback, as his PR #3815 addresses this more comprehensively (including hash tag support). This PR now contains only the ASK redirect handling.
Testing
parse_redirect_reply()refactor passes integration tests with live slot migrations across a 3-node clusterPartially addresses #2811