-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Summary
When a user thread (created via ngx.thread.spawn) internally uses coroutine.resume() to run a child coroutine that performs cosocket I/O, and the parent user thread is subsequently killed via ngx.thread.kill(), a phantom ctx->uthreads-- occurs when the child coroutine later completes. This causes ctx->uthreads to underflow to -1, triggering an assertion failure (ctx->uthreads == 0) in ngx_http_lua_finalize_threads().
Root Cause
GDB Backtrace
(gdb) bt
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6,
no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1 0x00007f72e2693f4f in __pthread_kill_internal (signo=6, threadid=<optimized out>)
at ./nptl/pthread_kill.c:78
#2 0x00007f72e2644fb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007f72e262f472 in __GI_abort () at ./stdlib/abort.c:79
#4 0x00007f72e262f395 in __assert_fail_base (
fmt=0x7f72e27a4a90 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x56076190bd70 "ctx->uthreads == 0",
file=file@entry=0x56076190c2e0 "../ngx_lua-0.10.28/src/ngx_http_lua_util.c",
line=line@entry=3352,
function=function@entry=0x56076190d360 <__PRETTY_FUNCTION__.15> "ngx_http_lua_finalize_threads") at ./assert/assert.c:94
#5 0x00007f72e263dec2 in __GI___assert_fail (
assertion=assertion@entry=0x56076190bd70 "ctx->uthreads == 0",
file=file@entry=0x56076190c2e0 "../ngx_lua-0.10.28/src/ngx_http_lua_util.c",
line=line@entry=3352,
function=function@entry=0x56076190d360 <__PRETTY_FUNCTION__.15> "ngx_http_lua_finalize_threads") at ./assert/assert.c:103
#6 0x000056076131c739 in ngx_http_lua_finalize_threads (r=0x56078266f590,
ctx=0x560782ed37d0, L=0x7f72e25c5380) at ../ngx_lua-0.10.28/src/ngx_http_lua_util.c:3352
#7 0x0000560761209236 in ngx_destroy_pool (pool=pool@entry=0x5607827f9f80)
at src/core/ngx_palloc.c:57
#8 0x000056076131fe0b in ngx_http_lua_close_fake_connection (c=0x560781a24270)
at ../ngx_lua-0.10.28/src/ngx_http_lua_util.c:3842
#9 0x0000560761209236 in ngx_destroy_pool (pool=0x560782e7aea0) at src/core/ngx_palloc.c:57
#10 0x000056076126d9b7 in ngx_http_close_connection (c=c@entry=0x560781a244a0)
at src/http/ngx_http_request.c:4051
#11 0x0000560761272a2b in ngx_http_ssl_handshake_handler (c=0x560781a244a0)
at src/http/ngx_http_request.c:1020
#12 0x000056076122c934 in ngx_event_expire_timers () at src/event/ngx_event_timer.c:94
#13 0x000056076122c3d4 in ngx_process_events_and_timers (cycle=cycle@entry=0x560782c80330)
at src/event/ngx_event.c:271
#14 0x0000560761236176 in ngx_worker_process_cycle (cycle=cycle@entry=0x560782c80330,
data=data@entry=0x3) at src/os/unix/ngx_process_cycle.c:798
#15 0x0000560761234646 in ngx_spawn_process (cycle=cycle@entry=0x560782c80330,
proc=proc@entry=0x560761236100 <ngx_worker_process_cycle>, data=data@entry=0x3,
name=name@entry=0x5607618ece4e "worker process", respawn=respawn@entry=-4)
at src/os/unix/ngx_process.c:207
#16 0x0000560761235548 in ngx_start_worker_processes (cycle=cycle@entry=0x560782c80330, n=4,
type=type@entry=-4) at src/os/unix/ngx_process_cycle.c:386
#17 0x00005607612373f4 in ngx_master_process_cycle (cycle=0x560782c80330)
at src/os/unix/ngx_process_cycle.c:242
#18 0x000056076120691d in main (argc=<optimized out>, argv=<optimized out>)
at src/core/nginx.c:387
(gdb) f 6
#6 0x000056076131c739 in ngx_http_lua_finalize_threads (r=0x56078266f590, ctx=0x560782ed37d0, L=0x7f72e25c5380) at ../ngx_lua-0.10.28/src/ngx_http_lua_util.c:3352
(gdb) p ctx->uthreads
$1 = -1
In ngx_http_lua_util.c, the user_co_done label handles coroutine completion by walking up the parent chain:
user_co_done:
// ...
if (ctx->cur_co_ctx->is_uthread) {
ngx_http_lua_del_thread(r, L, ctx, ctx->cur_co_ctx);
ctx->uthreads--;
}The problem is that ngx_http_lua_del_thread() checks co_ref == LUA_NOREF and returns early if the thread was already deleted, but ctx->uthreads-- executes unconditionally after del_thread() returns. This means a thread that was already cleaned up by ngx.thread.kill() gets its uthreads counter decremented a second time.
The conditions for triggering this are:
-
User thread T uses
coroutine.resume(child): T yields withco_op = NGX_HTTP_LUA_USER_CORO_RESUME. At this point, T'sco_ctx->cleanupisNULLbecause T yielded forcoroutine.resume, not for socket I/O. -
Child coroutine does cosocket I/O: The child's
co_ctx->cleanupis set tongx_http_lua_coctx_cleanup, andu->read_co_ctxpoints to the child'sco_ctx. -
T is killed via
ngx.thread.kill(T):cleanup_pending_operation(T->co_ctx)is a no-op (T's cleanup is NULL).del_thread(T)setsco_ref = LUA_NOREFandco_status = CO_DEAD, but does not clearis_uthread. The child's socket I/O is not cancelled because the kill only operates on T'sco_ctx, not the child's. -
Child's socket event fires: The child resumes and completes normally. In
run_threadcase 0, the child is not a user thread, so it falls through to theuser_co_donelabel.next_coctx = child->parent_co_ctx = T->co_ctx(the killed thread).ctx->cur_co_ctxis set to T'sco_ctx, which hasis_uthread = 1. Thedel_thread()call returns early (co_ref is alreadyLUA_NOREF), butctx->uthreads--still executes — a phantom decrement.
Sequence Diagram
Entry thread Thread T (uthread) Child coroutine
| | |
|-- ngx.thread.spawn(T) --> | |
| |-- coroutine.resume(child) -->|
| | (T yields, cleanup=NULL) |
| | |-- sock:receive()
| | | (child yields for I/O)
| | |
|-- ngx.thread.wait(T,...) -| |
| (sets T.waited_by_parent=1) |
| | |
|-- ngx.thread.kill(T) ----| |
| cleanup_pending_op: NO-OP (cleanup=NULL) |
| del_thread: co_ref=NOREF, co_status=DEAD |
| is_uthread=1 NOT cleared |
| child's socket NOT cancelled! |
| uthreads-- (correct, 1st decrement) |
| |
|-- ngx.thread.spawn(probe) --> |
| |
| Socket data arrives ------> |
| |-- completes
| |
| run_thread: user_co_done |
| ctx->cur_co_ctx = T.co_ctx (killed!) |
| is_uthread=1 → del_thread (returns early) |
| uthreads-- (PHANTOM! 2nd decrement) |
| |
|-- finalize_threads: finds probe thread alive |
| but uthreads already 0 → uthreads-- → -1 → CRASH! |
Reproduction
The bug requires specific timing: the killed thread's child coroutine must complete before finalize_threads cleans it up. This happens when the child's socket event arrives in the same epoll batch as other events being processed.
Minimal test case (Test::Nginx)
=== TEST: phantom uthreads-- via killed uthread with live child coroutine
--- config
location = /t {
rewrite_by_lua_block {
ngx.ctx._rewrite = true
}
access_by_lua_block {
local redis_port = $TEST_NGINX_REDIS_PORT
local dns_threads = {}
-- T1: fast thread (completes immediately)
dns_threads[1] = ngx.thread.spawn(function()
local sock = ngx.socket.tcp()
sock:settimeout(2000)
sock:connect("127.0.0.1", redis_port)
sock:send("PING\r\n")
local line = sock:receive()
sock:setkeepalive()
return line
end)
-- T2: uses coroutine.resume with child doing socket I/O
dns_threads[2] = ngx.thread.spawn(function()
local child = coroutine.create(function()
local sock = ngx.socket.tcp()
sock:settimeout(2000)
sock:connect("127.0.0.1", redis_port)
sock:send("PING\r\n")
local line = sock:receive()
sock:setkeepalive()
return line
end)
local ok, res = coroutine.resume(child)
return res
end)
-- Wait for first result, then kill remaining
local ok, res = ngx.thread.wait(dns_threads[1], dns_threads[2])
for _, t in ipairs(dns_threads) do
ngx.thread.kill(t)
end
-- Spawn probe threads
local probe_threads = {}
for i = 1, 10 do
probe_threads[i] = ngx.thread.spawn(function(idx)
local sock = ngx.socket.tcp()
sock:settimeout(2000)
sock:connect("127.0.0.1", redis_port)
sock:send("PING\r\n")
local line = sock:receive()
sock:setkeepalive()
return line
end, i)
end
local ok_count = 0
for i = 1, #probe_threads do
local ok, res = ngx.thread.wait(probe_threads[i])
if ok and res then ok_count = ok_count + 1 end
end
ngx.ctx.ok_count = ok_count
}
content_by_lua_block {
-- reset_ctx → finalize_threads triggers the assertion here
ngx.say("ok_count=", ngx.ctx.ok_count)
}
}
--- request
GET /t
--- response_body
ok_count=10The crash is timing-dependent: T1 and T2's child must both send Redis PING so their responses arrive in the same epoll batch. T1's event is processed first (entry thread resumes, kills T2, spawns probes), then the child's event fires (phantom decrement). Use repeat_each(50) or higher to increase reproduction probability.