-
Notifications
You must be signed in to change notification settings - Fork 0
nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock #363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: linus-master_base
Are you sure you want to change the base?
Conversation
|
Upstream branch: 6da43bb |
83d3e2f to
00d5e5c
Compare
|
Upstream branch: f824272 |
c60e57c to
6268848
Compare
00d5e5c to
d782508
Compare
|
Upstream branch: f824272 |
6268848 to
38e60f0
Compare
|
Upstream branch: f824272 |
38e60f0 to
c0ba4a9
Compare
|
Upstream branch: f824272 |
c0ba4a9 to
0e9d2ad
Compare
d782508 to
6099a4d
Compare
|
Upstream branch: e7c375b |
0e9d2ad to
32ed0c4
Compare
6099a4d to
5121c4d
Compare
|
Upstream branch: e7c375b |
32ed0c4 to
c9ded35
Compare
5121c4d to
4458758
Compare
|
Upstream branch: 8b69055 |
c9ded35 to
33a1d14
Compare
4458758 to
6f43942
Compare
|
Upstream branch: fd95357 |
33a1d14 to
4f56eb1
Compare
6f43942 to
ec9caac
Compare
blk_mq_{add,del}_queue_tag_set() functions add and remove queues from
tagset, the functions make sure that tagset and queues are marked as
shared when two or more queues are attached to the same tagset.
Initially a tagset starts as unshared and when the number of added
queues reaches two, blk_mq_add_queue_tag_set() marks it as shared along
with all the queues attached to it. When the number of attached queues
drops to 1 blk_mq_del_queue_tag_set() need to mark both the tagset and
the remaining queues as unshared.
Both functions need to freeze current queues in tagset before setting on
unsetting BLK_MQ_F_TAG_QUEUE_SHARED flag. While doing so, both functions
hold set->tag_list_lock mutex, which makes sense as we do not want
queues to be added or deleted in the process. This used to work fine
until commit 98d81f0 ("nvme: use blk_mq_[un]quiesce_tagset")
made the nvme driver quiesce tagset instead of quiscing individual
queues. blk_mq_quiesce_tagset() does the job and quiesce the queues in
set->tag_list while holding set->tag_list_lock also.
This results in deadlock between two threads with these stacktraces:
__schedule+0x48e/0xed0
schedule+0x5a/0xc0
schedule_preempt_disabled+0x11/0x20
__mutex_lock.constprop.0+0x3cc/0x760
blk_mq_quiesce_tagset+0x26/0xd0
nvme_dev_disable_locked+0x77/0x280 [nvme]
nvme_timeout+0x268/0x320 [nvme]
blk_mq_handle_expired+0x5d/0x90
bt_iter+0x7e/0x90
blk_mq_queue_tag_busy_iter+0x2b2/0x590
? __blk_mq_complete_request_remote+0x10/0x10
? __blk_mq_complete_request_remote+0x10/0x10
blk_mq_timeout_work+0x15b/0x1a0
process_one_work+0x133/0x2f0
? mod_delayed_work_on+0x90/0x90
worker_thread+0x2ec/0x400
? mod_delayed_work_on+0x90/0x90
kthread+0xe2/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x2d/0x50
? kthread_complete_and_exit+0x20/0x20
ret_from_fork_asm+0x11/0x20
__schedule+0x48e/0xed0
schedule+0x5a/0xc0
blk_mq_freeze_queue_wait+0x62/0x90
? destroy_sched_domains_rcu+0x30/0x30
blk_mq_exit_queue+0x151/0x180
disk_release+0xe3/0xf0
device_release+0x31/0x90
kobject_put+0x6d/0x180
nvme_scan_ns+0x858/0xc90 [nvme_core]
? nvme_scan_work+0x281/0x560 [nvme_core]
nvme_scan_work+0x281/0x560 [nvme_core]
process_one_work+0x133/0x2f0
? mod_delayed_work_on+0x90/0x90
worker_thread+0x2ec/0x400
? mod_delayed_work_on+0x90/0x90
kthread+0xe2/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x2d/0x50
? kthread_complete_and_exit+0x20/0x20
ret_from_fork_asm+0x11/0x20
The top stacktrace is showing nvme_timeout() called to handle nvme
command timeout. timeout handler is trying to disable the controller and
as a first step, it needs to blk_mq_quiesce_tagset() to tell blk-mq not
to call queue callback handlers. The thread is stuck waiting for
set->tag_list_lock as it tires to walk the queues in set->tag_list.
The lock is held by the second thread in the bottom stack which is
waiting for one of queues to be frozen. The queue usage counter will
drop to zero after nvme_timeout() finishes, and this will not happen
because the thread will wait for this mutex forever.
Convert set->tag_list_lock mutex to set->tag_list_rwsem rwsemaphore to
avoid the deadlock. Update blk_mq_[un]quiesce_tagset() to take the
semaphore for read since this is enough to guarantee no queues will be
added or removed. Update blk_mq_{add,del}_queue_tag_set() to take the
semaphore for write while updating set->tag_list and downgrade it to
read while freezing the queues. It should be safe to update set->flags
and hctx->flags while holding the semaphore for read since the queues
are already frozen.
Fixes: 98d81f0 ("nvme: use blk_mq_[un]quiesce_tagset")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
|
Upstream branch: c2f2b01 |
4f56eb1 to
1746eba
Compare
Pull request for series with
subject: nvme: Convert tag_list mutex to rwsemaphore to avoid deadlock
version: 1
url: https://patchwork.kernel.org/project/linux-block/list/?series=1023161