-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Be more careful with locking db.db_mtx #17418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
alek-p
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already reviewed this internally, and as the PR description states, we've had a good experience running with this patch for the last couple of months
|
As I see, in most of cases (I've spotted only one different) when you are taking |
|
FWIW, as we're discussing here, I even think - after all the staring at the code - that the locking itself is actually fine, it seems to be a result of optimizations exactly because things don't need to be overlocked if it's guaranteed to be OK via other logical dependencies. I think I have actually nailed where the problem is, but @asomers says he can't try it :) |
That's because of this comment from @pcd1193182: "So the subtlety here is that the value of the db.db_data and db_buf fields are, I believe, still protected by the db_mtx plus the db_holds refcount. The contents of the buffers are protected by the db_rwlock." So many places need both |
|
I'm sorry, I mixed it up. This is definitely needed and then there's a bug with dbuf resize. Two different things. |
|
@asomers Are you still awaiting reviewers on this? I've been running with the changes from this PR without any issues for a while now. It would be nice to get in all the "prevents corruption" PRs before 2.4.0. |
|
Does this apply to 2.2.8 also? |
amotin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I went through all this, and I believe most of locking is not neded -- se below. Only few I've left uncommented.
|
Though I see your comments, @amotin , I still struggle to understand the right thing to do, generally, because the locking requirements aren't well documented, nor are they enforced either by the compiler or at runtime. Here are the different descriptions I've seen: From dbuf.h: And here's what @pcd1193182 said in #17118 👍 And later But I don't see any list of what the various states are, nor how to tell which state a dbuf is in. @amotin added the following in that same discussion thread: And @amotin added some more detail in this PR:
I can't confidently make any changes here without a complete and accurate description of the locking comments. What I need are:
@amotin can you please help with that? At least with the first part? |
|
@asomers Let me rephrase the key points:
|
|
My humble opinion. I think it is reasonable request to:
It is good to have optimizations, but it is not healthy that there is very limit knowledge of the locking scheme in small group of people with poor documentation and inability to examine the code for correctness. |
|
@asomers Despite my comments on many of the changes here, IIRC there were some that could be useful. Do you plan to clean this up, document, etc, or I'll have to take it over? |
Yes. My approach is to create some assertion functions which check that either db_data is locked, or is in a state where it doesn't need to be. The WIP is here, but it isn't ready for review yet. Probably next week. https://github.com/asomers/zfs/tree/db_data_elide . |
|
@amotin I've eliminated the lock acquisitions as you requested. Please review. Note that while I've run the ZFS test suite with this round of changes, I don't know whether they suffice to solve the original corruption bug. The only way to know that is to run the code in production. But I'd like your review before I try that, because it takes quite a bit of time and effort to get sufficient production time. Not to mention the risk of corrupting customer data again. |
|
@asomers if you can rebase this on the latest commits in that master branch that should resolve most of the CI build failures. While you're at if please go ahead and squash the commits. |
|
@behlendorf I've squashed and rebased. However, it's important remember that I've never been able to produce this bug on demand. I've only seen it in production. And this version of the PR was never tested in production, so I can't guarantee that it actually fixes the original bug. |
|
Yup understood. |
| if (dr->dr_dnode->dn_phys->dn_nlevels != 1) { | ||
| parent_db = dr->dr_parent->dr_dbuf; | ||
| assert_db_data_addr_locked(parent_db); | ||
| rw_enter(&parent_db->db_rwlock, RW_READER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think this lock is needed, shouldn't this block be similar to one in dbuf_lightweight_ready() above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Similar" how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locking dr->dr_parent->dr_dbuf if it is present or dn->dn_dbuf otherwise, as you've done two chunks above? I see dbuf_lightweight_bp() does check for dn_nlevels == 1 to decide what to access, but I wonder if it is equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anything, I think the logic in dbuf_lightweight_ready is wrong. Because dbuf_lightweight_bp doesn't actually access any db_data field if dn->dn_phys->dn_nlevels == 1. So I think that both places should look more like the code here. Do you agree? The relevant code was introduced in commit ba67d82 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If dn->dn_phys->dn_nlevels == 1, dbuf_lightweight_bp() returns pointer on dn_phys content, which is a part of dn->dn_dbuf. And I guess access to one may need to be locked, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is different. I think dn_struct_rwlock protects dnode's structure, like number of levels, etc, while db_rwlock protects the block pointers. Though might be wrong. Haven't looked on dn_struct_rwlock closely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? This is the first suggestion I've seen that dn_phys should be protected by db_rwlock. And there's existing code that uses dn_phys without first locking db_rwlock. For example, in dbuf_write_ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be not "beyond reasonable doubt", but somewhere about "more likely than not". Why in assert_db_data_contents_locked() we are checking for db->db.db_object != DMU_META_DNODE_OBJECT? Because for meta dnodes we require db_rwlock() to access the block pointers stored in the dn_phys/dn_dbuf. Same time dn_struct_rwlock must be held when code depends on dnode structure, like number of layers, should not change. But please douple-check me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My double-check says that there are many places where dn_phys is accessed without locking db_rwlock. In fact, it's harder to find places where db_rwlock is locked. So either you are mistaken, or we need another PR to clean up dn_phys accesses just like this PR does for db_rwlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dn_phys includes many things aside of the block pointers, while db_rwlock protects only block pointers. Some other fields may be protected by dn_struct_rwlock, etc.
module/zfs/dbuf.c
Outdated
| */ | ||
| if (db->db_blkid == DMU_BONUS_BLKID || db->db_blkid == DMU_SPILL_BLKID) | ||
| return; | ||
| if (db->db_dirtycnt == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what this is doing here. If we really need to check it, then we'd need to know that we hold db_mtx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of your comment "I don't believe this condition is needed (or even correct). If the db_dirtycnt below is zero (and it should be protected by db_mtx), then the buffer must be empty.". There, you told me not to bother locking db_rwlock if db_dirtycnt were below zero (but I think you meant "equal to zero").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was specifically about dbuf_verify(), where the code first asserts that db_mtx is held and then verifies that block that must never have any data does not really have any. It does not require locking there after all the conditions met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll remove the db_dirtycnt check here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you removed the check, but added locking into dbuf_verify() instead, which as I just said are not needed there, since it uses completely different condition, which it actually asserts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different cases are protected in different ways. db_rwlock protect only block pointers. User data are protected with range locks and dirty/transaction mechanism. The part of the code code in dbuf_verify() verifies a proper use of dirty buffers and transacitons. Range locks are outside of dbufs scope to assert them in dbuf.c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that doesn't answer the question. "What's to prevent any other thread from writing to db.db_data while this function is running?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @amotin . Is there anything else that prevents some other thread from writing to db.db_data while this function is running? If not, I think I should re-add the assert_db_data_contents_locked check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I re-add the assert_db_data_contents_locked check, would you approve the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the ZFS leadership conference call today, @amotin explained that the db_rwlock is not needed in dbuf_verify because it already checks that db_dirtycnt is 0. That means that db_data cannot possibly contain block pointers. And the db_mtx, which is held throughout that function, prevents any other thread from modifying db_dirtycnt. That makes sense.
However, I see now that the documentation in dbuf.h doesn't actually say that db_mtx protects db_dirtycnt. And in fact, some functions check db_dirtycnt without locking db_mtx, like free_children. So I guess we need to modify the documentation, and add more assertions in such functions.
|
That new panic must be the result of removing the |
module/zfs/dbuf.c
Outdated
| * partially fill in a hole. | ||
| */ | ||
| if (db->db_dirtycnt == 0) { | ||
| rw_enter(&db->db_rwlock, FALSE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we passing FALSE here, shouldn't it be RW_READER or RW_WRITER? FALSE is presumably 0, which will panic on Linux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Maybe that snuck in as a result of a botched rebase, perhaps? In any case, I'll fix it.
scripts/zfs-tests.sh
Outdated
| DEFAULT_RUNFILES="common.run,$(uname | tr '[:upper:]' '[:lower:]').run" | ||
| RUNFILES=${RUNFILES:-$DEFAULT_RUNFILES} | ||
| FILEDIR=${FILEDIR:-/var/tmp} | ||
| DISKS="/dev/vtbd1 /dev/vtbd2 /dev/vtbd3 /dev/vtbd4 /dev/vtbd5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this break the zfs test suite on every other platform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. That was a change that I made locally. I never intended to commit it. I'll revert.
|
There are suddenly a lot of "Wrong value for OS variable!" failures. I think that |
I am not sure what it means, but when you last rebased? |
Not since September. I've avoided doing that, since it can make the review confusing. But I'll do it now. |
Signed-off-by: Alan Somers <asomers@gmail.com>
Lock db_mtx in some places that access db->db_data. But in some places, add assertions that the dbuf is in a state where it will not be copied, rather than locking it. Lock db_rwlock in some places that access db->db.db_data's contents. But in some places, add assertions that should guarantee the buffer is being accessed by one thread only, rather than locking it. Closes openzfs#16626 Sponsored by: ConnectWise Signed-off-by: Alan Somers <asomers@gmail.com>
1) It wasn't actually checking the rwlock for indirect blocks 2) Per @amotin, "DMU_BONUS_BLKID and DMU_SPILL_BLKID can only exist at level 0", it was redundantly checking the blkid.
The assertion was no longer true after removing the check for db_dirtycnt in the previous commit.
Either this function needs to acquire db_rwlock, or we need some guarantee that no other thread can modify db_data while db_dirtycnt == 0. From what @amotin said, it sounds like there is no guarantee.
the meta dnode may have bonus or spill blocks, but we don't need to lock db_data for those.
Delete unintended change
According to @amotin that was always the intention. But it wasn't documented, and in practice wasn't always done. Also, don't lock db_rwlock during dbuf_verify. Since db_dirtycnt == 0, we don't need to.
Lock db->db_mtx in some places that access db->db_data. But don't lock it in free_children, even though it does access db->db_data, because that leads to a recurse-on-non-recursive panic.
Lock db->db_rwlock in some places that access db->db.db_data's contents.
Closes #16626
Sponsored by: ConnectWise
Motivation and Context
Fixes occasional in-memory corruption which is usually manifested as a panic with a message like "blkptr XXX has invalid XXX" or "blkptr XXX has no valid DVAs". I suspect that some on-disk corruption bugs have been caused by this same root cause, too.
Description
Always lock
dmu_buf_impl_t.db_mtxin places that access the value ofdmu_buf_impl_t.db->db_data. And always lockdmu_buf_impl_t.db_rwlockin places that access the contents ofdmu_buf_impl_t.db->db_rwlock.Note that
free_childrenstill violates these rules. It can't easily be fixed without causing other problems. A proper fix is left for the future.How Has This Been Tested?
I cannot reproduce the bug on command, so I had to rely on statistics to validate the patch.
Types of changes
Checklist:
Signed-off-by.