-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Implement fine grain locking for build-dir
#16155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
src/cargo/core/compiler/locking.rs
Outdated
| /// Coarse grain locking (Profile level) | ||
| Coarse, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mtime changes in fingerprints due to uplifting/downlifting order
This will also be a concern for #5931
For non-local packages, we don't check mtimes. Unsure what we do for their build script runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't quiet thought through how we will handle mtime for the artifact cache.
Checksum freshness (#14136) would sure make this easier as mtimes are painful to deal with.
Might be worth exploring pushing that forward before starting on the artifact cache...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that mtimes are still used for build scripts
src/cargo/core/compiler/locking.rs
Outdated
| /// Coarse grain locking (Profile level) | ||
| Coarse, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/benches need to be ran before being uplifted OR uplifted and locked during execution which leads to more locking design needed. (also running pre-uplift introduces other potential side effects like the path displayed to the user being deleted as its temporary)
Do we hold any locks during test execution today? I'm not aware of any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, I was under the assumption that the lock in Layout was held while running the tests, but I never explicitly tested that. Upon further inspection, we do indeed release the lock before executing the tests.
src/cargo/core/compiler/locking.rs
Outdated
| let primary_lock = open_file(&self.primary)?; | ||
| primary_lock.lock()?; | ||
|
|
||
| let secondary_lock = open_file(&self.secondary)?; | ||
| secondary_lock.lock()?; | ||
|
|
||
| self.guard = Some(UnitLockGuard { | ||
| primary: primary_lock, | ||
| _secondary: Some(secondary_lock), | ||
| }); | ||
| Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we double checked if we run into problems like #15698?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a closer look and it appears that there is a possibility that a similar issue could happen.
I think in practice it would not dead lock since failing to take a lock would result in the build failing, so the lock would be released when the process exits.
But regardless, I went ahead and added logic to unlock the partial lock if we fail to take the full lock just incase.
src/cargo/core/compiler/locking.rs
Outdated
| pub fn downgrade(&mut self) -> CargoResult<()> { | ||
| let guard = self | ||
| .guard | ||
| .as_ref() | ||
| .context("guard was None while calling downgrade")?; | ||
|
|
||
| // NOTE: | ||
| // > Subsequent flock() calls on an already locked file will convert an existing lock to the new lock mode. | ||
| // https://man7.org/linux/man-pages/man2/flock.2.html | ||
| // | ||
| // However, the `std::file::File::lock/lock_shared` is allowed to change this in the | ||
| // future. So its probably up to us if we are okay with using this or if we want to use a | ||
| // different interface to flock. | ||
| guard.primary.lock_shared()?; | ||
|
|
||
| Ok(()) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should rely on advertised behavior, especially as I'm assuming not all platforms are backed by flock, like windows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably move away from the std interface or perhaps open an issue to see if the t-lib would be willing to clarify the behavior of calling lock_shared() while holding an exclusive lock.
side note: I came across this crate which advertises the behavior we need. It's fairly small (and MIT) so we could potentially reuse part of this code cargo's usecase. (or use it directly, though I don't know cargo's policy on taking dependencies on third party crates)
Any concerns with keeping the std lock for now and having an action item for this prior to stabilization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this to an Unresolved issue.
b234db5 to
1deca86
Compare
5b36e7c to
083d4b3
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
083d4b3 to
ec9f6bc
Compare
This comment has been minimized.
This comment has been minimized.
ec9f6bc to
8252855
Compare
|
@epage I re-reviewed the changes in this PR and I believe they accomplish the the first step in the plan laid out in #t-cargo > Build cache and locking design @ 💬. So I think this PR is now good to be reviewed. |
This comment has been minimized.
This comment has been minimized.
src/cargo/core/compiler/mod.rs
Outdated
| // TODO: We should probably revalidate the fingerprint here as another Cargo instance could | ||
| // have already compiled the crate before we recv'd the lock. | ||
| // For large crates re-compiling here would be quiet costly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this TODO out of the code to a place we can track
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably revalidate the fingerprint here as another Cargo instance could
Or we move the lock acquisition up a level to be around the fingerprinting (since that has us read the build unit) and then we move it into the job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm good point. I over looked locking during the fingerprint read 😓
Even if a unit is fresh, we still need to take a lock while evaluating the fingerprint.
I suppose this shouldn't be too problematic as if the unit is fresh we'd immediate unlock that unit so lock contention low.
I think that change should not be too difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#16155 (comment) ties into this and can also have a major affect on the design
src/cargo/core/compiler/mod.rs
Outdated
| let mut lock = if build_runner.bcx.gctx.cli_unstable().fine_grain_locking | ||
| && matches!(build_runner.locking_mode, LockingMode::Fine) | ||
| { | ||
| Some(CompilationLock::new(build_runner, unit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a build_runner.unit_lock(unit)
src/cargo/core/compiler/layout.rs
Outdated
| target: Option<CompileTarget>, | ||
| dest: &str, | ||
| must_take_artifact_dir_lock: bool, | ||
| build_dir_locking_mode: &LockingMode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we take a reference to something that could be made Copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no reason, I'll make it Copy :D
src/cargo/core/compiler/locking.rs
Outdated
| Ok(()) | ||
| } | ||
|
|
||
| pub fn downgrade(&mut self) -> CargoResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clarify what this means? maybe downgrade_partial?
src/cargo/core/compiler/locking.rs
Outdated
| pub fn lock_exclusive(&mut self) -> CargoResult<()> { | ||
| assert!(self.guard.is_none()); | ||
|
|
||
| let partial = open_file(&self.partial)?; | ||
| partial.lock()?; | ||
|
|
||
| let full = open_file(&self.full)?; | ||
| full.lock()?; | ||
|
|
||
| self.guard = Some(UnitLockGuard { | ||
| partial, | ||
| _full: Some(full), | ||
| }); | ||
| Ok(()) | ||
| } | ||
|
|
||
| pub fn lock_shared(&mut self, ty: &SharedLockType) -> CargoResult<()> { | ||
| assert!(self.guard.is_none()); | ||
|
|
||
| let partial = open_file(&self.partial)?; | ||
| partial.lock_shared()?; | ||
|
|
||
| let full = if matches!(ty, SharedLockType::Full) { | ||
| let full_lock = open_file(&self.full)?; | ||
| full_lock.lock_shared()?; | ||
| Some(full_lock) | ||
| } else { | ||
| None | ||
| }; | ||
|
|
||
| self.guard = Some(UnitLockGuard { | ||
| partial, | ||
| _full: full, | ||
| }); | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike Filessytems locking, this doesn't provide a way to find out what you are blocked on or that we are even blocked.
I doubt we can send a Blocking message to the user within our current progress system though that is something i want to eventually redesign.
Can we at least log a message saying what is being blocked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I can add logging.
I also looked into providing feedback but:
- Was a bit tricky as
gctxis not available in theWorkunits that are executed by the job queue, which I believe is the primary interface to shell output - I wasn't quiet sure what the best way to present this info to the user. I was worried about potentially flooding the screen with messages as units are unlocked and new units get blocked.
src/cargo/core/compiler/locking.rs
Outdated
| /// This lock is designed to reduce file descriptors by sharing a single file descriptor for a | ||
| /// given lock when the lock is shared. The motivation for this is to avoid hitting file descriptor | ||
| /// limits when fine grain locking is enabled. | ||
| pub struct RcFileLock { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of complexity when we aren't needing to lock within our own process?
What if we tracked all locks inside of the BuildRunner? We could have a single lock per build unit that we grab exclusively as soon as we know the dep unit path, store them in a HashMap<PathBuf, FileLock> (either using Filesystem or added a public constructor for FileLock), and hold onto them until the end of the build.
At least for a first step, it simplifies things a lot. It does mean that another build will block until this one is done if they share some build units but not all. That won't be the case for cargo check vs cargo build or for cargo check vs cargo clippy. It will be an issue for cargo check vs cargo check --no-default-features or cargo check vs cargo check --workspace. We can at least defer that out of this initial PR and evaluate both different multi-lock designs under these scheme and how much of a need there is for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we tracked all locks inside of the
BuildRunner?
We could try something like this but the .rmeta_produce() called for pipelined builds makes this tricky since build_runner is not in scope for the Work closure. (similar issue as this comment)
Though maybe it might be possible to plumb that over to the JobState.
Unsure how difficult that would be but I can look into it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was suggesting we simplify things down to just one lock per build unit. We grab it when doing the fingerprint check and then hold onto it until the end. We don't need these locks for coordination within our own build, this is just for cross-process coordination. If you have two processed doing cargo check && cargo check, the second one will effectively be blocked on the first anyways. Having finer grained locks than this only helps when some leaf crates can be shared but nothing else, like what happens when different features are activated. This seems minor enough especially when we get the cross-project build cache which is where these are more likely to live and will have a different locking scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From #16155 (comment)
I updated the implementation to use a single lock per build unit as mentioned in #t-cargo > Build cache and locking design @ 💬
From my testing we now get parallelism between cargo check and cargo build though there is some lock contention for some build units, so I think there are some other scenarios where check and build share build units. More research needed on my side to understand which.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you expanded on my idea to go ahead with rwlock to reduce content. I'm surprised we're still seeing any contention with that choice unless the build unit needs to be rebuilt again for some reason.
8252855 to
4ace282
Compare
|
This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
|
I updated the implementation to use a single lock per build unit as mentioned in #t-cargo > Build cache and locking design @ 💬 The latest commit omits build-scripts and proc macros from locking which we will need to decide how we want to handle. (added to open questions list in PR description) From my testing we now get parallelism between Also I think the CI failure is a spurious failure unrelated to my changes |
Skipping them means we have race conditions. i would assume the safer route for an MVP would be to lock everything and then iterate from there. |
| if let Some(lock) = locks.get_mut(&key) { | ||
| lock.file().lock_shared()?; | ||
| } else { | ||
| let fs = Filesystem::new(key.0.clone()); | ||
| let lock = | ||
| fs.open_ro_shared_create(&key.0, build_runner.bcx.gctx, &format!("locking {key}"))?; | ||
| locks.insert(key.clone(), lock); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: locks.entry may be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tried that but you run into the issue of not being able to propagate the error from .lock_shared() out of the .and_modify() closure :(
4ace282 to
1a81371
Compare
| } | ||
|
|
||
| let lock = if build_runner.bcx.gctx.cli_unstable().fine_grain_locking { | ||
| Some(build_runner.lock_manager.lock_shared(build_runner, unit)?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current lock upgrade setup, we can deadlock if two processes grab the shared lock to check the fingerprint and they both decide to build, they will block trying to get an exclusive lock but can't because the other process has a shared lock.
Potential solutions:
- Only use exclusive locks, blocking on other builds using the same units
- Between fingerprint and build, drop the shared lock and then acquire an exclusive lock (rather than upgrade), sometimes rebuilding for the same fingerprint after blocking until the other build is complete
- Could be reduced by re-checking the fingerprint but that may not be ideal to do
To review a previous conversation: rwlocks only help when there is some overlap between builds, e.g.
checkvsbuildwhen there are build scripts and proc macros- different packages selected
- different features selected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that is a problem.
Between fingerprint and build, drop the shared lock and then acquire an exclusive lock (rather than upgrade), sometimes rebuilding for the same fingerprint after blocking until the other build is complete
The tricky part here is that we also need a read lock on dependency units. For checking the fingerprint we need a read lock on the dependency units so we can read the upstrem fingerprints. So trying to unlock and relocking may still result in deadlock.
Taking exclusive locks for fingerprints would probably be the most safe against deadlocks but would also greatly reduce the parallelism wins :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that is a problem.
Between fingerprint and build, drop the shared lock and then acquire an exclusive lock (rather than upgrade), sometimes rebuilding for the same fingerprint after blocking until the other build is complete
The tricky part here is that we also need a read lock on dependency units. For checking the fingerprint we need a read lock on the dependency units so we can read the upstrem fingerprints. So trying to unlock and relocking may still result in deadlock.
The idea was
- grab reader lock for fingerprint
- if fingerprint matches, keep lock until build end
- if fingerprint doesn't match, drop the lock
- grab an exclusive lock
- build the unit
- downgrade to reader lock
- at end of build, drop all locks
I'm missing the deadlock scenario in this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we read a fingerprint, we also need to take read locks on dependency units. If we drop the lock on a dirty fingerprint should we also drop the dependency locks?
I was thinking if we drop those as well, then might run into issues. But thinking about it a bit more maybe that not an issue. I think it should be safe to hold on to those shared locks.
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest push, I moved to the read for fingerprint, unlock, relock exclusive.
From my testing, it appears to be working :)
When I try to cargo check a crate with tokio = { ..., features=["full"] } while a cargo build I get:
Blocking waiting for file lock on locking /home/ross/projects/foo/build/debug/build/proc-macro2/4b4e013b0585d84c/.lock
Blocking waiting for file lock on locking /home/ross/projects/foo/build/debug/build/quote/401da6e704cd00e6/.lock
Blocking waiting for file lock on locking /home/ross/projects/foo/build/debug/build/syn/287dc061d31804b8/.lock
Checking libc v0.2.178
....
This is promising as these are only proc-macros and we get some parallelism.
I still haven't fully thought through all of the scenarios that we could end up dead locking, but CI is passing on windows now which it was not previously, so its a step in the right direction.
1a81371 to
39d1cee
Compare
39d1cee to
c75a617
Compare
This PR adds fine grain locking for the build cache using build unit level locking.
I'd recommend reading the design details in this description and then reviewing commit by commit.
Part of #4282
Previous attempt: #16089
Design decisions / rational
Original Design
Implementation details
primary.lockandsecondary.lock(seelocking.rsmodule docs for more details on the states)determine_locking_mode()).cargo-locklock as RO shared to continue working with older cargo versions while allowing multiple newer cargo instances to run in parallel.cargo cleancontinues to use coarse grain locking for simplicity.UnitGraph.number of build units * 10) we automatically fallback to coarse grain locking and display a warning to the user.RcFileLockI was seeing peaks of ~12,000 open fds which I felt was quiet high even for a large project like Zed.FileLockInternerthat holds on to the file descriptors (RcFileLock) until the end of the process. (We could potentially add it toJobStateif preferred, it would just be a bit more plumbing)build-dir/<profile>/.cargo-lockto stay backwards compatible with previous versions of cargo.For the rational for this design see the discussion #t-cargo > Build cache and locking design @ 💬
Open Questions
build-dir#16155 (comment)cargo checkandcargo build? The current implementation skips locking them as an MVP design.build-dir#16155 (comment))build-dir#16155 (comment)cargo doccargo checkwould not take a lock artifact-dir butcargo buildwould). This would mean that 2cargo buildinvocations would not run in parallel because one of them would hold the lock artifact-dir (blocking the other). This might actually be ideal to avoid 2 instances fighting over the CPU while recompiling the same crates.-Zbuild-dir-new-layout. With the current implementation I am not seeing any perf regression on linux but I have yet to test on windows/macos.CARGO_BUILD_LOCKING_MODE=coarse?