Implement fine grain locking for `build-dir` #16155

ranger-ross · 2025-10-26T13:28:06Z

This PR adds fine grain locking for the build cache using build unit level locking.
I'd recommend reading the design details in this description and then reviewing commit by commit.
Part of #4282

Previous attempt: #16089

Design decisions / rational

Original Design

Using build unit level locking instead of a temporary working directory.
- After experimenting with multiple approaches, I am currently leaning to towards build unit level locking.
- The working directory approach introduces a fair bit of uplifting complexity and I further along I pushed my prototype the more I ran into unexpected issues.
  - mtime changes in fingerprints due to uplifting/downlifting order
  - tests/benches need to be ran before being uplifted OR uplifted and locked during execution which leads to more locking design needed. (also running pre-uplift introduces other potential side effects like the path displayed to the user being deleted as its temporary)
- The trade off here is that with build unit level locks, we need a more advanced locking mechanism and we will have more open locks at once.
- The reason I think this is a worth while trade of is that the locking complexity can largely be contained to single module where the uplifting complexity would be spread through out the cargo codebase anywhere we do uplifting. The increased locks count while unavoidable can be mitigated (see below for more details)
Risk of too many locks (file descriptors)
- On Linux 1024 is a fairly common default soft limit. Windows is even lower at 256.
- Having 2 locks per build unit makes is possible to hit with a moderate amount of dependencies
- There are a few mitigations I could think of for this problem (that are included in this PR)
  - Increasing the file descriptor limits of based on the number of build units (if hard limit is high enough)
  - Share file descriptors for shared locks across jobs (within a single process) using a virtual lock
    - This could be implemented using reference counting.
  - Falling back to coarse grain locking if some heuristic is not met

Implementation details

We have a stateful lock per build unit made up of multiple file locks primary.lock and secondary.lock (see locking.rs module docs for more details on the states)
- This is needed to enable pipelined builds
We fall back to coarse grain locking if fine grain locking is determined to be unsafe (see determine_locking_mode())
Fine grain locking continues to take the existing .cargo-lock lock as RO shared to continue working with older cargo versions while allowing multiple newer cargo instances to run in parallel.
Locking is disabled on network filesystems. (keeping existing behavior from Don't use flock on NFS mounts #2623)
cargo clean continues to use coarse grain locking for simplicity.
File descriptors
- I added functionality to increase the file descriptors if cargo detects that there will not be enough based on the number of build units in the UnitGraph.
- If we aren’t able to increase a threshold (currently number of build units * 10) we automatically fallback to coarse grain locking and display a warning to the user.
  - I picked 10 times the number of build units a conservative estimate for now. I think lowering this number may be reasonable.
- While testing, I was seeing a peak of ~3,200 open file descriptors while compiling Zed. This is approximately x2 the number of build units.
  - Without the RcFileLock I was seeing peaks of ~12,000 open fds which I felt was quiet high even for a large project like Zed.
We use a global FileLockInterner that holds on to the file descriptors (RcFileLock) until the end of the process. (We could potentially add it to JobState if preferred, it would just be a bit more plumbing)

Using build unit level locking with a single lock per build unit.
We continue to take a shared lock on build-dir/<profile>/.cargo-lock to stay backwards compatible with previous versions of cargo.
Build Flow
- Before checking fingerprint freshness we take a shared lock. This prevents reading a fingerprint while another build is active.
- For units that are dirty, when the job server queues the job we take an exclusive lock to prevent others from reading while we compile.
- After compilation is complete, we downgrade back to a shared lock allowing other readers.

For the rational for this design see the discussion #t-cargo > Build cache and locking design @ 💬

Open Questions

Do we need rlimit checks and dynamic rlimits? Implement fine grain locking for build-dir #16155 (comment)
How should we handle locking for build units that are shared between cargo check and cargo build? The current implementation skips locking them as an MVP design.
Losing the Blocking message (Implement fine grain locking for build-dir #16155 (comment))
- Update Dec 18 2025: With updated impl, we now get the blocking message when taking the initial shared lock, but we get no message when taking the exclusive lock right before compiling.
Lock downgrading scheme relies on unspecified behavior, see Implement fine grain locking for build-dir #16155 (comment)
How do we want to handle locking on the artifact directory?
- We could simply continue using coarse grain locking, locking and unlocking when files are uplifted.
- One downside of locking/unlocking multiple times per invocation is that artifact-dir is touch many times across the compilation process (for example, there is a pre-rustc clean up step Also we need to take into account other commands like cargo doc
- Another option would to only take a lock on the artifact-dir for commands that we know will uplift files. (e.g. cargo check would not take a lock artifact-dir but cargo build would). This would mean that 2 cargo build invocations would not run in parallel because one of them would hold the lock artifact-dir (blocking the other). This might actually be ideal to avoid 2 instances fighting over the CPU while recompiling the same crates.
- Solved by Do not lock the artifact-dir for check builds #16230
What should our testing strategy for locking be?
- My testing strategy thus far has been to run cargo on dummy projects to verify the locking.
- For the max file descriptor testing, I have been using the Zed codebase as a testbed as it has over 1,500 build units which is more than the default ulimit on my linux system. (I am happy to test this on other large codebase that we think would be good to verify against)
- It’s not immediately obvious to me as to how to create repeatable unit tests for this or what those tests should be testing for.
- For performance testing, I have been using hyperfine to benchmark builds with and without -Zbuild-dir-new-layout. With the current implementation I am not seeing any perf regression on linux but I have yet to test on windows/macos.
Should we expose an option in the future to allow users to force coarse grain locking?
- In the event a user’s system can’t support fine grain locking for some reason, should we provide some way to control the locking mode like CARGO_BUILD_LOCKING_MODE=coarse?
- If so, would this be something that would be temporary during the transition period or a permanent feature?
What should the heuristics to disable fine grain locking be?
- Currently, it's if the max file descriptors are less than 10 times the number of build units. This is pretty conservative. From my testing, it generally peaks around 2 times the number of build units.
- I wonder if there is any other information in the crate graph that we could as a heuristic?

rustbot · 2025-10-26T13:28:10Z

r? @ehuss

rustbot has assigned @ehuss.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

src/cargo/core/compiler/layout.rs

src/cargo/core/compiler/build_runner/mod.rs

src/cargo/core/compiler/layout.rs

src/cargo/core/compiler/locking.rs

epage · 2025-10-27T19:31:40Z

src/cargo/core/compiler/locking.rs

+    /// Coarse grain locking (Profile level)
+    Coarse,


mtime changes in fingerprints due to uplifting/downlifting order

This will also be a concern for #5931

For non-local packages, we don't check mtimes. Unsure what we do for their build script runs.

I haven't quiet thought through how we will handle mtime for the artifact cache.

Checksum freshness (#14136) would sure make this easier as mtimes are painful to deal with.
Might be worth exploring pushing that forward before starting on the artifact cache...

Note that mtimes are still used for build scripts

epage · 2025-10-27T19:34:57Z

src/cargo/core/compiler/locking.rs

+    /// Coarse grain locking (Profile level)
+    Coarse,


tests/benches need to be ran before being uplifted OR uplifted and locked during execution which leads to more locking design needed. (also running pre-uplift introduces other potential side effects like the path displayed to the user being deleted as its temporary)

Do we hold any locks during test execution today? I'm not aware of any.

hmmm, I was under the assumption that the lock in Layout was held while running the tests, but I never explicitly tested that. Upon further inspection, we do indeed release the lock before executing the tests.

src/cargo/core/compiler/layout.rs

src/cargo/core/compiler/locking.rs

epage · 2025-10-27T20:07:09Z

src/cargo/core/compiler/locking.rs

+        let primary_lock = open_file(&self.primary)?;
+        primary_lock.lock()?;
+
+        let secondary_lock = open_file(&self.secondary)?;
+        secondary_lock.lock()?;
+
+        self.guard = Some(UnitLockGuard {
+            primary: primary_lock,
+            _secondary: Some(secondary_lock),
+        });
+        Ok(())


Have we double checked if we run into problems like #15698?

I took a closer look and it appears that there is a possibility that a similar issue could happen.
I think in practice it would not dead lock since failing to take a lock would result in the build failing, so the lock would be released when the process exits.

But regardless, I went ahead and added logic to unlock the partial lock if we fail to take the full lock just incase.

epage · 2025-10-27T20:09:24Z

src/cargo/core/compiler/locking.rs

+    pub fn downgrade(&mut self) -> CargoResult<()> {
+        let guard = self
+            .guard
+            .as_ref()
+            .context("guard was None while calling downgrade")?;
+
+        // NOTE:
+        // > Subsequent flock() calls on an already locked file will convert an existing lock to the new lock mode.
+        // https://man7.org/linux/man-pages/man2/flock.2.html
+        //
+        // However, the `std::file::File::lock/lock_shared` is allowed to change this in the
+        // future. So its probably up to us if we are okay with using this or if we want to use a
+        // different interface to flock.
+        guard.primary.lock_shared()?;
+
+        Ok(())
+    }
+}


We should rely on advertised behavior, especially as I'm assuming not all platforms are backed by flock, like windows

We could probably move away from the std interface or perhaps open an issue to see if the t-lib would be willing to clarify the behavior of calling lock_shared() while holding an exclusive lock.

side note: I came across this crate which advertises the behavior we need. It's fairly small (and MIT) so we could potentially reuse part of this code cargo's usecase. (or use it directly, though I don't know cargo's policy on taking dependencies on third party crates)

Any concerns with keeping the std lock for now and having an action item for this prior to stabilization?

We can move this to an Unresolved issue.

src/cargo/core/compiler/locking.rs

src/cargo/core/compiler/build_runner/mod.rs

src/cargo/util/rlimit.rs

ranger-ross · 2025-12-04T13:33:52Z

@epage I re-reviewed the changes in this PR and I believe they accomplish the the first step in the plan laid out in #t-cargo > Build cache and locking design @ 💬.

So I think this PR is now good to be reviewed.

epage · 2025-12-04T16:05:43Z

src/cargo/core/compiler/mod.rs

+            // TODO: We should probably revalidate the fingerprint here as another Cargo instance could
+            // have already compiled the crate before we recv'd the lock.
+            // For large crates re-compiling here would be quiet costly.


Let's move this TODO out of the code to a place we can track

We should probably revalidate the fingerprint here as another Cargo instance could

Or we move the lock acquisition up a level to be around the fingerprinting (since that has us read the build unit) and then we move it into the job

hmmm good point. I over looked locking during the fingerprint read 😓
Even if a unit is fresh, we still need to take a lock while evaluating the fingerprint.

I suppose this shouldn't be too problematic as if the unit is fresh we'd immediate unlock that unit so lock contention low.
I think that change should not be too difficult.

#16155 (comment) ties into this and can also have a major affect on the design

epage · 2025-12-04T16:06:11Z

src/cargo/core/compiler/mod.rs

+    let mut lock = if build_runner.bcx.gctx.cli_unstable().fine_grain_locking
+        && matches!(build_runner.locking_mode, LockingMode::Fine)
+    {
+        Some(CompilationLock::new(build_runner, unit))


Can we have a build_runner.unit_lock(unit)

epage · 2025-12-04T16:14:28Z

src/cargo/core/compiler/layout.rs

        target: Option<CompileTarget>,
        dest: &str,
        must_take_artifact_dir_lock: bool,
+        build_dir_locking_mode: &LockingMode,


Why do we take a reference to something that could be made Copy?

no reason, I'll make it Copy :D

epage · 2025-12-04T16:18:03Z

src/cargo/core/compiler/locking.rs

+        Ok(())
+    }
+
+    pub fn downgrade(&mut self) -> CargoResult<()> {


Can we clarify what this means? maybe downgrade_partial?

epage · 2025-12-04T16:23:11Z

src/cargo/core/compiler/locking.rs

+    pub fn lock_exclusive(&mut self) -> CargoResult<()> {
+        assert!(self.guard.is_none());
+
+        let partial = open_file(&self.partial)?;
+        partial.lock()?;
+
+        let full = open_file(&self.full)?;
+        full.lock()?;
+
+        self.guard = Some(UnitLockGuard {
+            partial,
+            _full: Some(full),
+        });
+        Ok(())
+    }
+
+    pub fn lock_shared(&mut self, ty: &SharedLockType) -> CargoResult<()> {
+        assert!(self.guard.is_none());
+
+        let partial = open_file(&self.partial)?;
+        partial.lock_shared()?;
+
+        let full = if matches!(ty, SharedLockType::Full) {
+            let full_lock = open_file(&self.full)?;
+            full_lock.lock_shared()?;
+            Some(full_lock)
+        } else {
+            None
+        };
+
+        self.guard = Some(UnitLockGuard {
+            partial,
+            _full: full,
+        });
+        Ok(())
+    }


Unlike Filessytems locking, this doesn't provide a way to find out what you are blocked on or that we are even blocked.

I doubt we can send a Blocking message to the user within our current progress system though that is something i want to eventually redesign.

Can we at least log a message saying what is being blocked?

sure, I can add logging.

I also looked into providing feedback but:

Was a bit tricky as gctx is not available in the Work units that are executed by the job queue, which I believe is the primary interface to shell output

I wasn't quiet sure what the best way to present this info to the user. I was worried about potentially flooding the screen with messages as units are unlocked and new units get blocked.

epage · 2025-12-04T16:35:47Z

src/cargo/core/compiler/locking.rs

+/// This lock is designed to reduce file descriptors by sharing a single file descriptor for a
+/// given lock when the lock is shared. The motivation for this is to avoid hitting file descriptor
+/// limits when fine grain locking is enabled.
+pub struct RcFileLock {


This is a lot of complexity when we aren't needing to lock within our own process?

What if we tracked all locks inside of the BuildRunner? We could have a single lock per build unit that we grab exclusively as soon as we know the dep unit path, store them in a HashMap<PathBuf, FileLock> (either using Filesystem or added a public constructor for FileLock), and hold onto them until the end of the build.

At least for a first step, it simplifies things a lot. It does mean that another build will block until this one is done if they share some build units but not all. That won't be the case for cargo check vs cargo build or for cargo check vs cargo clippy. It will be an issue for cargo check vs cargo check --no-default-features or cargo check vs cargo check --workspace. We can at least defer that out of this initial PR and evaluate both different multi-lock designs under these scheme and how much of a need there is for that.

What if we tracked all locks inside of the BuildRunner?

We could try something like this but the .rmeta_produce() called for pipelined builds makes this tricky since build_runner is not in scope for the Work closure. (similar issue as this comment)

Though maybe it might be possible to plumb that over to the JobState.
Unsure how difficult that would be but I can look into it

I was suggesting we simplify things down to just one lock per build unit. We grab it when doing the fingerprint check and then hold onto it until the end. We don't need these locks for coordination within our own build, this is just for cross-process coordination. If you have two processed doing cargo check && cargo check, the second one will effectively be blocked on the first anyways. Having finer grained locks than this only helps when some leaf crates can be shared but nothing else, like what happens when different features are activated. This seems minor enough especially when we get the cross-project build cache which is where these are more likely to live and will have a different locking scheme.

From #16155 (comment)

I updated the implementation to use a single lock per build unit as mentioned in #t-cargo > Build cache and locking design @ 💬

From my testing we now get parallelism between cargo check and cargo build though there is some lock contention for some build units, so I think there are some other scenarios where check and build share build units. More research needed on my side to understand which.

Looks like you expanded on my idea to go ahead with rwlock to reduce content. I'm surprised we're still seeing any contention with that choice unless the build unit needs to be rebuilt again for some reason.

rustbot · 2025-12-18T14:08:46Z

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

ranger-ross · 2025-12-18T14:34:40Z

I updated the implementation to use a single lock per build unit as mentioned in #t-cargo > Build cache and locking design @ 💬

The latest commit omits build-scripts and proc macros from locking which we will need to decide how we want to handle. (added to open questions list in PR description)

From my testing we now get parallelism between cargo check and cargo build though there is some lock contention for some build units, so I think there are some other scenarios where check and build share build units. More research needed on my side to understand which.

Also I think the CI failure is a spurious failure unrelated to my changes

fatal: unable to access 'https://github.com/rust-lang/bitflags.git/': Failed to connect to github.com port 443 after 75004 ms: Couldn't connect to server

epage · 2025-12-18T19:18:00Z

How should we handle locking for build units that are shared between cargo check and cargo build? The current implementation skips locking them as an MVP design.

Skipping them means we have race conditions. i would assume the safer route for an MVP would be to lock everything and then iterate from there.

epage · 2025-12-18T19:20:38Z

src/cargo/core/compiler/locking.rs

+        if let Some(lock) = locks.get_mut(&key) {
+            lock.file().lock_shared()?;
+        } else {
+            let fs = Filesystem::new(key.0.clone());
+            let lock =
+                fs.open_ro_shared_create(&key.0, build_runner.bcx.gctx, &format!("locking {key}"))?;
+            locks.insert(key.clone(), lock);
+        }


nit: locks.entry may be useful

tried that but you run into the issue of not being able to propagate the error from .lock_shared() out of the .and_modify() closure :(

src/cargo/core/compiler/layout.rs

src/cargo/core/compiler/build_runner/compilation_files.rs

src/cargo/core/compiler/locking.rs

epage · 2025-12-19T17:02:00Z

src/cargo/core/compiler/mod.rs

    }

+    let lock = if build_runner.bcx.gctx.cli_unstable().fine_grain_locking {
+        Some(build_runner.lock_manager.lock_shared(build_runner, unit)?)


With the current lock upgrade setup, we can deadlock if two processes grab the shared lock to check the fingerprint and they both decide to build, they will block trying to get an exclusive lock but can't because the other process has a shared lock.

Potential solutions:

Only use exclusive locks, blocking on other builds using the same units

Between fingerprint and build, drop the shared lock and then acquire an exclusive lock (rather than upgrade), sometimes rebuilding for the same fingerprint after blocking until the other build is complete

Could be reduced by re-checking the fingerprint but that may not be ideal to do

To review a previous conversation: rwlocks only help when there is some overlap between builds, e.g.

check vs build when there are build scripts and proc macros

different packages selected

different features selected

Yeah that is a problem.

Between fingerprint and build, drop the shared lock and then acquire an exclusive lock (rather than upgrade), sometimes rebuilding for the same fingerprint after blocking until the other build is complete

The tricky part here is that we also need a read lock on dependency units. For checking the fingerprint we need a read lock on the dependency units so we can read the upstrem fingerprints. So trying to unlock and relocking may still result in deadlock.

Taking exclusive locks for fingerprints would probably be the most safe against deadlocks but would also greatly reduce the parallelism wins :/

Yeah that is a problem.

Between fingerprint and build, drop the shared lock and then acquire an exclusive lock (rather than upgrade), sometimes rebuilding for the same fingerprint after blocking until the other build is complete

The tricky part here is that we also need a read lock on dependency units. For checking the fingerprint we need a read lock on the dependency units so we can read the upstrem fingerprints. So trying to unlock and relocking may still result in deadlock.

The idea was

grab reader lock for fingerprint

if fingerprint matches, keep lock until build end

if fingerprint doesn't match, drop the lock

grab an exclusive lock

build the unit

downgrade to reader lock

at end of build, drop all locks

I'm missing the deadlock scenario in this

When we read a fingerprint, we also need to take read locks on dependency units. If we drop the lock on a dirty fingerprint should we also drop the dependency locks?

I was thinking if we drop those as well, then might run into issues. But thinking about it a bit more maybe that not an issue. I think it should be safe to hold on to those shared locks.
wdyt?

In the latest push, I moved to the read for fingerprint, unlock, relock exclusive.

From my testing, it appears to be working :)

When I try to cargo check a crate with tokio = { ..., features=["full"] } while a cargo build I get:

Blocking waiting for file lock on locking /home/ross/projects/foo/build/debug/build/proc-macro2/4b4e013b0585d84c/.lock Blocking waiting for file lock on locking /home/ross/projects/foo/build/debug/build/quote/401da6e704cd00e6/.lock Blocking waiting for file lock on locking /home/ross/projects/foo/build/debug/build/syn/287dc061d31804b8/.lock Checking libc v0.2.178 ....

This is promising as these are only proc-macros and we get some parallelism.

I still haven't fully thought through all of the scenarios that we could end up dead locking, but CI is passing on windows now which it was not previously, so its a step in the right direction.

rustbot assigned ehuss Oct 26, 2025

ranger-ross mentioned this pull request Oct 26, 2025

(experiment) Fine grain locking #16089

Closed