Skip to content
315 changes: 171 additions & 144 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,28 +10,6 @@ associated problem space.
point that out explicitly and clearly in the associated patches and Cc
`Christian Brauner <brauner (at) kernel (dot) org`.**

### Mount a subdirectory instead of the top-level directory

Ability to mount a subdirectory of a regular file system instead of
the top-level directory. E.e. for a file system `/dev/sda1` which
contains a sub-directory `/foobar` mount `/foobar` without having
to mount its parent directory first. Consider something like this:

```
mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar
```

(This is of course already possible via some mount namespacing
shenanigans, but this requires namespacing to be available, and is
not precisely obvious to implement. Explicit kernel support at mount
time would be much preferable.)

**Use-Case:** `systemd-homed` currently mounts a sub-directory of
the per-user LUKS volume as the user's home directory (and not the
root directory of the per-user LUKS volume's file system!), and in
order to implement this invisibly from the host side requires a
complex mount namespace exercise.

### inotify() events for BSD file locks

BSD file locks (i.e. `flock()`, as opposed to POSIX `F_SETLK` and
Expand Down Expand Up @@ -243,22 +221,6 @@ to use `pidfd`s to remove PID recycling security issues, but
currently cannot as it also needs to generically wait for such
unexpected children.

### Mount notifications without rescanning of `/proc/self/mountinfo`

Mount notifications that do not require continuous rescanning of
`/proc/self/mountinfo`. Currently, if a program wants to track
mounts established on the system it can receive `poll()`able
events via a file descriptor to `/proc/self/mountinfo`. When
receiving them it needs to rescan the file from the top and
compare it with the previous scan. This is both slow and
racy. It's slow on systems with a large number of mounts as the
cost for re-scanning the table has to be paid for every change to
the mount table. It's racy because quickly added and removed
mounts might not be noticed.

**Use-Case:** `systemd` tracks the mount table to integrate the mounts
into it own dependency management.

### Asynchronous `close()`

An asynchronous or forced `close()`, that guarantees that
Expand Down Expand Up @@ -367,43 +329,6 @@ user namespace. But this doesn't just lock a single mount or mount subtree
it locks all mounts in the mount namespace, i.e., the mount table cannot be
altered.

### Add `OPEN_TREE_CLEAR` flag to `open_tree()`

Add a new `OPEN_TREE_CLEAR` flag to `open_tree()` that can only be
used in conjunction with `OPEN_TREE_CLONE`. When specified it will clear
all mount properties from that mount including the mount's idmapping.
Requires the caller to be `ns_capable(mntns->user_ns)`. If idmapped mounts
are encountered the caller must be `ns_capable(sb->user_ns, CAP_SYS_ADMIN)`
in the filesystems user namespace.

Locked mount properties cannot be changed. A mount's idmapping becomes
locked if it propagates across user namespaces.

This is useful to get a new, clear mount and also allows the caller to
create a new detached mount with an idmapping attached to the mount. Iow,
the caller may idmap the mount afterwards.

**Use-Case:** A user may already use an idmapped mount for their home
directory. And once a mount has been idmapped the idmapping cannot be
changed anymore. This allows for simple semantics and allows to avoid
lifetime complexity in order to account for scenarios where concurrent
readers or writers might still use a given user namespace while it is about
to be changed.
But this poses a problem when the user wants to attach an idmapping to
a mount that is already idmapped. The new flag allows to solve this
problem. A sufficiently privileged user such as a container manager can
create a user namespace for the container which expresses the desired
ownership. Then they can create a new detached mount without any prior
mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this
mount.

### Require a user namespace to have an idmapping when attached

Enforce that the user namespace about to be attached to a mount must
have an idmapping written.

**Use-Case:** Tighten the semantics.

### Extend `setns()` to allow attaching to all new namespaces of a process

Add an extension to `setns()` to allow attaching to all namespaces of
Expand Down Expand Up @@ -575,69 +500,6 @@ different sources and it should not be possible to generate a
system extension with a key pair that is supposed to be good for
container images only.

### Make statx() on a pidfd return additional info

Make statx() on a pidfd return additional recognizable identifiers in
`.stx_btime`.

**🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇**

It would be fantastic if issuing statx() on any pidfd would return
the start time of the process in `.stx_btime` even after the process
died.

These fields should in particular be queriable *after* the process
already exited and has been reaped, i.e. after its PID has already
been recycled.

**Usecase:** In systemd we maintain lists of processes in a hash
table. Right now, the key is the PID, but this is less than ideal
because of PID recycling. By being able to use the `.stx_btime`
and/or `.stx_ino` fields instead would be perfect to safely
identify, track and compare process even after they ceased to exist.

### API to determine the parent process ID of a pidfd

An API to determine the parent process ID (ppid) of a pidfd would be
good.

This information is relevant to code dealing with pidfds, since if
the ppid of a pidfd matches the process own pid it can call
`waitid()` on the process, if it doesn't it cannot and such a call
would fail. It would be very useful if this could be determined
easily before even calling that syscall.

**Usecase:** systemd manages a multitude of processes, most of which
are its own children, but many which are not. It would be great if
we could easily determine whether it is worth waiting for
`SIGCHLD`/`waitid()` on them or whether waiting for `POLLIN` on
them is the only way to get exit notification.

### Set `comm` field before `exec()`

There should be a way to control the process' `comm` field if
started via `fexecve()`/`execveat()`.

Right now, when `fexecve()`/`execveat()` is used, the `comm` field
(i.e. `/proc/self/comm`) contains a name derived of the numeric fd,
which breaks `ps -C …` and various other tools. In particular when
the fd was opened with `O_CLOEXEC`, the number of the fd in the old
process is completely meaningless.

The goal is add a way to tell `fexecve()`/`execveat()` what Name to use.

Since `comm` is under user control anyway (via `PR_SET_NAME`), it
should be safe to also make it somehow configurable at fexecve()
time.

See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294,
https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81.

**Usecase:** In systemd we generally would prefer using `fexecve()`
to safely and race-freely invoke processes, but the fact that `comm`
is useless after invoking a process that way makes the call
unfortunately hard to use for systemd.

### Path-based ACL management in an LSM hook

The LSM module API should have the ability to do path-based (not
Expand Down Expand Up @@ -766,12 +628,6 @@ Add an option to go from individual thread to thread-group leader.
**Use-Case:** Allow for a race free way to go from individual thread
to thread-group leader pidfd.

### Namespace ioctl to translate a PID between PID namespaces

**Use-Case:** This makes it possible to e.g., figure out what a given PID in
a PID namespace corresponds to in the caller's PID namespace. For example, to
figure out what the PID of PID 1 inside of a given PID namespace is.

### Useful handling of LSM denials on SCM_RIGHTS

Right now if some LSM such as SELinux denies an `AF_UNIX` socket peer
Expand Down Expand Up @@ -811,6 +667,177 @@ on received messages.

## Finished Items

### Namespace ioctl to translate a PID between PID namespaces

[x] Namespace ioctl to translate a PID between PID namespaces

**🙇 `ca567df74a28a9fb368c6b2d93e864113f73f5c2 ("nsfs: add pid translation ioctls")` 🙇**

**Use-Case:** This makes it possible to e.g., figure out what a given PID in
a PID namespace corresponds to in the caller's PID namespace. For example, to
figure out what the PID of PID 1 inside of a given PID namespace is.

### API to determine the parent process ID of a pidfd

[x] API to determine the parent process ID of a pidfd

An API to determine the parent process ID (ppid) of a pidfd would be
good.

This information is relevant to code dealing with pidfds, since if
the ppid of a pidfd matches the process own pid it can call
`waitid()` on the process, if it doesn't it cannot and such a call
would fail. It would be very useful if this could be determined
easily before even calling that syscall.

**🙇 `cdda1f26e74b ("pidfd: add ioctl to retrieve pid info")` 🙇**

**Usecase:** systemd manages a multitude of processes, most of which
are its own children, but many which are not. It would be great if
we could easily determine whether it is worth waiting for
`SIGCHLD`/`waitid()` on them or whether waiting for `POLLIN` on
them is the only way to get exit notification.

### Set `comm` field before `exec()`

[x] Set `comm` field before `exec()`

There should be a way to control the process' `comm` field if
started via `fexecve()`/`execveat()`.

Right now, when `fexecve()`/`execveat()` is used, the `comm` field
(i.e. `/proc/self/comm`) contains a name derived of the numeric fd,
which breaks `ps -C …` and various other tools. In particular when
the fd was opened with `O_CLOEXEC`, the number of the fd in the old
process is completely meaningless.

The goal is add a way to tell `fexecve()`/`execveat()` what Name to use.

Since `comm` is under user control anyway (via `PR_SET_NAME`), it
should be safe to also make it somehow configurable at fexecve()
time.

See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294,
https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81.

**🙇 `543841d18060 ("exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case")` 🙇**

**Usecase:** In systemd we generally would prefer using `fexecve()`
to safely and race-freely invoke processes, but the fact that `comm`
is useless after invoking a process that way makes the call
unfortunately hard to use for systemd.
### Make statx() on a pidfd return additional info

Make statx() on a pidfd return additional recognizable identifiers in
`.stx_btime`.

**🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇**

It would be fantastic if issuing statx() on any pidfd would return
the start time of the process in `.stx_btime` even after the process
died.

These fields should in particular be queriable *after* the process
already exited and has been reaped, i.e. after its PID has already
been recycled.

**Usecase:** In systemd we maintain lists of processes in a hash
table. Right now, the key is the PID, but this is less than ideal
because of PID recycling. By being able to use the `.stx_btime`
and/or `.stx_ino` fields instead would be perfect to safely
identify, track and compare process even after they ceased to exist.

### Allow creating idmapped mounts from idmapped mounts

[x] Allow creating idmapped mounts from idmapped mounts

Add a new `OPEN_TREE_CLEAR` flag to `open_tree()` that can only be
used in conjunction with `OPEN_TREE_CLONE`. When specified it will clear
all mount properties from that mount including the mount's idmapping.
Requires the caller to be `ns_capable(mntns->user_ns)`. If idmapped mounts
are encountered the caller must be `ns_capable(sb->user_ns, CAP_SYS_ADMIN)`
in the filesystems user namespace.

Locked mount properties cannot be changed. A mount's idmapping becomes
locked if it propagates across user namespaces.

This is useful to get a new, clear mount and also allows the caller to
create a new detached mount with an idmapping attached to the mount. Iow,
the caller may idmap the mount afterwards.

**🙇 `c4a16820d901 ("fs: add open_tree_attr()")` 🙇**

**Use-Case:** A user may already use an idmapped mount for their home
directory. And once a mount has been idmapped the idmapping cannot be
changed anymore. This allows for simple semantics and allows to avoid
lifetime complexity in order to account for scenarios where concurrent
readers or writers might still use a given user namespace while it is about
to be changed.
But this poses a problem when the user wants to attach an idmapping to
a mount that is already idmapped. The new flag allows to solve this
problem. A sufficiently privileged user such as a container manager can
create a user namespace for the container which expresses the desired
ownership. Then they can create a new detached mount without any prior
mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this
mount.

### Require a user namespace to have an idmapping when attached

[x] Require a user namespace to have an idmapping when attached

Enforce that the user namespace about to be attached to a mount must
have an idmapping written.

**🙇 `dacfd001eaf2 ("fs/mnt_idmapping.c: Return -EINVAL when no map is written")` 🙇**

**Use-Case:** Tighten the semantics.

### Mount notifications without rescanning of `/proc/self/mountinfo`

[x] Mount notifications without rescanning of `/proc/self/mountinfo`

Mount notifications that do not require continuous rescanning of
`/proc/self/mountinfo`. Currently, if a program wants to track
mounts established on the system it can receive `poll()`able
events via a file descriptor to `/proc/self/mountinfo`. When
receiving them it needs to rescan the file from the top and
compare it with the previous scan. This is both slow and
racy. It's slow on systems with a large number of mounts as the
cost for re-scanning the table has to be paid for every change to
the mount table. It's racy because quickly added and removed
mounts might not be noticed.

**🙇 `0f46d81f2bce ("fanotify: notify on mount attach and detach")` 🙇**

**Use-Case:** `systemd` tracks the mount table to integrate the mounts
into it own dependency management.

### Mount a subdirectory instead of the top-level directory

[x] Mount a subdirectory instead of the top-level directory

Ability to mount a subdirectory of a regular file system instead of
the top-level directory. E.e. for a file system `/dev/sda1` which
contains a sub-directory `/foobar` mount `/foobar` without having
to mount its parent directory first. Consider something like this:

```
mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar
```

(This is of course already possible via some mount namespacing
shenanigans, but this requires namespacing to be available, and is
not precisely obvious to implement. Explicit kernel support at mount
time would be much preferable.)

**🙇 `c5c12f871a30 ("fs: create detached mounts from detached mounts")` 🙇**

**Use-Case:** `systemd-homed` currently mounts a sub-directory of
the per-user LUKS volume as the user's home directory (and not the
root directory of the per-user LUKS volume's file system!), and in
order to implement this invisibly from the host side requires a
complex mount namespace exercise.

### Unmounting of obstructed mounts

[x] ability to unmount obstructed mounts. (this means: you have a stack
Expand Down