From d4e8aa3bc591fc439597a6e0bac452d7efaa11ba Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 6 Jun 2025 10:13:44 +0200 Subject: [PATCH 1/8] wishlist: mark "Mount a subdirectory instead of the top-level directory" as done Signed-off-by: Christian Brauner --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 7615dae..b86d617 100644 --- a/README.md +++ b/README.md @@ -12,6 +12,8 @@ point that out explicitly and clearly in the associated patches and Cc ### Mount a subdirectory instead of the top-level directory +[x] Mount a subdirectory instead of the top-level directory + Ability to mount a subdirectory of a regular file system instead of the top-level directory. E.e. for a file system `/dev/sda1` which contains a sub-directory `/foobar` mount `/foobar` without having @@ -26,6 +28,8 @@ shenanigans, but this requires namespacing to be available, and is not precisely obvious to implement. Explicit kernel support at mount time would be much preferable.) +**🙇 `c5c12f871a30 ("fs: create detached mounts from detached mounts")` 🙇** + **Use-Case:** `systemd-homed` currently mounts a sub-directory of the per-user LUKS volume as the user's home directory (and not the root directory of the per-user LUKS volume's file system!), and in From 952e2b69824e2e520526f3e19c27772b900a5010 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 6 Jun 2025 10:19:40 +0200 Subject: [PATCH 2/8] wishlist: mark "Mount notifications without rescanning of `/proc/self/mountinfo`" as done Signed-off-by: Christian Brauner --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index b86d617..4b31932 100644 --- a/README.md +++ b/README.md @@ -249,6 +249,8 @@ unexpected children. ### Mount notifications without rescanning of `/proc/self/mountinfo` +[x] Mount notifications without rescanning of `/proc/self/mountinfo` + Mount notifications that do not require continuous rescanning of `/proc/self/mountinfo`. Currently, if a program wants to track mounts established on the system it can receive `poll()`able @@ -260,6 +262,8 @@ cost for re-scanning the table has to be paid for every change to the mount table. It's racy because quickly added and removed mounts might not be noticed. +**🙇 `0f46d81f2bce ("fanotify: notify on mount attach and detach")` 🙇** + **Use-Case:** `systemd` tracks the mount table to integrate the mounts into it own dependency management. From ec7af62abbe2066eb502ce1a785a178730f52cc7 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 6 Jun 2025 10:21:52 +0200 Subject: [PATCH 3/8] wishlist: mark "Allow creating idmapped mounts from idmapped mounts" as done Signed-off-by: Christian Brauner --- README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 4b31932..546977c 100644 --- a/README.md +++ b/README.md @@ -375,7 +375,9 @@ user namespace. But this doesn't just lock a single mount or mount subtree it locks all mounts in the mount namespace, i.e., the mount table cannot be altered. -### Add `OPEN_TREE_CLEAR` flag to `open_tree()` +### Allow creating idmapped mounts from idmapped mounts + +[x] Allow creating idmapped mounts from idmapped mounts Add a new `OPEN_TREE_CLEAR` flag to `open_tree()` that can only be used in conjunction with `OPEN_TREE_CLONE`. When specified it will clear @@ -391,6 +393,8 @@ This is useful to get a new, clear mount and also allows the caller to create a new detached mount with an idmapping attached to the mount. Iow, the caller may idmap the mount afterwards. +**🙇 `c4a16820d901 ("fs: add open_tree_attr()")` 🙇** + **Use-Case:** A user may already use an idmapped mount for their home directory. And once a mount has been idmapped the idmapping cannot be changed anymore. This allows for simple semantics and allows to avoid From 90f62ad84b19ab9ed17ea76417a6e2d3dcad1d24 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 6 Jun 2025 10:23:27 +0200 Subject: [PATCH 4/8] wishlist: mark "Require a user namespace to have an idmapping when attached" as done Signed-off-by: Christian Brauner --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 546977c..2d68a30 100644 --- a/README.md +++ b/README.md @@ -411,9 +411,13 @@ mount. ### Require a user namespace to have an idmapping when attached +[x] Require a user namespace to have an idmapping when attached + Enforce that the user namespace about to be attached to a mount must have an idmapping written. +**🙇 `dacfd001eaf2 ("fs/mnt_idmapping.c: Return -EINVAL when no map is written")` 🙇** + **Use-Case:** Tighten the semantics. ### Extend `setns()` to allow attaching to all new namespaces of a process From f857eac94567d9b6f3b2a5900c0f422b49a133a6 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 6 Jun 2025 10:26:17 +0200 Subject: [PATCH 5/8] whishlist: mark "API to determine the parent process ID of a pidfd" as done Signed-off-by: Christian Brauner --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 2d68a30..8d4c331 100644 --- a/README.md +++ b/README.md @@ -614,6 +614,8 @@ identify, track and compare process even after they ceased to exist. ### API to determine the parent process ID of a pidfd +[x] API to determine the parent process ID of a pidfd + An API to determine the parent process ID (ppid) of a pidfd would be good. @@ -623,6 +625,8 @@ the ppid of a pidfd matches the process own pid it can call would fail. It would be very useful if this could be determined easily before even calling that syscall. +**🙇 `cdda1f26e74b ("pidfd: add ioctl to retrieve pid info")` 🙇** + **Usecase:** systemd manages a multitude of processes, most of which are its own children, but many which are not. It would be great if we could easily determine whether it is worth waiting for From 8f9727607b9a37b9fd066b74347ff0f5806e3068 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 6 Jun 2025 10:28:13 +0200 Subject: [PATCH 6/8] wishlist: mark "Set `comm` field before `exec()`" as done Signed-off-by: Christian Brauner --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 8d4c331..121a612 100644 --- a/README.md +++ b/README.md @@ -635,6 +635,8 @@ them is the only way to get exit notification. ### Set `comm` field before `exec()` +[x] Set `comm` field before `exec()` + There should be a way to control the process' `comm` field if started via `fexecve()`/`execveat()`. @@ -653,6 +655,8 @@ time. See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294, https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81. +**🙇 `543841d18060 ("exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case")` 🙇** + **Usecase:** In systemd we generally would prefer using `fexecve()` to safely and race-freely invoke processes, but the fact that `comm` is useless after invoking a process that way makes the call From 662e7b792520261ed4c261fa3028265300ca1823 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 6 Jun 2025 10:30:29 +0200 Subject: [PATCH 7/8] wishlist: mark "Namespace ioctl to translate a PID between PID namespaces" as done Signed-off-by: Christian Brauner --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 121a612..be8a36e 100644 --- a/README.md +++ b/README.md @@ -792,6 +792,10 @@ to thread-group leader pidfd. ### Namespace ioctl to translate a PID between PID namespaces +[x] Namespace ioctl to translate a PID between PID namespaces + +**🙇 `ca567df74a28a9fb368c6b2d93e864113f73f5c2 ("nsfs: add pid translation ioctls")` 🙇** + **Use-Case:** This makes it possible to e.g., figure out what a given PID in a PID namespace corresponds to in the caller's PID namespace. For example, to figure out what the PID of PID 1 inside of a given PID namespace is. From 171e5e31d27d7c223efeef0414b730a91e71c741 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 6 Jun 2025 10:33:01 +0200 Subject: [PATCH 8/8] wishlist: move all finished items into the correct section Signed-off-by: Christian Brauner --- README.md | 343 +++++++++++++++++++++++++++--------------------------- 1 file changed, 171 insertions(+), 172 deletions(-) diff --git a/README.md b/README.md index be8a36e..d4e1b68 100644 --- a/README.md +++ b/README.md @@ -10,32 +10,6 @@ associated problem space. point that out explicitly and clearly in the associated patches and Cc `Christian Brauner user_ns)`. If idmapped mounts -are encountered the caller must be `ns_capable(sb->user_ns, CAP_SYS_ADMIN)` -in the filesystems user namespace. - -Locked mount properties cannot be changed. A mount's idmapping becomes -locked if it propagates across user namespaces. - -This is useful to get a new, clear mount and also allows the caller to -create a new detached mount with an idmapping attached to the mount. Iow, -the caller may idmap the mount afterwards. - -**🙇 `c4a16820d901 ("fs: add open_tree_attr()")` 🙇** - -**Use-Case:** A user may already use an idmapped mount for their home -directory. And once a mount has been idmapped the idmapping cannot be -changed anymore. This allows for simple semantics and allows to avoid -lifetime complexity in order to account for scenarios where concurrent -readers or writers might still use a given user namespace while it is about -to be changed. -But this poses a problem when the user wants to attach an idmapping to -a mount that is already idmapped. The new flag allows to solve this -problem. A sufficiently privileged user such as a container manager can -create a user namespace for the container which expresses the desired -ownership. Then they can create a new detached mount without any prior -mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this -mount. - -### Require a user namespace to have an idmapping when attached - -[x] Require a user namespace to have an idmapping when attached - -Enforce that the user namespace about to be attached to a mount must -have an idmapping written. - -**🙇 `dacfd001eaf2 ("fs/mnt_idmapping.c: Return -EINVAL when no map is written")` 🙇** - -**Use-Case:** Tighten the semantics. - ### Extend `setns()` to allow attaching to all new namespaces of a process Add an extension to `setns()` to allow attaching to all namespaces of @@ -591,77 +500,6 @@ different sources and it should not be possible to generate a system extension with a key pair that is supposed to be good for container images only. -### Make statx() on a pidfd return additional info - -Make statx() on a pidfd return additional recognizable identifiers in -`.stx_btime`. - -**🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇** - -It would be fantastic if issuing statx() on any pidfd would return -the start time of the process in `.stx_btime` even after the process -died. - -These fields should in particular be queriable *after* the process -already exited and has been reaped, i.e. after its PID has already -been recycled. - -**Usecase:** In systemd we maintain lists of processes in a hash -table. Right now, the key is the PID, but this is less than ideal -because of PID recycling. By being able to use the `.stx_btime` -and/or `.stx_ino` fields instead would be perfect to safely -identify, track and compare process even after they ceased to exist. - -### API to determine the parent process ID of a pidfd - -[x] API to determine the parent process ID of a pidfd - -An API to determine the parent process ID (ppid) of a pidfd would be -good. - -This information is relevant to code dealing with pidfds, since if -the ppid of a pidfd matches the process own pid it can call -`waitid()` on the process, if it doesn't it cannot and such a call -would fail. It would be very useful if this could be determined -easily before even calling that syscall. - -**🙇 `cdda1f26e74b ("pidfd: add ioctl to retrieve pid info")` 🙇** - -**Usecase:** systemd manages a multitude of processes, most of which -are its own children, but many which are not. It would be great if -we could easily determine whether it is worth waiting for -`SIGCHLD`/`waitid()` on them or whether waiting for `POLLIN` on -them is the only way to get exit notification. - -### Set `comm` field before `exec()` - -[x] Set `comm` field before `exec()` - -There should be a way to control the process' `comm` field if -started via `fexecve()`/`execveat()`. - -Right now, when `fexecve()`/`execveat()` is used, the `comm` field -(i.e. `/proc/self/comm`) contains a name derived of the numeric fd, -which breaks `ps -C …` and various other tools. In particular when -the fd was opened with `O_CLOEXEC`, the number of the fd in the old -process is completely meaningless. - -The goal is add a way to tell `fexecve()`/`execveat()` what Name to use. - -Since `comm` is under user control anyway (via `PR_SET_NAME`), it -should be safe to also make it somehow configurable at fexecve() -time. - -See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294, -https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81. - -**🙇 `543841d18060 ("exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case")` 🙇** - -**Usecase:** In systemd we generally would prefer using `fexecve()` -to safely and race-freely invoke processes, but the fact that `comm` -is useless after invoking a process that way makes the call -unfortunately hard to use for systemd. - ### Path-based ACL management in an LSM hook The LSM module API should have the ability to do path-based (not @@ -790,16 +628,6 @@ Add an option to go from individual thread to thread-group leader. **Use-Case:** Allow for a race free way to go from individual thread to thread-group leader pidfd. -### Namespace ioctl to translate a PID between PID namespaces - -[x] Namespace ioctl to translate a PID between PID namespaces - -**🙇 `ca567df74a28a9fb368c6b2d93e864113f73f5c2 ("nsfs: add pid translation ioctls")` 🙇** - -**Use-Case:** This makes it possible to e.g., figure out what a given PID in -a PID namespace corresponds to in the caller's PID namespace. For example, to -figure out what the PID of PID 1 inside of a given PID namespace is. - ### Useful handling of LSM denials on SCM_RIGHTS Right now if some LSM such as SELinux denies an `AF_UNIX` socket peer @@ -839,6 +667,177 @@ on received messages. ## Finished Items +### Namespace ioctl to translate a PID between PID namespaces + +[x] Namespace ioctl to translate a PID between PID namespaces + +**🙇 `ca567df74a28a9fb368c6b2d93e864113f73f5c2 ("nsfs: add pid translation ioctls")` 🙇** + +**Use-Case:** This makes it possible to e.g., figure out what a given PID in +a PID namespace corresponds to in the caller's PID namespace. For example, to +figure out what the PID of PID 1 inside of a given PID namespace is. + +### API to determine the parent process ID of a pidfd + +[x] API to determine the parent process ID of a pidfd + +An API to determine the parent process ID (ppid) of a pidfd would be +good. + +This information is relevant to code dealing with pidfds, since if +the ppid of a pidfd matches the process own pid it can call +`waitid()` on the process, if it doesn't it cannot and such a call +would fail. It would be very useful if this could be determined +easily before even calling that syscall. + +**🙇 `cdda1f26e74b ("pidfd: add ioctl to retrieve pid info")` 🙇** + +**Usecase:** systemd manages a multitude of processes, most of which +are its own children, but many which are not. It would be great if +we could easily determine whether it is worth waiting for +`SIGCHLD`/`waitid()` on them or whether waiting for `POLLIN` on +them is the only way to get exit notification. + +### Set `comm` field before `exec()` + +[x] Set `comm` field before `exec()` + +There should be a way to control the process' `comm` field if +started via `fexecve()`/`execveat()`. + +Right now, when `fexecve()`/`execveat()` is used, the `comm` field +(i.e. `/proc/self/comm`) contains a name derived of the numeric fd, +which breaks `ps -C …` and various other tools. In particular when +the fd was opened with `O_CLOEXEC`, the number of the fd in the old +process is completely meaningless. + +The goal is add a way to tell `fexecve()`/`execveat()` what Name to use. + +Since `comm` is under user control anyway (via `PR_SET_NAME`), it +should be safe to also make it somehow configurable at fexecve() +time. + +See https://github.com/systemd/systemd/commit/35a926777e124ae8c2ac3cf46f44248b5e147294, +https://github.com/systemd/systemd/commit/8939eeae528ef9b9ad2a21995279b76d382d5c81. + +**🙇 `543841d18060 ("exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case")` 🙇** + +**Usecase:** In systemd we generally would prefer using `fexecve()` +to safely and race-freely invoke processes, but the fact that `comm` +is useless after invoking a process that way makes the call +unfortunately hard to use for systemd. +### Make statx() on a pidfd return additional info + +Make statx() on a pidfd return additional recognizable identifiers in +`.stx_btime`. + +**🙇 `cb12fd8e0dabb9a1c8aef55a6a41e2c255fcdf4b pidfd: add pidfs` 🙇** + +It would be fantastic if issuing statx() on any pidfd would return +the start time of the process in `.stx_btime` even after the process +died. + +These fields should in particular be queriable *after* the process +already exited and has been reaped, i.e. after its PID has already +been recycled. + +**Usecase:** In systemd we maintain lists of processes in a hash +table. Right now, the key is the PID, but this is less than ideal +because of PID recycling. By being able to use the `.stx_btime` +and/or `.stx_ino` fields instead would be perfect to safely +identify, track and compare process even after they ceased to exist. + +### Allow creating idmapped mounts from idmapped mounts + +[x] Allow creating idmapped mounts from idmapped mounts + +Add a new `OPEN_TREE_CLEAR` flag to `open_tree()` that can only be +used in conjunction with `OPEN_TREE_CLONE`. When specified it will clear +all mount properties from that mount including the mount's idmapping. +Requires the caller to be `ns_capable(mntns->user_ns)`. If idmapped mounts +are encountered the caller must be `ns_capable(sb->user_ns, CAP_SYS_ADMIN)` +in the filesystems user namespace. + +Locked mount properties cannot be changed. A mount's idmapping becomes +locked if it propagates across user namespaces. + +This is useful to get a new, clear mount and also allows the caller to +create a new detached mount with an idmapping attached to the mount. Iow, +the caller may idmap the mount afterwards. + +**🙇 `c4a16820d901 ("fs: add open_tree_attr()")` 🙇** + +**Use-Case:** A user may already use an idmapped mount for their home +directory. And once a mount has been idmapped the idmapping cannot be +changed anymore. This allows for simple semantics and allows to avoid +lifetime complexity in order to account for scenarios where concurrent +readers or writers might still use a given user namespace while it is about +to be changed. +But this poses a problem when the user wants to attach an idmapping to +a mount that is already idmapped. The new flag allows to solve this +problem. A sufficiently privileged user such as a container manager can +create a user namespace for the container which expresses the desired +ownership. Then they can create a new detached mount without any prior +mount properties via OPEN_TREE_CLEAR and then attach the idmapping to this +mount. + +### Require a user namespace to have an idmapping when attached + +[x] Require a user namespace to have an idmapping when attached + +Enforce that the user namespace about to be attached to a mount must +have an idmapping written. + +**🙇 `dacfd001eaf2 ("fs/mnt_idmapping.c: Return -EINVAL when no map is written")` 🙇** + +**Use-Case:** Tighten the semantics. + +### Mount notifications without rescanning of `/proc/self/mountinfo` + +[x] Mount notifications without rescanning of `/proc/self/mountinfo` + +Mount notifications that do not require continuous rescanning of +`/proc/self/mountinfo`. Currently, if a program wants to track +mounts established on the system it can receive `poll()`able +events via a file descriptor to `/proc/self/mountinfo`. When +receiving them it needs to rescan the file from the top and +compare it with the previous scan. This is both slow and +racy. It's slow on systems with a large number of mounts as the +cost for re-scanning the table has to be paid for every change to +the mount table. It's racy because quickly added and removed +mounts might not be noticed. + +**🙇 `0f46d81f2bce ("fanotify: notify on mount attach and detach")` 🙇** + +**Use-Case:** `systemd` tracks the mount table to integrate the mounts +into it own dependency management. + +### Mount a subdirectory instead of the top-level directory + +[x] Mount a subdirectory instead of the top-level directory + +Ability to mount a subdirectory of a regular file system instead of +the top-level directory. E.e. for a file system `/dev/sda1` which +contains a sub-directory `/foobar` mount `/foobar` without having +to mount its parent directory first. Consider something like this: + +``` +mount -t ext4 /dev/sda1 somedir/ -o subdir=/foobar +``` + +(This is of course already possible via some mount namespacing +shenanigans, but this requires namespacing to be available, and is +not precisely obvious to implement. Explicit kernel support at mount +time would be much preferable.) + +**🙇 `c5c12f871a30 ("fs: create detached mounts from detached mounts")` 🙇** + +**Use-Case:** `systemd-homed` currently mounts a sub-directory of +the per-user LUKS volume as the user's home directory (and not the +root directory of the per-user LUKS volume's file system!), and in +order to implement this invisibly from the host side requires a +complex mount namespace exercise. + ### Unmounting of obstructed mounts [x] ability to unmount obstructed mounts. (this means: you have a stack