runc v1.2.0-rc.1 -- "There's a frood who really knows where his towel is." This is the first release candidate for the 1.2.0 branch of runc. It includes all patches and bugfixes included in runc 1.1 patch releases (up to and including 1.1.12). A fair few new features have been added, and some changes have been made which may affect users. Please help us thoroughly test this release before we release 1.2.0. runc now requires a minimum of Go 1.20 to compile. > NOTE: runc currently will not work properly when compiled with Go 1.22 or > newer. This is due to some unfortunate glibc behaviour that Go 1.22 > exacerbates in a way that results in containers not being able to start on > some systems. [See this issue for more information.][runc-4233] Breaking: * Several aspects of how mount options work has been adjusted in a way that could theoretically break users that have very strange mount option strings. This was necessary to fix glaring issues in how mount options were being treated. The key changes are: - Mount options on bind-mounts that clear a mount flag are now always applied. Previously, if a user requested a bind-mount with only clearing options (such as `rw,exec,dev`) the options would be ignored and the original bind-mount options would be set. Unfortunately this also means that container configurations which specified only clearing mount options will now actually get what they asked for, which could break existing containers (though it seems unlikely that a user who requested a specific mount option would consider it "broken" to get the mount options they asked foruser who requested a specific mount option would consider it "broken" to get the mount options they asked for). This also allows us to silently add locked mount flags the user *did not explicitly request to be cleared* in rootless mode, allowing for easier use of bind-mounts for rootless containers. (#3967) - Container configurations using bind-mounts with superblock mount flags (i.e. filesystem-specific mount flags, referred to as "data" in `mount(2)`, as opposed to VFS generic mount flags like `MS_NODEV`) will now return an error. This is because superblock mount flags will also affect the host mount (as the superblock is shared when bind-mounting), which is obviously not acceptable. Previously, these flags were silently ignored so this change simply tells users that runc cannot fulfil their request rather than just ignoring it. (#3990) If any of these changes cause problems in real-world workloads, please [open an issue](https://github.com/opencontainers/runc/issues/new/choose) so we can adjust the behaviour to avoid compatibility issues. Added: * runc has been updated to OCI runtime-spec 1.2.0, and supports all Linux features with a few minor exceptions. See [`docs/spec-conformance.md`](https://github.com/opencontainers/runc/blob/v1.2.0-rc.1/docs/spec-conformance.md) for more details. * runc now supports id-mapped mounts for bind-mounts (with no restrictions on the mapping used for each mount). Other mount types are not currently supported. This feature requires `MOUNT_ATTR_IDMAP` kernel support (Linux 5.12 or newer) as well as kernel support for the underlying filesystem used for the bind-mount. See [`mount_setattr(2)`][mount_setattr.2] for a list of supported filesystems and other restrictions. (#3717, #3985, #3993) * Two new mechanisms for reducing the memory usage of our protections against [CVE-2019-5736][cve-2019-5736] have been introduced: - `runc-dmz` is a minimal binary (~8K) which acts as an additional execve stage, allowing us to only need to protect the smaller binary. It should be noted that there have been several compatibility issues reported with the usage of `runc-dmz` (namely related to capabilities and SELinux). As such, this mechanism is **opt-in** and can be enabled by running `runc` with the environment variable `RUNC_DMZ=true` (setting this environment variable in `config.json` will have no effect). This feature can be disabled at build time using the `runc_nodmz` build tag. (#3983, #3987) - `contrib/memfd-bind` is a helper daemon which will bind-mount a memfd copy of `/usr/bin/runc` on top of `/usr/bin/runc`. This entirely eliminates per-container copies of the binary, but requires care to ensure that upgrades to runc are handled properly, and requires a long-running daemon (unfortunately memfds cannot be bind-mounted directly and thus require a daemon to keep them alive). (#3987) * runc will now use `cgroup.kill` if available to kill all processes in a container (such as when doing `runc kill`). (#3135, #3825) * Add support for setting the umask for `runc exec`. (#3661) * libct/cg: support `SCHED_IDLE` for runc cgroupfs. (#3377) * checkpoint/restore: implement `--manage-cgroups-mode=ignore`. (#3546) * seccomp: refactor flags support; add flags to features, set `SPEC_ALLOW` by default. (#3588) * libct/cg/sd: use systemd v240+ new `MAJOR:*` syntax. (#3843) * Support CFS bandwidth burst for CPU. (#3749, #3145) * Support time namespaces. (#3876) * Reduce the `runc` binary size by ~11% by updating `github.com/checkpoint-restore/go-criu`. (#3652) * Add `--pidfd-socket` to `runc run` and `runc exec` to allow for management processes to receive a pidfd for the new process, allowing them to avoid pid reuse attacks. (#4045) Deprecated: * `runc` option `--criu` is now ignored (with a warning), and the option will be removed entirely in a future release. Users who need a non-standard `criu` binary should rely on the standard way of looking up binaries in `$PATH`. (#3316) * `runc kill` option `-a` is now deprecated. Previously, it had to be specified to kill a container (with SIGKILL) which does not have its own private PID namespace (so that runc would send SIGKILL to all processes). Now, this is done automatically. (#3864, #3825) * `github.com/opencontainers/runc/libcontainer/user` is now deprecated, please use `github.com/moby/sys/user` instead. It will be removed in a future release. (#4017) Changed: * When Intel RDT feature is not available, its initialization is skipped, resulting in slightly faster `runc exec` and `runc run`. (#3306) * `runc features` is no longer experimental. (#3861) * libcontainer users that create and kill containers from a daemon process (so that the container init is a child of that process) must now implement a proper child reaper in case a container does not have its own private PID namespace, as documented in `container.Signal`. (#3825) * Sum `anon` and `file` from `memory.stat` for cgroupv2 root usage, as the root does not have `memory.current` for cgroupv2. This aligns cgroupv2 root usage more closely with cgroupv1 reporting. Additionally, report root swap usage as sum of swap and memory usage, aligned with v1 and existing non-root v2 reporting. (#3933) * Add `swapOnlyUsage` in `MemoryStats`. This field reports swap-only usage. For cgroupv1, `Usage` and `Failcnt` are set by subtracting memory usage from memory+swap usage. For cgroupv2, `Usage`, `Limit`, and `MaxUsage` are set. (#4010) * libcontainer users that create and kill containers from a daemon process (so that the container init is a child of that process) must now implement a proper child reaper in case a container does not have its own private PID namespace, as documented in `container.Signal`. (#3825) * libcontainer: `container.Signal` no longer takes an `all` argument. Whether or not it is necessary to kill all processes in the container individually is now determined automatically. (#3825, #3885) * seccomp: enable seccomp binary tree optimization. (#3405) * `runc run`/`runc exec`: ignore SIGURG. (#3368) * Remove tun/tap from the default device allowlist. (#3468) * `runc --root non-existent-dir list` now reports an error for non-existent root directory. (#3374) Fixed: * In case the runc binary resides on tmpfs, `runc init` no longer re-execs itself twice. (#3342) * Our seccomp `-ENOSYS` stub now correctly handles multiplexed syscalls on s390 and s390x. This solves the issue where syscalls the host kernel did not support would return `-EPERM` despite the existence of the `-ENOSYS` stub code (this was due to how s390x does syscall multiplexing). (#3474) * Remove tun/tap from the default device rules. (#3468) * specconv: avoid mapping "acl" to `MS_POSIXACL`. (#3739) * libcontainer: fix private PID namespace detection when killing the container. (#3866, #3825) * systemd socket notification: fix race where runc exited before systemd properly handled the `READY` notification. (#3291, #3293) * The `-ENOSYS` seccomp stub is now always generated for the native architecture that `runc` is running on. This is needed to work around some arguably specification-incompliant behaviour from Docker on architectures such as ppc64le, where the allowed architecture list is set to `null`. This ensures that we always generate at least one `-ENOSYS` stub for the native architecture even with these weird configs. (#4219) Removed: * In order to fix performance issues in the "lightweight" bindfd protection against [CVE-2019-5736][cve-2019-5736], the temporary `ro` bind-mount of `/proc/self/exe` has been removed. runc now creates a binary copy in all cases. See the above notes about `memfd-bind` and `runc-dmz` as well as `contrib/cmd/memfd-bind/README.md` for more information about how this (minor) change in memory usage can be further reduced. (#3987, #3599, #2532, #3931) * libct/cg: Remove `EnterPid` (a function with no users). (#3797) * libcontainer: Remove `{Pre,Post}MountCmds` which were never used and are obsoleted by more generic container hooks. (#3350) [runc-4233]: https://github.com/opencontainers/runc/issues/4233 [mount_setattr.2]: https://man7.org/linux/man-pages/man2/mount_setattr.2.html [cve-2019-5736]: https://github.com/advisories/GHSA-gxmr-w5mj-v8hh Thanks to the following contributors who made this release possible: * Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp> * Alban Crequy <albancrequy@microsoft.com> * Aleksa Sarai <cyphar@cyphar.com> * Alex Jia <ajia@redhat.com> * Alexander Eldeib <alexeldeib@gmail.com> * Andrey Tsygunka <dreamsider@mail.ru> * Austin Vazquez <macedonv@amazon.com> * Bjorn Neergaard <bjorn.neergaard@docker.com> * Brian Goff <cpuguy83@gmail.com> * Chengen, Du <chengen.du@canonical.com> * Chethan Suresh <chethan.suresh@sony.com> * Christian Happ <Christian.Happ@jumo.net> * Cory Snider <csnider@mirantis.com> * CrazyMax <crazy-max@users.noreply.github.com> * Daniel, Dao Quang Minh <dqminh89@gmail.com> * Danish Prakash <grafitykoncept@gmail.com> * Davanum Srinivas <davanum@gmail.com> * Eng Zer Jun <engzerjun@gmail.com> * Eric Ernst <eric_ernst@apple.com> * Erik Sjölund <erik.sjolund@gmail.com> * Evan Phoenix <evan@phx.io> * Francis Laniel <flaniel@linux.microsoft.com> * Heran Yang <heran55@126.com> * Irwin D'Souza <dsouzai.gh@gmail.com> * Jaroslav Jindrak <dzejrou@gmail.com> * Jonas Eschenburg <jonas.eschenburg@kuka.com> * Jordan Rife <jrife0@gmail.com> * Kailun Qin <kailun.qin@intel.com> * Kang Chen <kongchen28@gmail.com> * Kazuki Hasegawa <nanasi880@gmail.com> * Kir Kolyshkin <kolyshkin@gmail.com> * Markus Lehtonen <markus.lehtonen@intel.com> * Masahiro Yamada <masahiroy@kernel.org> * Mikko Ylinen <mikko.ylinen@intel.com> * Mrunal Patel <mrunalp@gmail.com> * Peter Hunt <pehunt@redhat.com> * Prajwal S N <prajwalnadig21@gmail.com> * Qiang Huang <h.huangqiang@huawei.com> * Radostin Stoyanov <rstoyanov@fedoraproject.org> * Rodrigo Campos <rodrigoca@microsoft.com> * Ruediger Pluem <ruediger.pluem@vodafone.com> * Sebastiaan van Stijn <github@gone.nl> * Shengjing Zhu <zhsj@debian.org> * Sjoerd van Leent <sjoerd.van.leent@alliander.com> * SuperQ <superq@gmail.com> * TTFISH <jiongchiyu@gmail.com> * Tianon Gravi <admwiggin@gmail.com> * Vipul Newaskar <vipulnewaskar7@gmail.com> * Walt Chen <godsarmycy@gmail.com> * Wang-squirrel <117961776+Wang-squirrel@users.noreply.github.com> * Wei Fu <fuweid89@gmail.com> * Zheao Li <me@manjusaka.me> * Zoe <hi@zoe.im> * cdoern <cdoern@redhat.com> * dharmicksai <dharmicksaik@gmail.com> * guodong <guodong9211@gmail.com> * hang.jiang <hang.jiang@daocloud.io> * lengrongfu <lengrongfu@lengrongfudeMacBook-Pro.local> * lifubang <lifubang@acmcoder.com> * utam0k <k0ma@utam0k.jp> * wineway <wangyuweihx@gmail.com> * yanggang <gang.yang@daocloud.io> * yaozhenxiu <946666800@qq.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>