I think it was a bad idea to put cryptographic APIs or VPN in the kernel. If userspace is too slow for this, you should either reduce context switch overhead, or create special kind of processes, which are isolated, but quick to switch into. They are repeating Windows mistakes.
1. I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.
2. The write-to-RO-page-cache primitive STILL WORKED! It’s just that the particular exploit used had no meaningful effect in the already-root-in-a-container context. If you think you are safe, you’re probably wrong. All you need to make a new exploit is an fd representing something that you aren’t supposed to be able to write. This likely includes CoW things where you are supposed to be able to write after CoW but you aren’t supposed to be able to write to the source.
So:
- Are you using these containers with a common image or even a common layer in an image to isolate dangerous workloads from each other. Oops, they can modify the image layers and corrupt each other. There goes any sort of cross-tenant isolation.
- What if you get an fd backed by the zero page and write to it? This can’t result in anything that the administrator would approve of.
- What if you ro-bind-mount something in? It’s not ro any more.
> I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.
I see a lot of projects blocking those sockets in containers as a response to this exploit, but it seems rather strange to me. We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time? It's a rather weird default to use. It's not like we're mass-disabling kernel modules everywhere every time someone discovers an EoP bug, do we? Did we blacklist OpenSSL's binaries after Heartbleed?
I suppose it makes sense as a default on vulnerable kernels (though people running vulnerable kernels should put effort into patching rather than workarounds in my opinion), but these defaults are going to be around ten years from now when copy.fail is a distant memory.
> We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time? It's a rather weird default to use.
The need for this feature/functionality in the fist place is questioned by some:
> As someone who works on the Linux kernel's cryptography code, the regularly occurring AF_ALG exploits are really frustrating. AF_ALG, which was added to the kernel many years ago without sufficient review, should not exist. It's very complex, and it exposes a massive attack surface to unprivileged userspace programs. And it's almost completely unnecessary, as userspace already has its own cryptography code to use. The kernel's cryptography code is just for in-kernel users (for example, dm-crypt).
> The algorithm being used in this [specific] exploit, "authencesn", is even an IPsec implementation detail, which never should have been exposed to userspace as a general-purpose en/decryption API. […]
> Did we blacklist OpenSSL's binaries after Heartbleed?
No, but lots of companies have since migrated away. OpenSSL was harder to move away from because there weren't as obvious drop-in replacements. Blocking a syscall that you never actually used is simple and effective.
In fairness, after heartbleed - there was quite a push to move away from openSSL - like Google's boring ssl, openbsd libressl and Mozilla/nss or gnutls - but the alternative here would be moving to a different kernel, like freebsd or open Solaris/Illumos ...
that's just moving to kernel that had 1000x less eyes on it. Yeah sure it will have less exploits but purely because nobody bothers to look when there are much juicer targets on Linux.
But I am disappointed that we still don't have clear OpenSSL successor, there is nothing to be salvaged from this mess of a project
1000x less eyes is true, but also: Linux, even in the kernel, has a long history of "move fast and break things".
Yes, the syscall API is (famously) stable, but the drivers, for example, are such a mess that many non-Linux projects prefer to take BSD drivers for e.g. WiFi despite them supporting far fewer devices (even if the Linux ones would be license compatible).
iiuc the AF_ALG interface only offers real performance wins if you have specialized hardware that the kernel can offload computations to. If you're not using that hardware, there's little reason not to do the crypto in userspace.
In fact, the authors specifically say on the very first line of their website that the copy/fail primitive can be used as a container escape. The entire premise of this article is flawed and irresponsible.
There is an addendum at the bottom where they admit the page corruption is still problematic even with rootless podman.
Although using this to justify their migration to micro-VMs is very strange to me. Sure for this CVE it would have been better, but surely for a future attack it could hit a component shared across VMs but not containers? Are people really choosing technology based on CVE-of-the-week?
These sorts of vulns are extremely common on Linux. This one is making the rounds for various reasons but it's a good justification for a migration away from containers if your threat model is concerned about it.
MicroVMs have much lower attack surface and you can even toss a container into one if you'd like.
Or use gvisor, which mitigates this vulnerability.
Containers were never a security boundary. VMs have better isolation, which is why people choose them for security. Containers are convenience and usually have better performance.
I see the ‘not a security boundary’ thing repeated constantly, and while it makes sense (eg. they’re sharing the underlying kernel or at least some access to it) if you think about it a little more, VMs are not magically different: they are better isolated, but VMs on the same host still share the host in common. A CVE next week that allows corruption of host state that affects eg every VM under a particular hypervisor will be no less damaging than this CVE is to containers
You are obviously right that these are similar in principle: VM isolation exploit would lead to the same exposure like container-related isolation exploits.
VMs are considered vastly better because the surface area where exploits can happen is smaller and/or better isolated within the kernel.
If you are arguing the latter is not true — and we are all collectively hand-waving away big chunk of the surface area so that may be the case — it would help to be explicit in why you believe an exploit in that area is similarly likely?
I would say it's the fact that "not a security boundary" appears to be a pass/fail statement, whereas the reality is more like a security continuum, along which VMs are further than containers.
> A CVE next week that allows corruption of host state that affects eg every VM under a particular hypervisor will be no less damaging than this CVE is to containers
Yeah this almost never happens though whereas Linux privesc is 10x a day.
They may not provide isolation as VMs but they clearly do limit some attacks. VMs do not provide the same isolation as using physically separate hardware either.
I would have thought they provide better isolation than using multiple users which is the traditional security boundary.
It might depends on what you mean by a container? Are sandboxes such as Bubblewrap and Firejail containers?
Containers are a convenience boundary and they increase complexity of your risk assessments.
It is easy for security scanners to scan a Linux system, but will they inspect your containers, and snaps, and flatpaks, and VMs? It is easy for DevOps to ssh into your Linux server, but can they also get logged in to each container, and do useful things? Your patches and all dependencies are up-to-date on your server, but those containers are still dragging around legacy dependencies, by design. Is your backup system aware of containers and capable of creating backup images or files, that are suitable for restoring back to service?
> Security scanners already support most container and VM image formats in widespread use.
E.g.,
> Container Security stores and scans container images as the images are built, before production. It provides vulnerability and malware detection, along with continuous monitoring of container images. By integrating with the continuous integration and continuous deployment (CI/CD) systems that build container images, Container Security ensures every container reaching production is secure and compliant with enterprise policy.
You need a tool like Anchore and PrismaCloud to scan the container images then monitor them in runtime with PrismaCloud. Trellix can “scan” however most people turn off or exclude container directories on the host because it can interfere with the running container.
If the goal is just preventing full root privileges, a CapabilityBoundingSet in a systemd unit will do.
However copy fail can be used in many other ways not contained by containers or the above settings. For example it can modify the /etc/ssl/certs to prepare for MitM attacks. If you have multiple containers based on the same image then one compromised CA set affects another.
tl;dr - within the container, the exploit works, and elevates to root (uid 0) within the container - BUT because that namespace actually maps to uid 1000 (the user) outside the container, the escalation does not flow up to the host.
But… does this escape the container? If not (the author seems to indicate it does not) then does it matter if you are in Docker or rootless Podman, right, since the end result is always: you have elevated to root within the container. If the rest of the container filesystem isolation does its job, the end result is the same? Though I guess another chained exploit to escape the container would be worse in Docker? Do I have that right?
This is a problem and most people hadn’t considered it before because the caching is done to speed up build pipeline performance:
“ While rootless containers prevent the attacker from escalating to host root, the page cache is still shared across the host. Containers that re-use the same base image layers share the same cached pages for those layers — if a malicious CI job corrupts a binary in the page cache, other containers launched from that same image could end up executing the poisoned version.”
I'm no expert, but the kernel is shared between all containers and the host.
I don't believe the kernel maintains separate page caches for each container; a malicious CI job could corrupt a binary from any container, or the host.
It’s manipulating the binary to make it as small as possible. In golf, the lowest score wins. So, in this context, the smallest binary that still works wins.
Nowhere in the article is mentioned user namespaces completely mitigate the vulnerability, page cache corruption still happens but not being able to obtain root in the target host increases the attack vector to more than just a one liner into having to figure out whether specific shared base image layers are in use and by whom and by what binaries (think of a shared CI platform like the one we run for GNOME).
The article does not prove that you can't get root on the host via page cache corruption, just that the specific exploit strategy they tried didn't work.
There's a specific reason why the exploit targets a setuid binary, if you poison it in memory it will be executed with the permissions of the user owning it, in this case root, meaning a setuid(0) + spawning a new shell will effectively give you root access on the host system, this for systems where uid=0 is equivalent inside and outside the container itself. The vulnerability is still there and is deadly serious, with rootless containers the attack vector just increases, the attacker will have to identify other factors (what containers are using a shared base image, what binaries are being called, what binaries should be overridden in memory etc).
On top of this there's another thing worth mentioning, it's a common thing in Openshift (for non rootless podman) to allow CAP_SETGID/CAP_SETUID for being able to create a container within a container (this is called the allowPrivilegeEscalation in SCCs), that effectively grants you the ability to become uid=root in the container and in that scenario uid=0 matches the host uid=0. The important difference is that specific instance of the root user doesn't have CAP_SYS_ADMIN (or most of the other privileged kernel capabilities) meaning the actions the user can then perform are very limited.
that just prevents the faulty module from loading. So you have time to fix it properly (kernel upgrade)
Technically there should be zero impact (the very very few tools that use it will fall back to userspace), I haven't even found that module loaded in infrastructure
Then check if it is loaded, and if it is, unload/reboot
That's true... for the exploit demo that they released. The primitive that underlies the exploit, however -- a page cache write -- can easily bypass the container boundary. One only needs to hook an executable which is also present in the host.
> (like reading env vars and sending them to an external server) it'd not be able to send credentials or fetch a malware remotely at all due to the DNS queries being intercepted by eBPF and being sent to a CoreDNS proxy.
Wouldn’t the exploit then just use ip addresses directly?
You can work with the idea of a DNS whitelist, as in you pass a list of allowed DNS entries via your .gitlab-ci.yml (or separate config) resolution happens and those entries (IPs) are stored in a list, any other IP not present in that list gets denied by eBPF (which can easily be used to rewrite the source and destination of a packet before the packet actually reaches the NIC for dispatch)
The dedicated website: https://copy.fail
1. I would hope the default seccomp policy blocks AF_ALG in these containers. I bet it doesn’t. Oh well.
2. The write-to-RO-page-cache primitive STILL WORKED! It’s just that the particular exploit used had no meaningful effect in the already-root-in-a-container context. If you think you are safe, you’re probably wrong. All you need to make a new exploit is an fd representing something that you aren’t supposed to be able to write. This likely includes CoW things where you are supposed to be able to write after CoW but you aren’t supposed to be able to write to the source.
So:
- Are you using these containers with a common image or even a common layer in an image to isolate dangerous workloads from each other. Oops, they can modify the image layers and corrupt each other. There goes any sort of cross-tenant isolation.
- What if you get an fd backed by the zero page and write to it? This can’t result in anything that the administrator would approve of.
- What if you ro-bind-mount something in? It’s not ro any more.
I see a lot of projects blocking those sockets in containers as a response to this exploit, but it seems rather strange to me. We're disabling a cryptographic performance enhancement feature entirely because there was a security bug in them that one time? It's a rather weird default to use. It's not like we're mass-disabling kernel modules everywhere every time someone discovers an EoP bug, do we? Did we blacklist OpenSSL's binaries after Heartbleed?
I suppose it makes sense as a default on vulnerable kernels (though people running vulnerable kernels should put effort into patching rather than workarounds in my opinion), but these defaults are going to be around ten years from now when copy.fail is a distant memory.
The need for this feature/functionality in the fist place is questioned by some:
> As someone who works on the Linux kernel's cryptography code, the regularly occurring AF_ALG exploits are really frustrating. AF_ALG, which was added to the kernel many years ago without sufficient review, should not exist. It's very complex, and it exposes a massive attack surface to unprivileged userspace programs. And it's almost completely unnecessary, as userspace already has its own cryptography code to use. The kernel's cryptography code is just for in-kernel users (for example, dm-crypt).
> The algorithm being used in this [specific] exploit, "authencesn", is even an IPsec implementation detail, which never should have been exposed to userspace as a general-purpose en/decryption API. […]
* https://news.ycombinator.com/item?id=47952181#unv_47956312
More than one time.
> a cryptographic performance enhancement feature
It's very rarely used.
> Did we blacklist OpenSSL's binaries after Heartbleed?
No, but lots of companies have since migrated away. OpenSSL was harder to move away from because there weren't as obvious drop-in replacements. Blocking a syscall that you never actually used is simple and effective.
But I am disappointed that we still don't have clear OpenSSL successor, there is nothing to be salvaged from this mess of a project
Yes, the syscall API is (famously) stable, but the drivers, for example, are such a mess that many non-Linux projects prefer to take BSD drivers for e.g. WiFi despite them supporting far fewer devices (even if the Linux ones would be license compatible).
To my knowledge, not many things were using the in-kernel code anyways, the recommended way is to use userland tools...
It's optional for openssl, systemd apparently needs it, but deleting the module from one of my systems didn't cause any issues. /shrug
there is no reason it would be default policy. Else might as well block every socket and just multiplex everything on stdin/out
share and enjoy!
You may be on to something…
Oh, an this [2] just happened
[1] https://github.com/containers/oci-seccomp-bpf-hook/pull/209 [2] https://github.com/moby/moby/pull/52501
Although using this to justify their migration to micro-VMs is very strange to me. Sure for this CVE it would have been better, but surely for a future attack it could hit a component shared across VMs but not containers? Are people really choosing technology based on CVE-of-the-week?
MicroVMs have much lower attack surface and you can even toss a container into one if you'd like.
Or use gvisor, which mitigates this vulnerability.
VMs are not different due to 'magic' but through hardware assist with things like Intel VT-x and AMD-V:
* https://en.wikipedia.org/wiki/X86_virtualization#Hardware-as...
* https://blog.lyc8503.net/en/post/hypervisor-explore/
* https://binarydebt.wordpress.com/2018/10/14/intel-virtualisa...
VMs are considered vastly better because the surface area where exploits can happen is smaller and/or better isolated within the kernel.
If you are arguing the latter is not true — and we are all collectively hand-waving away big chunk of the surface area so that may be the case — it would help to be explicit in why you believe an exploit in that area is similarly likely?
> A CVE next week that allows corruption of host state that affects eg every VM under a particular hypervisor will be no less damaging than this CVE is to containers
Yeah this almost never happens though whereas Linux privesc is 10x a day.
I would have thought they provide better isolation than using multiple users which is the traditional security boundary.
It might depends on what you mean by a container? Are sandboxes such as Bubblewrap and Firejail containers?
It is easy for security scanners to scan a Linux system, but will they inspect your containers, and snaps, and flatpaks, and VMs? It is easy for DevOps to ssh into your Linux server, but can they also get logged in to each container, and do useful things? Your patches and all dependencies are up-to-date on your server, but those containers are still dragging around legacy dependencies, by design. Is your backup system aware of containers and capable of creating backup images or files, that are suitable for restoring back to service?
Does this increase complexity? Yes, it does. Is it worth the cost? Depends on each individual case IMO.
E.g.,
> Container Security stores and scans container images as the images are built, before production. It provides vulnerability and malware detection, along with continuous monitoring of container images. By integrating with the continuous integration and continuous deployment (CI/CD) systems that build container images, Container Security ensures every container reaching production is secure and compliant with enterprise policy.
* https://docs.tenable.com/enclave-security/container-security...
Couldn't you then simply re-run the exploit again as unprivileged podman user and gain root on the host?
If you can orchestrate a container escape from the container's "root", then you're on to something.
However copy fail can be used in many other ways not contained by containers or the above settings. For example it can modify the /etc/ssl/certs to prepare for MitM attacks. If you have multiple containers based on the same image then one compromised CA set affects another.
But… does this escape the container? If not (the author seems to indicate it does not) then does it matter if you are in Docker or rootless Podman, right, since the end result is always: you have elevated to root within the container. If the rest of the container filesystem isolation does its job, the end result is the same? Though I guess another chained exploit to escape the container would be worse in Docker? Do I have that right?
“ While rootless containers prevent the attacker from escalating to host root, the page cache is still shared across the host. Containers that re-use the same base image layers share the same cached pages for those layers — if a malicious CI job corrupts a binary in the page cache, other containers launched from that same image could end up executing the poisoned version.”
I don't believe the kernel maintains separate page caches for each container; a malicious CI job could corrupt a binary from any container, or the host.
On top of this there's another thing worth mentioning, it's a common thing in Openshift (for non rootless podman) to allow CAP_SETGID/CAP_SETUID for being able to create a container within a container (this is called the allowPrivilegeEscalation in SCCs), that effectively grants you the ability to become uid=root in the container and in that scenario uid=0 matches the host uid=0. The important difference is that specific instance of the root user doesn't have CAP_SYS_ADMIN (or most of the other privileged kernel capabilities) meaning the actions the user can then perform are very limited.
Technically there should be zero impact (the very very few tools that use it will fall back to userspace), I haven't even found that module loaded in infrastructure
Then check if it is loaded, and if it is, unload/reboot
If algif_aead was a builtin module, it needs to be disabled by adding initcall_blacklist=algif_aead_init to the boot cmdline.
However initcall_blacklist requires the kernel to be built with CONFIG_KALLSYMS.
Wouldn’t the exploit then just use ip addresses directly?