Prior to v28, iptables rules were written in such a way that they depended upon the default `FORWARD` policy. To get proper container isolation, that default policy had to be set to `DROP`.
That's not the case anymore. Iptables rules have been rewritten to not depend on that default policy, but we're still setting it as users might (un)knowingly depend on that to secure their system. We thought it wasn't worth the trouble to change that after so many years. However, we added an escape hatch in the form of a new daemon parameter (named `ip-forward-no-drop`) to not force users to disable iptables integration altogether when they don't want that default policy.
We published a blog post about that, and other security hardening measures we took in v28: https://www.docker.com/blog/docker-engine-28-hardening-conta...
v29.0 will have support for nftables. It'll be marked as experimental in the first few releases to allow us to change anything without worrying about backward compatibility. However, it already provides the same feature coverage as iptables. Things will be a bit different with this firewall backend though - the Engine will refuse to start if sysctl `net.ipv4.ip_forward` is not set to 1. Users will have to set it on their own, consider the security implications, and take the necessary measures to block forwarding between non-Docker interfaces. Our rules will be isolated in their own nft table, so hopefully it'll feel less like "Docker owns the system".
> Docker’s lack of UID isolation by default
This is not my area of expertise but this is omitting that user namespaces tend to drastically increase the attack surface (despite what some vendors say). For instance: https://blog.qualys.com/vulnerabilities-threat-research/2025....
> Docker makes it quite difficult to deploy IPv6 properly in containers, [...] since Docker relies on NAT [...] The only way around this is to… write your own firewall rules
This is not true anymore. We added a network-level parameter to use IPv6 without NAT, and keep the semantic of `-p` (the port-publishing flag).
For instance, you can create a non-NAT / "routed" network with: `docker network create -o com.docker.network.bridge.gateway_mode_ipv6=routed --ipv6 testnet`. That network will get a ULA subnet assigned if no IPv6 `--subnet` was provided.
If you run a container with a published port, e.g. `docker run --network testnet -p 80/tcp …`, you container's port 80 will be accessible but not other ports.
The downside of that approach is that some / all of your routers in your local network need to learn about this subnet to correctly route it to the Docker host.
If anything, it's the problem with the design of the UNIX's process management, inherited thoughtlessly, which Docker decided to not deal with on its own. Why does there have to be a whole special, unkillable process whose only job is to call wait(2) in an infinite loop? Because in the original UNIX design, Ken Thompson apparently did not want to do too much work in the kernel during exit(2): if process A calls exit(2) while having 20 already exited children it didn't wait for, you either have to reap those 20 processes (which involves reading their PCBs from the swap on disk), and then potentially reap their already exited children, and their grandchildren... or you can just iterate over the process table and set the ppid of A's children to 1 and schedule PID 1 to run and let it deal with reaping one process at a time in wait(2). Essentially, the work is pushed to the scheduler, but the logic itself lives in the user space at the cost of PID space pollution.
trilogic•5h ago