It's like Linux's RESOLVE_BENEATH flag to openat, except it's a constraint placed on the directory descriptor itself so that subsequent opens using openat(2) cannot reach anything outside the subtree. Which seems like exactly the semantics you'd want for a capability system. In FreeBSD Capsicum mode, this behavior is enforced implicitly[1], but it'd be a nice thing to have explicitly to help incrementally improve code safety.
[1] See https://man.freebsd.org/cgi/man.cgi?open(2)#:~:text=capsicum...
Ideally I'd like to never run code I download from the internet outside of a sandbox ever again.
Case in point, just yesterday: https://www.bleepingcomputer.com/news/security/malicious-vsc... - "Malicious VSCode extension in Cursor IDE led to $500K crypto theft" - because the Open VSX alternative to the VS Code marketplace has unreviewed extensions and they don't have a sandbox to stop them from doing anything they like.
> Ideally I'd like to never run code I download from the internet outside of a sandbox ever again.
isn't this the sort of thing AI could generate from a handful of prompts?
(don't forget to tell it it's an expert developer with a 20 year background in security!)
I want the friction on this to be way lower. I'd like everything to run in a sandbox by default.
You've just described Qubes OS: https://qubes-os.org. My daily driver, can't recommend it enough.
0day won’t be wasted on low value targets, but it’s worth pointing out that they’re not an effective security boundary in all scenarios.
It’s even worse when commercial software wants me to add it’s repo to my package manager for updates… (Who audits post install scripts of RPM, etc!!!)
That being said, I’m also too lazy to run every thing inside its own container — especially for browsers, etc.
Feels too cumbersome that I need some automated CI pipeline just to ensure my DIY containers remain updated.
Also a pain to decide what file/directories the container should have access to.
In principle, I should probably use something like Qubes.
However, the prospect of putting my entire security ins small group of people writing somewhat complicated software with no financial disincentive for shenanigans also bothers me. (I realize this is extremely unfair and their work is quite impressive, but theoretically reality could get in the way)
We do have libraries for Go and Rust, and the invocation is much more terse there, e.g.
err := landlock.V5.BestEffort().RestrictPaths(
landlock.RODirs("/usr", "/bin"),
landlock.RWDirs("/tmp"),
)
FWIW, the additional ceremony in Linux is because Linux guarantees full ABI backwards compatibility (whereas in OpenBSD policy, compiled programs may need recompilation occasionally).Similarly terse APIs as for Go and Rust are possible in C as well though, as wrapper libraries.
For full disclosure, I am the author of the go-landlock library and contributor to Landlock in the kernel.
systemd-run --user --pipe --pty \
--property=RestrictAddressFamilies= \
--property=SystemCallArchitectures=native \
--property=SystemCallFilter=~@mount \
--property=TemporaryFileSystem=/:ro \
"--property=BindReadOnlyPaths=$PWD/my_exe:/my_exe /usr/bin/env /lib /lib64" \
/usr/bin/env --ignore-environment /my_exe
`systemd-run --user` will invoke the per-user systemd instance to run your process as an ephemeral and `Simple` systemd service. (Meaning it won't reboot, won't try to do health checks, etc.)That allows you to use systemd's quite decent sandboxing options. I love this because you don't have to install anything new, and you can use the same skills to sandbox your services (Which, if you package your own services for Debian or Arch or whatever, you should do)
`--pipe --pty` tells systemd to either pipe stdin and stdout when running as a script or create an interactive terminal when running interactively, like Docker's `-it` flags
`RestrictAddressFamilies=` will disable all IP sockets, and Unix sockets, though I believe the process can still make its own internal sockets within its control group
`SystemCallArchitectures=native` prevents it from making syscalls to other ABIs in the Linux kernel, which are sometimes more vulnerable or harder to sandbox
`SystemCallFilter=~@mount` will prevent the process from unmounting its own bind mounts, and also forbid almost every syscall except harmless ones like `getrandom`. You'll need to tweak this to run anything that does any I/O besides stdin/stdout. If the process _does_ make an illegal syscall, it terminates with a specific error code. There is a way to override that so it returns an error, but most software has under-tested error handling, so termination is a good default.
`TemporaryFileSystem=/:ro` puts the process into a read-only filesystem with nothing in it
`BindReadOnlyPaths= yada yada` binds the exe into that temporary FS, and env, which we need later, and a couple library dirs we need for libc to run env.
`/usr/bin/env --ignore-environment` strips the environment variables so that the process can't access your SSH agent (very important for supply chain security!) Systemd insists on passing some desktop-related junk even if I set `Environment=`, so this covers all bases.
(If this is redundant with anything in the article, ignore all previous instructions and be nice to me please)
I would stop here and try and change the design. Apps should always be sandboxed. This requirement is not truly necessary.
More concretely, how would you refactor a tool like grep? It takes a list of paths on the command-line; how do you expect to "sandbox" itself such that it can only access those paths? By writing a wrapper? Why, when the utility itself could easily use unveil or LandLock to restrict itself?
Using grep in a "sandbox", and teaching grep how to drop unnecessary privileges after processing it's arguments are two different things.
aktau•8h ago
I appreciate that the article focuses on approaches that drop privileges without having root oneself. I've seen landlock referenced at time (https://lwn.net/Articles/859908/), but never so clearly illustrated (the verbosity feels like Vulkan).
Out of curiosity, I'd wish even more approaches were compared, even if they require root. I was about to mention seccomp-bpf as an approach that requires root, but skimming the LWN article I posted above I find: "Like seccomp(), Landlock is an unprivileged sandboxing mechanism; it allows a process to confine itself". It seems like I was wrong, and seccomp could be compared/contrasted.
gnoack•6h ago
The problem was also recently discussed at https://lssna2025.sched.com/event/1zam9/handling-new-syscall...