Unprivileged sandboxing of popular script languages.
| Unprivileged | sandboxing | script languages |
|---|---|---|
| Alice: "Why not Docker?" Bob: "Because sudo docker". |
landlock seccomp |
lua -> hlua python -> hpython node -> hnode bash -> hsh |
This project is a work in progress and has not been audited by security experts.
However I think it remains useful for educational purposes regarding Linux's
sometimes daunting security features and using strace to illustrate how a
program written in a high level language is translated into syscalls to obtain
its desired (or undesired) effects.
... and it is certainly better than nothing as I will try to exemplify in the following section. But as always, remember that sandboxing and containerization only limit the extent of a successful attack, and don't give you carte blanche to willy-nilly execute untrusted code.
So given that disclaimer, why did I write this?
Showcasing Linux's security features is only a secondary goal; my primary goal
is for the reader to add strace to her list of favorite tools.
Alice's game
Assume Alice is a game designer with malicious intent and you are her intended
victim.
Being a fan of indie games you of course accept to be a beta-tester for her
latest creation.
She sends you the fun.lua game and hidden within is the statement:
os.execute("sudo rm -rf --no-preserve-root /")(or she'll try sudo --askpass if the credentials aren't cached).
A diligent code-reviewer might catch such an obviously malicious
statement, but it can be surprisingly easy to miss in a hurried
glance; try to allow yourself only a few seconds to read the following:
function run(cmdline)
local s = os.getenv("SUDO")
if not s then
cmdline = "sudo -A " .. cmdline
end
os.execute(cmdline)
end
function clean_cache()
local project = os.getenv("PROJECT_ROOT") or ".."
local cache = os.getenv("CACHE") or "/tmp"
run(string.format("rm -r %s/%s", cache, project))
endDid you spot the malicious or unintended transposition?
This is the hardship presented to us by the PR-culture, and can provide a false
sense of security.
There are also programming languages
designed to be difficult to read.
And speaking of programming languages:
"the greatest thing about Lua is that you don't have to write Lua."
Meaning that it's very feasible to bundle a compiler for another language,
however non-esoteric (check out: fennel and
Amulet).
But Lua (as well as Python, Node.js, C and many many more) are:
any-effect-at-any-time languages.
This in contrast with Haskell
(check out Learn You a Haskell for Great Good!)
or maybe eff if you're feeling adventurous.
That means that an expected pure/side-effect free operation such as compiling a
piece of source code can include an obfuscated os.execute-attack or worse
if the attacker has a more insidious mind.
Considering that compilers are usually quite extensive pieces of software
they provide ample forestry to hide a malicious tree.
Alice, I suggest you split your malicious code into several commits and PR:s
(preferably large ones close to a deadline).
For the victim, I recommend Ken Thompson's "Reflections on Trusting Trust",
which if you haven't read I expect will shatter any trust you might have
imagined you had in any binary executable
(going all the way back to punchcards and the PDP-1).
This may seem ridiculous, but OCaml
(my yardstick language of languages):
still bundle a bootstrapping binary complier
to build subsequent compilers: this is very much
"tusting trust".
Even more so since Coq is implemented
in and thus compiled by OCaml; now your trust stack ends in a binary blob:
do you trust it? And do you have the incredibly time consuming task of
verifying no malicious opcodes hide within?
So the world is a scary and unsatisfactory environment, let's consider mitigating the consequences of malicious and/or incompetently written code.
Enter no new privileges
Alice's sudo-based rm-attack can be mitigated by a one-liner:
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0).
This call is not expected to fail, but being a conscientious developer it never
hurts to crash-don't-thrash and I present a
copy-pastable snippet:
#include <sys/prctl.h>
#include <stdlib.h>
void no_new_privs(void)
{
if(0 != prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
abort();
}
}You might prefer exit.
I don't: libc:s commonly provide
atexit
which in my opinion is contrary to a fail-early/crash-don't-thrash philosophy:
the operating system already has to assume the responsibility of clean up
after a failing process.
(Ever noticed that C coders don't free their allocations when exiting?)
Using exit and atexit reminds me of languages with exceptions and the
nightmare when exception handlers raise exceptions ad infinitum.
Instead consider programming models where failure-is-always-an-option thinking
is prevalent, such as the actor model:
where the non-delivery of a message is a scenario
brought to the forefront (with the real-world scenario of the fallibility of
network connections).
If you are curious I recommend Erlang
(check out Learn You Some Erlang for great good!).
(Erlang does not implement the Actor model in a strict way, but provide a very
enjoyable way to explore its concepts while writing highly concurrent
applications.)
Back to mitigating Alice's attacks: the above
no_new_privs
call is so simple it should always, always, be used:
unless explicitly necessary to gain new privileges.
This is the Principle of least privilege:
if the functionality you intend to provide does not require privileges your
process should not have any privileges, and this is the red thread in this
attempted raison d'être.
But in the RealWorld,
processes inherit quite a handful of privileges that Alice can still abuse,
as we shall see.
So Alice can't sudo anymore thanks to PR_SET_NO_NEW_PRIVS,
but even a sneaky os.execute("rm -r ~") would still
be a mayor buzzkill.
The naive Lua specific mitigation is to os.execute = nil before running the
entrypoint of Alice's game.
Well, that may be good enough
(I haven't figured out a way around that mitigation, but I'm reasonably sure
there is an exploit and would be interested in seeing it).
Continuing this idea we can tweak it into at least making this first naive
mitigation useful:
os.execute = function() error("not allowed") endEspecially since the mitigation I suggest below do not even allow for the program to try to provide a user-friendly error message.
Enter seccomp
Seccomp is Linux's way of filtering syscalls and so limiting the exposed kernel surface.
Fancy words aside, this means that when you receive in your email inbox the notification of a new vulnerability you can feel certain that you are not affected because the vulnerable syscalls are rejected by your program. If you aren't subscribed to any CVE mailing lists I recommend:
- Arch Linux's,
- Ubuntu's or
- OpenBSD's mailing lists.
The simplest seccomp filters are essentially accept/reject lists, but they can do more complex things. But as always when it comes to security: easily understandable code is always preferred.
Back to Alice's os.execute-based attacks:
with seccomp enabled with a filter that forbids exec:s,
the kernel will politely kill your process and suggest to the rest of the
system that you received a SIGSYS signal.
In practice this means that your process immediately vanishes, so without a
syscall inspection tool such as strace one is reduced to debugging by:
thou shalt printf("are we nearly here yet?");
Enter strace
If you haven't invoked strace before, or you are curious what syscalls are being used by a program, then try:
strace lua -e 'print("hello")'
strace python -c 'print("hello")'The output of strace can be quite extensive (and therefore strace provides
sophisticated ways to filter what is traced).
For our hello world example
the interesting syscall can be found towards the end:
write(1, "hello\n", 6) = 6
Other interesting syscalls to look for are memory allocating syscalls such as
brk and
mmap:
mmap(NULL, 151552, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f55a8b74000
Continuing our investigation of Alice's os.execute-attack, we can trace an
almost as trivial piece of code:
strace -f lua -e 'os.execute("echo hello")'
strace -f python -c 'import os; os.system("echo hello")'Note the -f (--follow-forks) option that tells strace to continue tracing
spawned child processes.
And now we look for
clone (the syscall that implements
fork)
and execve.
From the trace in the parent:
clone3({flags=CLONE_VM|CLONE_VFORK, exit_signal=SIGCHLD, stack=0x7f8160953000, stack_size=0x9000}, 88) = 2304236
and in the child:
execve("/bin/sh", ["sh", "-c", "echo hello"], 0x7ffebf549aa8 /* 52 vars */) = 0
So why not reject clone as well?
Remember, in Linux threads and processes are the same abstraction:
essentially one with shared virtual memory space and the other without
(but even a casual glance at clone's options makes the difference no longer so
clear cut).
Now with clone rejected: both threads as well as processes are no longer
things you need to reason about.
So how do we actually tell seccomp what to accept and what to reject?
Enter Berkeley Packet Filters
Seccomp filters are expected to be binary representations of cBPF, the c stands for "classic" BPF (in contrast with extended BPF (eBPF).
While cBPF is not theoretically Turing complete
because of lack of infinite memory; it is restricted to the scratch memory:
uint32_t M[16].
That only presents an interesting challenge:
which Project Euler or
Advent of Code problems can be solved in cBPF?
Therefore working with seccomp can provide somewhat of a challenge. So you may want to use an assembler and a preprocessor (I've bundled them together as bpfc) that can interpret the constants commonly used when making syscalls.
I always start with a "reject everything" filter:
bad: ret #$SECCOMP_RET_KILL_THREAD
good: ret #$SECCOMP_RET_ALLOW
then run a test under strace, look for SIGSYS, reason about the offending
syscall, reluctantly add it to the allowed list and iterate:
ld [$$offsetof(struct seccomp_data, nr)$$]
jeq #$__NR_brk, good
bad: ret #$SECCOMP_RET_KILL_THREAD
good: ret #$SECCOMP_RET_ALLOW
Eventually when the test passes you have achieved a list of syscalls living up to the principle of least privilege.
The filter you produce might appear very long, but remember that Linux has a
massive amount of syscalls.
Viewed from a security perspective even a moderately long filter is still
a huge reduction of the exposed kernel surface.
But a simple (but still very effective) yes/no approach to filtering
syscalls falls short when it encounters the
"functionality-grouping" syscalls such as
fcntl,
ioctl and
prctl
(which we encountered above).
For these syscalls it becomes necessary to inspect the call arguments
(from hlua's filter):
jne #$__NR_fcntl, fcntl_end
ld [$$offsetof(struct seccomp_data, args[1])$$]
jeq #$F_GETFL, good
jmp bad
fcntl_end:
Sometimes it may be useful to "tamper" with a syscall instead of rejecting it
outright:
return -1 and set errno to EPERM or ENOSYS to allow a child to recover:
see for example the prlimit check in
hnode's seccomp filter.
Doing the "test-n-strace" dance for a non-trivial test-case you quickly end up
with a filter usually including the
read, write and close syscalls.
(Unsurprisingly these have syscall numbers:
0, 1 and 3.)
write is particularly fun to think about: without it
how can you communicate the result of any computation
in a "everything is a file" system?
The syscall filtering way of expressing this is
seccomp's strict mode:
only allow write and exit. The reasoning being is that you are only allowed
to write to already opened file descriptors (since in this setting
open is forbidden, or more accurately not expressively allowed).
But even moderately interesting Lua applications enjoy using
require.
So it's not unreasonable to allow Lua to open files (which fills in the
number 2 syscall numbering slot).
But then Alice changes her fun.lua game to include (obfuscated of course):
io.open(os.getenv("HOME") .. "/.aws/credentials", "r"):read("*a")Now Alice has to get this information back to her, but maybe it's a multiplayer game? Or she obfuscates it in the game's log file and exclaims: "Oh the game crashed, why don't you send me the logs?"
Alice's intentions might only go as far as
griefing, and she will try to
os.remove your access tokens.
Alice, try removing Chrome/Firefox cookies as well.
This would definitely lose me my sunny disposition.
Removing files map to the unlink
syscall.
Certainly it commonly makes sense to reject it,
but a plausible scenario is using unlink is to remove intermediately
created files (during compilation maybe).
So what do we do about Alice's intent to remove your
with-blood-sweat-and-tears.doc file?
Enter landlock
Landlock is a fairly recently added security feature, which is meant to restrict filesystem access for unprivileged processes, in addition to the standard UNIX file permissions. (I will argue landlock is fairly recent when its new syscalls have, at the time of writing: the highest syscall number.)
In essence landlock
grants or restricts rights to filesystem operations
on whole filesystem hierarchies. (Note that a single file is a trivial
hierarchy.)
So we can grant read access to /usr/lib only and mitigate Alice's attack on
your access tokens in your home directory. And maybe allow both read and write
to /tmp, and maybe allow removing (i.e. unlinking).
Unless you allow open's
O_TMPFILE flag
in your seccomp filter of course.
The reason this section is bare of example code is that I found, and hope you will do too, the concepts behind landlock easily understandable and yet very powerful. Therefore I will not include any sample code here since the the sample code provided with landlock is excellent, and is relatively verbatim what I use. My experienced ratio of positive security impact versus time spent learning the feature is huge.
My one criticism of the current implementation of landlock is the inability
to hide files: that is, even though landlock restricts access to a
file, for example /etc/passwd, then stat (or similar) responds with
EACCESS instead of ENOENT.
The knowledge that a Linux installation has a /etc/passwd maybe of limited
value, but revealing that ~/.aws/credentials exist
can enable an attacker to target her attack more effectively against the
discovered files.
Furthermore an attacker can, given enough access to
(lots, lots and lots of)
system time, enumerate your entire file tree.
This is of course a ridiculous endeavour;
#define NAME_MAX 255
and pp ('z'-'a')+('Z'-'A')+('9'-'0') says that the amount of
filenames to try are bounded from below by
59^255:
which evaluates to an integer that starts with 36920 and ends with 89299
(not mentioning the other 442 digits).
The counter-argument is that there are other, perhaps better, ways of achieving
this functionality
(chroot maybe),
reducing my criticism to a mere down-prioritized item on a wishlist.
The wrinkle in our concrete setting of providing script hosts is that sometimes
the interpreters want to dynamically load shared libraries which boast a
notorious elusiveness and never appear in the same place twice
(which implies we at least know their velocity).
Hence I have added a set of tools to, at compile time,
tell hpython's landlock rules to allow read access to the path
where the embedded Python instance will look for, say, the libz library.
This is the functionality exercised in
hpython's import test.
The journey starts with the paths utility:
paths --python-site -lzwhich for my system suggest that these file system trees are of particular interest:
/usr/lib/python3.10
/usr/lib/libz.so.1
Behind the scenes dlinfo is used to
resolve the shared libraries.
The paths are then
inspected and converted into a relevant landlock rules code snippet
which is then included and applied in the main program.
Continuing the same wrinkle into the hsh project
which executes a bash
in the same security setting as the other script hosts programs
produce another complication.
Being a fully-fledged shell it desires to link quite a bit
of dynamic libraries, which in turn desire even more of that shared binary
goodness.
ldd /bin/bash expose the extent of
their desire:
linux-vdso.so.1 (0x00007ffe3bed2000)
libreadline.so.8 => /usr/lib/libreadline.so.8 (0x00007f45e8bcc000)
libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f45e8bc7000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007f45e89e0000)
libncursesw.so.6 => /usr/lib/libncursesw.so.6 (0x00007f45e896c000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f45e8d44000)
That indeed is a wrinkle to iron out given the
diversity of Linux distributions.
My awk-front-leds and sed-mid-legs twitch
but there is a better approach using objdump:
objdump -p /path/to/program | grep NEEDEDwhich said insect has bundled into the poor_ldd utility
(using the above mentioned dlinfo
based lib utility).
Now poor_ldd /bin/bash produce a similar output as ldd:
/lib64/ld-linux-x86-64.so.2
/usr/lib/libc.so.6
/usr/lib/libdl.so.2
/usr/lib/libncursesw.so.6
/usr/lib/libreadline.so.8
which then can be handed of to landlockc and grant a very limited set of read-access rules.
Alice's set attack vectors are now quite diminished, but we can do even better.
Enter drop capabilities
I have included a code snippet to drop capabilities.
This is a Linux feature I previously hadn't had the need to explore (so take
that code and what comes next with a grain of salt and always:
"trust, but verify").
The classic selling point of capabilities is the scenario to allow unprivileged
users to run ping.
In a pre-capabilities world one would have to have to obtain the full
power of the privileged user (root) in order to use ping.
Of course setsuid reduces the mess of every user su:ing, but still
provides a nice potential attack vector on the ping binary.
The capabilities is basically the idea to split root into separate, well,
capabilities that can be granted independently.
(ping requires the CAP_NET_RAW
capability).
In this project this scenario isn't really applicable (since we start out as unprivileged users.) But what may be applicable is the functionality to relinquish granted capabilities from the current process. Maybe this sounds convoluted, but in our current Dockerized world I would say its fairly common to see images invoke executables in a privileged mode (i.e. not setting another user).
And a noteworthy configuration option of Linux is that you don't have to include the bothersome userland. Here I imagine a barebones server setup: the kernel, a single stand-alone server executable (serving as the init process) and nothing else. In that setting dropping capabilities could be useful.
But even with these restrictions Alice can cause quite a bother:
Enter rlimits
Now Alice is restricted to using a pre-approved set of syscalls and restricted to a pre-approved set of file-system operations on an equally pre-approved subset of the file-system tree.
Her last-ditch effort is to execute a
Denial-of-service attack.
I suggest Alice tries to while(1) allocate at least one page of memory
(getconf PAGESIZE: 4096 bytes),
write a single pseudo-randomly generated
byte to each allocation: forcing the kernel to
copy-on-write.
This will quickly exhaust all available memory, and any unfortunate Linux user
will attest to the ensuing misery.
The mitigation is to apply strict
rlimits.
In this attack RLIMIT_AS
might be the most efficient mitigation.
The common way of applying rlimits is by using the shell's
ulimit command.
Alice then tries a fork bomb.
Rejecting the clone syscall will of course mitigate such an attack, but for
instance: node is determined to spawn worker threads making such a
mitigation ineffective.
Once more rlimits come to the rescue:
RLIMIT_NPROC
restricts the number of process a process can spawn
(including threads of course).
Alice, your next attack vector should be to exhaust any available block-devices
by creating huge files with your pseduo-random generator.
But again rlimits provides the mitigation:
RLIMIT_FSIZE.
The pattern should be obvious: restrict all available rlimits to the minimum
required to make the intended functionality succeed.
The code-snippet used
to restrict the rlimits zeroes any resource restriction not expressively
required non-zero limit.
Check the #define RLIMIT_DEFAULT_:s at the top of hlua,
hpython and hnode.
Again we encounter the principle of least privilege.
For instance, this approach guarantees that a file-descriptor-exhaustion
attack is no longer viable:
RLIMIT_NOFILE.
So given Alice's game you're itching to play regardless of her malicious intent: do you now feel safe enough to evaluate her code?
- We can enforce a list of allowed syscalls and their arguments using seccomp
- We can impose an additional layer of access restriction upon the file system hierarchy using landlock
- We can enforce strict resource usage limits on: memory usage, file-descriptor and thread/processes allocation
You might feel safe enough: but what surreal thing will she think of next‽
- "No, but seriously, why not
sudo docker?"
Yes, and seriously, nosudo, but yes Docker. My opinion is that Docker is great (for meDocker := cgroups+overlayfspackaged into a sleek product, but that's fine), especially in professional CI/Kubernetes/what-have-you settings. With this project I want to showcase a Linux way of sandboxing applications unprivileged: hence nosudo, but yes Docker. Also my guiding principle (other than "trust, but verify" that is); principle of least privilege: why require privileges to do something that can be achieved without? - "Why no binary packages?"
Because each embedded script host may have a different license, and I do not want to spend the time to study each of them and mess up anyway. Also my aim for this project to be an educational showcase and a sandbox to let users experiment and get hands on experience with the Linux security features in a non-toy setting: reducing the value of a pre-built binary package distributed without the sources and tools. The laziness argument coupled with this argument guides me to only offer source packages: mostly as a guide for users who do not feel comfortable to jump into the deep end withgit cloneandmake.
This project is intended to be built using Makefile:s and common C build
tools, except for the bpf_asm tool (found in the
Linux kernel sources).
Arch Linux users can use the bpf
package, but other distributions might have to build their own copies.
I have prepared a build script which is used when bpf_asm
is not found (the script is used in the Ubuntu workflow job).
The steps to build the project is then:
make tools
make build
make checkIf these steps fail because of missing dependencies you may consult the following table (derived from the packages installed during the Build and test workflow).
| runtime | build | check | |
|---|---|---|---|
| Ubuntu 24.04 | libcap2 lua5.4 python3 libnode109 bash |
make pkg-config gcc libcap-dev wget ca-certificates bison flex liblua5.4-dev python3 libpython3-dev libnode-dev |
|
| Ubuntu 22.04 | libcap2 lua5.4 python3 libnode72 bash |
make pkg-config gcc libcap-dev wget ca-certificates bison flex liblua5.4-dev python3 libpython3-dev libnode-dev |
python3-toml |
| Arch Linux | lua python nodejs bash |
bpf |
Pick a release and download the Ubuntu source package asset. Included within are the sources and two helper scripts:
build-packagethat runsdpkg-buildpackage, as well as checking for missing build-time dependenciesinstall-packageinstalls the built package usingapt-get, but note that you can try out the built binaries without a system-wide installation
These scripts are intended to be running as an unprivileged user, but might
need sudo access to apt-get in order to install missing dependencies.
Both scripts accept an -s option for this case, or you can set the SUDO
environment variable (e.g. SUDO="sudo --askpass").
Pick a release and download the Arch Linux
PKGBUILD
asset, place it in a suitably empty directory and invoke
makepkg,
possibly with --syncdeps and/or --install options when desired.
Note that you can try out the built binaries (found in the created src
subfolder) without a system-wide installation.
- use
stracestatistics to sort seccomp filters with respect to number of calls - landlock ABI=2 (see the sandbox example)
- readline or naive REPL with rlwrap
- reference OpenBSD's pledge(2)