Linux kernel v4.14 was released
this last Sunday, and there s a bunch of security things I think are interesting:
vmapped kernel stack on arm64
Similar to the same feature on x86
, Mark Rutland and Ard Biesheuvel implemented
for arm64, which moves the kernel stack to an isolated and guard-paged vmap area. With traditional stacks, there were two major risks when exhausting the stack: overwriting the
structure (which contained the
field which is checked during
), and overwriting neighboring stacks (or other things allocated next to the stack). While arm64 previously moved its thread_info off the stack
to deal with the former issue, this vmap change adds the last bit of protection by nature of the vmap guard pages. If the kernel tries to write past the end of the stack, it will hit the guard page and fault. (Testing for this is now possible via LKDTM s
One aspect of the guard page protection that will need further attention (on all architectures) is that if the stack grew because of a giant Variable Length Array on the stack (effectively an implicit
call), it might be possible to jump over the guard page entirely (as seen in the userspace Stack Clash
attacks). Thankfully the use of VLAs is rare in the kernel. In the future, hopefully we ll see the addition of PaX/grsecurity s STACKLEAK
plugin which, in addition to its primary purpose of clearing the kernel stack on return to userspace, makes sure stack expansion cannot skip over guard pages. This stack probing ability will likely also become directly available from the compiler
Related to the
field mentioned above, another class of bug is finding a way to force the kernel into accidentally leaving
open to kernel memory through an unbalanced call to
. In some areas of the kernel, in order to reuse userspace routines (usually VFS or compat related), code will do something like:
set_fs(KERNEL_DS); ...some code here...; set_fs(USER_DS);
. When the
call goes missing (usually due to a buggy error path
), subsequent system calls can suddenly start writing into kernel memory via
(where the to user really means within the
Thomas Garnier implemented USER_DS checking
at syscall exit time for x86, arm, and arm64. This means that a broken
setting will not extend beyond the buggy syscall that fails to set it back to
. Additionally, as part of the discussion on the best way to deal with this feature, Christoph Hellwig and Al Viro (and others) have been making extensive changes
to avoid the need for
being used at all, which should greatly reduce the number of places where it might be possible to introduce such a bug in the future.
SLUB freelist hardening
A common class of heap attacks is overwriting the freelist pointers
stored inline in the unallocated SLUB cache objects. PaX/grsecurity developed an inexpensive defense that XORs the freelist pointer with a global random value (and the storage address). Daniel Micay improved on this by using a per-cache random value, and I refactored the code a bit more. The resulting feature
, enabled with
, makes freelist pointer overwrites very hard to exploit unless an attacker has found a way to expose both the random value and the pointer location. This should render blind heap overflow bugs much more difficult to exploit.
Additionally, Alexander Popov implemented a simple double-free defense
, similar to the fasttop
check in the GNU C library, which will catch sequential
s of the same pointer. (And has already uncovered a bug
Future work would be to provide similar metadata protections to the SLAB allocator (though SLAB doesn t store its freelist within the individual unused objects
, so it has a different set of exposures compared to SLUB).
setuid-exec stack limitation
Continuing the various additional defenses to protect against future problems related to userspace memory layout manipulation (as shown most recently in the Stack Clash
attacks), I implemented an 8MiB stack limit
for privileged (i.e. setuid) execs, inspired by a similar protection in grsecurity, after reworking the secureexec handling by LSMs. This complements the unconditional limit to the size of exec arguments
that landed in v4.13.
randstruct automatic struct selection
While the bulk of the port of the randstruct gcc plugin
from grsecurity landed in v4.13, the last of the work needed to enable automatic struct selection
landed in v4.14. This means that the coverage of randomized structures, via
, now includes one of the major targets of exploits: function pointer structures. Without knowing the build-randomized location of a callback pointer an attacker needs to overwrite in a structure, exploits become much less reliable.
structleak passed-by-reference variable initialization
Ard Biesheuvel enhanced the structleak gcc plugin
to initialize all variables on the stack that are passed by reference when built with
. Normally the compiler will yell if a variable is used before being initialized, but it silences this warning if the variable s address is passed into a function call first, as it has no way to tell if the function did actually initialize the contents. So the plugin now zero-initializes such variables (if they hadn t already been initialized) before the function call that takes their address. Enabling this feature has a small performance impact, but solves many stack content exposure flaws. (In fact at least one such flaw
reported during the v4.15 development cycle was mitigated by this plugin.)
improved boot entropy
Laura Abbott and Daniel Micay improved early boot entropy available to the stack protector by both moving the stack protector setup later in the boot
, and including the kernel command line in boot entropy collection
(since with some devices it changes on each boot).
eBPF JIT for 32-bit ARM
The ARM BPF JIT had been around a while, but it didn t support eBPF (and, as a result, did not provide constant value blinding
, which meant it was exposed to being used by an attacker to build arbitrary machine code with BPF constant values). Shubham Bansal spent a bunch of time building a full eBPF JIT for 32-bit ARM
which both speeds up eBPF and brings it up to date on JIT exploit defenses in the kernel.
Tyler Hicks addressed a long-standing deficiency in how seccomp could log action results. In addition to creating a way to mark a specific seccomp filter
as needing to be logged with
, he added a new action result,
. With these changes in place, it should be much easier for developers to inspect the results of seccomp filters, and for process launchers to generate logs for their child processes operating under a seccomp filter.
Additionally, I finally found a way to implement an often-requested feature for seccomp, which was to kill an entire process instead of just the offending thread. This was done by creating the
mask (n e
) and implementing
That s it for now; please let me know if I missed anything. The v4.15 merge window is now open!
2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.