Search Results: "kees"

16 November 2023

Dimitri John Ledkov: Ubuntu 23.10 significantly reduces the installed kernel footprint


Photo by Pixabay
Ubuntu systems typically have up to 3 kernels installed, before they are auto-removed by apt on classic installs. Historically the installation was optimized for metered download size only. However, kernel size growth and usage no longer warrant such optimizations. During the 23.10 Mantic Minatour cycle, I led a coordinated effort across multiple teams to implement lots of optimizations that together achieved unprecedented install footprint improvements.

Given a typical install of 3 generic kernel ABIs in the default configuration on a regular-sized VM (2 CPU cores 8GB of RAM) the following metrics are achieved in Ubuntu 23.10 versus Ubuntu 22.04 LTS:

  • 2x less disk space used (1,417MB vs 2,940MB, including initrd)

  • 3x less peak RAM usage for the initrd boot (68MB vs 204MB)

  • 0.5x increase in download size (949MB vs 600MB)

  • 2.5x faster initrd generation (4.5s vs 11.3s)

  • approximately the same total time (103s vs 98s, hardware dependent)


For minimal cloud images that do not install either linux-firmware or modules extra the numbers are:

  • 1.3x less disk space used (548MB vs 742MB)

  • 2.2x less peak RAM usage for initrd boot (27MB vs 62MB)

  • 0.4x increase in download size (207MB vs 146MB)


Hopefully, the compromise of download size, relative to the disk space & initrd savings is a win for the majority of platforms and use cases. For users on extremely expensive and metered connections, the likely best saving is to receive air-gapped updates or skip updates.

This was achieved by precompressing kernel modules & firmware files with the maximum level of Zstd compression at package build time; making actual .deb files uncompressed; assembling the initrd using split cpio archives - uncompressed for the pre-compressed files, whilst compressing only the userspace portions of the initrd; enabling in-kernel module decompression support with matching kmod; fixing bugs in all of the above, and landing all of these things in time for the feature freeze. Whilst leveraging the experience and some of the design choices implementations we have already been shipping on Ubuntu Core. Some of these changes are backported to Jammy, but only enough to support smooth upgrades to Mantic and later. Complete gains are only possible to experience on Mantic and later.

The discovered bugs in kernel module loading code likely affect systems that use LoadPin LSM with kernel space module uncompression as used on ChromeOS systems. Hopefully, Kees Cook or other ChromeOS developers pick up the kernel fixes from the stable trees. Or you know, just use Ubuntu kernels as they do get fixes and features like these first.

The team that designed and delivered these changes is large: Benjamin Drung, Andrea Righi, Juerg Haefliger, Julian Andres Klode, Steve Langasek, Michael Hudson-Doyle, Robert Kratky, Adrien Nader, Tim Gardner, Roxana Nicolescu - and myself Dimitri John Ledkov ensuring the most optimal solution is implemented, everything lands on time, and even implementing portions of the final solution.

Hi, It's me, I am a Staff Engineer at Canonical and we are hiring https://canonical.com/careers.

Lots of additional technical details and benchmarks on a huge range of diverse hardware and architectures, and bikeshedding all the things below:

For questions and comments please post to Kernel section on Ubuntu Discourse.



1 May 2023

Russ Allbery: Review: The Amazing Maurice and His Educated Rodents

Review: The Amazing Maurice and His Educated Rodents, by Terry Pratchett
Series: Discworld #28
Publisher: HarperCollins
Copyright: 2001
Printing: 2008
ISBN: 0-06-001235-8
Format: Mass market
Pages: 351
The Amazing Maurice and His Educated Rodents is the 28th Discworld novel and the first marketed for younger readers. Although it has enough references to establish it as taking place on Discworld, it has no obvious connections with the other books and doesn't rely on any knowledge of the series so far. This would not be a bad place to start with Terry Pratchett and see if his writing style and sense of humor is for you. Despite being marketed as young adult, and despite Pratchett's comments in an afterward in the edition I own that writing YA novels is much harder, I didn't think this was that different than a typical Discworld novel. The two main human characters read as about twelve and there were some minor changes in tone, but I'm not sure I would have immediately labeled it as YA if I hadn't already known it was supposed to be. There are considerably fewer obvious pop culture references than average, though; if that's related, I think I'll prefer Pratchett's YA novels, since I think his writing is stronger when he's not playing reference bingo. Maurice (note to US readers: Maurice is pronounced "Morris" in the UK) is a talking cat and the mastermind of a wandering con job. He, a stupid-looking kid with a flute (Maurice's description), and a tribe of talking rats travel the small towns of Discworld. The rats go in first, making a show of breaking into the food, swimming in the cream, and widdling on things that humans don't want widdled on. Once the townspeople are convinced they have a plague of rats, the kid with the flute enters the town and offers to pipe the rats away for a very reasonable fee. He plays his flute, the rats swarm out of town, and they take their money and move on to the next town. It's a successful life that suits Maurice and his growing hoard of gold very well. If only the rats would stop asking pointed questions about the ethics of this scheme. The town of Bad Blintz is the next on their itinerary, and if the rats have their way, will be the last. Their hope is they've gathered enough money by now to find an island, away from humans, where they can live their own lives. But, as is always the case for one last job in fiction, there's something uncannily wrong about Bad Blintz. There are traps everywhere, more brutal and dangerous ones than they've found in any other town, and yet there is no sign of native, unintelligent rats. Meanwhile, Maurice and the boy find a town that looks wealthy but has food shortages, a bounty on rats that is absurdly high, and a pair of sinister-looking rat-catchers who are bringing in collections of rat tails that look suspiciously like bootlaces. The mayor's daughter discovers Maurice can talk and immediately decides she has to take them in hand. Malicia is very certain of her own opinions, not accustomed to taking no for an answer, and is certain that the world follows the logic of stories, even if she has to help it along. This is truly great stuff. I think this might be my favorite Discworld novel to date, although I do have some criticisms that I'll get to in a moment. The best part are the rats, and particularly the blind philosopher rat Dangerous Beans and his assistant Peaches. In the middle of daring infiltration of the trapped sewers in scenes reminiscent of Mission: Impossible, the rats are also having philosophical arguments. They've become something different than the unaltered rats that they call the keekees, but what those differences mean is harder to understand. The older rats are not happy about too many changes and think the rats should keep acting like rats. The younger ones are discovering that they're afraid of shadows because now they understand what the shadows hint at. Dangerous Beans is trying to work out a writing system so that they can keep important thoughts. One of their few guides is a children's book of talking animals, although they quickly discover that the portrayed clothing is annoyingly impractical. But as good as the rats are, Maurice is nearly as much fun in an entirely different way. He is unapologetically out for himself, streetwise and canny in a way that feels apt for a cat, gets bored and mentally wanders off in the middle of conversations, and pretends to agree with people when that's how he can get what he wants. But he also has a weird sense of loyalty and ethics that only shows up when something is truly important. It's a variation on the con man with a heart of gold, but it's a very well-done variation that weaves in a cat's impatience with and inattention to anything that doesn't directly concern them. I was laughing throughout the book. Malicia is an absolute delight, the sort of character who takes over scenes through sheer force of will, and the dumb-looking kid (whose name turns out to be Keith) is a perfect counterbalance: a laid-back, quiet boy who just wants to play his music and is almost entirely unflappable. It's such a great cast. The best part of the plot is the end. I won't spoil it, so I'll only say that Pratchett has the characters do the work on the aftermath that a lot of books skip over. He doesn't have any magical solutions for the world's problems, but he's so very good at restoring one's faith that maybe sometimes those solutions can be constructed. My one complaint with this book is that Pratchett introduces a second villain, and while there are good in-story justifications for it and it's entangled with the primary plot, he added elements of (mild) supernatural horror and evil that I thought were extraneous and unnecessary. He already had enough of a conflict set up without adding that additional element, and I think it undermined the moral complexity of the story. I would have much rather he kept the social dynamics of the town at the core of the story and used that to trigger the moments of sacrifice and philosophy that made the climax work. The Discworld books by this point have gotten very good, but each book seems to have one element like that where it felt like Pratchett took the easy way out of a plot corner or added some story element that didn't really work. I feel like the series is on the verge of having a truly great book that rises above the entire series to date, but never quite gets there. That caveat aside, I thoroughly enjoyed this and had trouble putting it down. Mrs. Frisby and the Rats of Nimh was one of my favorite books as a kid, and this reminded me of it in some good ways (enough so that I think some of the references were intentional). Great stuff. If you were to read only one Discworld book and didn't want to be confused by all the entangled plot threads and established characters, I would seriously consider making it this one. Recommended. Followed by Night Watch in publication order. There doesn't appear to be a direct plot sequel, more's the pity. Rating: 8 out of 10

13 July 2022

Reproducible Builds: Reproducible Builds in June 2022

Welcome to the June 2022 report from the Reproducible Builds project. In these reports, we outline the most important things that we have been up to over the past month. As a quick recap, whilst anyone may inspect the source code of free software for malicious flaws, almost all software is distributed to end users as pre-compiled binaries.

Save the date! Despite several delays, we are pleased to announce dates for our in-person summit this year: November 1st 2022 November 3rd 2022
The event will happen in/around Venice (Italy), and we intend to pick a venue reachable via the train station and an international airport. However, the precise venue will depend on the number of attendees. Please see the announcement mail from Mattia Rizzolo, and do keep an eye on the mailing list for further announcements as it will hopefully include registration instructions.

News David Wheeler filed an issue against the Rust programming language to report that builds are not reproducible because full path to the source code is in the panic and debug strings . Luckily, as one of the responses mentions: the --remap-path-prefix solves this problem and has been used to great effect in build systems that rely on reproducibility (Bazel, Nix) to work at all and that there are efforts to teach cargo about it here .
The Python Security team announced that:
The ctx hosted project on PyPI was taken over via user account compromise and replaced with a malicious project which contained runtime code which collected the content of os.environ.items() when instantiating Ctx objects. The captured environment variables were sent as a base64 encoded query parameter to a Heroku application [ ]
As their announcement later goes onto state, version-pinning using hash-checking mode can prevent this attack, although this does depend on specific installations using this mode, rather than a prevention that can be applied systematically.
Developer vanitasvitae published an interesting and entertaining blog post detailing the blow-by-blow steps of debugging a reproducibility issue in PGPainless, a library which aims to make using OpenPGP in Java projects as simple as possible . Whilst their in-depth research into the internals of the .jar may have been unnecessary given that diffoscope would have identified the, it must be said that there is something to be said with occasionally delving into seemingly low-level details, as well describing any debugging process. Indeed, as vanitasvitae writes:
Yes, this would have spared me from 3h of debugging But I probably would also not have gone onto this little dive into the JAR/ZIP format, so in the end I m not mad.

Kees Cook published a short and practical blog post detailing how he uses reproducibility properties to aid work to replace one-element arrays in the Linux kernel. Kees approach is based on the principle that if a (small) proposed change is considered equivalent by the compiler, then the generated output will be identical but only if no other arbitrary or unrelated changes are introduced. Kees mentions the fantastic diffoscope tool, as well as various kernel-specific build options (eg. KBUILD_BUILD_TIMESTAMP) in order to prepare my build with the known to disrupt code layout options disabled .
Stefano Zacchiroli gave a presentation at GDR S curit Informatique based in part on a paper co-written with Chris Lamb titled Increasing the Integrity of Software Supply Chains. (Tweet)

Debian In Debian in this month, 28 reviews of Debian packages were added, 35 were updated and 27 were removed this month adding to our knowledge about identified issues. Two issue types were added: nondeterministic_checksum_generated_by_coq and nondetermistic_js_output_from_webpack. After Holger Levsen found hundreds of packages in the bookworm distribution that lack .buildinfo files, he uploaded 404 source packages to the archive (with no meaningful source changes). Currently bookworm now shows only 8 packages without .buildinfo files, and those 8 are fixed in unstable and should migrate shortly. By contrast, Debian unstable will always have packages without .buildinfo files, as this is how they come through the NEW queue. However, as these packages were not built on the official build servers (ie. they were uploaded by the maintainer) they will never migrate to Debian testing. In the future, therefore, testing should never have packages without .buildinfo files again. Roland Clobus posted yet another in-depth status report about his progress making the Debian Live images build reproducibly to our mailing list. In this update, Roland mentions that all major desktops build reproducibly with bullseye, bookworm and sid but also goes on to outline the progress made with automated testing of the generated images using openQA.

GNU Guix Vagrant Cascadian made a significant number of contributions to GNU Guix: Elsewhere in GNU Guix, Ludovic Court s published a paper in the journal The Art, Science, and Engineering of Programming called Building a Secure Software Supply Chain with GNU Guix:
This paper focuses on one research question: how can [Guix]((https://www.gnu.org/software/guix/) and similar systems allow users to securely update their software? [ ] Our main contribution is a model and tool to authenticate new Git revisions. We further show how, building on Git semantics, we build protections against downgrade attacks and related threats. We explain implementation choices. This work has been deployed in production two years ago, giving us insight on its actual use at scale every day. The Git checkout authentication at its core is applicable beyond the specific use case of Guix, and we think it could benefit to developer teams that use Git.
A full PDF of the text is available.

openSUSE In the world of openSUSE, SUSE announced at SUSECon that they are preparing to meet SLSA level 4. (SLSA (Supply chain Levels for Software Artifacts) is a new industry-led standardisation effort that aims to protect the integrity of the software supply chain.) However, at the time of writing, timestamps within RPM archives are not normalised, so bit-for-bit identical reproducible builds are not possible. Some in-toto provenance files published for SUSE s SLE-15-SP4 as one result of the SLSA level 4 effort. Old binaries are not rebuilt, so only new builds (e.g. maintenance updates) have this metadata added. Lastly, Bernhard M. Wiedemann posted his usual monthly openSUSE reproducible builds status report.

diffoscope diffoscope is our in-depth and content-aware diff utility. Not only can it locate and diagnose reproducibility issues, it can provide human-readable diffs from many kinds of binary formats. This month, Chris Lamb prepared and uploaded versions 215, 216 and 217 to Debian unstable. Chris Lamb also made the following changes:
  • New features:
    • Print profile output if we were called with --profile and we were killed via a TERM signal. This should help in situations where diffoscope is terminated due to some sort of timeout. [ ]
    • Support both PyPDF 1.x and 2.x. [ ]
  • Bug fixes:
    • Also catch IndexError exceptions (in addition to ValueError) when parsing .pyc files. (#1012258)
    • Correct the logic for supporting different versions of the argcomplete module. [ ]
  • Output improvements:
    • Don t leak the (likely-temporary) pathname when comparing PDF documents. [ ]
  • Logging improvements:
    • Update test fixtures for GNU readelf 2.38 (now in Debian unstable). [ ][ ]
    • Be more specific about the minimum required version of readelf (ie. binutils), as it appears that this patch level version change resulted in a change of output, not the minor version. [ ]
    • Use our @skip_unless_tool_is_at_least decorator (NB. at_least) over @skip_if_tool_version_is (NB. is) to fix tests under Debian stable. [ ]
    • Emit a warning if/when we are handling a UNIX TERM signal. [ ]
  • Codebase improvements:
    • Clarify in what situations the main finally block gets called with respect to TERM signal handling. [ ]
    • Clarify control flow in the diffoscope.profiling module. [ ]
    • Correctly package the scripts/ directory. [ ]
In addition, Edward Betts updated a broken link to the RSS on the diffoscope homepage and Vagrant Cascadian updated the diffoscope package in GNU Guix [ ][ ][ ].

Upstream patches The Reproducible Builds project detects, dissects and attempts to fix as many currently-unreproducible packages as possible. We endeavour to send all of our patches upstream where appropriate. This month, we wrote a large number of such patches, including:

Testing framework The Reproducible Builds project runs a significant testing framework at tests.reproducible-builds.org, to check packages and other artifacts for reproducibility. This month, the following changes were made:
  • Holger Levsen:
    • Add a package set for packages that use the R programming language [ ] as well as one for Rust [ ].
    • Improve package set matching for Python [ ] and font-related [ ] packages.
    • Install the lz4, lzop and xz-utils packages on all nodes in order to detect running kernels. [ ]
    • Improve the cleanup mechanisms when testing the reproducibility of Debian Live images. [ ][ ]
    • In the automated node health checks, deprioritise the generic kernel warning . [ ]
  • Roland Clobus (Debian Live image reproducibility):
    • Add various maintenance jobs to the Jenkins view. [ ]
    • Cleanup old workspaces after 24 hours. [ ]
    • Cleanup temporary workspace and resulting directories. [ ]
    • Implement a number of fixes and improvements around publishing files. [ ][ ][ ]
    • Don t attempt to preserve the file timestamps when copying artifacts. [ ]
And finally, node maintenance was also performed by Mattia Rizzolo [ ].

Mailing list and website On our mailing list this month: Lastly, Chris Lamb updated the main Reproducible Builds website and documentation in a number of small ways, but primarily published an interview with Hans-Christoph Steiner of the F-Droid project. Chris Lamb also added a Coffeescript example for parsing and using the SOURCE_DATE_EPOCH environment variable [ ]. In addition, Sebastian Crane very-helpfully updated the screenshot of salsa.debian.org s request access button on the How to join the Salsa group. [ ]

Contact If you are interested in contributing to the Reproducible Builds project, please visit our Contribute page on our website. However, you can get in touch with us via:

24 June 2022

Kees Cook: finding binary differences

As part of the continuing work to replace 1-element arrays in the Linux kernel, it s very handy to show that a source change has had no executable code difference. For example, if you started with this:
struct foo  
    unsigned long flags;
    u32 length;
    u32 data[1];
 ;
void foo_init(int count)
 
    struct foo *instance;
    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
    ...
    instance = kmalloc(bytes, GFP_KERNEL);
    ...
 ;
And you changed only the struct definition:
-    u32 data[1];
+    u32 data[];
The bytes calculation is going to be incorrect, since it is still subtracting 1 element s worth of space from the desired count. (And let s ignore for the moment the open-coded calculation that may end up with an arithmetic over/underflow here; that can be solved separately by using the struct_size() helper or the size_mul(), size_add(), etc family of helpers.) The missed adjustment to the size calculation is relatively easy to find in this example, but sometimes it s much less obvious how structure sizes might be woven into the code. I ve been checking for issues by using the fantastic diffoscope tool. It can produce a LOT of noise if you try to compare builds without keeping in mind the issues solved by reproducible builds, with some additional notes. I prepare my build with the known to disrupt code layout options disabled, but with debug info enabled:
$ KBF="KBUILD_BUILD_TIMESTAMP=1970-01-01 KBUILD_BUILD_USER=user KBUILD_BUILD_HOST=host KBUILD_BUILD_VERSION=1"
$ OUT=gcc
$ make $KBF O=$OUT allmodconfig
$ ./scripts/config --file $OUT/.config \
        -d GCOV_KERNEL -d KCOV -d GCC_PLUGINS -d IKHEADERS -d KASAN -d UBSAN \
        -d DEBUG_INFO_NONE -e DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT
$ make $KBF O=$OUT olddefconfig
Then I build a stock target, saving the output in before . In this case, I m examining drivers/scsi/megaraid/:
$ make -jN $KBF O=$OUT drivers/scsi/megaraid/
$ mkdir -p $OUT/before
$ cp $OUT/drivers/scsi/megaraid/*.o $OUT/before/
Then I patch and build a modified target, saving the output in after :
$ vi the/source/code.c
$ make -jN $KBF O=$OUT drivers/scsi/megaraid/
$ mkdir -p $OUT/after
$ cp $OUT/drivers/scsi/megaraid/*.o $OUT/after/
And then run diffoscope:
$ diffoscope $OUT/before/ $OUT/after/
If diffoscope output reports nothing, then we re done. Usually, though, when source lines move around other stuff will shift too (e.g. WARN macros rely on line numbers, so the bug table may change contents a bit, etc), and diffoscope output will look noisy. To examine just the executable code, the command that diffoscope used is reported in the output, and we can run it directly, but with possibly shifted line numbers not reported. i.e. running objdump without --line-numbers:
$ ARGS="--disassemble --demangle --reloc --no-show-raw-insn --section=.text"
$ for i in $(cd $OUT/before && echo *.o); do
        echo $i
        diff -u <(objdump $ARGS $OUT/before/$i   sed "0,/^Disassembly/d") \
                <(objdump $ARGS $OUT/after/$i    sed "0,/^Disassembly/d")
done
If I see an unexpected difference, for example:
-    c120:      movq   $0x0,0x800(%rbx)
+    c120:      movq   $0x0,0x7f8(%rbx)
Then I'll search for the pattern with line numbers added to the objdump output:
$ vi <(objdump --line-numbers $ARGS $OUT/after/megaraid_sas_fp.o)
I'd search for "0x0,0x7f8", find the source file and line number above it, open that source file at that position, and look to see where something was being miscalculated:
$ vi drivers/scsi/megaraid/megaraid_sas_fp.c +329
Once tracked down, I'd start over at the "patch and build a modified target" step above, repeating until there were no differences. For example, in the starting example, I'd also need to make this change:
-    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
+    size_t bytes = sizeof(*instance) + sizeof(u32) * count;
Though, as hinted earlier, better yet would be:
-    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
+    size_t bytes = struct_size(instance, data, count);
But sometimes adding the helper usage will add binary output differences since they're performing overflow checking that might saturate at SIZE_MAX. To help with patch clarity, those changes can be done separately from fixing the array declaration.

2022, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

5 April 2022

Kees Cook: security things in Linux v5.10

Previously: v5.9 Linux v5.10 was released in December, 2020. Here s my summary of various security things that I found interesting: AMD SEV-ES
While guest VM memory encryption with AMD SEV has been supported for a while, Joerg Roedel, Thomas Lendacky, and others added register state encryption (SEV-ES). This means it s even harder for a VM host to reconstruct a guest VM s state. x86 static calls
Josh Poimboeuf and Peter Zijlstra implemented static calls for x86, which operates very similarly to the static branch infrastructure in the kernel. With static branches, an if/else choice can be hard-coded, instead of being run-time evaluated every time. Such branches can be updated too (the kernel just rewrites the code to switch around the branch ). All these principles apply to static calls as well, but they re for replacing indirect function calls (i.e. a call through a function pointer) with a direct call (i.e. a hard-coded call address). This eliminates the need for Spectre mitigations (e.g. RETPOLINE) for these indirect calls, and avoids a memory lookup for the pointer. For hot-path code (like the scheduler), this has a measurable performance impact. It also serves as a kind of Control Flow Integrity implementation: an indirect call got removed, and the potential destinations have been explicitly identified at compile-time. network RNG improvements
In an effort to improve the pseudo-random number generator used by the network subsystem (for things like port numbers and packet sequence numbers), Linux s home-grown pRNG has been replaced by the SipHash round function, and perturbed by (hopefully) hard-to-predict internal kernel states. This should make it very hard to brute force the internal state of the pRNG and make predictions about future random numbers just from examining network traffic. Similarly, ICMP s global rate limiter was adjusted to avoid leaking details of network state, as a start to fixing recent DNS Cache Poisoning attacks. SafeSetID handles GID
Thomas Cedeno improved the SafeSetID LSM to handle group IDs (which required teaching the kernel about which syscalls were actually performing setgid.) Like the earlier setuid policy, this lets the system owner define an explicit list of allowed group ID transitions under CAP_SETGID (instead of to just any group), providing a way to keep the power of granting this capability much more limited. (This isn t complete yet, though, since handling setgroups() is still needed.) improve kernel s internal checking of file contents
The kernel provides LSMs (like the Integrity subsystem) with details about files as they re loaded. (For example, loading modules, new kernel images for kexec, and firmware.) There wasn t very good coverage for cases where the contents were coming from things that weren t files. To deal with this, new hooks were added that allow the LSMs to introspect the contents directly, and to do partial reads. This will give the LSMs much finer grain visibility into these kinds of operations. set_fs removal continues
With the earlier work landed to free the core kernel code from set_fs(), Christoph Hellwig made it possible for set_fs() to be optional for an architecture. Subsequently, he then removed set_fs() entirely for x86, riscv, and powerpc. These architectures will now be free from the entire class of kernel address limit attacks that only needed to corrupt a single value in struct thead_info. sysfs_emit() replaces sprintf() in /sys
Joe Perches tackled one of the most common bug classes with sprintf() and snprintf() in /sys handlers by creating a new helper, sysfs_emit(). This will handle the cases where kernel code was not correctly dealing with the length results from sprintf() calls, which might lead to buffer overflows in the PAGE_SIZE buffer that /sys handlers operate on. With the helper in place, it was possible to start the refactoring of the many sprintf() callers. nosymfollow mount option
Mattias Nissler and Ross Zwisler implemented the nosymfollow mount option. This entirely disables symlink resolution for the given filesystem, similar to other mount options where noexec disallows execve(), nosuid disallows setid bits, and nodev disallows device files. Quoting the patch, it is useful as a defensive measure for systems that need to deal with untrusted file systems in privileged contexts. (i.e. for when /proc/sys/fs/protected_symlinks isn t a big enough hammer.) Chrome OS uses this option for its stateful filesystem, as symlink traversal as been a common attack-persistence vector. ARMv8.5 Memory Tagging Extension support
Vincenzo Frascino added support to arm64 for the coming Memory Tagging Extension, which will be available for ARMv8.5 and later chips. It provides 4 bits of tags (covering multiples of 16 byte spans of the address space). This is enough to deterministically eliminate all linear heap buffer overflow flaws (1 tag for free , and then rotate even values and odd values for neighboring allocations), which is probably one of the most common bugs being currently exploited. It also makes use-after-free and over/under indexing much more difficult for attackers (but still possible if the target s tag bits can be exposed). Maybe some day we can switch to 128 bit virtual memory addresses and have fully versioned allocations. But for now, 16 tag values is better than none, though we do still need to wait for anyone to actually be shipping ARMv8.5 hardware. fixes for flaws found by UBSAN
The work to make UBSAN generally usable under syzkaller continues to bear fruit, with various fixes all over the kernel for stuff like shift-out-of-bounds, divide-by-zero, and integer overflow. Seeing these kinds of patches land reinforces the the rationale of shifting the burden of these kinds of checks to the toolchain: these run-time bugs continue to pop up. flexible array conversions
The work on flexible array conversions continues. Gustavo A. R. Silva and others continued to grind on the conversions, getting the kernel ever closer to being able to enable the -Warray-bounds compiler flag and clear the path for saner bounds checking of array indexes and memcpy() usage. That s it for now! Please let me know if you think anything else needs some attention. Next up is Linux v5.11.

2022, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

5 April 2021

Kees Cook: security things in Linux v5.9

Previously: v5.8 Linux v5.9 was released in October, 2020. Here s my summary of various security things that I found interesting: seccomp user_notif file descriptor injection
Sargun Dhillon added the ability for SECCOMP_RET_USER_NOTIF filters to inject file descriptors into the target process using SECCOMP_IOCTL_NOTIF_ADDFD. This lets container managers fully emulate syscalls like open() and connect(), where an actual file descriptor is expected to be available after a successful syscall. In the process I fixed a couple bugs and refactored the file descriptor receiving code. zero-initialize stack variables with Clang
When Alexander Potapenko landed support for Clang s automatic variable initialization, it did so with a byte pattern designed to really stand out in kernel crashes. Now he s added support for doing zero initialization via CONFIG_INIT_STACK_ALL_ZERO, which besides actually being faster, has a few behavior benefits as well. Unlike pattern initialization, which has a higher chance of triggering existing bugs, zero initialization provides safe defaults for strings, pointers, indexes, and sizes. Like the pattern initialization, this feature stops entire classes of uninitialized stack variable flaws. common syscall entry/exit routines
Thomas Gleixner created architecture-independent code to do syscall entry/exit, since much of the kernel s work during a syscall entry and exit is the same. There was no need to repeat this in each architecture, and having it implemented separately meant bugs (or features) might only get fixed (or implemented) in a handful of architectures. It means that features like seccomp become much easier to build since it wouldn t need per-architecture implementations any more. Presently only x86 has switched over to the common routines. SLAB kfree() hardening
To reach CONFIG_SLAB_FREELIST_HARDENED feature-parity with the SLUB heap allocator, I added naive double-free detection and the ability to detect cross-cache freeing in the SLAB allocator. This should keep a class of type-confusion bugs from biting kernels using SLAB. (Most distro kernels use SLUB, but some smaller devices prefer the slightly more compact SLAB, so this hardening is mostly aimed at those systems.) new CAP_CHECKPOINT_RESTORE capability
Adrian Reber added the new CAP_CHECKPOINT_RESTORE capability, splitting this functionality off of CAP_SYS_ADMIN. The needs for the kernel to correctly checkpoint and restore a process (e.g. used to move processes between containers) continues to grow, and it became clear that the security implications were lower than those of CAP_SYS_ADMIN yet distinct from other capabilities. Using this capability is now the preferred method for doing things like changing /proc/self/exe. debugfs boot-time visibility restriction
Peter Enderborg added the debugfs boot parameter to control the visibility of the kernel s debug filesystem. The contents of debugfs continue to be a common area of sensitive information being exposed to attackers. While this was effectively possible by unsetting CONFIG_DEBUG_FS, that wasn t a great approach for system builders needing a single set of kernel configs (e.g. a distro kernel), so now it can be disabled at boot time. more seccomp architecture support
Michael Karcher implemented the SuperH seccomp hooks, Guo Ren implemented the C-SKY seccomp hooks, and Max Filippov implemented the xtensa seccomp hooks. Each of these included the ever-important updates to the seccomp regression testing suite in the kernel selftests. stack protector support for RISC-V
Guo Ren implemented -fstack-protector (and -fstack-protector-strong) support for RISC-V. This is the initial global-canary support while the patches to GCC to support per-task canaries is getting finished (similar to the per-task canaries done for arm64). This will mean nearly all stack frame write overflows are no longer useful to attackers on this architecture. It s nice to see this finally land for RISC-V, which is quickly approaching architecture feature parity with the other major architectures in the kernel. new tasklet API
Romain Perier and Allen Pais introduced a new tasklet API to make their use safer. Much like the timer_list refactoring work done earlier, the tasklet API is also a potential source of simple function-pointer-and-first-argument controlled exploits via linear heap overwrites. It s a smaller attack surface since it s used much less in the kernel, but it is the same weak design, making it a sensible thing to replace. While the use of the tasklet API is considered deprecated (replaced by threaded IRQs), it s not always a simple mechanical refactoring, so the old API still needs refactoring (since that CAN be done mechanically is most cases). x86 FSGSBASE implementation
Sasha Levin, Andy Lutomirski, Chang S. Bae, Andi Kleen, Tony Luck, Thomas Gleixner, and others landed the long-awaited FSGSBASE series. This provides task switching performance improvements while keeping the kernel safe from modules accidentally (or maliciously) trying to use the features directly (which exposed an unprivileged direct kernel access hole). filter x86 MSR writes
While it s been long understood that writing to CPU Model-Specific Registers (MSRs) from userspace was a bad idea, it has been left enabled for things like MSR_IA32_ENERGY_PERF_BIAS. Boris Petkov has decided enough is enough and has now enabled logging and kernel tainting (TAINT_CPU_OUT_OF_SPEC) by default and a way to disable MSR writes at runtime. (However, since this is controlled by a normal module parameter and the root user can just turn writes back on, I continue to recommend that people build with CONFIG_X86_MSR=n.) The expectation is that userspace MSR writes will be entirely removed in future kernels. uninitialized_var() macro removed
I made treewide changes to remove the uninitialized_var() macro, which had been used to silence compiler warnings. The rationale for this macro was weak to begin with ( the compiler is reporting an uninitialized variable that is clearly initialized ) since it was mainly papering over compiler bugs. However, it creates a much more fragile situation in the kernel since now such uses can actually disable automatic stack variable initialization, as well as mask legitimate unused variable warnings. The proper solution is to just initialize variables the compiler warns about. function pointer cast removals
Oscar Carter has started removing function pointer casts from the kernel, in an effort to allow the kernel to build with -Wcast-function-type. The future use of Control Flow Integrity checking (which does validation of function prototypes matching between the caller and the target) tends not to work well with function casts, so it d be nice to get rid of these before CFI lands. flexible array conversions
As part of Gustavo A. R. Silva s on-going work to replace zero-length and one-element arrays with flexible arrays, he has documented the details of the flexible array conversions, and the various helpers to be used in kernel code. Every commit gets the kernel closer to building with -Warray-bounds, which catches a lot of potential buffer overflows at compile time. That s it for now! Please let me know if you think anything else needs some attention. Next up is Linux v5.10.

2021, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

9 February 2021

Kees Cook: security things in Linux v5.8

Previously: v5.7 Linux v5.8 was released in August, 2020. Here s my summary of various security things that caught my attention: arm64 Branch Target Identification
Dave Martin added support for ARMv8.5 s Branch Target Instructions (BTI), which are enabled in userspace at execve() time, and all the time in the kernel (which required manually marking up a lot of non-C code, like assembly and JIT code). With this in place, Jump-Oriented Programming (JOP, where code gadgets are chained together with jumps and calls) is no longer available to the attacker. An attacker s code must make direct function calls. This basically reduces the usable code available to an attacker from every word in the kernel text to only function entries (or jump targets). This is a low granularity forward-edge Control Flow Integrity (CFI) feature, which is important (since it greatly reduces the potential targets that can be used in an attack) and cheap (implemented in hardware). It s a good first step to strong CFI, but (as we ve seen with things like CFG) it isn t usually strong enough to stop a motivated attacker. High granularity CFI (which uses a more specific branch-target characteristic, like function prototypes, to track expected call sites) is not yet a hardware supported feature, but the software version will be coming in the future by way of Clang s CFI implementation. arm64 Shadow Call Stack
Sami Tolvanen landed the kernel implementation of Clang s Shadow Call Stack (SCS), which protects the kernel against Return-Oriented Programming (ROP) attacks (where code gadgets are chained together with returns). This backward-edge CFI protection is implemented by keeping a second dedicated stack pointer register (x18) and keeping a copy of the return addresses stored in a separate shadow stack . In this way, manipulating the regular stack s return addresses will have no effect. (And since a copy of the return address continues to live in the regular stack, no changes are needed for back trace dumps, etc.) It s worth noting that unlike BTI (which is hardware based), this is a software defense that relies on the location of the Shadow Stack (i.e. the value of x18) staying secret, since the memory could be written to directly. Intel s hardware ROP defense (CET) uses a hardware shadow stack that isn t directly writable. ARM s hardware defense against ROP is PAC (which is actually designed as an arbitrary CFI defense it can be used for forward-edge too), but that depends on having ARMv8.3 hardware. The expectation is that SCS will be used until PAC is available. Kernel Concurrency Sanitizer infrastructure added
Marco Elver landed support for the Kernel Concurrency Sanitizer, which is a new debugging infrastructure to find data races in the kernel, via CONFIG_KCSAN. This immediately found real bugs, with some fixes having already landed too. For more details, see the KCSAN documentation. new capabilities
Alexey Budankov added CAP_PERFMON, which is designed to allow access to perf(). The idea is that this capability gives a process access to only read aspects of the running kernel and system. No longer will access be needed through the much more powerful abilities of CAP_SYS_ADMIN, which has many ways to change kernel internals. This allows for a split between controls over the confidentiality (read access via CAP_PERFMON) of the kernel vs control over integrity (write access via CAP_SYS_ADMIN). Alexei Starovoitov added CAP_BPF, which is designed to separate BPF access from the all-powerful CAP_SYS_ADMIN. It is designed to be used in combination with CAP_PERFMON for tracing-like activities and CAP_NET_ADMIN for networking-related activities. For things that could change kernel integrity (i.e. write access), CAP_SYS_ADMIN is still required. network random number generator improvements
Willy Tarreau made the network code s random number generator less predictable. This will further frustrate any attacker s attempts to recover the state of the RNG externally, which might lead to the ability to hijack network sessions (by correctly guessing packet states). fix various kernel address exposures to non-CAP_SYSLOG
I fixed several situations where kernel addresses were still being exposed to unprivileged (i.e. non-CAP_SYSLOG) users, though usually only through odd corner cases. After refactoring how capabilities were being checked for files in /sys and /proc, the kernel modules sections, kprobes, and BPF exposures got fixed. (Though in doing so, I briefly made things much worse before getting it properly fixed. Yikes!) RISCV W^X detection
Following up on his recent work to enable strict kernel memory protections on RISCV, Zong Li has now added support for CONFIG_DEBUG_WX as seen for other architectures. Any writable and executable memory regions in the kernel (which are lovely targets for attackers) will be loudly noted at boot so they can get corrected. execve() refactoring continues
Eric W. Biederman continued working on execve() refactoring, including getting rid of the frequently problematic recursion used to locate binary handlers. I used the opportunity to dust off some old binfmt_script regression tests and get them into the kernel selftests. multiple /proc instances
Alexey Gladkov modernized /proc internals and provided a way to have multiple /proc instances mounted in the same PID namespace. This allows for having multiple views of /proc, with different features enabled. (Including the newly added hidepid=4 and subset=pid mount options.) set_fs() removal continues
Christoph Hellwig, with Eric W. Biederman, Arnd Bergmann, and others, have been diligently working to entirely remove the kernel s set_fs() interface, which has long been a source of security flaws due to weird confusions about which address space the kernel thought it should be accessing. Beyond things like the lower-level per-architecture signal handling code, this has needed to touch various parts of the ELF loader, and networking code too. READ_IMPLIES_EXEC is no more for native 64-bit
The READ_IMPLIES_EXEC flag was a work-around for dealing with the addition of non-executable (NX) memory when x86_64 was introduced. It was designed as a way to mark a memory region as well, since we don t know if this memory region was expected to be executable, we must assume that if we need to read it, we need to be allowed to execute it too . It was designed mostly for stack memory (where trampoline code might live), but it would carry over into all mmap() allocations, which would mean sometimes exposing a large attack surface to an attacker looking to find executable memory. While normally this didn t cause problems on modern systems that correctly marked their ELF sections as NX, there were still some awkward corner-cases. I fixed this by splitting READ_IMPLIES_EXEC from the ELF PT_GNU_STACK marking on x86 and arm/arm64, and declaring that a native 64-bit process would never gain READ_IMPLIES_EXEC on x86_64 and arm64, which matches the behavior of other native 64-bit architectures that correctly didn t ever implement READ_IMPLIES_EXEC in the first place. array index bounds checking continues
As part of the ongoing work to use modern flexible arrays in the kernel, Gustavo A. R. Silva added the flex_array_size() helper (as a cousin to struct_size()). The zero/one-member into flex array conversions continue with over a hundred commits as we slowly get closer to being able to build with -Warray-bounds. scnprintf() replacement continues
Chen Zhou joined Takashi Iwai in continuing to replace potentially unsafe uses of sprintf() with scnprintf(). Fixing all of these will make sure the kernel avoids nasty buffer concatenation surprises. That s it for now! Let me know if there is anything else you think I should mention here. Next up: Linux v5.9.

2021, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

30 October 2020

Kees Cook: combining apt install and get dist-upgrade ?

I frequently see a pattern in image build/refresh scripts where a set of packages is installed, and then all packages are updated:
apt update
apt install -y pkg1 pkg2 pkg2
apt dist-upgrade -y
While it s not much, this results in redundant work. For example reading/writing package database, potentially running triggers (man-page refresh, ldconfig, etc). The internal package dependency resolution stuff isn t actually different: install will also do upgrades of needed packages, etc. Combining them should be entirely possible, but I haven t found a clean way to do this yet. The best I ve got so far is:
apt update
apt-cache dumpavail   dpkg --merge-avail -
(for i in pkg1 pkg2 pkg3; do echo "$i install")   dpkg --set-selections
apt-get dselect-upgrade
This gets me the effect of running install and upgrade at the same time, but not dist-upgrade (which has slightly different resolution logic that d I d prefer to use). Also, it includes the overhead of what should be an unnecessary update of dpkg s database. Anyone know a better way to do this? Update: Julian Andres Klode pointed out that dist-upgrade actually takes package arguments too just like install. *face palm* I didn t even try it I believed the man-page and the -h output. It works perfectly!

2020, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

21 September 2020

Kees Cook: security things in Linux v5.7

Previously: v5.6 Linux v5.7 was released at the end of May. Here s my summary of various security things that caught my attention: arm64 kernel pointer authentication
While the ARMv8.3 CPU Pointer Authentication (PAC) feature landed for userspace already, Kristina Martsenko has now landed PAC support in kernel mode. The current implementation uses PACIASP which protects the saved stack pointer, similar to the existing CONFIG_STACKPROTECTOR feature, only faster. This also paves the way to sign and check pointers stored in the heap, as a way to defeat function pointer overwrites in those memory regions too. Since the behavior is different from the traditional stack protector, Amit Daniel Kachhap added an LKDTM test for PAC as well. BPF LSM
The kernel s Linux Security Module (LSM) API provide a way to write security modules that have traditionally implemented various Mandatory Access Control (MAC) systems like SELinux, AppArmor, etc. The LSM hooks are numerous and no one LSM uses them all, as some hooks are much more specialized (like those used by IMA, Yama, LoadPin, etc). There was not, however, any way to externally attach to these hooks (not even through a regular loadable kernel module) nor build fully dynamic security policy, until KP Singh landed the API for building LSM policy using BPF. With this, it is possible (for a privileged process) to write kernel LSM hooks in BPF, allowing for totally custom security policy (and reporting). execve() deadlock refactoring
There have been a number of long-standing races in the kernel s process launching code where ptrace could deadlock. Fixing these has been attempted several times over the last many years, but Eric W. Biederman and Ernd Edlinger decided to dive in, and successfully landed the a series of refactorings, splitting up the problematic locking and refactoring their uses to remove the deadlocks. While he was at it, Eric also extended the exec_id counter to 64 bits to avoid the possibility of the counter wrapping and allowing an attacker to send arbitrary signals to processes they normally shouldn t be able to. slub freelist obfuscation improvements
After Silvio Cesare observed some weaknesses in the implementation of CONFIG_SLAB_FREELIST_HARDENED s freelist pointer content obfuscation, I improved their bit diffusion, which makes attacks require significantly more memory content exposures to defeat the obfuscation. As part of the conversation, Vitaly Nikolenko pointed out that the freelist pointer s location made it relatively easy to target too (for either disclosures or overwrites), so I moved it away from the edge of the slab, making it harder to reach through small-sized overflows (which usually target the freelist pointer). As it turns out, there were a few assumptions in the kernel about the location of the freelist pointer, which had to also get cleaned up. RISCV page table dumping
Following v5.6 s generic page table dumping work, Zong Li landed the RISCV page dumping code. This means it s much easier to examine the kernel s page table layout when running a debug kernel (built with PTDUMP_DEBUGFS), visible in /sys/kernel/debug/kernel_page_tables. array index bounds checking
This is a pretty large area of work that touches a lot of overlapping elements (and history) in the Linux kernel. The short version is: C is bad at noticing when it uses an array index beyond the bounds of the declared array, and we need to fix that. For example, don t do this:
int foo[5];
...
foo[8] = bar;
The long version gets complicated by the evolution of flexible array structure members, so we ll pause for a moment and skim the surface of this topic. While things like CONFIG_FORTIFY_SOURCE try to catch these kinds of cases in the memcpy() and strcpy() family of functions, it doesn t catch it in open-coded array indexing, as seen in the code above. GCC has a warning (-Warray-bounds) for these cases, but it was disabled by Linus because of all the false positives seen due to fake flexible array members. Before flexible arrays were standardized, GNU C supported zero sized array members. And before that, C code would use a 1-element array. These were all designed so that some structure could be the header in front of some data blob that could be addressable through the last structure member:
/* 1-element array */
struct foo  
    ...
    char contents[1];
 ;
/* GNU C extension: 0-element array */
struct foo  
    ...
    char contents[0];
 ;
/* C standard: flexible array */
struct foo  
    ...
    char contents[];
 ;
instance = kmalloc(sizeof(struct foo) + content_size);
Converting all the zero- and one-element array members to flexible arrays is one of Gustavo A. R. Silva s goals, and hundreds of these changes started landing. Once fixed, -Warray-bounds can be re-enabled. Much more detail can be found in the kernel s deprecation docs. However, that will only catch the visible at compile time cases. For runtime checking, the Undefined Behavior Sanitizer has an option for adding runtime array bounds checking for catching things like this where the compiler cannot perform a static analysis of the index values:
int foo[5];
...
for (i = 0; i < some_argument; i++)  
    ...
    foo[i] = bar;
    ...
 
It was, however, not separate (via kernel Kconfig) until Elena Petrova and I split it out into CONFIG_UBSAN_BOUNDS, which is fast enough for production kernel use. With this enabled, it's now possible to instrument the kernel to catch these conditions, which seem to come up with some regularity in Wi-Fi and Bluetooth drivers for some reason. Since UBSAN (and the other Sanitizers) only WARN() by default, system owners need to set panic_on_warn=1 too if they want to defend against attacks targeting these kinds of flaws. Because of this, and to avoid bloating the kernel image with all the warning messages, I introduced CONFIG_UBSAN_TRAP which effectively turns these conditions into a BUG() without needing additional sysctl settings. Fixing "additive" snprintf() usage
A common idiom in C for building up strings is to use sprintf()'s return value to increment a pointer into a string, and build a string with more sprintf() calls:
/* safe if strlen(foo) + 1 < sizeof(string) */
wrote  = sprintf(string, "Foo: %s\n", foo);
/* overflows if strlen(foo) + strlen(bar) > sizeof(string) */
wrote += sprintf(string + wrote, "Bar: %s\n", bar);
/* writing way beyond the end of "string" now ... */
wrote += sprintf(string + wrote, "Baz: %s\n", baz);
The risk is that if these calls eventually walk off the end of the string buffer, it will start writing into other memory and create some bad situations. Switching these to snprintf() does not, however, make anything safer, since snprintf() returns how much it would have written:
/* safe, assuming available <= sizeof(string), and for this example
 * assume strlen(foo) < sizeof(string) */
wrote  = snprintf(string, available, "Foo: %s\n", foo);
/* if (strlen(bar) > available - wrote), this is still safe since the
 * write into "string" will be truncated, but now "wrote" has been
 * incremented by how much snprintf() *would* have written, so "wrote"
 * is now larger than "available". */
wrote += snprintf(string + wrote, available - wrote, "Bar: %s\n", bar);
/* string + wrote is beyond the end of string, and availabe - wrote wraps
 * around to a giant positive value, making the write effectively 
 * unbounded. */
wrote += snprintf(string + wrote, available - wrote, "Baz: %s\n", baz);
So while the first overflowing call would be safe, the next one would be targeting beyond the end of the array and the size calculation will have wrapped around to a giant limit. Replacing this idiom with scnprintf() solves the issue because it only reports what was actually written. To this end, Takashi Iwai has been landing a bunch scnprintf() fixes. That's it for now! Let me know if there is anything else you think I should mention here. Next up: Linux v5.8.

2020, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

2 September 2020

Kees Cook: security things in Linux v5.6

Previously: v5.5. Linux v5.6 was released back in March. Here s my quick summary of various features that caught my attention: WireGuard
The widely used WireGuard VPN has been out-of-tree for a very long time. After 3 1/2 years since its initial upstream RFC, Ard Biesheuvel and Jason Donenfeld finished the work getting all the crypto prerequisites sorted out for the v5.5 kernel. For this release, Jason has gotten WireGuard itself landed. It was a twisty road, and I m grateful to everyone involved for sticking it out and navigating the compromises and alternative solutions. openat2() syscall and RESOLVE_* flags
Aleksa Sarai has added a number of important path resolution scoping options to the kernel s open() handling, covering things like not walking above a specific point in a path hierarchy (RESOLVE_BENEATH), disabling the resolution of various magic links (RESOLVE_NO_MAGICLINKS) in procfs (e.g. /proc/$pid/exe) and other pseudo-filesystems, and treating a given lookup as happening relative to a different root directory (as if it were in a chroot, RESOLVE_IN_ROOT). As part of this, it became clear that there wasn t a way to correctly extend the existing openat() syscall, so he added openat2() (which is a good example of the efforts being made to codify Extensible Syscall arguments). The RESOLVE_* set of flags also cover prior behaviors like RESOLVE_NO_XDEV and RESOLVE_NO_SYMLINKS. pidfd_getfd() syscall
In the continuing growth of the much-needed pidfd APIs, Sargun Dhillon has added the pidfd_getfd() syscall which is a way to gain access to file descriptors of a process in a race-less way (or when /proc is not mounted). Before, it wasn t always possible make sure that opening file descriptors via /proc/$pid/fd/$N was actually going to be associated with the correct PID. Much more detail about this has been written up at LWN. openat() via io_uring
With my attack surface reduction hat on, I remain personally suspicious of the io_uring() family of APIs, but I can t deny their utility for certain kinds of workloads. Being able to pipeline reads and writes without the overhead of actually making syscalls is pretty great for performance. Jens Axboe has added the IORING_OP_OPENAT command so that existing io_urings can open files to be added on the fly to the mapping of available read/write targets of a given io_uring. While LSMs are still happily able to intercept these actions, I remain wary of the growing syscall multiplexer that io_uring is becoming. I am, of course, glad to see that it has a comprehensive (if out of tree ) test suite as part of liburing. removal of blocking random pool
After making algorithmic changes to obviate separate entropy pools for random numbers, Andy Lutomirski removed the blocking random pool. This simplifies the kernel pRNG code significantly without compromising the userspace interfaces designed to fetch cryptographically secure random numbers. To quote Andy, This series should not break any existing programs. /dev/urandom is unchanged. /dev/random will still block just after booting, but it will block less than it used to. See LWN for more details on the history and discussion of the series. arm64 support for on-chip RNG
Mark Brown added support for the future ARMv8.5 s RNG (SYS_RNDR_EL0), which is, from the kernel s perspective, similar to x86 s RDRAND instruction. This will provide a bootloader-independent way to add entropy to the kernel s pRNG for early boot randomness (e.g. stack canary values, memory ASLR offsets, etc). Until folks are running on ARMv8.5 systems, they can continue to depend on the bootloader for randomness (via the UEFI RNG interface) on arm64. arm64 E0PD
Mark Brown added support for the future ARMv8.5 s E0PD feature (TCR_E0PD1), which causes all memory accesses from userspace into kernel space to fault in constant time. This is an attempt to remove any possible timing side-channel signals when probing kernel memory layout from userspace, as an alternative way to protect against Meltdown-style attacks. The expectation is that E0PD would be used instead of the more expensive Kernel Page Table Isolation (KPTI) features on arm64. powerpc32 VMAP_STACK
Christophe Leroy added VMAP_STACK support to powerpc32, joining x86, arm64, and s390. This helps protect against the various classes of attacks that depend on exhausting the kernel stack in order to collide with neighboring kernel stacks. (Another common target, the sensitive thread_info, had already been moved away from the bottom of the stack by Christophe Leroy in Linux v5.1.) generic Page Table dumping
Related to RISCV s work to add page table dumping (via /sys/fs/debug/kernel_page_tables), Steven Price extracted the existing implementations from multiple architectures and created a common page table dumping framework (and then refactored all the other architectures to use it). I m delighted to have this because I still remember when not having a working page table dumper for ARM delayed me for a while when trying to implement upstream kernel memory protections there. Anything that makes it easier for architectures to get their kernel memory protection working correctly makes me happy. That s in for now; let me know if there s anything you think I missed. Next up: Linux v5.7.

2020, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

14 August 2020

Keith Packard: picolibc-news

Picolibc Updates I thought work on picolibc would slow down at some point, but I keep finding more things that need work. I spent a few weeks working in libm and then discovered some important memory allocation bugs in the last week that needed attention too. Cleaning up the Picolibc Math Library Picolibc uses the same math library sources as newlib, which includes code from a range of sources:
  • SunPro (Sun Microsystems). This forms the bulk of the common code for the math library, with copyright dates stretching back to 1993. This code is designed for processors with FPUs and uses 'float' for float functions and 'double' for double functions.
  • NetBSD. This is where the complex functions came from, with Copyright dates of 2007.
  • FreeBSD. fenv support for aarch64, arm and sparc
  • IBM. SPU processor support along with a number of stubs for long-double support where long double is the same as double
  • Various processor vendors have provided processor-specific code for exceptions and a few custom functions.
  • Szabolcs Nagy and Wilco Dijkstra (ARM). These two re-implemented some of the more important functions in 2017-2018 for both float and double using double precision arithmetic and fused multiply-add primitives to improve performance for systems with hardware double precision support.
The original SunPro math code had been split into two levels at some point:
  1. IEEE-754 functions. These offer pure IEEE-754 semantics, including return values and exceptions. They do not set the POSIX errno value. These are all prefixed with __ieee754_ and can be called directly by applications if desired.
  2. POSIX functions. These can offer POSIX semantics, including setting errno and returning expected values when errno is set.
New Code Sponsored by ARM Szabolcs Nagy and Wilco Dijkstra's work in the last few years has been to improve the performance of some of the core math functions, which is much appreciated. They've adopted a more modern coding style (C99) and written faster code at the expense of a larger memory foot print. One interesting choice was to use double computations for the float implementations of various functions. This makes these functions shorter and more accurate than versions done using float throughout. However, for machines which don't have HW double, this pulls in soft double code which adds considerable size to the resulting binary and slows down the computations, especially if the platform does support HW float. The new code also takes advantage of HW fused-multiply-add instructions. Those offer more precision than a sequence of primitive instructions, and so the new code can be much shorter as a result. The method used to detect whether the target machine supported fma operations was slightly broken on 32-bit ARM platforms, where those with 'float' fma acceleration but without 'double' fma acceleration would use the shorter code sequence, but with an emulated fma operation that used the less-precise sequence of operations, leading to significant reductions in the quality of the resulting math functions. I fixed the double fma detection and then also added float fma detection along with implementations of float and double fma for ARM and RISC-V. Now both of those platforms get fma-enhanced math functions where available. Errno Adventures I'd submitted patches to newlib a while ago that aliased the regular math library names to the __ieee754_ functions when the library was configured to not set errno, which is pretty common for embedded environments where a shared errno is a pain anyways. Note the use of the word can in remark about the old POSIX wrapper functions. That's because all of these functions are run-time switchable between _IEEE_ and _POSIX_ mode using the _LIB_VERSION global symbol. When left in the usual _IEEE_ mode, none of this extra code was ever executed, so these wrapper functions never did anything beyond what the underlying __ieee754_ functions did. The new code by Nagy and Dijkstra changed how functions are structured to eliminate the underlying IEEE-754 api. These new functions use tail calls to various __math_ error reporting functions. Those can be configured at library build time to set errno or not, centralizing those decisions in a few functions. The result of this combination of source material is that in the default configuration, some library functions (those written by Nagy and Dijkstra) would set errno and others (the old SunPro code) would not. To disable all errno references, the library would need to be compiled with a set of options, -D_IEEE_LIBM to disable errno in the SunPro code and -DWANT_ERRNO=0 to disable errno in the new code. To enable errno everywhere, you'd set -D_POSIX_MODE to make the default value for _LIB_VERSION be _POSIX_ instead of _IEEE_. To clean all of this up, I removed the run-time _LIB_VERSION variable and made that compile-time. In combination with the earlier work to alias the __ieee754_ functions to the regular POSIX names when _IEEE_LIBM was defined this means that the old SunPro POSIX functions now only get used when _IEEE_LIBM is not defined, and in that case the _LIB_VERSION tests always force use of the errno setting code. In addition, I made the value of WANT_ERRNO depend on whether _IEEE_LIBM was defined, so now a single definition (-D_IEEE_LIBM) causes all of the errno handling from libm to be removed, independent of which code is in use. As part of this work, I added a range of errno tests for the math functions to find places where the wrong errno value was being used. Exceptions As an alternative to errno, C also provides for IEEE-754 exceptions through the fenv functions. These have some significant advantages, including having independent bits for each exception type and having them accumulate instead of sharing errno with a huge range of other C library functions. Plus, they're generally implemented in hardware, so you get exceptions for both library functions and primitive operations. Well, you should get exceptions everywhere, except that the GCC soft float libraries don't support them at all. So, errno can still be useful if you need to know what happened in your library functions when using soft floats. Newlib has recently seen a spate of fenv support being added for various architectures, so I decided that it would be a good idea to add some tests. I added tests for both primitive operations, and then tests for library functions to check both exceptions and errno values. Oddly, this uncovered a range of minor mistakes in various math functions. Lots of these were mistakes in the SunPro POSIX wrapper functions where they modified the return values from the __ieee754_ implementations. Simply removing those value modifications fixed many of those errors. Fixing Memory Allocator bugs Picolibc inherits malloc code from newlib which offers two separate implementations, one big and fast, the other small and slow(er). Selecting between them is done while building the library, and as Picolibc is expected to be used on smaller systems, the small and slow one is the default. Contributed by someone from ARM back in 2012/2013, nano-mallocr reminds me of the old V7 memory allocator. A linked list, sorted in address order, holds discontiguous chunks of available memory. Allocation is done by searching for a large enough chunk in the list. The first one large enough is selected, and if it is large enough, a chunk is split off and left on the free list while the remainder is handed to the application. When the list doesn't have any chunk large enough, sbrk is called to get more memory. Free operations involve walking the list and inserting the chunk in the right location, merging the freed memory with any immediately adjacent chunks to reduce fragmentation. The size of each chunk is stored just before the first byte of memory used by the application, where it remains while the memory is in use and while on the free list. The free list is formed by pointers stored in the active area of the chunk, so the only overhead for chunks in use is the size field. Something Something Padding To deal with the vagaries of alignment, the original nano-mallocr code would allow for there to be 'padding' between the size field and the active memory area. The amount of padding could vary, depending on the alignment required for a particular chunk (in the case of memalign, that padding can be quite large). If present, nano-mallocr would store the padding value in the location immediately before the active area and distinguish that from a regular size field by a negative sign. The whole padding thing seems mysterious to me -- why would it ever be needed when the allocator could simply create chunks that were aligned to the required value and a multiple of that value in size. The only use I could think of was for memalign; adding this padding field would allow for less over-allocation to find a suitable chunk. I didn't feel like this one (infrequent) use case was worth the extra complexity; it certainly caused me difficulty in reading the code. A Few Bugs In reviewing the code, I found a couple of easy-to-fix bugs.
  • calloc was not checking for overflow in multiplication. This is something I've only heard about in the last five or six years -- multiplying the size of each element by the number of elements can end up wrapping around to a small value which may actually succeed and cause the program to mis-behave.
  • realloc copied new_size bytes from the original location to the new location. If the new size was larger than the old, this would read off the end of the original allocation, potentially disclosing information from an adjacent allocation or walk off the end of physical memory and cause some hard fault.
Time For Testing Once I had uncovered a few bugs in this code, I decided that it would be good to write a few tests to exercise the API. With the tests running on four architectures in nearly 60 variants, it seemed like I'd be able to uncover at least a few more failures:
  • Error tests. Allocating too much memory and make sure the correct errors were returned and that nothing obviously untoward happened.
  • Touch tests. Just call the allocator and validate the return values.
  • Stress test. Allocate lots of blocks, resize them and free them. Make sure, using 'mallinfo', that the malloc arena looked reasonable.
These new tests did find bugs. But not where I expected them. Which is why I'm so fond of testing. GCC Optimizations One of my tests was to call calloc and make sure it returned a chunk of memory that appeared to work or failed with a reasonable value. To my surprise, on aarch64, that test never finished. It worked elsewhere, but on that architecture it hung in the middle of calloc itself. Which looked like this:
void * nano_calloc(malloc_size_t n, malloc_size_t elem)
 
    ptrdiff_t bytes;
    void * mem;
    if (__builtin_mul_overflow (n, elem, &bytes))
     
    RERRNO = ENOMEM;
    return NULL;
     
    mem = nano_malloc(bytes);
    if (mem != NULL) memset(mem, 0, bytes);
    return mem;
 
Note the naming here -- nano_mallocr uses nano_ prefixes in the code, but then uses #defines to change their names to those expected in the ABI. (No, I don't understand why either). However, GCC sees the real names and has some idea of what these functions are supposed to do. In particular, the pattern:
foo = malloc(n);
if (foo) memset(foo, '\0', n);
is converted into a shorter and semantically equivalent:
foo = calloc(n, 1);
Alas, GCC doesn't take into account that this optimization is occurring inside of the implementation of calloc. Another sequence of code looked like this:
chunk->size = foo
nano_free((char *) chunk + CHUNK_OFFSET);
Well, GCC knows that the content of memory passed to free cannot affect the operation of the application, and so it converted this into:
nano_free((char *) chunk + CHUNK_OFFSET);
Remember that nano_mallocr stores the size of the chunk just before the active memory. In this case, nano_mallocr was splitting a large chunk into two pieces, setting the size of the left-over part and placing that on the free list. Failing to set that size value left whatever was there before for the size and usually resulted in the free list becoming quite corrupted. Both of these problems can be corrected by compiling the code with a couple of GCC command-line switches (-fno-builtin-malloc and -fno-builtin-free). Reworking Malloc Having spent this much time reading through the nano_mallocr code, I decided to just go through it and make it easier for me to read today, hoping that other people (which includes 'future me') will also find it a bit easier to follow. I picked a couple of things to focus on:
  1. All newly allocated memory should be cleared. This reduces information disclosure between whatever code freed the memory and whatever code is about to use the memory. Plus, it reduces the effect of un-initialized allocations as they now consistently get zeroed memory. Yes, this masks bugs. Yes, this goes slower. This change is dedicated to Kees Cook, but please blame me for it not him.
  2. Get rid of the 'Padding' notion. Every time I read this code it made my brain hurt. I doubt I'll get any smarter in the future.
  3. Realloc could use some love, improving its efficiency in common cases to reduce memory usage.
  4. Reworking linked list walking. nano_mallocr uses a singly-linked free list and open-codes all list walking. Normally, I'd switch to a library implementation to avoid introducing my own bugs, but in this fairly simple case, I think it's a reasonable compromise to open-code the list operations using some patterns I learned while working at MIT from Bob Scheifler.
  5. Discover necessary values, like padding and the limits of the memory space, from the environment rather than having them hard-coded.
Padding To get rid of 'Padding' in malloc, I needed to make sure that every chunk was aligned and sized correctly. Remember that there is a header on every allocated chunk which is stored before the active memory which contains the size of the chunk. On 32-bit machines, that size is 4 bytes. If the machine requires allocations to be aligned on 8-byte boundaries (as might be the case for 'double' values), we're now going to force the alignment of the header to 8-bytes, wasting four bytes between the size field and the active memory. Well, the existing nano_mallocr code also wastes those four bytes to store the 'padding' value. Using a consistent alignment for chunk starting addresses and chunk sizes has made the code a lot simpler and easier to reason about while not using extra memory for normal allocation. Except for memalign, which I'll cover in the next section. realloc The original nano_realloc function was as simple as possible:
mem = nano_malloc(new_size);
if (mem)  
    memcpy(mem, old, MIN(old_size, new_size));
    nano_free(old);
 
return mem;
However, this really performs badly when the application is growing a buffer while accumulating data. A couple of simple optimizations occurred to me:
  1. If there's a free chunk just after the original location, it could be merged to the existing block and avoid copying the data.
  2. If the original chunk is at the end of the heap, call sbrk() to increase the size of the chunk.
The second one seems like the more important case; in a small system, the buffer will probably land at the end of the heap at some point, at which point growing it to the size of available memory becomes quite efficient. When shrinking the buffer, instead of allocating new space and copying, if there's enough space being freed for a new chunk, create one and add it to the free list. List Walking Walking singly-linked lists seem like one of the first things we see when learning pointer manipulation in C:
for (element = head; element; element = element->next)
    do stuff ...
However, this becomes pretty complicated when 'do stuff' includes removing something from the list:
prev = NULL;
for (element = head; element; element = element->next)
    ...
    if (found)
        break;
    ...
    prev = element
if (prev != NULL)
    prev->next = element->next;
else
    head = element->next;
An extra variable, and a test to figure out how to re-link the list. Bob showed me a simpler way, which I'm sure many people are familiar with:
for (ptr = &head; (element = *ptr); ptr = &(element->next))
    ...
    if (found)
        break;
*ptr = element->next;
Insertion is similar, as you would expect:
for (ptr = &head; (element = *ptr); ptr = &(element->next))
    if (found)
        break;
new_element->next = element;
*ptr = new_element;
In terms of memory operations, it's the same -- each 'next' pointer is fetched exactly once and the list is re-linked by performing a single store. In terms of reading the code, once you've seen this pattern, getting rid of the extra variable and the conditionals around the list update makes it shorter and less prone to errors. In the nano_mallocr code, instead of using 'prev = NULL', it actually used 'prev = free_list', and the test for updating the head was 'prev == element', which really caught me unawares. System Parameters Any malloc implementation needs to know a couple of things about the system it's running on:
  1. Address space. The maximum range of possible addresses sets the limit on how large a block of memory might be allocated, and hence the size of the 'size' field. Fortunately, we've got the 'size_t' type for this, so we can just use that.
  2. Alignment requirements. These derive from the alignment requirements of the basic machine types, including pointers, integers and floating point numbers which are formed from a combination of machine requirements (some systems will fault if attempting to use memory with the wrong alignment) along with a compromise between memory usage and memory system performance.
I decided to let the system tell me the alignment necessary using a special type declaration and the 'offsetof' operation:
typedef struct  
    char c;
    union  
    void *p;
    double d;
    long long ll;
    size_t s;
      u;
  align_t;
#define MALLOC_ALIGN        (offsetof(align_t, u))
Because C requires struct fields to be stored in order of declaration, the 'u' field would have to be after the 'c' field, and would have to be assigned an offset equal to the largest alignment necessary for any of its members. Testing on a range of machines yields the following alignment requirements:
Architecture Alignment
x86_64 8
RISC-V 8
aarch64 8
arm 8
x86 4
So, I guess I could have just used a constant value of '8' and not worried about it, but using the compiler-provided value means that running picolibc on older architectures might save a bit of memory at no real cost in the code. Now, the header containing the 'size' field can be aligned to this value, and all allocated blocks can be allocated in units of this value. memalign memalign, valloc and pvalloc all allocate memory with restrictions on the alignment of the base address and length. You'd think these would be simple -- allocate a large chunk, align within that chunk and return the address. However, they also all require that the address can be successfully passed to free. Which means that the allocator needs to do some tricks to make it all work. Essentially, you allocate 'lots' of memory and then arrange that any bytes at the head and tail of the allocation can be returned to the free list. The tail part is easy; if it's large enough to form a free chunk (which must contain the size and a 'next' pointer for the free list), it can be split off. Otherwise, it just sits at the end of the allocation being wasted space. The head part is a bit tricky when it's not large enough to form a free chunk. That's where the 'padding' business came in handy; that can be as small as a 'size_t' value, which (on 32-bit systems) is only four bytes. Now that we're giving up trying to reason about 'padding', any extra block at the start must be big enough to hold a free block, which includes the size and a next pointer. On 32-bit systems, that's just 8 bytes which (for most of our targets) is the same as the alignment value we're using. On 32-bit systems that can use 4-byte alignment, and on 64-bit systems, it's possible that the alignment required by the application for memalign and the alignment of a chunk returned by malloc might be off by too small an amount to create a free chunk. So, we just allocate a lot of extra space; enough so that we can create a block of size 'toosmall + align' at the start and create a free chunk of memory out of that. This works, and at least returns all of the unused memory back for other allocations. Sending Patches Back to Newlib I've sent the floating point fixes upstream to newlib where they've already landed on master. I've sent most of the malloc fixes, but I'm not sure they really care about seeing nano_mallocr refactored. If they do, I'll spend the time necessary to get the changes ported back to the newlib internal APIs and merged upstream.

27 May 2020

Kees Cook: security things in Linux v5.5

Previously: v5.4. I got a bit behind on this blog post series! Let s get caught up. Here are a bunch of security things I found interesting in the Linux kernel v5.5 release: restrict perf_event_open() from LSM
Given the recurring flaws in the perf subsystem, there has been a strong desire to be able to entirely disable the interface. While the kernel.perf_event_paranoid sysctl knob has existed for a while, attempts to extend its control to block all perf_event_open() calls have failed in the past. Distribution kernels have carried the rejected sysctl patch for many years, but now Joel Fernandes has implemented a solution that was deemed acceptable: instead of extending the sysctl, add LSM hooks so that LSMs (e.g. SELinux, Apparmor, etc) can make these choices as part of their overall system policy. generic fast full refcount_t
Will Deacon took the recent refcount_t hardening work for both x86 and arm64 and distilled the implementations into a single architecture-agnostic C version. The result was almost as fast as the x86 assembly version, but it covered more cases (e.g. increment-from-zero), and is now available by default for all architectures. (There is no longer any Kconfig associated with refcount_t; the use of the primitive provides full coverage.) linker script cleanup for exception tables
When Rick Edgecombe presented his work on building Execute-Only memory under a hypervisor, he noted a region of memory that the kernel was attempting to read directly (instead of execute). He rearranged things for his x86-only patch series to work around the issue. Since I d just been working in this area, I realized the root cause of this problem was the location of the exception table (which is strictly a lookup table and is never executed) and built a fix for the issue and applied it to all architectures, since it turns out the exception tables for almost all architectures are just a data table. Hopefully this will help clear the path for more Execute-Only memory work on all architectures. In the process of this, I also updated the section fill bytes on x86 to be a trap (0xCC, int3), instead of a NOP instruction so functions would need to be targeted more precisely by attacks. KASLR for 32-bit PowerPC
Joining many other architectures, Jason Yan added kernel text base-address offset randomization (KASLR) to 32-bit PowerPC. seccomp for RISC-V
After a bit of long road, David Abdurachmanov has added seccomp support to the RISC-V architecture. The series uncovered some more corner cases in the seccomp self tests code, which is always nice since then we get to make it more robust for the future! seccomp USER_NOTIF continuation
When the seccomp SECCOMP_RET_USER_NOTIF interface was added, it seemed like it would only be used in very limited conditions, so the idea of needing to handle normal requests didn t seem very onerous. However, since then, it has become clear that the overhead of a monitor process needing to perform lots of normal open() calls on behalf of the monitored process started to look more and more slow and fragile. To deal with this, it became clear that there needed to be a way for the USER_NOTIF interface to indicate that seccomp should just continue as normal and allow the syscall without any special handling. Christian Brauner implemented SECCOMP_USER_NOTIF_FLAG_CONTINUE to get this done. It comes with a bit of a disclaimer due to the chance that monitors may use it in places where ToCToU is a risk, and for possible conflicts with SECCOMP_RET_TRACE. But overall, this is a net win for container monitoring tools. EFI_RNG_PROTOCOL for x86
Some EFI systems provide a Random Number Generator interface, which is useful for gaining some entropy in the kernel during very early boot. The arm64 boot stub has been using this for a while now, but Dominik Brodowski has now added support for x86 to do the same. This entropy is useful for kernel subsystems performing very earlier initialization whre random numbers are needed (like randomizing aspects of the SLUB memory allocator). FORTIFY_SOURCE for MIPS
As has been enabled on many other architectures, Dmitry Korotin got MIPS building with CONFIG_FORTIFY_SOURCE, so compile-time (and some run-time) buffer overflows during calls to the memcpy() and strcpy() families of functions will be detected. limit copy_ to,from _user() size to INT_MAX
As done for VFS, vsnprintf(), and strscpy(), I went ahead and limited the size of copy_to_user() and copy_from_user() calls to INT_MAX in order to catch any weird overflows in size calculations. Other things
Alexander Popov pointed out some more v5.5 features that I missed in this blog post. I m repeating them here, with some minor edits/clarifications. Thank you Alexander! Edit: added Alexander Popov s notes That s it for v5.5! Let me know if there s anything else that I should call out here. Next up: Linux v5.6.

2020, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

15 November 2017

Kees Cook: security things in Linux v4.14

Previously: v4.13. Linux kernel v4.14 was released this last Sunday, and there s a bunch of security things I think are interesting: vmapped kernel stack on arm64
Similar to the same feature on x86, Mark Rutland and Ard Biesheuvel implemented CONFIG_VMAP_STACK for arm64, which moves the kernel stack to an isolated and guard-paged vmap area. With traditional stacks, there were two major risks when exhausting the stack: overwriting the thread_info structure (which contained the addr_limit field which is checked during copy_to/from_user()), and overwriting neighboring stacks (or other things allocated next to the stack). While arm64 previously moved its thread_info off the stack to deal with the former issue, this vmap change adds the last bit of protection by nature of the vmap guard pages. If the kernel tries to write past the end of the stack, it will hit the guard page and fault. (Testing for this is now possible via LKDTM s STACK_GUARD_PAGE_LEADING/TRAILING tests.) One aspect of the guard page protection that will need further attention (on all architectures) is that if the stack grew because of a giant Variable Length Array on the stack (effectively an implicit alloca() call), it might be possible to jump over the guard page entirely (as seen in the userspace Stack Clash attacks). Thankfully the use of VLAs is rare in the kernel. In the future, hopefully we ll see the addition of PaX/grsecurity s STACKLEAK plugin which, in addition to its primary purpose of clearing the kernel stack on return to userspace, makes sure stack expansion cannot skip over guard pages. This stack probing ability will likely also become directly available from the compiler as well. set_fs() balance checking
Related to the addr_limit field mentioned above, another class of bug is finding a way to force the kernel into accidentally leaving addr_limit open to kernel memory through an unbalanced call to set_fs(). In some areas of the kernel, in order to reuse userspace routines (usually VFS or compat related), code will do something like: set_fs(KERNEL_DS); ...some code here...; set_fs(USER_DS);. When the USER_DS call goes missing (usually due to a buggy error path or exception), subsequent system calls can suddenly start writing into kernel memory via copy_to_user (where the to user really means within the addr_limit range ). Thomas Garnier implemented USER_DS checking at syscall exit time for x86, arm, and arm64. This means that a broken set_fs() setting will not extend beyond the buggy syscall that fails to set it back to USER_DS. Additionally, as part of the discussion on the best way to deal with this feature, Christoph Hellwig and Al Viro (and others) have been making extensive changes to avoid the need for set_fs() being used at all, which should greatly reduce the number of places where it might be possible to introduce such a bug in the future. SLUB freelist hardening
A common class of heap attacks is overwriting the freelist pointers stored inline in the unallocated SLUB cache objects. PaX/grsecurity developed an inexpensive defense that XORs the freelist pointer with a global random value (and the storage address). Daniel Micay improved on this by using a per-cache random value, and I refactored the code a bit more. The resulting feature, enabled with CONFIG_SLAB_FREELIST_HARDENED, makes freelist pointer overwrites very hard to exploit unless an attacker has found a way to expose both the random value and the pointer location. This should render blind heap overflow bugs much more difficult to exploit. Additionally, Alexander Popov implemented a simple double-free defense, similar to the fasttop check in the GNU C library, which will catch sequential free()s of the same pointer. (And has already uncovered a bug.) Future work would be to provide similar metadata protections to the SLAB allocator (though SLAB doesn t store its freelist within the individual unused objects, so it has a different set of exposures compared to SLUB). setuid-exec stack limitation
Continuing the various additional defenses to protect against future problems related to userspace memory layout manipulation (as shown most recently in the Stack Clash attacks), I implemented an 8MiB stack limit for privileged (i.e. setuid) execs, inspired by a similar protection in grsecurity, after reworking the secureexec handling by LSMs. This complements the unconditional limit to the size of exec arguments that landed in v4.13. randstruct automatic struct selection
While the bulk of the port of the randstruct gcc plugin from grsecurity landed in v4.13, the last of the work needed to enable automatic struct selection landed in v4.14. This means that the coverage of randomized structures, via CONFIG_GCC_PLUGIN_RANDSTRUCT, now includes one of the major targets of exploits: function pointer structures. Without knowing the build-randomized location of a callback pointer an attacker needs to overwrite in a structure, exploits become much less reliable. structleak passed-by-reference variable initialization
Ard Biesheuvel enhanced the structleak gcc plugin to initialize all variables on the stack that are passed by reference when built with CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL. Normally the compiler will yell if a variable is used before being initialized, but it silences this warning if the variable s address is passed into a function call first, as it has no way to tell if the function did actually initialize the contents. So the plugin now zero-initializes such variables (if they hadn t already been initialized) before the function call that takes their address. Enabling this feature has a small performance impact, but solves many stack content exposure flaws. (In fact at least one such flaw reported during the v4.15 development cycle was mitigated by this plugin.) improved boot entropy
Laura Abbott and Daniel Micay improved early boot entropy available to the stack protector by both moving the stack protector setup later in the boot, and including the kernel command line in boot entropy collection (since with some devices it changes on each boot). eBPF JIT for 32-bit ARM
The ARM BPF JIT had been around a while, but it didn t support eBPF (and, as a result, did not provide constant value blinding, which meant it was exposed to being used by an attacker to build arbitrary machine code with BPF constant values). Shubham Bansal spent a bunch of time building a full eBPF JIT for 32-bit ARM which both speeds up eBPF and brings it up to date on JIT exploit defenses in the kernel. seccomp improvements
Tyler Hicks addressed a long-standing deficiency in how seccomp could log action results. In addition to creating a way to mark a specific seccomp filter as needing to be logged with SECCOMP_FILTER_FLAG_LOG, he added a new action result, SECCOMP_RET_LOG. With these changes in place, it should be much easier for developers to inspect the results of seccomp filters, and for process launchers to generate logs for their child processes operating under a seccomp filter. Additionally, I finally found a way to implement an often-requested feature for seccomp, which was to kill an entire process instead of just the offending thread. This was done by creating the SECCOMP_RET_ACTION_FULL mask (n e SECCOMP_RET_ACTION) and implementing SECCOMP_RET_KILL_PROCESS. That s it for now; please let me know if I missed anything. The v4.15 merge window is now open!

2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

Russ Allbery: Review: The Piper's Son

Review: The Piper's Son, by Melina Marchetta
Series: Francesca #2
Publisher: Candlewick Press
Copyright: 2010
Printing: 2011
ISBN: 0-7636-5458-2
Format: Kindle
Pages: 330
Tom Mackee's family has fallen apart. The impetus was the death of his uncle Joe in the London tube terrorist bombings, but that was only the start. He destroyed his chances with the only woman he really loved. His father's drinking got out of control, his mother left with his younger sister to live in a different city, and he refused to go with them and abandon his father. But then, six months later, his father abandoned him anyway. As this novel opens, Tom collapses while performing a music set, high on drugs and no sleep, and wakes up to discover his roommates have been fired from their jobs for stealing, and in turn have thrown him out of their apartment. He's at rock bottom. The one place he can turn for a place to stay is his aunt Georgie, the second (although less frequent) viewpoint character of this book. She was the one who took the trip to the UK to try to find out what happened and retrieve her brother's body, and the one who had to return to Australia with nothing. Her life isn't in much better shape than Tom's. She's kept her job, but she's pregnant by her ex-boyfriend but barely talking to him, since he now has a son by another woman he met during their separation. And she's not even remotely over her grief. The whole Finch/Mackee family is, in short, a disaster. But they have a few family relationships left that haven't broken, some underlying basic decency, and some patient and determined friends. I should warn up-front, despite having read this book without knowing this, that this is a sequel to Saving Francesca, set five years later and focusing on secondary characters from the original novel. I've subsequently read that book as well, though, and I don't think reading it first is necessary. This is one of the rare books where being a sequel made it a better stand-alone novel. I never felt a gap of missing story, just a rich and deep background of friendships and previous relationships that felt realistic. People are embedded in networks of relationships even when they feel the most alone, and I really enjoyed seeing that surface in this book. All those patterns from Tom's past didn't feel like information I was missing. They felt like glimpses of what you'd see if you looked into any other person's life. The plot summary above might make The Piper's Son sound like a depressing drama fest, but Marchetta made an excellent writing decision: the worst of this has already happened before the start of the book, and the rest is in the first chapter. This is not at all a book about horrible things happening to people. It's a book about healing. An authentic, prickly, angry healing that doesn't forget and doesn't turn into simple happily-ever-after stories, but does involve a lot of recognition that one has been an ass, and that it's possible to be less of an ass in the future, and maybe some things can be fixed. A plot summary might fool you into thinking that this is a book about a boy and his father, or about dealing with a drunk you still love. It's not. The bright current under this whole story is not father-son bonding. It's female friendships. Marchetta pulls off a beautiful double-story, writing a book that's about Tom, and Georgie, and the layered guilt and tragedy of the Finch/Mackee family, but whose emotional heart is their friends. Francesca, Justine, absent Siobhan. Georgie's friend Lucia. Ned, the cook, and his interactions with Tom's friends. And Tara Finke, also mostly absent, but perfectly written into the story in letters and phone calls. Marchetta never calls unnecessary attention to this, keeping the camera on Tom and Georgie, but the process of reading this book is a dawning realization of just how much work friendship is doing under the surface, how much back-channel conversation is happening off the page, and how much careful and thoughtful and determined work went into providing Tom a floor, a place to get his feet under him, and enough of a shove for him to pull himself together. Pulling that off requires a deft and subtle authorial touch, and I'm in awe at how well it worked. This is a beautifully written novel. Marchetta never belabors an emotional point, sticking with a clear and spare description of actions and thoughts, with just the right sentences scattered here and there to expose the character's emotions. Tom's family is awful at communication, which is much of the reason why they start the book in the situation they're in, but Marchetta somehow manages to write that in a way that didn't just frustrate me or make me want to start banging their heads together. She somehow conveys the extent to which they're trying, even when they're failing, and adds just the right descriptions so that the reader can follow the wordless messages they send each other even when they can't manage to talk directly. I usually find it very hard to connect with people who can only communicate by doing things rather than saying them. It's a high compliment to the author that I felt I understood Tom and his family as well as I did. One bit of warning: while this is not a story of a grand reunion with an alcoholic father where all is forgiven because family, thank heavens, there is an occasional wiggle in that direction. There is also a steady background assumption that one should always try to repair family relationships, and a few tangential notes about the Finches and Mackees that made me think there was a bit more abuse here than anyone involved wants to admit. I don't think the book is trying to make apologies for this, and instead is trying to walk the fine line of talking about realistically messed up families, but I also don't have a strong personal reaction to that type of story. If you have an aversion to "we should all get along because faaaaamily" stories, you may want to skip this book, or at least go in pre-warned. That aside, the biggest challenge I had in reading this book was not breaking into tears. The emotional arc is just about perfect. Tom and Georgie never stay stuck in the same emotional cycle for too long, Marchetta does a wonderful job showing irritating characters from a slightly different angle and having them become much less irritating, and the interactions between Tom, Tara, and Francesca are just perfect. I don't remember reading another book that so beautifully captures that sensation of knowing that you've been a total ass, knowing that you need to stop, but realizing just how much work you're going to have to do, and how hard that work will be, once you own up to how much you fucked up. That point where you keep being an ass for a few moments longer, because stopping is going to hurt so much, but end up stopping anyway because you can't stand yourself any more. And stopping and making amends is hard and hurts badly, and yet somehow isn't quite as bad as you thought it was going to be. This is really great stuff. One final complaint, though: what is it with mainstream fiction and the total lack of denouement? I don't read very much mainstream fiction, but this is the second really good mainstream book I've read (after The Death of Bees) that hits its climax and then unceremoniously dumps the reader on the ground and disappears. Come back here! I wasn't done with these people! I don't need a long happily-ever-after story, but give me at least a handful of pages to be happy with the characters after crying with them for hours! ARGH. But, that aside, the reader does get that climax, and it's note-perfect to the rest of the book. Everyone is still themselves, no one gets suddenly transformed, and yet everything is... better. It's the kind of book you can trust. Highly, highly recommended. Rating: 9 out of 10

5 September 2017

Kees Cook: security things in Linux v4.13

Previously: v4.12. Here s a short summary of some of interesting security things in Sunday s v4.13 release of the Linux kernel: security documentation ReSTification
The kernel has been switching to formatting documentation with ReST, and I noticed that none of the Documentation/security/ tree had been converted yet. I took the opportunity to take a few passes at formatting the existing documentation and, at Jon Corbet s recommendation, split it up between end-user documentation (which is mainly how to use LSMs) and developer documentation (which is mainly how to use various internal APIs). A bunch of these docs need some updating, so maybe with the improved visibility, they ll get some extra attention. CONFIG_REFCOUNT_FULL
Since Peter Zijlstra implemented the refcount_t API in v4.11, Elena Reshetova (with Hans Liljestrand and David Windsor) has been systematically replacing atomic_t reference counters with refcount_t. As of v4.13, there are now close to 125 conversions with many more to come. However, there were concerns over the performance characteristics of the refcount_t implementation from the maintainers of the net, mm, and block subsystems. In order to assuage these concerns and help the conversion progress continue, I added an unchecked refcount_t implementation (identical to the earlier atomic_t implementation) as the default, with the fully checked implementation now available under CONFIG_REFCOUNT_FULL. The plan is that for v4.14 and beyond, the kernel can grow per-architecture implementations of refcount_t that have performance characteristics on par with atomic_t (as done in grsecurity s PAX_REFCOUNT). CONFIG_FORTIFY_SOURCE
Daniel Micay created a version of glibc s FORTIFY_SOURCE compile-time and run-time protection for finding overflows in the common string (e.g. strcpy, strcmp) and memory (e.g. memcpy, memcmp) functions. The idea is that since the compiler already knows the size of many of the buffer arguments used by these functions, it can already build in checks for buffer overflows. When all the sizes are known at compile time, this can actually allow the compiler to fail the build instead of continuing with a proven overflow. When only some of the sizes are known (e.g. destination size is known at compile-time, but source size is only known at run-time) run-time checks are added to catch any cases where an overflow might happen. Adding this found several places where minor leaks were happening, and Daniel and I chased down fixes for them. One interesting note about this protection is that is only examines the size of the whole object for its size (via __builtin_object_size(..., 0)). If you have a string within a structure, CONFIG_FORTIFY_SOURCE as currently implemented will make sure only that you can t copy beyond the structure (but therefore, you can still overflow the string within the structure). The next step in enhancing this protection is to switch from 0 (above) to 1, which will use the closest surrounding subobject (e.g. the string). However, there are a lot of cases where the kernel intentionally copies across multiple structure fields, which means more fixes before this higher level can be enabled. NULL-prefixed stack canary
Rik van Riel and Daniel Micay changed how the stack canary is defined on 64-bit systems to always make sure that the leading byte is zero. This provides a deterministic defense against overflowing string functions (e.g. strcpy), since they will either stop an overflowing read at the NULL byte, or be unable to write a NULL byte, thereby always triggering the canary check. This does reduce the entropy from 64 bits to 56 bits for overflow cases where NULL bytes can be written (e.g. memcpy), but the trade-off is worth it. (Besdies, x86_64 s canary was 32-bits until recently.) IPC refactoring
Partially in support of allowing IPC structure layouts to be randomized by the randstruct plugin, Manfred Spraul and I reorganized the internal layout of how IPC is tracked in the kernel. The resulting allocations are smaller and much easier to deal with, even if I initially missed a few needed container_of() uses. randstruct gcc plugin
I ported grsecurity s clever randstruct gcc plugin to upstream. This plugin allows structure layouts to be randomized on a per-build basis, providing a probabilistic defense against attacks that need to know the location of sensitive structure fields in kernel memory (which is most attacks). By moving things around in this fashion, attackers need to perform much more work to determine the resulting layout before they can mount a reliable attack. Unfortunately, due to the timing of the development cycle, only the manual mode of randstruct landed in upstream (i.e. marking structures with __randomize_layout). v4.14 will also have the automatic mode enabled, which randomizes all structures that contain only function pointers. A large number of fixes to support randstruct have been landing from v4.10 through v4.13, most of which were already identified and fixed by grsecurity, but many were novel, either in newly added drivers, as whitelisted cross-structure casts, refactorings (like IPC noted above), or in a corner case on ARM found during upstream testing. lower ELF_ET_DYN_BASE
One of the issues identified from the Stack Clash set of vulnerabilities was that it was possible to collide stack memory with the highest portion of a PIE program s text memory since the default ELF_ET_DYN_BASE (the lowest possible random position of a PIE executable in memory) was already so high in the memory layout (specifically, 2/3rds of the way through the address space). Fixing this required teaching the ELF loader how to load interpreters as shared objects in the mmap region instead of as a PIE executable (to avoid potentially colliding with the binary it was loading). As a result, the PIE default could be moved down to ET_EXEC (0x400000) on 32-bit, entirely avoiding the subset of Stack Clash attacks. 64-bit could be moved to just above the 32-bit address space (0x100000000), leaving the entire 32-bit region open for VMs to do 32-bit addressing, but late in the cycle it was discovered that Address Sanitizer couldn t handle it moving. With most of the Stack Clash risk only applicable to 32-bit, fixing 64-bit has been deferred until there is a way to teach Address Sanitizer how to load itself as a shared object instead of as a PIE binary. early device randomness
I noticed that early device randomness wasn t actually getting added to the kernel entropy pools, so I fixed that to improve the effectiveness of the latent_entropy gcc plugin. That s it for now; please let me know if I missed anything. As a side note, I was rather alarmed to discover that due to all my trivial ReSTification formatting, and tiny FORTIFY_SOURCE and randstruct fixes, I made it into the most active 4.13 developers list (by patch count) at LWN with 76 patches: a whopping 0.6% of the cycle s patches. ;) Anyway, the v4.14 merge window is open!

2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

30 August 2017

Kees Cook: GRUB and LUKS

I got myself stuck yesterday with GRUB running from an ext4 /boot/grub, but with /boot inside my LUKS LVM root partition, which meant GRUB couldn t load the initramfs and kernel. Luckily, it turns out that GRUB does know how to mount LUKS volumes (and LVM volumes), but all the instructions I could find talk about setting this up ahead of time ( Add GRUB_ENABLE_CRYPTODISK=y to /etc/default/grub ), rather than what the correct manual GRUB commands are to get things running on a failed boot. These are my notes on that, in case I ever need to do this again, since there was one specific gotcha with using GRUB s cryptomount command (noted below). Available devices were the raw disk (hd0), the /boot/grub partition (hd0,msdos1), and the LUKS volume (hd0,msdos5):
grub> ls
(hd0) (hd0,msdos1) (hd0,msdos5)
Used cryptomount to open the LUKS volume (but without ()s! It says it works if you use parens, but then you can t use the resulting (crypto0)):
grub> insmod luks
grub> cryptomount hd0,msdos5
Enter password...
Slot 0 opened.
Then you can load LVM and it ll see inside the LUKS volume:
grub> insmod lvm
grub> ls
(crypto0) (hd0) (hd0,msdos1) (hd0,msdos5) (lvm/rootvg-rootlv)
And then I could boot normally:
grub> configfile $prefix/grub.cfg
After booting, I added GRUB_ENABLE_CRYPTODISK=y to /etc/default/grub and ran update-grub. I could boot normally after that, though I d be prompted twice for the LUKS passphrase (once by GRUB, then again by the initramfs). To avoid this, it s possible to add a second LUKS passphrase, contained in a file in the initramfs, as described here and works for Ubuntu and Debian too. The quick summary is: Create the keyfile and add it to LUKS:
# dd bs=512 count=4 if=/dev/urandom of=/crypto_keyfile.bin
# chmod 0400 /crypto_keyfile.bin
# cryptsetup luksAddKey /dev/sda5 /crypto_keyfile.bin
*enter original password*
Adjust the /etc/crypttab to include passing the file via /bin/cat:
sda5_crypt UUID=4aa5da72-8da6-11e7-8ac9-001cc008534d /crypto_keyfile.bin luks,keyscript=/bin/cat
Add an initramfs hook to copy the key file into the initramfs, keep non-root users from being able to read your initramfs, and trigger a rebuild:
# cat > /etc/initramfs-tools/hooks/crypto_keyfile <<EOF
#!/bin/bash
if [ "$1" = "prereqs" ] ; then
    cp /crypto_keyfile.bin "$ DESTDIR "
fi
EOF
# chmod a+x /etc/initramfs-tools/hooks/crypto_keyfile
# chmod 0700 /boot
# update-initramfs -u
This has the downside of leaving a LUKS passphrase in the clear while you re booted, but if someone has root, they can just get your dm-crypt encryption key directly anyway:
# dmsetup table --showkeys sda5_crypt
0 155797496 crypt aes-cbc-essiv:sha256 e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 0 8:5 2056
And of course if you re worried about Evil Maid attacks, you ll need a real static root of trust instead of doing full disk encryption passphrase prompting from an unverified /boot partition. :)

2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

10 July 2017

Kees Cook: security things in Linux v4.12

Previously: v4.11. Here s a quick summary of some of the interesting security things in last week s v4.12 release of the Linux kernel: x86 read-only and fixed-location GDT
With kernel memory base randomization, it was stil possible to figure out the per-cpu base address via the sgdt instruction, since it would reveal the per-cpu GDT location. To solve this, Thomas Garnier moved the GDT to a fixed location. And to solve the risk of an attacker targeting the GDT directly with a kernel bug, he also made it read-only. usercopy consolidation
After hardened usercopy landed, Al Viro decided to take a closer look at all the usercopy routines and then consolidated the per-architecture uaccess code into a single implementation. The per-architecture code was functionally very similar to each other, so it made sense to remove the redundancy. In the process, he uncovered a number of unhandled corner cases in various architectures (that got fixed by the consolidation), and made hardened usercopy available on all remaining architectures. ASLR entropy sysctl on PowerPC
Continuing to expand architecture support for the ASLR entropy sysctl, Michael Ellerman implemented the calculations needed for PowerPC. This lets userspace choose to crank up the entropy used for memory layouts. LSM structures read-only
James Morris used __ro_after_init to make the LSM structures read-only after boot. This removes them as a desirable target for attackers. Since the hooks are called from all kinds of places in the kernel this was a favorite method for attackers to use to hijack execution of the kernel. (A similar target used to be the system call table, but that has long since been made read-only.) KASLR enabled by default on x86
With many distros already enabling KASLR on x86 with CONFIG_RANDOMIZE_BASE and CONFIG_RANDOMIZE_MEMORY, Ingo Molnar felt the feature was mature enough to be enabled by default. Expand stack canary to 64 bits on 64-bit systems
The stack canary values used by CONFIG_CC_STACKPROTECTOR is most powerful on x86 since it is different per task. (Other architectures run with a single canary for all tasks.) While the first canary chosen on x86 (and other architectures) was a full unsigned long, the subsequent canaries chosen per-task for x86 were being truncated to 32-bits. Daniel Micay fixed this so now x86 (and future architectures that gain per-task canary support) have significantly increased entropy for stack-protector. Expanded stack/heap gap
Hugh Dickens, with input from many other folks, improved the kernel s mitigation against having the stack and heap crash into each other. This is a stop-gap measure to help defend against the Stack Clash attacks. Additional hardening needs to come from the compiler to produce stack probes when doing large stack expansions. Any Variable Length Arrays on the stack or alloca() usage needs to have machine code generated to touch each page of memory within those areas to let the kernel know that the stack is expanding, but with single-page granularity. That s it for now; please let me know if I missed anything. The v4.13 merge window is open!

2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

2 May 2017

Kees Cook: security things in Linux v4.11

Previously: v4.10. Here s a quick summary of some of the interesting security things in this week s v4.11 release of the Linux kernel: refcount_t infrastructure Building on the efforts of Elena Reshetova, Hans Liljestrand, and David Windsor to port PaX s PAX_REFCOUNT protection, Peter Zijlstra implemented a new kernel API for reference counting with the addition of the refcount_t type. Until now, all reference counters were implemented in the kernel using the atomic_t type, but it has a wide and general-purpose API that offers no reasonable way to provide protection against reference counter overflow vulnerabilities. With a dedicated type, a specialized API can be designed so that reference counting can be sanity-checked and provide a way to block overflows. With 2016 alone seeing at least a couple public exploitable reference counting vulnerabilities (e.g. CVE-2016-0728, CVE-2016-4558), this is going to be a welcome addition to the kernel. The arduous task of converting all the atomic_t reference counters to refcount_t will continue for a while to come. CONFIG_DEBUG_RODATA renamed to CONFIG_STRICT_KERNEL_RWX Laura Abbott landed changes to rename the kernel memory protection feature. The protection hadn t been debug for over a decade, and it covers all kernel memory sections, not just rodata . Getting it consolidated under the top-level arch Kconfig file also brings some sanity to what was a per-architecture config, and signals that this is a fundamental kernel protection needed to be enabled on all architectures. read-only usermodehelper A common way attackers use to escape confinement is by rewriting the user-mode helper sysctls (e.g. /proc/sys/kernel/modprobe) to run something of their choosing in the init namespace. To reduce attack surface within the kernel, Greg KH introduced CONFIG_STATIC_USERMODEHELPER, which switches all user-mode helper binaries to a single read-only path (which defaults to /sbin/usermode-helper). Userspace will need to support this with a new helper tool that can demultiplex the kernel request to a set of known binaries. seccomp coredumps Mike Frysinger noticed that it wasn t possible to get coredumps out of processes killed by seccomp, which could make debugging frustrating, especially for automated crash dump analysis tools. In keeping with the existing documentation for SIGSYS, which says a coredump should be generated, he added support to dump core on seccomp SECCOMP_RET_KILL results. structleak plugin Ported from PaX, I landed the structleak plugin which enforces that any structure containing a __user annotation is fully initialized to 0 so that stack content exposures of these kinds of structures are entirely eliminated from the kernel. This was originally designed to stop a specific vulnerability, and will now continue to block similar exposures. ASLR entropy sysctl on MIPS
Matt Redfearn implemented the ASLR entropy sysctl for MIPS, letting userspace choose to crank up the entropy used for memory layouts. NX brk on powerpc Denys Vlasenko fixed a long standing bug where the kernel made assumptions about ELF memory layouts and defaulted the the brk section on powerpc to be executable. Now it s not, and that ll keep process heap from being abused. That s it for now; please let me know if I missed anything. The v4.12 merge window is open!

2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

28 February 2017

Kees Cook: security things in Linux v4.10

Previously: v4.9. Here s a quick summary of some of the interesting security things in last week s v4.10 release of the Linux kernel: PAN emulation on arm64 Catalin Marinas introduced ARM64_SW_TTBR0_PAN, which is functionally the arm64 equivalent of arm s CONFIG_CPU_SW_DOMAIN_PAN. While Privileged eXecute Never (PXN) has been available in ARM hardware for a while now, Privileged Access Never (PAN) will only be available in hardware once vendors start manufacturing ARMv8.1 or later CPUs. Right now, everything is still ARMv8.0, which left a bit of a gap in security flaw mitigations on ARM since CONFIG_CPU_SW_DOMAIN_PAN can only provide PAN coverage on ARMv7 systems, but nothing existed on ARMv8.0. This solves that problem and closes a common exploitation method for arm64 systems. thread_info relocation on arm64 As done earlier for x86, Mark Rutland has moved thread_info off the kernel stack on arm64. With thread_info no longer on the stack, it s more difficult for attackers to find it, which makes it harder to subvert the very sensitive addr_limit field. linked list hardening
I added CONFIG_BUG_ON_DATA_CORRUPTION to restore the original CONFIG_DEBUG_LIST behavior that existed prior to v2.6.27 (9 years ago): if list metadata corruption is detected, the kernel refuses to perform the operation, rather than just WARNing and continuing with the corrupted operation anyway. Since linked list corruption (usually via heap overflows) are a common method for attackers to gain a write-what-where primitive, it s important to stop the list add/del operation if the metadata is obviously corrupted. seeding kernel RNG from UEFI A problem for many architectures is finding a viable source of early boot entropy to initialize the kernel Random Number Generator. For x86, this is mainly solved with the RDRAND instruction. On ARM, however, the solutions continue to be very vendor-specific. As it turns out, UEFI is supposed to hide various vendor-specific things behind a common set of APIs. The EFI_RNG_PROTOCOL call is designed to provide entropy, but it can t be called when the kernel is running. To get entropy into the kernel, Ard Biesheuvel created a UEFI config table (LINUX_EFI_RANDOM_SEED_TABLE_GUID) that is populated during the UEFI boot stub and fed into the kernel entropy pool during early boot. arm64 W^X detection As done earlier for x86, Laura Abbott implemented CONFIG_DEBUG_WX on arm64. Now any dangerous arm64 kernel memory protections will be loudly reported at boot time. 64-bit get_user() zeroing fix on arm
While the fix itself is pretty minor, I like that this bug was found through a combined improvement to the usercopy test code in lib/test_user_copy.c. Hoeun Ryu added zeroing-on-failure testing, and I expanded the get_user()/put_user() tests to include all sizes. Neither improvement alone would have found the ARM bug, but together they uncovered a typo in a corner case. no-new-privs visible in /proc/$pid/status
This is a tiny change, but I like being able to introspect processes externally. Prior to this, I wasn t able to trivially answer the question is that process setting the no-new-privs flag? To address this, I exposed the flag in /proc/$pid/status, as NoNewPrivs. That s all for now! Please let me know if you saw anything else you think needs to be called out. :) I m already excited about the v4.11 merge window opening

2017, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

12 December 2016

Kees Cook: security things in Linux v4.9

Previously: v4.8. Here are a bunch of security things I m excited about in the newly released Linux v4.9: Latent Entropy GCC plugin Building on her earlier work to bring GCC plugin support to the Linux kernel, Emese Revfy ported PaX s Latent Entropy GCC plugin to upstream. This plugin is significantly more complex than the others that have already been ported, and performs extensive instrumentation of functions marked with __latent_entropy. These functions have their branches and loops adjusted to mix random values (selected at build time) into a global entropy gathering variable. Since the branch and loop ordering is very specific to boot conditions, CPU quirks, memory layout, etc, this provides some additional uncertainty to the kernel s entropy pool. Since the entropy actually gathered is hard to measure, no entropy is credited , but rather used to mix the existing pool further. Probably the best place to enable this plugin is on small devices without other strong sources of entropy. vmapped kernel stack and thread_info relocation on x86 Normally, kernel stacks are mapped together in memory. This meant that attackers could use forms of stack exhaustion (or stack buffer overflows) to reach past the end of a stack and start writing over another process s stack. This is bad, and one way to stop it is to provide guard pages between stacks, which is provided by vmalloced memory. Andy Lutomirski did a bunch of work to move to vmapped kernel stack via CONFIG_VMAP_STACK on x86_64. Now when writing past the end of the stack, the kernel will immediately fault instead of just continuing to blindly write. Related to this, the kernel was storing thread_info (which contained sensitive values like addr_limit) at the bottom of the kernel stack, which was an easy target for attackers to hit. Between a combination of explicitly moving targets out of thread_info, removing needless fields, and entirely moving thread_info off the stack, Andy Lutomirski and Linus Torvalds created CONFIG_THREAD_INFO_IN_TASK for x86. CONFIG_DEBUG_RODATA mandatory on arm64 As recently done for x86, Mark Rutland made CONFIG_DEBUG_RODATA mandatory on arm64. This feature controls whether the kernel enforces proper memory protections on its own memory regions (code memory is executable and read-only, read-only data is actually read-only and non-executable, and writable data is non-executable). This protection is a fundamental security primitive for kernel self-protection, so there s no reason to make the protection optional. random_page() cleanup Cleaning up the code around the userspace ASLR implementations makes them easier to reason about. This has been happening for things like the recent consolidation on arch_mmap_rnd() for ET_DYN and during the addition of the entropy sysctl. Both uncovered some awkward uses of get_random_int() (or similar) in and around arch_mmap_rnd() (which is used for mmap (and therefore shared library) and PIE ASLR), as well as in randomize_stack_top() (which is used for stack ASLR). Jason Cooper cleaned things up further by doing away with randomize_range() entirely and replacing it with the saner random_page(), making the per-architecture arch_randomize_brk() (responsible for brk ASLR) much easier to understand. That s it for now! Let me know if there are other fun things to call attention to in v4.9.

2016, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

Next.