Search Results: "Keith Packard"

2 March 2022

Keith Packard: picolibc-testing

Testing Picolibc with the glibc tests Picolibc has a bunch of built-in tests, but more testing is always better, right? I decided to see how hard it would be to run some of the tests provided in the GNU C Library (glibc). Parallel meson build files Similar to how Picolibc uses meson build files to avoid modifying the newlib autotools infrastructure, I decided to take the glibc code and write meson build rules that would compile the tests against Picolibc header files and link against Picolibc libraries. I decided to select a single target for this project so I could focus on getting things building and not worry too much about making it portable. I wanted to pick something that had hardware floating point so that I would have rounding modes and exception support, so I picked the ARM Cortex M7 with hard float ABI:
$ arm-none-eabi-gcc -mcpu=cortex-m7 -mfloat-abi=hard
It should be fairly easy to extend this to other targets, but for now, that's what I've got working. There's a cross-cortex-m7.txt file containing all of the cross compilation details which is used when running meson setup. All of the Picolibc-specific files live in a new picolibc directory so they are isolated from the main glibc code. Pre-loading a pile of hacks Adapt Picolibc to support the Glibc test code required a bunch of random hacks, from providing _unlocked versions of the stdio macros to stubbing out various unsupportable syscalls (like sleep and chdir). Instead of modifying the Glibc code, I created a file called hacks.h which is full of this stuff and used the gcc -include parameter to read that into the compiler before starting compilation on all of the files. Supporting command line parameters The glibc tests all support a range of command line parameters, some of which turned out to be quite useful for this work. Picolibc had limited semihosting support for accessing the command line, but that required modifying applications to go fetch the command line using a special semihosting function. To make this easier, I added a new crt0 variant for picolibc called (oddly) semihost. This extends the existing hosted variant by adding a call to the semihosting library to fetch the current command line and splitting that into words at each space. It doesn't handle any quoting, but it's sufficient for the needs here. Avoiding glibc headers The glibc tests use some glibc-specific extensions to the standard POSIX C library, so I needed to include those in the test builds. Headers for those extensions are mixed in with the rest of the POSIX standard headers, which conflict with the Picolibc versions. To work around this, I stuck stub #include files in the picolibc directory which directly include the appropriate headers for the glibc API extensions. This includes things like argp.h and array_length.h. For other headers which weren't actually needed for picolibc, I created empty files. Adding more POSIX to Picolibc At this point, code was compiling but failing to find various standard POSIX functions which aren't available in Picolibc. That included some syscalls which could be emulated using semihosting, like gettimeofday and getpagesize. It also included some generally useful additions, like replacing ecvtbuf and fcvtbuf with ecvt_r and fcvt_r. The _r variants provide a target buffer size instead of assuming that it was large enough as the Picolibc buf variants did. Which tests are working? So far, I've got some of the tests in malloc, math, misc and stdio-common running. There are a lot of tests in the malloc directory which cover glibc API extensions or require POSIX syscalls not supported by semihosting. I think I've added all of the tests which should be supported. For the math tests, I'm testing the standard POSIX math APIs in both float and double forms, except for the Bessel and Gamma functions. Picolibc's versions of those are so bad that they violate some pretty basic assumptions about error bounds built into the glibc test code. Until Picolibc gets better versions of these functions, we'll have to skip testing them this way. In the misc directory, I've only added tests for ecvt, fcvt, gcvt, dirname and hsearch. I don't think there are any other tests there which should work. Finally, for stdio-common, almost all of the tests expect a fully functioning file system, which semihosting really can't support. As a result, we're only able to run the printf, scanf and some of the sprintf tests. All in all, we're running 78 of the glibc test programs, which is a small fraction of the total tests, but I think it's the bulk of the tests which cover APIs that don't depend too much on the underlying POSIX operating system. Bugs found and fixed in Picolibc This exercise has resulted in 17 fixes in Picolibc, which can be classified as:
  1. Math functions taking sNaN and not raising FE_INVALID and returning qNaN. Almost any operation on an sNaN value is supposed to signal an invalid operand and replace that with a qNaN so that further processing doesn't raise another exception. This was fairly easy to fix, just need to use return x + x; instead of return x;.
  2. Math functions failing to set errno. I'd recently restructured the math library to get rid of the separate IEEE version of the functions which didn't set errno and missed a couple of cases that needed to use the errno-setting helper functions.
  3. Corner cases in string/float conversion, including the need to perform round-to-even for '%a' conversions in printf and supporting negative decimal values for fcvt. This particular exercise led to replacing the ecvtbuf and fcvtbuf APIs with glibc's ecvt_r and fcvt_r versions as those pass explicit buffer lengths, making overflow prevention far more reliable.
  4. A bunch of malloc entry points were not handling failure correctly; allocation values close to the maximum possible size caused numerous numeric wrapping errors with the usual consequences (allocations "succeed", but return a far smaller buffer than requested). Also, realloc was failing to check the return value from malloc before copying the old data, which really isn't a good idea.
  5. Tinystdio's POSIX support code was returning an error instead of EOF at end of file.
  6. Error bounds for the Picolibc math library aren't great; I had to generate Picolibc-specific ulps files. Most functions are between 0 and 3 ulps, but for some reason, the float version of erfc (ercf) has errors as large as 64 ulps. That needs investigation.
Tests added to Picolibc With all of the fixes applied to Picolibc, I added some tests to verify them without relying on running the glibc tests, that includes sNaN vs qNaN tests for math functions, testing fopen and mktemp, checking the printf %a rounding support and adding a ecvt/fcvt tests. I also discovered that the github-based CI system was compiling but not testing when using clang with a riscv target, so that got added in. Where's the source code? The Picolibc changes are sitting on the glibc-testing branch. I'll merge them once the current CI pass finishes. The hacked-up Glibc bits are in a glibc mirror at github in the picolibc project on the picolibc-testing branch. It would be interesting to know what additional tests should be usable in this environment. And, perhaps, finding a way to use this for picolibc CI testing in the future. Concluding thoughts Overall, I'm pretty pleased with these results. The bugs in malloc are fairly serious and could easily result in trouble, but the remaining issues are mostly minor and shouldn't have a big impact on applications using Picolibc. I'll get the changes merged and start thinking about doing another Picolibc release.

1 January 2021

Keith Packard: kgames

Reviving Very Old X Code I've taken the week between Christmas and New Year's off this year. I didn't really have anything serious planned, just taking a break from the usual routine. As often happens, I got sucked into doing a project when I received this simple bug report Debian Bug #974011
I have been researching old terminal and X games recently, and realized
that much of the code from 'xmille' originated from the terminal game
'mille', which is part of bsdgames.
[The copyright and license information] has been stripped out of all
code in the xmille distribution.  Also, none of the included materials
give credit to the original author, Ken Arnold.
The reason the 'xmille' source is missing copyright and license information from the 'mille' files is that they were copied in before that information was added upstream. Xmille forked from Mille around 1987 or so. I wrote the UI parts for the system I had at the time, which was running X10R4. A very basic port to X11 was done at some point, and that's what Debian has in the archive today. At some point in the 90s, I ported Xmille to the Athena widget set, including several custom widgets in an Xaw extension library, Xkw. It's a lot better than the version in Debian, including displaying the cards correctly (the Debian version has some pretty bad color issues). Here's what the current Debian version looks like: Fixing The Bug To fix the missing copyright and license information, I imported the mille source code into the "latest" Xaw-based version. The updated mille code had a number of bug fixes and improvements, along with the copyright information. That should have been sufficient to resolve the issue and I could have constructed a suitable source package from whatever bits were needed and and uploaded that as a replacement 'xmille' package. However, at some later point, I had actually merged xmille into a larger package, 'kgames', which also included a number of other games, including Reversi, Dominoes, Cribbage and ten Solitaire/Patience variants. (as an aside, those last ten games formed the basis for my Patience Palm Pilot application, which seems to have inspired an Android App of the same name...) So began my yak shaving holiday. Building Kgames in 2020 Ok, so getting this old source code running should be easy, right? It's just a bunch of C code designed in the 80s and 90s to work on VAXen and their kin. How hard could it be?
  1. Everything was a 32-bit computer back then; pointers and ints were both 32 bits, so you could cast them with wild abandon and cause no problems. Today, testing revealed segfaults in some corners of the code.
  2. It's K&R C code. Remember that the first version of ANSI-C didn't come out until 1989, and it was years later that we could reliably expect to find an ANSI compiler with a random Unix box.
  3. It's X11 code. Fortunately (?), X11 hasn't changed since these applications were written, so at least that part still works just fine. Imagine trying to build Windows or Mac OS code from the early 90's on a modern OS...
I decided to dig in and add prototypes everywhere; that found a lot of pointer/int casting issues, as well as several lurking bugs where the code was just plain broken. After a day or so, I had things building and running and was no longer hitting crashes. Kgames 1.0 uploaded to Debian New Queue With that done, I decided I could at least upload the working bits to the Debian archive and close the bug reported above. kgames 1.0-2 may eventually get into unstable, presumably once the Debian FTP team realizes just how important fixing this bug is. Or something. Here's what xmille looks like in this version: And here's my favorite solitaire variant too: But They Look So Old Yeah, Xaw applications have a rustic appearance which may appeal to some, but for people with higher resolution monitors and well seasoned eyesight, squinting at the tiny images and text makes it difficult to enjoy these games today. How hard could it be to update them to use larger cards and scalable fonts? Xkw version 2.0 I decided to dig in and start hacking the code, starting by adding new widgets to the Xkw library that used cairo for drawing instead of core X calls. Fortunately, the needs of the games were pretty limited, so I only needed to implement a handful of widgets: The other Xkw widgets all got their rendering switched to using cairo, plus using double buffering to make updates look better. SVG Playing Cards Looking on wikimedia, I found a page referencing a large number of playing cards in SVG form That led me to Adrian Kennard's playing card web site that let me customize and download a deck of cards, licensed using the CC0 Public Domain license. With these cards, I set about rewriting the Xkw playing card widget, stripping out three different versions of bitmap playing cards and replacing them with just these new SVG versions. SVG Xmille Cards Ok, so getting regular playing cards was good, but the original goal was to update Xmille, and that has cards hand drawn by me. I could just use those images, import them into cairo and let it scale them to suit on the screen. I decided to experiment with inkscape's bitmap tracing code to see what it could do with them. First, I had to get them into a format that inkscape could parse. That turned out to be a bit tricky; the original format is as a set of X bitmap layers; each layer painting a single color. I ended up hacking the Xmille source code to generate the images using X, then fetching them with XGetImage and walking them to construct XPM format files which could then be fed into the portable bitmap tools to create PNG files that inkscape could handle. The resulting images have a certain charm: I did replace the text in the images to make it readable, otherwise these are untouched from what inkscape generated. The Results Remember that all of these are applications built using the venerable X toolkit; there are still some non-antialiased graphics visible as the shaped buttons use the X Shape extension. But, all rendering is now done with cairo, so it's all anti-aliased and all scalable. Here's what Xmille looks like after the upgrades: And here's spider: Once kgames 1.0 reaches Debian unstable, I'll upload these new versions.

14 August 2020

Keith Packard: picolibc-news

Picolibc Updates I thought work on picolibc would slow down at some point, but I keep finding more things that need work. I spent a few weeks working in libm and then discovered some important memory allocation bugs in the last week that needed attention too. Cleaning up the Picolibc Math Library Picolibc uses the same math library sources as newlib, which includes code from a range of sources: The original SunPro math code had been split into two levels at some point:
  1. IEEE-754 functions. These offer pure IEEE-754 semantics, including return values and exceptions. They do not set the POSIX errno value. These are all prefixed with __ieee754_ and can be called directly by applications if desired.
  2. POSIX functions. These can offer POSIX semantics, including setting errno and returning expected values when errno is set.
New Code Sponsored by ARM Szabolcs Nagy and Wilco Dijkstra's work in the last few years has been to improve the performance of some of the core math functions, which is much appreciated. They've adopted a more modern coding style (C99) and written faster code at the expense of a larger memory foot print. One interesting choice was to use double computations for the float implementations of various functions. This makes these functions shorter and more accurate than versions done using float throughout. However, for machines which don't have HW double, this pulls in soft double code which adds considerable size to the resulting binary and slows down the computations, especially if the platform does support HW float. The new code also takes advantage of HW fused-multiply-add instructions. Those offer more precision than a sequence of primitive instructions, and so the new code can be much shorter as a result. The method used to detect whether the target machine supported fma operations was slightly broken on 32-bit ARM platforms, where those with 'float' fma acceleration but without 'double' fma acceleration would use the shorter code sequence, but with an emulated fma operation that used the less-precise sequence of operations, leading to significant reductions in the quality of the resulting math functions. I fixed the double fma detection and then also added float fma detection along with implementations of float and double fma for ARM and RISC-V. Now both of those platforms get fma-enhanced math functions where available. Errno Adventures I'd submitted patches to newlib a while ago that aliased the regular math library names to the __ieee754_ functions when the library was configured to not set errno, which is pretty common for embedded environments where a shared errno is a pain anyways. Note the use of the word can in remark about the old POSIX wrapper functions. That's because all of these functions are run-time switchable between _IEEE_ and _POSIX_ mode using the _LIB_VERSION global symbol. When left in the usual _IEEE_ mode, none of this extra code was ever executed, so these wrapper functions never did anything beyond what the underlying __ieee754_ functions did. The new code by Nagy and Dijkstra changed how functions are structured to eliminate the underlying IEEE-754 api. These new functions use tail calls to various __math_ error reporting functions. Those can be configured at library build time to set errno or not, centralizing those decisions in a few functions. The result of this combination of source material is that in the default configuration, some library functions (those written by Nagy and Dijkstra) would set errno and others (the old SunPro code) would not. To disable all errno references, the library would need to be compiled with a set of options, -D_IEEE_LIBM to disable errno in the SunPro code and -DWANT_ERRNO=0 to disable errno in the new code. To enable errno everywhere, you'd set -D_POSIX_MODE to make the default value for _LIB_VERSION be _POSIX_ instead of _IEEE_. To clean all of this up, I removed the run-time _LIB_VERSION variable and made that compile-time. In combination with the earlier work to alias the __ieee754_ functions to the regular POSIX names when _IEEE_LIBM was defined this means that the old SunPro POSIX functions now only get used when _IEEE_LIBM is not defined, and in that case the _LIB_VERSION tests always force use of the errno setting code. In addition, I made the value of WANT_ERRNO depend on whether _IEEE_LIBM was defined, so now a single definition (-D_IEEE_LIBM) causes all of the errno handling from libm to be removed, independent of which code is in use. As part of this work, I added a range of errno tests for the math functions to find places where the wrong errno value was being used. Exceptions As an alternative to errno, C also provides for IEEE-754 exceptions through the fenv functions. These have some significant advantages, including having independent bits for each exception type and having them accumulate instead of sharing errno with a huge range of other C library functions. Plus, they're generally implemented in hardware, so you get exceptions for both library functions and primitive operations. Well, you should get exceptions everywhere, except that the GCC soft float libraries don't support them at all. So, errno can still be useful if you need to know what happened in your library functions when using soft floats. Newlib has recently seen a spate of fenv support being added for various architectures, so I decided that it would be a good idea to add some tests. I added tests for both primitive operations, and then tests for library functions to check both exceptions and errno values. Oddly, this uncovered a range of minor mistakes in various math functions. Lots of these were mistakes in the SunPro POSIX wrapper functions where they modified the return values from the __ieee754_ implementations. Simply removing those value modifications fixed many of those errors. Fixing Memory Allocator bugs Picolibc inherits malloc code from newlib which offers two separate implementations, one big and fast, the other small and slow(er). Selecting between them is done while building the library, and as Picolibc is expected to be used on smaller systems, the small and slow one is the default. Contributed by someone from ARM back in 2012/2013, nano-mallocr reminds me of the old V7 memory allocator. A linked list, sorted in address order, holds discontiguous chunks of available memory. Allocation is done by searching for a large enough chunk in the list. The first one large enough is selected, and if it is large enough, a chunk is split off and left on the free list while the remainder is handed to the application. When the list doesn't have any chunk large enough, sbrk is called to get more memory. Free operations involve walking the list and inserting the chunk in the right location, merging the freed memory with any immediately adjacent chunks to reduce fragmentation. The size of each chunk is stored just before the first byte of memory used by the application, where it remains while the memory is in use and while on the free list. The free list is formed by pointers stored in the active area of the chunk, so the only overhead for chunks in use is the size field. Something Something Padding To deal with the vagaries of alignment, the original nano-mallocr code would allow for there to be 'padding' between the size field and the active memory area. The amount of padding could vary, depending on the alignment required for a particular chunk (in the case of memalign, that padding can be quite large). If present, nano-mallocr would store the padding value in the location immediately before the active area and distinguish that from a regular size field by a negative sign. The whole padding thing seems mysterious to me -- why would it ever be needed when the allocator could simply create chunks that were aligned to the required value and a multiple of that value in size. The only use I could think of was for memalign; adding this padding field would allow for less over-allocation to find a suitable chunk. I didn't feel like this one (infrequent) use case was worth the extra complexity; it certainly caused me difficulty in reading the code. A Few Bugs In reviewing the code, I found a couple of easy-to-fix bugs. Time For Testing Once I had uncovered a few bugs in this code, I decided that it would be good to write a few tests to exercise the API. With the tests running on four architectures in nearly 60 variants, it seemed like I'd be able to uncover at least a few more failures: These new tests did find bugs. But not where I expected them. Which is why I'm so fond of testing. GCC Optimizations One of my tests was to call calloc and make sure it returned a chunk of memory that appeared to work or failed with a reasonable value. To my surprise, on aarch64, that test never finished. It worked elsewhere, but on that architecture it hung in the middle of calloc itself. Which looked like this:
void * nano_calloc(malloc_size_t n, malloc_size_t elem)
    ptrdiff_t bytes;
    void * mem;
    if (__builtin_mul_overflow (n, elem, &bytes))
    return NULL;
    mem = nano_malloc(bytes);
    if (mem != NULL) memset(mem, 0, bytes);
    return mem;
Note the naming here -- nano_mallocr uses nano_ prefixes in the code, but then uses #defines to change their names to those expected in the ABI. (No, I don't understand why either). However, GCC sees the real names and has some idea of what these functions are supposed to do. In particular, the pattern:
foo = malloc(n);
if (foo) memset(foo, '\0', n);
is converted into a shorter and semantically equivalent:
foo = calloc(n, 1);
Alas, GCC doesn't take into account that this optimization is occurring inside of the implementation of calloc. Another sequence of code looked like this:
chunk->size = foo
nano_free((char *) chunk + CHUNK_OFFSET);
Well, GCC knows that the content of memory passed to free cannot affect the operation of the application, and so it converted this into:
nano_free((char *) chunk + CHUNK_OFFSET);
Remember that nano_mallocr stores the size of the chunk just before the active memory. In this case, nano_mallocr was splitting a large chunk into two pieces, setting the size of the left-over part and placing that on the free list. Failing to set that size value left whatever was there before for the size and usually resulted in the free list becoming quite corrupted. Both of these problems can be corrected by compiling the code with a couple of GCC command-line switches (-fno-builtin-malloc and -fno-builtin-free). Reworking Malloc Having spent this much time reading through the nano_mallocr code, I decided to just go through it and make it easier for me to read today, hoping that other people (which includes 'future me') will also find it a bit easier to follow. I picked a couple of things to focus on:
  1. All newly allocated memory should be cleared. This reduces information disclosure between whatever code freed the memory and whatever code is about to use the memory. Plus, it reduces the effect of un-initialized allocations as they now consistently get zeroed memory. Yes, this masks bugs. Yes, this goes slower. This change is dedicated to Kees Cook, but please blame me for it not him.
  2. Get rid of the 'Padding' notion. Every time I read this code it made my brain hurt. I doubt I'll get any smarter in the future.
  3. Realloc could use some love, improving its efficiency in common cases to reduce memory usage.
  4. Reworking linked list walking. nano_mallocr uses a singly-linked free list and open-codes all list walking. Normally, I'd switch to a library implementation to avoid introducing my own bugs, but in this fairly simple case, I think it's a reasonable compromise to open-code the list operations using some patterns I learned while working at MIT from Bob Scheifler.
  5. Discover necessary values, like padding and the limits of the memory space, from the environment rather than having them hard-coded.
Padding To get rid of 'Padding' in malloc, I needed to make sure that every chunk was aligned and sized correctly. Remember that there is a header on every allocated chunk which is stored before the active memory which contains the size of the chunk. On 32-bit machines, that size is 4 bytes. If the machine requires allocations to be aligned on 8-byte boundaries (as might be the case for 'double' values), we're now going to force the alignment of the header to 8-bytes, wasting four bytes between the size field and the active memory. Well, the existing nano_mallocr code also wastes those four bytes to store the 'padding' value. Using a consistent alignment for chunk starting addresses and chunk sizes has made the code a lot simpler and easier to reason about while not using extra memory for normal allocation. Except for memalign, which I'll cover in the next section. realloc The original nano_realloc function was as simple as possible:
mem = nano_malloc(new_size);
if (mem)  
    memcpy(mem, old, MIN(old_size, new_size));
return mem;
However, this really performs badly when the application is growing a buffer while accumulating data. A couple of simple optimizations occurred to me:
  1. If there's a free chunk just after the original location, it could be merged to the existing block and avoid copying the data.
  2. If the original chunk is at the end of the heap, call sbrk() to increase the size of the chunk.
The second one seems like the more important case; in a small system, the buffer will probably land at the end of the heap at some point, at which point growing it to the size of available memory becomes quite efficient. When shrinking the buffer, instead of allocating new space and copying, if there's enough space being freed for a new chunk, create one and add it to the free list. List Walking Walking singly-linked lists seem like one of the first things we see when learning pointer manipulation in C:
for (element = head; element; element = element->next)
    do stuff ...
However, this becomes pretty complicated when 'do stuff' includes removing something from the list:
prev = NULL;
for (element = head; element; element = element->next)
    if (found)
    prev = element
if (prev != NULL)
    prev->next = element->next;
    head = element->next;
An extra variable, and a test to figure out how to re-link the list. Bob showed me a simpler way, which I'm sure many people are familiar with:
for (ptr = &head; (element = *ptr); ptr = &(element->next))
    if (found)
*ptr = element->next;
Insertion is similar, as you would expect:
for (ptr = &head; (element = *ptr); ptr = &(element->next))
    if (found)
new_element->next = element;
*ptr = new_element;
In terms of memory operations, it's the same -- each 'next' pointer is fetched exactly once and the list is re-linked by performing a single store. In terms of reading the code, once you've seen this pattern, getting rid of the extra variable and the conditionals around the list update makes it shorter and less prone to errors. In the nano_mallocr code, instead of using 'prev = NULL', it actually used 'prev = free_list', and the test for updating the head was 'prev == element', which really caught me unawares. System Parameters Any malloc implementation needs to know a couple of things about the system it's running on:
  1. Address space. The maximum range of possible addresses sets the limit on how large a block of memory might be allocated, and hence the size of the 'size' field. Fortunately, we've got the 'size_t' type for this, so we can just use that.
  2. Alignment requirements. These derive from the alignment requirements of the basic machine types, including pointers, integers and floating point numbers which are formed from a combination of machine requirements (some systems will fault if attempting to use memory with the wrong alignment) along with a compromise between memory usage and memory system performance.
I decided to let the system tell me the alignment necessary using a special type declaration and the 'offsetof' operation:
typedef struct  
    char c;
    void *p;
    double d;
    long long ll;
    size_t s;
#define MALLOC_ALIGN        (offsetof(align_t, u))
Because C requires struct fields to be stored in order of declaration, the 'u' field would have to be after the 'c' field, and would have to be assigned an offset equal to the largest alignment necessary for any of its members. Testing on a range of machines yields the following alignment requirements:
Architecture Alignment
x86_64 8
aarch64 8
arm 8
x86 4
So, I guess I could have just used a constant value of '8' and not worried about it, but using the compiler-provided value means that running picolibc on older architectures might save a bit of memory at no real cost in the code. Now, the header containing the 'size' field can be aligned to this value, and all allocated blocks can be allocated in units of this value. memalign memalign, valloc and pvalloc all allocate memory with restrictions on the alignment of the base address and length. You'd think these would be simple -- allocate a large chunk, align within that chunk and return the address. However, they also all require that the address can be successfully passed to free. Which means that the allocator needs to do some tricks to make it all work. Essentially, you allocate 'lots' of memory and then arrange that any bytes at the head and tail of the allocation can be returned to the free list. The tail part is easy; if it's large enough to form a free chunk (which must contain the size and a 'next' pointer for the free list), it can be split off. Otherwise, it just sits at the end of the allocation being wasted space. The head part is a bit tricky when it's not large enough to form a free chunk. That's where the 'padding' business came in handy; that can be as small as a 'size_t' value, which (on 32-bit systems) is only four bytes. Now that we're giving up trying to reason about 'padding', any extra block at the start must be big enough to hold a free block, which includes the size and a next pointer. On 32-bit systems, that's just 8 bytes which (for most of our targets) is the same as the alignment value we're using. On 32-bit systems that can use 4-byte alignment, and on 64-bit systems, it's possible that the alignment required by the application for memalign and the alignment of a chunk returned by malloc might be off by too small an amount to create a free chunk. So, we just allocate a lot of extra space; enough so that we can create a block of size 'toosmall + align' at the start and create a free chunk of memory out of that. This works, and at least returns all of the unused memory back for other allocations. Sending Patches Back to Newlib I've sent the floating point fixes upstream to newlib where they've already landed on master. I've sent most of the malloc fixes, but I'm not sure they really care about seeing nano_mallocr refactored. If they do, I'll spend the time necessary to get the changes ported back to the newlib internal APIs and merged upstream.

3 June 2020

Keith Packard: picolibc-ryu

Float/String Conversion in Picolibc: Enter Ry I recently wrote about this topic having concluded that the best route for now was to use the malloc-free, but imprecise, conversion routines in the tinystdio alternative. A few days later, Sreepathi Pai pointed me at some very recent work in this area: This is amazing! Thirty years after the papers referenced in the previous post, Ulf Adams came up with some really cool ideas and managed to reduce the math required for 64-bit conversion to 128 bit integers. This is a huge leap forward; we were doing long multi-precision computations before, and now it's all short enough to fit in registers (ok, a lot of registers, but still). Getting the Ry Code The code is available on github: Reading through it, it's very clear that the author focuses on performance with lots of tuning for common cases. Still, it's quite readable, especially compared with the newlib multi-precision based code. Picolibc String/Float conversion interface Picolibc has some pretty basic needs for the float/string conversion code, it wants four functions:
  1. __dtoa_engine
    __dtoa_engine(double x, struct dtoa *dtoa, uint8_t max_digits, uint8_t max_decimals);
    This converts the double x to a string of decimal digits and a decimal exponent stored inside the 'dtoa' struct. It limits the total number of digits to max_digits and, optionally (when max_decimals is non-zero), limits the number of fractional digits to max_decimals - 1. This latter supports 'f' formats. Returns the number of digits stored, which is <= max_digits. Less if the number can be accurately represented in fewer digits.
  2. __ftoa_engine
    __ftoa_engine(float x, struct ftoa *ftoa, uint8_t max_digits, uint8_t max_decimals);
    The same as __dtoa_engine, except for floats.
  3. __atod_engine
    __atod_engine(uint64_t m10, int e10);
    To avoid needing to handle stdio inside the conversion function, __atod_engine receives fully parsed values, the base-10 significand (m10) and exponent (e10). The value to convert is m10 * pow(10, e10).
  4. __atof_engine
    __atof_engine(uint32_t m10, int e10);
    The same as __atod_engine, except for floats.
With these, it can do printf, scanf, ecvt, fcvt, gcvt, strtod, strtof and atof. Porting Ry to Picolibc The existing Ry float-to-string code always generates the number of digits necessary for accurate output. I had to hack it up to generate correctly rounded shorter output when max_digits or max_decimals were smaller. I'm not sure I managed to do that correctly, but at least it appears to be passing all of the test cases I have. In normal operation, Ry iteratively removes digits from the answer that aren't necessary to disambiguate with neighboring values. What I changed was to keep removing digits using that method until the answer had few enough digits to fit in the desired length. There's some tricky rounding code that adjusts the final result and I had to bypass that if I'd removed extra digits. That was about the only change necessary to the core algorithm. I also trimmed the code to only include the general case and not the performance improvements, then wrapped it with code to provide the _engine interface. On the string-to-float side, most of what I needed to do was remove the string parsing bits at the start of the function and switch from performance-optimized to space-optimized versions of a couple of internal routines. Correctness Results Because these new functions are now 'exact', I was able to adjust the picolibc tests to compare all of the bits for string/float conversion instead of having to permit a bit of slop in the answers. With those changes, the picolibc test suite passes, which offers some assurance that things aren't completely broken. Size Results Snek uses the 32-bit float versions of the conversion routines, and for that, the size difference is:
   text    data     bss     dec     hex filename
  59068      44   37968   97080   17b38 snek-qemu-riscv-orig.elf
  59430      44   37968   97442   17ca2 snek-qemu-riscv-ryu.elf
362 bytes added to gain accurate printf/strtof results seems like a good trade-off in this case. Performance I haven't measured performance at all, but I suspect that it won't be nearly as problematic on most platforms as the source code makes it appear. And that's because Ry is entirely integer arithmetic with no floating point at all. This avoids using the soft fp code for platforms without hardware float support. Pointers to the Code I haven't merged this to picolibc master yet, it's on the ryu branch: Review, especially of the hack above to return short results, would be greatly appreciated! Thanks again to Ulf Adams for creating this code and to Sreepathi Pai for sending me a note about it!

29 May 2020

Keith Packard: picolibc-string-float

Float/String Conversion in Picolibc Exact conversion between strings and floats seems like a fairly straightforward problem. There are two related problems:
  1. String to Float conversion. In this case, the goal is to construct the floating point number which most closely approximates the number represented by the string.
  2. Float to String conversion. Here, the goal is to generate the shortest string which, when fed back into the String to Float conversion code, exactly reproduces the original value.
When linked together, getting from float to string and back to float is a round trip , and an exact pair of algorithms does this for every floating point value. Solutions for both directions were published in the proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation, with the string-to-float version written by William Clinger and the float-to-string version written by Guy Steele and Jon White. These solutions rely on very high precision integer arithmetic to get every case correct, with float-to-string requiring up to 1050 bits for the 64-bit IEEE floating point format. That's a lot of bits. Newlib Float/String Conversion The original newlib code, written in 1998 by David M. Gay, has arbitrary-precision numeric code for these functions to get exact results. However, it has the disadvantages of performing numerous memory allocations, consuming considerable space for the code, and taking a long time for conversions. The first disadvantage, using malloc during conversion, ended up causing a number of CVEs because the results of malloc were not being checked. That's bad on all platforms, but especially bad for embedded systems where reading and writing through NULL pointers may have unknown effects. Upstream newlib applied a quick fix to check the allocations and call abort. Again, on platforms with an OS, that at least provides a way to shut down the program and let the operating environment figure out what to do next. On tiny embedded systems, there may not be any way to log an error message or even restart the system. Ok, so we want to get rid of the calls to abort and have the error reported back through the API call which caused the problem. That's got two issues, one mere technical work, and another mere re-interpretation of specifications. Let's review the specification issue. The libc APIs involved here are: Input: Output: Scanf and printf are both documented to set errno to ENOMEM when they run out of memory, but none of the other functions takes that possibility into account. So we'll make some stuff up and hope it works out: Now, looking back at the technical challenge. That's a simple matter of inserting checks at each allocation, or call which may result in an allocation, and reporting failure back up the call stack, unwinding any intermediate state to avoid leaking memory. Testing Every Possible Allocation Failure There are a lot of allocation calls in the newlib code. And the call stack can get pretty deep. A simple visual inspection of the code didn't seem sufficient to me to validate the allocation checking code. So I instrumented malloc, making it count the number of allocations and fail at a specific one. Now I can count the total number of allocations done over the entire test suite run for each API involved and then run the test suite that many times, failing each allocation in turn and checking to make sure we recover correctly. By that, I mean: There were about 60000 allocations to track, so I ran the test suite that many times, which (with the added malloc tracing enabled) took about 12 hours. Bits Pushed to the Repository With the testing complete, I'm reasonably confident that the code is now working, and that these CVEs are more completely squashed. If someone is interested in back-porting the newlib fixes upstream to newlib, that would be awesome. It's not completely trivial as this part of picolibc has diverged a bit due to the elimination of the reent structure. Picolibc's Tinystdio Float/String Conversion Picolibc contains a complete replacement for stdio which was originally adopted from avr libc. That's a stdio implementation designed to run on 8-bit Atmel processors and focuses on very limited memory use and small code size. It does this while maintaining surprisingly complete support for C99 printf and scanf support. However, it also does this without any arbitrary precision arithmetic, which means it doesn't get the right answer all of the time. For most embedded systems, this is usually a good trade off -- floating point input and output are likely to be largely used for diagnostics and debugging, so mostly correct answers are probably sufficient. The original avr-libc code only supports 32-bit floats, as that's all the ABI on those processors has. I extended that to 64-, 80- and 128- bit floats to cover double and long double on x86 and RISC-V processors. Then I spent a bunch of time adjusting the code to get it to more accurately support C99 standards. Tinystdio also had strtod support, but it was missing ecvt, fcvt and gcvt. For those, picolibc was just falling back to the old newlib code, which introduced all of the memory allocation issues we've just read about. Fixing that so that tinystdio was self-contained and did ecvt, fcvt and gcvt internally required writing those functions in terms of the float-to-string primitives already provided in tinystdio to support printf. gcvt is most easily supported by just calling sprintf. Once complete, the default picolibc build, using tinystdio, no longer does any memory allocation for float/string conversions.

1 July 2017

Keith Packard: DRM-lease-4

DRM leasing part three (vblank) The last couple of weeks have been consumed by getting frame sequence numbers and events handled within the leasing environment (and Vulkan) correctly. Vulkan EXT_display_control extension This little extension provides the bits necessary for applications to track the display of frames to the user.
vkGetSwapchainCounterEXT(VkDevice           device,
             VkSwapchainKHR         swapchain,
             VkSurfaceCounterFlagBitsEXT    counter,
             uint64_t           *pCounterValue);
This function just retrieves the current frame count from the display associated with swapchain.
vkRegisterDisplayEventEXT(VkDevice          device,
              VkDisplayKHR          display,
              const VkDisplayEventInfoEXT   *pDisplayEventInfo,
              const VkAllocationCallbacks   *pAllocator,
              VkFence           *pFence);
This function creates a fence that will be signaled when the specified event happens. Right now, the only event supported is when the first pixel of the next display refresh cycle leaves the display engine for the display. If you want something fancier (like two frames from now), you get to do that on your own using this basic function. drmWaitVBlank drmWaitVBlank is the existing interface for all things sequence related and has three modes (always nice to have one function do three things, I think). It can:
  1. Query the current vblank number
  2. Block until a specified vblank number
  3. Queue an event to be delivered at a specific vblank number
This interface has a few issues: For leases, figuring out the index into the kernel list of crtcs is pretty tricky -- our lease has a subset of those crtcs, so we can't actually compute the global crtc index. drmCrtcGetSequence
int drmCrtcGetSequence(int fd, uint32_t crtcId,
               uint64_t *sequence, uint64_t *ns);
Here's a simple new function hand it a crtc ID and it provides the current frame sequence number and the time when that frame started (in nanoseconds). drmCrtcQueueSequence
int drmCrtcQueueSequence(int fd, uint32_t crtcId,
                 uint32_t flags, uint64_t sequence,
             uint64_t user_data);
struct drm_event_crtc_sequence  
    struct drm_event    base;
    __u64           user_data;
    __u64           time_ns;
    __u64           sequence;
This will cause a CRTC_SEQUENCE event to be delivered at the start of the specified frame sequence. That event will include the frame when the event was actually generated (in case it's late), along with the time (in nanoseconds) when that frame was started. The event also includes a 64-bit user_data value, which can be used to hold a pointer to whatever data the application wants to see in the event handler. The 'flags' argument contains a combination of:
#define DRM_CRTC_SEQUENCE_RELATIVE      0x00000001  /* sequence is relative to current */
#define DRM_CRTC_SEQUENCE_NEXT_ON_MISS      0x00000002  /* Use next sequence if we've missed */
#define DRM_CRTC_SEQUENCE_FIRST_PIXEL_OUT   0x00000004  /* Signal when first pixel is displayed */
These are similar to the values provided for the drmWaitVBlank function, except I've added a selector for when the event should be delivered to align with potential future additions to Vulkan. Right now, the only time you can ask for is first-pixel-out, which says that the event should correspond to the display of the first pixel on the screen. DRM events Vulkan fences With the kernel able to deliver a suitable event at the next frame, all the Vulkan code needed was a to create a fence and hook it up to such an event. The existing fence code only deals with rendering fences, so I added window system interface (WSI) fencing infrastructure and extended the radv driver to be able to handle both kinds of fences within that code. Multiple waiting threads I've now got three places which can be waiting for a DRM event to appear:
  1. Frame sequence fences.
  2. Wait for an idle image. Necessary when you want an image to draw the next frame to.
  3. Wait for the previous flip to complete. The kernel can only queue one flip at a time, so we have to make sure the previous flip is complete before queuing another one.
Vulkan allows these to be run from separate threads, so I needed to deal with multiple threads waiting for a specific DRM event at the same time. XCB has the same problem and goes to great lengths to manage this with a set of locking and signaling primitives so that only one thread is ever doing poll or read from the socket at time. If another thread wants to read at the same time, it will block on a condition variable which is then signaled by the original reader thread at the appropriate time. It's all very complicated, and it didn't work reliably for a number of years. I decided to punt and just create a separate thread for processing all DRM events. It blocks using poll(2) until some events are readable, processes those and then broadcasts to a condition variable to notify any waiting threads that 'something' has happened. Each waiting thread simply checks for the desired condition and if not satisfied, blocks on that condition variable. It's all very simple looking, and seems to work just fine. Code Complete, Validation Remains At this point, all of the necessary pieces are in place for the VR application to take advantage of an HMD using only existing Vulkan extensions. Those will be automatically mapped into DRM leases and DRM events as appropriate. The VR compositor application is working pretty well; tests with Dota 2 show occasional jerky behavior in complex scenes, so there's clearly more work to be done somewhere. I need to go write a pile of tests to independently verify that my code is working. I wonder if I'll need to wire up some kind of light sensor so I can actually tell when frames get displayed as it's pretty easy to get consistent-but-wrong answers in this environment. Source Code

30 May 2017

Keith Packard: DRM-lease-3

DRM leasing part three (Vulkan) With the kernel APIs off for review, and the X RandR bits looking like they're in reasonable shape, I finally found some time to sit down and figure out how I wanted to integrate this into Vulkan. Avoiding two DRM file descriptors Given that a DRM lease is represented by a DRM master file descriptor, we want to use that for all of the operations in the driver, including rendering and mode setting. Using the vulkan driver render node and the lease master node together would require passing buffer objects between the kernel contexts using even more file descriptors. The Mesa Vulkan drivers open the device nodes while enumerating devices, not when they are created. This seems a bit early to me, but it makes sure that the devices being enumerated are actually available for use, and not just present in the system. To replace the render node fd with the lease master fd means hooking into the system early enough that the enumeration code can see the lease fd. And that means creating an instance extension as the instance gets created before devices are enumerated. The VK_KEITHP_kms_display instance extension This simple instance extension provides the necessary hooks to get the lease information from the application down into the driver before the DRM node is opened. In the first implementation, I added a function that could be called before the devices were enumerated to save the information in the Vulkan loader. That worked, but required quite a bit of study of the Vulkan loader and its XML description of the full Vulkan API. Mark Young suggested that a simpler plan would be to chain the information into the VkInstanceCreateInfo pNext field; with no new APIs added to Vulkan, there shouldn't be any need to change the Vulkan loader -- the device driver would advertise the new instance extension and the application could find it. That would have worked great, except the Vulkan loader 'helpfully' elides all instance extensions it doesn't know about before returning the list to the application. I'd say this was a bug and should be fixed, but for now, I've gone ahead and added the few necessary definitions to the loader to make it work. In the application, it's a simple matter of searching for this extension, constructing the VkKmsDisplayInfoKEITHP structure, chaining that into the VkInstanceCreateInfo pNext list and passing that in to the vkCreateInstance call.
typedef struct VkKmsDisplayInfoKEITHP  
    VkStructureType         sType;  /* VK_STRUCTURE_TYPE_KMS_DISPLAY_INFO_KEITHP */
    const void*             pNext;
    int                     fd;
    uint32_t                crtc_id;
    uint32_t                *connector_ids;
    int                     connector_count;
    drmModeModeInfoPtr      mode;
As you can see, this includes the master file descriptor along with all of the information necessary to set the desired video mode using the specified resources. The driver just walks the pNext list from the VkInstanceCreateInfo structure looking for any provided VkKmsDisplayInfoKEITHP structure and pulls the data out. To avoid questions about file descriptor lifetimes, the driver dup's the provided fd. The application is expected to close their copy at a suitable time. The VK_KHR_display extension Vulkan already has an API for directly accessing the raw device, including code for exposing video modes and everything. As tempting as it may be to just go do something simpler, there's a lot to be said for using existing APIs. This extension doesn't provide any direct APIs for acquiring display resources, relying on the VK_EXT_acquire_xlib_display extension for that part. And that takes a VkPhysicalDisplay parameter, which is only available after the device is opened, which is why I created the VK_KEITHP_kms_display extension instead of just using the VK_EXT_acquire_xlib_display extension -- we can't increase the capabilities of the render node opened by the driver, and we don't want to keep two file descriptors around. With the information provided by the VK_KEITHP_kms_display extension, we can implement all of the VK_KHR_display extension APIs, including enumerating planes and modes and creating the necessary display surface. Of course, there's only one plane and one mode, so some of the implementation is pretty simplistic. The big piece of work was to create the swap chain structure and associated frame buffers. A working example I've taken the 'cube' example from the Vulkan loader and hacked it up to use XCB to construct a DRM lease, the VK_KEITHP_kms_display extension to pass that lease into the Vulkan driver. The existing support for the VK_KHR_display extension "just worked", which was pretty satisfying. It's a bit of a mess I'm not satisfied with the mesa code at this point; there's a bunch of code in the radeon driver which should be in the vulkan WSI bits, and the vulkan WSI bits should probably not have the KMS interfaces wired in. I'll ask around and see what other Mesa developers think I should do before restructuring it further; I'll probably have to rewrite it all at least one more time before it's ready to upstream. Seeing the code I'll be cleaning the code up a bit before sending it out for review, but it's already visible in my own repositories:

25 April 2017

Keith Packard: TeleMini3

TeleMini V3.0 Dual-deploy altimeter with telemetry now available TeleMini v3.0 is an update to our original TeleMini v1.0 flight computer. It is a miniature (1/2 inch by 1.7 inch) dual-deploy flight computer with data logging and radio telemetry. Small enough to fit comfortably in an 18mm tube, this powerful package does everything you need on a single board: I don't have anything in these images to show just how tiny this board is but the spacing between the screw terminals is 2.54mm (0.1in), and the whole board is only 13mm wide (1/2in). This was a fun board to design. As you might guess from the version number, we made a couple prototypes of a version 2 using the same CC1111 SoC/radio part as version 1 but in the EasyMini form factor (0.8 by 1.5 inches). Feed back from existing users indicated that bigger wasn't better in this case, so we shelved that design. With the availability of the STM32F042 ARM Cortex-M0 part in a 4mm square package, I was able to pack that, the higher power CC1200 radio part, a 512kB memory part and a beeper into the same space as the original TeleMini version 1 board. There is USB on the board, but it's only on some tiny holes, along with the cortex SWD debugging connection. I may make some kind of jig to gain access to that for configuration, data download and reprogramming. For those interested in an even smaller option, you could remove the screw terminals and battery connector and directly wire to the board, and replace the beeper with a shorter version. You could even cut the rear mounting holes off to make the board shorter; there are no components in that part of the board.

1 April 2017

Keith Packard: DRM-lease-2

DRM leasing part deux (kernel side) I've stabilized the kernel lease implementation so that leases now work reliably, and don't require a reboot before running the demo app a second time. Here's whats changed:
  1. Reference counting is hard. I'm creating a new drm_master, which means creating a new file. There were a bunch of reference counting errors in my first pass; I've cleaned those up while also reducing the changes needed to the rest of the DRM code.
  2. Added a 'mask_lease' value to the leases -- this controls whether resources in the lease are hidden from the lessor, allowing the lessor to continue to do operations on the leased resources if it likes.
  3. Hacked the mutex locking assertions to not crash in normal circumstances. I'm now doing: BUG_ON(__mutex_owner(&master->dev->mode_config.idr_mutex) != current); to make sure the mutex is held by the current thread, instead of just making sure some thread holds the mutex. I have this memory of a better way to do this, but now I can't dig it up. Suggestions welcome, of course.
I'm reasonably pleased with the current state of the code, although I want to squash the patches together so that only the final state of the design is represented, rather than the series of hacks present now. Comments on #dri-devel I spent a bit of time on the #dri-devel IRC channel answering questions about the DRM-lease design and the changes above reflect that. One concern was about mode setting from two masters at the same time. Mode setting depends on a number of shared 'hidden' resources; things like memory fifos and the like. If either lessee or lessor wants to change the displayed modes, they may fail due to conflicts over these resources and not be able to recover easily. A related concern was that the current TEST/render/commit mechanism used by compositors may no longer be reliable as another master could change hidden resource usage between the TEST and commit operations. Daniel Vetter suggested allowing the lessor to 'exclude' lessee mode sets during this operation. One solution would be to have the lessor set the mode on the leased resources before ceding control of the objects, and once set, the lessee shouldn't perform additional mode setting operations. This would limit the flexibility in the lessee quite a bit as it wouldn't be able to pop up overlay planes or change resolution. Questions Let's review the questions from my last post, DRM-lease: Remaining Kernel Work The code is running, and appears stable. However, it's not quite done yet. Here's a list of remaining items that I know about:
  1. Changing leases should update sub-leases. When you reduce the resources in one lease, the kernel should walk any sub-leases and clear out resources which the lessor no longer has access to.
  2. Sending events when leases are created/destroyed. When a lease is created, if the mask_lease value is set, then the lessor should get regular events describing the effective change. Similarly, both lessor and lessee should get events when a lease is changed.
  3. Refactoring the patch series to squash intermediate versions of the new code.
Remaining Other Work Outside of the kernel, I'll be adding X support for this operation. Here's my current thinking: With that, I think this phase of the project will be wrapped up and it will be time to move on to actually hooking up a real HMD and hacking the VR code to use this new stuff. Seeing the code For those interested in seeing the state of the code so far, there's kernel, drm and kmscube repositories here:

28 March 2017

Keith Packard: DRM-lease

DRM display resource leasing (kernel side) So, you've got a fine head-mounted display and want to explore the delights of virtual reality. Right now, on Linux, that means getting the window system to cooperate because the window system is the DRM master and holds sole access to all display resources. So, you plug in your device, play with RandR to get it displaying bits from the window system and then carefully configure your VR application to use the whole monitor area and hope that the desktop will actually grant you the boon of page flipping so that you will get reasonable performance and maybe not even experience tearing. Results so far have been mixed, and depend on a lot of pieces working in ways that aren't exactly how they were designed to work. We could just hack up the window system(s) and try to let applications reserve the HMD monitors and somehow removing them from the normal display area so that other applications don't randomly pop up in the middle of the screen. That would probably work, and would take advantage of much of the existing window system infrastructure for setting video modes and performing page flips. However, we've got a pretty spiffy standard API in the kernel for both of those, and getting the window system entirely out of the way seems like something worth trying. I spent a few hours in Hobart chatting with Dave Airlie during LCA and discussed how this might actually work. Goals
  1. Use KMS interfaces directly from the VR application to drive presentation to the HMD.
  2. Make sure the window system clients never see the HMD as a connected monitor.
  3. Maybe let logind (or other service) manage the KMS resources and hand them out to the window system and VR applications.
  1. Don't make KMS resources appear and disappear. It turns out applications get confused when the set of available CRTCs, connectors and encoders changes at runtime.
An Outline for Multiple DRM masters By the end of our meeting in Hobart, Dave had sketched out a fairly simple set of ideas with me. We'd add support in the kernel to create additional DRM masters. Then, we'd make it possible to 'hide' enough state about the various DRM resources so that each DRM master would automagically use disjoint subsets of resources. In particular, we would.
  1. Pretend that connectors were always disconnected
  2. Mask off crtc and encoder bits so that some of them just didn't seem very useful.
  3. Block access to resources controlled by other DRM masters, just in case someone tried to do the wrong thing.
Refinement with Eric over Swedish Pancakes A couple of weeks ago, Eric Anholt and I had breakfast at the original pancake house and chatted a bit about this stuff. He suggested that the right interface for controlling these new DRM masters was through the existing DRM master interface, and that we could add new ioctls that the current DRM master could invoke to create and manage them. Leasing as a Model I spent some time just thinking about how this might work and came up with a pretty simple metaphor for these new DRM masters. The original DRM master on each VT "owns" the output resources and has final say over their use. However, a DRM master can create another DRM master and "lease" resources it has control over to the new DRM master. Once leased, resources cannot be controlled by the owner unless the owner cancels the lease, or the new DRM master is closed. Here's some terminology:
DRM Master
Any DRM file which can perform mode setting.
The original DRM Master, created by opening /dev/dri/card*
A DRM master which has leased out resources to one or more other DRM masters.
A DRM master which controls resources leased from another DRM master. Each Lessee leases resources from a single Lessor.
Lessee ID
An integer which uniquely identifies a lessee within the tree of DRM masters descending from a single Owner.
The contract between the Lessor and Lessee which identifies which resources which may be controlled by the Lessee. All of the resources must be owned by or leased to the Lessor.
With Eric's input, the interface to create a lease was pretty simple to write down:
int drmModeCreateLease(int fd,
               const uint32_t *objects,
               int num_objects,
               int flags,
               uint32_t *lessee_id);
Given an FD to a DRM master, and a list of objects to lease, a new DRM master FD is returned that holds a lease to those objects. 'flags' can be any combination of O_CLOEXEC and O_NONBLOCK for the newly minted file descriptor. Of course, the owner might want to take some resources back, or even grant new resources to the lessee. So, I added an interface that rewrites the terms of the lease with a new set of objects:
int drmModeChangeLease(int fd,
               uint32_t lessee_id,
               const uint32_t *objects,
               int num_objects);
Note that nothing here makes any promises about the state of the objects across changes in the lease status; the lessor and lessee are expected to perform whatever modesetting is required for the objects to be useful to them. Window System Integration There are two ways to integrate DRM leases into the window system environment:
  1. Have logind "lease" most resources to the window system. When a HMD is connected, it would lease out suitable resources to the VR environment.
  2. Have the window system "own" all of the resources and then add window system interfaces to create new DRM masters leased from its DRM master.
I'll probably go ahead and do 2. in X and see what that looks like. One trick with any of this will be to hide HMDs from any RandR clients listening in on the window system. You probably don't want the window system to tell the desktop that a new monitor has been connected, have it start reconfiguring things, and then have your VR application create a new DRM master, making the HMD appear to have disconnected to the window system and have that go reconfigure things all over again. I'm not sure how this might work, but perhaps having the VR application register something like a passive grab on hot plug events might make sense? Essentially, you want it to hear about monitor connect events, go look to see if the new monitor is one it wants, and if not, release that to other X clients for their use. This can be done in stages, with the ability to create a new DRM master over X done first, and then cleaning up the hotplug stuff later on. Current Status I hacked up the kernel to support the drmModeCreateLease API, and then hacked up kmscube to run two threads with different sets of KMS resources. That ran for nearly a minute before crashing and requiring a reboot. I think there may be some locking issues with page flips from two threads to the same device. I think I also made the wrong decision about how to handle lessors closing down. I tried to let the lessors get deleted and then 'orphan' the lessees. I've rewritten that so that lessees hold a reference on their lessor, keeping the lessor in place until the lessee shuts down. I've also written the kernel parts of the drmModeChangeLease support. Questions

14 March 2017

Keith Packard: Valve

Consulting for Valve in my spare time Valve Software has asked me to help work on a couple of Linux graphics issues, so I'll be doing a bit of consulting for them in my spare time. It should be an interesting diversion from my day job working for Hewlett Packard Enterprise on Memory Driven Computing and other fun things. First thing on my plate is helping support head-mounted displays better by getting the window system out of the way. I spent some time talking with Dave Airlie and Eric Anholt about how this might work and have started on the kernel side of that. A brief synopsis is that we'll split off some of the output resources from the window system and hand them to the HMD compositor to perform mode setting and page flips. After that, I'll be working out how to improve frame timing reporting back to games from a composited desktop under X. Right now, a game running on X with a compositing manager can't tell when each frame was shown, nor accurately predict when a new frame will be shown. This makes smooth animation rather difficult.

31 January 2017

Keith Packard: FOSDEM-2017

FOSDEM 2017 -- The Machine and ChaosKeys Yay! I get to go to FOSDEM this year. I'll be speaking about Free Software for The Machine on Sunday afternoon at 14:00 in K.1.105 (La Fontaine). I'll also be bringing along a number of ChaosKeys and will have them available for either $40 US or 40. You can either bring cash or pre-pay at the Altus Metrum shop. Hope to see you there.

28 January 2017

Bits from Debian: Debian at FOSDEM 2017

On February 4th and 5th, Debian will be attending FOSDEM 2017 in Brussels, Belgium; a yearly gratis event (no registration needed) run by volunteers from the Open Source and Free Software community. It's free, and it's big: more than 600 speakers, over 600 events, in 29 rooms. This year more than 45 current or past Debian contributors will speak at FOSDEM: Alexandre Viau, Bradley M. Kuhn, Daniel Pocock, Guus Sliepen, Johan Van de Wauw, John Sullivan, Josh Triplett, Julien Danjou, Keith Packard, Martin Pitt, Peter Van Eynde, Richard Hartmann, Sebastian Dr ge, Stefano Zacchiroli and Wouter Verhelst, among others. Similar to previous years, the event will be hosted at Universit libre de Bruxelles. Debian contributors and enthusiasts will be taking shifts at the Debian stand with gadgets, T-Shirts and swag. You can find us at stand number 4 in building K, 1 B; CoreOS Linux and PostgreSQL will be our neighbours. See for more details. We are looking forward to meeting you all!

8 January 2017

Keith Packard: embedded-arm-libc

Finding a Libc for tiny embedded ARM systems You'd think this problem would have been solved a long time ago. All I wanted was a C library to use in small embedded systems -- those with a few kB of flash and even fewer kB of RAM. Small system requirements A small embedded system has a different balance of needs: Available small C libraries I've looked at: Current AltOS C library We've been using pdclib for a couple of years. It was easy to get running, but it really doesn't match what we need. In particular, it uses a lot of stack space in the stdio implementation as there's an additional layer of abstraction that isn't necessary. In addition, pdclib doesn't include a math library, so I've had to 'borrow' code from other places where necessary. I've wanted to switch for a while, but there didn't seem to be a great alternative. What's wrong with newlib? The "obvious" embedded C library is newlib. Designed for embedded systems with a nice way to avoid needing a 'real' kernel underneath, newlib has a lot going for it. Most of the functions have a good balance between speed and size, and many of them even offer two implementations depending on what trade-off you need. Plus, the build system 'just works' on multi-lib targets like the family of cortex-m parts. The big problem with newlib is the stdio code. It absolutely requires dynamic memory allocation and the amount of code necessary for 'printf' is larger than the flash space on many of our devices. I was able to get a cortex-m3 application compiled in 41kB of code, and that used a smattering of string/memory functions and printf. How about avr libc? The Atmel world has it pretty good -- avr-libc is small and highly optimized for atmel's 8-bit avr processors. I've used this library with success in a number of projects, although nothing we've ever sold through Altus Metrum. In particular, the stdio implementation is quite nice -- a 'FILE' is effectively a struct containing pointers to putc/getc functions. The library does no buffering at all. And it's tiny -- the printf code lacks a lot of the fancy new stuff, which saves a pile of space. However, much of the places where performance is critical are written in assembly language, making it pretty darn hard to port to another processor. Mixing code together for fun and profit! Today, I decided to try an experiment to see what would happen if I used the avr-libc stdio bits within the newlib environment. There were only three functions written in assembly language, two of them were just stubs while the third was a simple ultoa function with a weird interface. With those coded up in C, I managed to get them wedged into newlib. Figuring out the newlib build system was the only real challenge; it's pretty awful having generated files in the repository and a mix of autoconf 2.64 and 2.68 version dependencies. The result is pretty usable though; my STM 32L discovery board demo application is only 14kB of flash while the original newlib stdio bits needed 42kB and that was still missing all of the 'syscalls', like read, write and sbrk. Here's gitweb pointing at the top of the tiny-stdio tree: gitweb And, of course you can check out the whole thing
git clone git://
'master' remains a plain upstream tree, although I do have a fix on that branch. The new code is all on the tiny-stdio branch. I'll post a note on the newlib mailing list once I've managed to subscribe and see if there is interest in making this option available in the upstream newlib releases. If so, I'll see what might make sense for the Debian libnewlib-arm-none-eabi packages.

19 November 2016

Keith Packard: AltOS-Lisp-2

Updates to Altos Lisp I wrote a few days ago about a tiny lisp interpreter I wrote for AltOS Really, it's almost "done" now, I just wanted to make a few improvements Incremental Collection I was on a walk on Wednesday when I figured out that I didn't need to do a full collection every time; a partial collection that only scanned the upper portion of memory would often find plenty of free space to keep working for a while. To recap, the heap is in two pieces; the ROM piece and the RAM piece. The ROM piece is generated during the build process and never changes afterwards (hence the name), so the only piece which is collected is the RAM piece. Collection works like:
chunk_low = heap base
new_top = heap base
For all of the heap
    Find the first 64 live objects above chunk_low
    Compact them all to new_top
    Rewrite references in the whole heap for them
    Set new_top above the new locations
    Set chunk_low above the old locations
top = new_top
The trick is to realize that there's really no need to start at the bottom of the heap; you can start anywhere you like and compact stuff, possibly leaving holes below that location in the heap. As the heap tends to have long-lived objects slowly sift down to the beginning, it's useful to compact objects higher than that, skipping the compaction process for the more stable area in memory. Each time the whole heap is scanned, the top location is recorded. After that, incremental collects happen starting at that location, and when that doesn't produce enough free space, a full collect is done. The collector now runs a bunch faster on average now. Binary Searches I stuck some linear searches in a few places in the code, the first was in the collector when looking to see where an object had moved to. As there are 64 entries, the search is reduced from 32 to 6 compares on average. The second place was in the frame objects, which hold the list of atom/value bindings for each lexical scope (including the global scope). These aren't terribly large, but a binary search is still a fine plan. I wanted to write down here the basic pattern I'm using for binary searches these days, which avoids some of the boundary conditions I've managed to generate in the past:
int find (needle)  
    int l = 0;
    int r = count - 1;
    while (l <= r)  
        int m = (l + r) >> 1;
        if (haystack[m] < needle)
            l = m + 1;
            r = m - 1;
    return l;
With this version, the caller can then check to see if there's an exact match, and if not, then the returned value is the location in the array where the value should be inserted. If the needle is known to not be in the haystack, and if the haystack is large enough to accept the new value:
void insert(needle)  
    int l = find(needle);
        (num - l) * sizeof (haystack[0]));
    haystack[l] = needle;
Similarly, if the caller just wants to know if the value is in the array:
bool exists(needle)  
    int l = find(needle);
    return (l < count && haystack[l] == needle);
Call with Current Continuation Because the execution stack is managed on the heap, it's completely trivial to provide the scheme-like call with current continuation, which constructs an object which can be 'called' to transfer control to a saved location:
> (+ "hello " (call/cc (lambda (return) (setq boo return) (return "foo "))) "world")
"hello foo world"
> (boo "bar ")
"hello bar world"
> (boo "yikes ")
"hello yikes world"
One thing I'd done previously is dump the entire state of the interpreter on any error, and that included a full stack trace. I adopted that code for printing of these continuation objects:
    expr:   (call/cc (lambda (return) (set (quote boo) return) (return "foo ")))
    state:  val
    values: (call/cc
    sexprs: ()
    expr:   (+ "hello " (call/cc (lambda (return) (set (quote boo) return) (return "foo "))) "world")
    state:  formal
    values: (+
             "hello "
    sexprs: ((call/cc (lambda (return) (set (quote boo) return) (return "foo ")))
The top stack frame is about to return from the call/cc spot with a value; supply a value to 'boo' and that's where you start. The next frame is in the middle of computing formals for the + s-expression. It's found the + function, and the "hello " string and has yet to get the value from call/cc or the value of the "world" string. Once the call/cc "returns", that value will get moved to the values list and the sexpr list will move forward one spot to compute the "world" value. Implementing this whole mechanism took only a few dozen lines of code as the existing stack contexts were already a continuation in effect. The hardest piece was figuring out that I needed to copy the entire stack each time the continuation was created or executed as it is effectively destroyed in the process of evaluation. I haven't implemented dynamic-wind yet; when I did that for nickle, it was a bit of a pain threading execution through the unwind paths. Re-using Frames I decided to try and re-use frames (those objects which hold atom/value bindings for each lexical scope). It wasn't that hard; the only trick was to mark frames which have been referenced from elsewhere as not-for-reuse and then avoid sticking those in the re-use queue. This reduced allocations even further so that for simple looping or tail-calling code, the allocator may never end up being called. How Big Is It? I've managed to squeeze the interpreter and all of the rest of the AltOS system into 25kB of Cortex-M0 code. That leaves space for the 4kB boot loader and 3kB of flash to save/restore the 3kB heap across resets. Adding builtins to control timers and GPIOs would make this a reasonable software load for an Arduino; offering a rather different programming model for those with a taste for adventure. Modern ARM-based Arduino boards have plenty of flash and ram for this. It might be interesting to get this running on the Arduino Zero; there's no real reason to replace the OS either; porting the lisp interpreter into the bare Arduino environment wouldn't take long.

15 November 2016

Keith Packard: AltOS-Lisp

A Tiny Lisp for AltOS I took a bit of a diversion over the last week or so when I wondered how small a lisp interpreter I could write, and whether I could fit that into one of the processors that AltOS runs on. It turns out, you can write a tiny lisp interpreter that fits in about 25kB of ram with a 3kB heap for dynamic data. I decided to target our ChaosKey boards; they're tiny, and I've got a lot of them. That processor offers 28kB of usable flash space (after the 4kB boot loader) and 6kB of ram with the processor running at a steaming 48MHz. I'm not at all sure this is useful, but I always enjoy doing language implementations, and this one presented some 'interesting' challenges: Iterative Compacting Allocator I'm betting someone has built one of these before, but I couldn't find one, so I wrote my own. The basic strategy is to walk the heap to find a subset of the active objects which are allocated sequentially in memory with only unused storage between them. These objects are then compacted in-place, and then the heap is walked again to update all references to the moved objects. Then, the process is restarted to find another subset and move them. By looking for these subsets starting at the bottom of the heap, and working upwards towards the top, the whole heap can be compacted into a contiguous chunk at the bottom of memory. Allocation involves moving a pointer along at the top of active memory; when it gets to the top of the heap, collect and see if there's space now. As always, the hardest part was to make sure all active memory was tied down. The second hardest part was to make sure that all active pointers were updated after any allocation, in case a collect moved the underlying object. That was just bookkeeping, but did consume much of the development time. One additional trick was to terminate the recursion during heap walking by flagging active cons cell locations in a global bitmap and then walking that separately, iterating until that bitmap is empty. Nested lambdas form another recursion which should probably get the same approach, but I haven't done that yet. An unexpected "benefit" of the tiny heap is that the collector gets called a lot, so any referencing bugs will have a good chance of being uncovered in even a short program execution. ROM-able Lisp Instead of implementing all of the language in C, I wanted to be able to implement various pieces in Lisp itself. Because of the complex nature of the evaluation process, adding things like 'let' or even 'defun' turn out to be dramatically simpler in Lisp. However, I didn't want to consume bunches of precious RAM to hold these basic functions. What I did was to create two heaps, one in ROM and the other in RAM. References are be tagged as to which heap they're in. 16-bit Values Lisp programs use a pile of references. Using a full 32 bits for each one would mean having a lot less effective storage. So, instead, I use an offset from the base of the heap. The top bit of the offset is used to distinguish between the ROM heap and the RAM heap. I needed a place to store type information, so I settled on using the bottom two bits of the references. This allows for four direct type values. One of these values is used to indicate an indirect type, where the type is stored in the first byte of the object. The direct types are:
0Cons cell
114-bit int
With 2 tag bits, the allocator needs to work in 32-bit units as the references couldn't point to individual bytes. Finally, I wanted 0 to be nil, so I add four to the offsets within the heaps. The result is that the ROM and RAM heaps can each cover up to 32k - 4 bytes. Note that ints are not stored in the heap; instead they are immediate values stored in 14 bits, providing a range of -8192 to 8191. One can imagine wanting more range in ints at some point. Heap-based Evaluator A simple lisp implementation uses the fact that eval is re-entrant and do the operation on the C stack:
val eval(val exprs)  
    val vals;
    while (exprs)  
        vals = append(vals, eval(car(exprs)));
        exprs = exprs->cdr;
    return execute (car(vals), cdr(vals));
This makes things really simple and provides for a clean framework for implementing various bits of lisp, including control flow and macros. However, it rapidly consumes all of the available memory for a stack, while also requiring separate bookkeeping for the in-use memory in each frame. I replaced this design with one which keeps the lisp stack on the heap, and then performs eval with a state machine with all state stored in global variables so that the memory manager can reference them directly. Each eval operation is performed in a separate 'stack' context, which holds the entire eval state except for the current value, which lives in a separate global variable and is used to pass values out of one stack frame and into another. When the last stack context is finished, the evaluation terminates and the value is returned to the caller. There are nine states in the state machine, each of which is implemented in a separate function, making the state machine a simple matter of pulling the current state from the top of the stack and invoking the associated function:
while (ao_lisp_stack)  
    if (!(*evals[ao_lisp_stack->state])()   ao_lisp_exception)  
        return AO_LISP_NIL;
return ao_lisp_v;
Because there's no C recursion involved, catching exceptions is a simple matter of one test at this level. Primitives like progn, while, cond and eval all take special magic in the state machine to handle; getting all of that working took several attempts before I found the simple loop shown above. Lexical Scoping The last time I did a lisp interpreter, I implemented dynamic scoping. Atoms were all global and had values associated directly with them. Evaluating a lambda started by saving all of the existing global values for the parameter atoms and then binding the new values. When finished, the previous values would be restored. This is almost correct, but provides surprising results for things like:
> (setq baz 1)
> (def foo (lambda (bar) (+ baz bar)))
> (def bletch (lambda (baz) (foo baz)))
> (bletch 2)
The value that foo gets for 'baz' is 2 instead of 1 under dynamic scoping, which most people find surprising. This time, I was determined to use lexical scoping, and it turned out to be surprisingly easy. The first trick was to separate the atoms from their 'value'; each atom can have a different value in different lexical scopes. So, each lexical scope gets a 'frame' object, those contain the value for each atom defined in that scope. There's a global scope which holds all of the globally defined values (like baz, foo and bletch above). Each frame points to its enclosing scope, so you can search upwards to find the right value. The second trick was to realize that the lexical scope of a lambda is the scope in which the lambda itself is evaluated, and that the evaluation of a lambda expression results in a 'function' object, which contains the lambda and its enclosing scope:
> (def foo (lambda (bar bletch)
       ((lambda (baz)
          (+ baz bar))
> (foo 2 3)
In this case, the inner lambda in foo can 'see' the value of bar from the enclosing lambda. More subtly, even if the inner lambda were executed multiple times, it would see the same baz, and could even change it. This can be used to implement all kinds of craziness, including generators:
> (defun make-inc (add)
  ((lambda (base)
     (lambda ()
     (setq base (+ base add))
> (setq plus2 (make-inc 2))
> (plus2)
> (plus2)
The current implementation of each frame is a simple array of atom/value pairs, with a reference to the parent frame to form the full scope. There are dramatically faster implementations of this same concept, but the goal here was small and simple. A Tiny Allocator Optimization With eval consuming heap space for stacks, frames and argument lists, the interpreter was spending a lot of time in the collector. As a simple optimization, I added some free lists for stack frames and cons cells. Stack frames are never referenced when they're finished, so they can always go on the free list. Cons cells used to construct argument lists for functions are usually free. Builtin functions have a bit which indicates whether they might hold on to a reference to the argument list. Interpreted lambdas can't get the list while nlambdas, lexprs and macros do. Each lambda execution creates a new frame, and while it would be possible to discover if that frame 'escapes' the lambda, I decided to not attempt to cache free ones yet. Save and Restore To make the lisp interpreter more useful in tiny computers, I added the ability to save and restore the entire heap to flash. This requires leaving enough space in the flash to preserve the heap, further constraining the amount of flash available for the application. Code All of this code is in the 'lisp' branch of my AltOS repository: AltOS The lisp interpreter is independent from the rest of AltOS and could be re-purposed for another embedded operating system. It runs fine on ChaosKey hardware, and also on the STM32F042 Nucleo-32 board There's also a test framework which runs on Linux, and is how I developed almost all of the code. That's in the src/test directory in the above repository, and is called 'ao_lisp_test'. Towers of Hanoi Here's an implementation of the classic recursive Towers of Hanoi game; it shows most of the current features of the language.
; Towers of Hanoi
; Copyright   2016 Keith Packard <>
; This program is free software; you can redistribute it and/or modify
; it under the terms of the GNU General Public License as published by
; the Free Software Foundation, either version 2 of the License, or
; (at your option) any later version.
; This program is distributed in the hope that it will be useful, but
; WITHOUT ANY WARRANTY; without even the implied warranty of
; General Public License for more details.
; ANSI control sequences
(defun move-to (col row)
  (patom "\033[" row ";" col "H" nil)
(defun clear ()
  (patom "\033[2J" nil)
(defun display-string (x y str)
  (move-to x y)
  (patom str)
; Here's the pieces to display
(setq stack '("*" "**" "***" "****" "*****" "******" "*******"))
(setq top (+ (length stack) 3))
; Here's all of the stacks of pieces
; This is generated when the program is run
(setq stacks nil)
; Display one stack, clearing any
; space above it
(defun display-stack (x y clear stack)
  (cond ((= 0 clear)
     (cond (stack (progn
            (display-string x y (car stack))
            (display-stack x (1+ y) 0 (cdr stack))
    (t (progn
         (display-string x y "          ")
         (display-stack x (1+ y) (1- clear) stack)
; This should probably be included in the rom image...
(defun length (list)
  (cond (list (1+ (length (cdr list))))
; Position of the top of the stack on the screen
; Shorter stacks start further down the screen
(defun stack-pos (y stack)
  (- y (length stack))
; Display all of the stacks, spaced 20 columns apart
(defun display-stacks (x y stacks)
  (cond (stacks (progn
          (display-stack x 0 (stack-pos y (car stacks)) (car stacks))
          (display-stacks (+ x 20) y (cdr stacks)))
; Display all of the stacks, then move the cursor
; out of the way and flush the output
(defun display ()
  (display-stacks 0 top stacks)
  (move-to 1 21)
; Reset stacks to the starting state, with
; all of the pieces in the first stack and the
; other two empty
(defun reset-stacks ()
  (setq stacks (list stack nil nil))
  (length stack)
; more functions which could usefully
; be in the rom image
(defun min (a b)
  (cond ((< a b) a)
(defun nth (list n)
  (cond ((= n 0) (car list))
    ((nth (cdr list) (1- n)))
; Replace a stack in the list of stacks
; with a new value
(defun replace (list pos member)
  (cond ((= pos 0) (cons member (cdr list)))
    ((cons (car list) (replace (cdr list) (1- pos) member)))
; Move a piece from the top of one stack
; to the top of another
(defun move-piece (from to)
  (let ((from-stack (nth stacks from))
    (to-stack (nth stacks to))
    (piece (car from-stack)))
    (setq from-stack (cdr from-stack))
    (setq to-stack (cons piece to-stack))
    (setq stacks (replace stacks from from-stack))
    (setq stacks (replace stacks to to-stack))
    (delay 100)
; The implementation of the game
(defun _hanoi (n from to use)
  (cond ((= 1 n)
      (move-piece from to)
      (_hanoi (1- n) from use to)
      (_hanoi 1 from to use)
      (_hanoi (1- n) use to from)
; A pretty interface which
; resets the state of the game,
; clears the screen and runs
; the program
(defun hanoi ()
  (setq len (reset-stacks))
  (_hanoi len 0 1 2)

12 September 2016

Keith Packard: hopkins

Hopkins Trailer Brake Controller in Subaru Outback My minivan transmission gave up the ghost last year, so I bought a Subaru outback to pull my t@b travel trailer. There isn't a huge amount of space under the dash, so I didn't want to mount a trailer brake controller in the 'usual' spot, right above my right knee. Instead, I bought a Hopkins InSIGHT brake controller, 47297. That comes in three separate pieces which allows for very flexible mounting options. I stuck the 'main' box way up under the dash on the left side of the car. There was a nice flat spot with plenty of space that was facing the right direction: The next trick was to mount the display and control boxes around the storage compartment in the center console: Routing the cables from the controls over to the main unit took a piece of 14ga solid copper wire to use as a fishing line. The display wire was routed above the compartment lid, the control wire was routed below the lid. I'm not entirely happy with the wire routing; I may drill some small holes and then cut the wires to feed them through.

16 August 2016

Keith Packard: udevwrap

Wrapping libudev using LD_PRELOAD Peter Hutterer and I were chasing down an X server bug which was exposed when running the libinput test suite against the X server with a separate thread for input. This was crashing deep inside libudev, which led us to suspect that libudev was getting run from multiple threads at the same time. I figured I'd be able to tell by wrapping all of the libudev calls from the server and checking to make sure we weren't ever calling it from both threads at the same time. My first attempt was a simple set of cpp macros, but that failed when I discovered that libwacom was calling libgudev, which was calling libudev. Instead of recompiling the world with my magic macros, I created a new library which exposes all of the (public) symbols in libudev. Each of these functions does a bit of checking and then simply calls down to the 'real' function. Finding the real symbols Here's the snippet which finds the real symbols:
static void *udev_symbol(const char *symbol)
    static void *libudev;
    static pthread_mutex_t  find_lock = PTHREAD_MUTEX_INITIALIZER;
    void *sym;
    if (!libudev)  
        libudev = dlopen("", RTLD_LOCAL   RTLD_NOW);
    sym = dlsym(libudev, symbol);
    return sym;
Yeah, the libudev version is hard-coded into the source; I didn't want to accidentally load the wrong one. This could probably be improved... Checking for re-entrancy As mentioned above, we suspected that the bug was caused when libudev got called from two threads at the same time. So, our checks are pretty simple; we just count the number of calls into any udev function (to handle udev calling itself). If there are other calls in process, we make sure the thread ID for those is the same as the current thread.
static void udev_enter(const char *func)  
    assert (udev_running == 0   udev_thread == pthread_self());
    udev_thread = pthread_self();
    udev_func[udev_running] = func;
static void udev_exit(void)  
    if (udev_running == 0)
    udev_thread = 0;
    udev_func[udev_running] = 0;
Wrapping functions Now, the ugly part -- libudev exposes 93 different functions, with a wide variety of parameters and return types. I constructed a hacky macro, calls for which could be constructed pretty easily from the prototypes found in libudev.h, and which would construct our stub function:
#define make_func(type, name, formals, actuals)         \
    type name formals                       \
    type ret;                       \
    static void *f;                     \
    if (!f)                         \
        f = udev_symbol(__func__);              \
    udev_enter(__func__);                   \
    ret = ((typeof (&name)) f) actuals;         \
    udev_exit();                        \
    return ret;                     \
There are 93 invocations of this macro (or a variant for void functions) which look much like:
make_func(struct udev *,
      (struct udev *udev),
Using udevwrap To use udevwrap, simply stick the filename of the .so in LD_PRELOAD and run your program normally:
# LD_PRELOAD=/usr/local/lib/ Xorg 
Source code I stuck udevwrap in my git repository:;a=summary You can clone it using
$ git git://

3 August 2016

Keith Packard: chaoskey

ChaosKey v1.0 Released USB Attached True Random Number Generator ChaosKey, our random number generator that attaches via USB, is now available for sale from the altusmetrum store. We talked about this device at Debconf 16 last month Support for this device is included in Linux starting with version 4.1. Plug ChaosKey into your system and the driver will automatically add entropy into the kernel pool, providing a constant supply of true random numbers to help keep the system secure. ChaosKey is free hardware running free software, built with free software on a free operating system.

3 June 2016

Gunnar Wolf: Stop it with those short PGP key IDs!

Debian is quite probably the project that most uses a OpenPGP implementation (that is, GnuPG, or gpg) for many of its internal operations, and that places most trust in it. PGP is also very widely used, of course, in many other projects and between individuals. It is regarded as a secure way to do all sorts of crypto (mainly, encrypting/decrypting private stuff, signing public stuff, certifying other people's identities). PGP's lineage traces back to Phil Zimmerman's program, first published in 1991 By far, not a newcomer PGP is secure, as it was 25 years ago. However, some uses of it might not be so. We went through several migrations related to algorithmic weaknesses (i.e. v3 keys using MD5; SHA1 is strongly discouraged, although not yet completely broken, and it should be avoided as well) or to computational complexity (as the migration away from keys smaller than 2048 bits, strongly prefering 4096 bits). But some vulnerabilities are human usage (that is, configuration-) related. Today, Enrico Zini gave us a heads-up in the #debian-keyring IRC channel, and started a thread in the debian-private mailing list; I understand the mail to a private list was partly meant to get our collective attention, and to allow for potentially security-relevant information to be shared. I won't go into details about what is, is not, should be or should not be private, but I'll post here only what's public information already. What are short and long key IDs? I'll start by quoting Enrico's mail:
there are currently at least 3 ways to refer to a gpg key: short key ID (last 8 hex digits of fingerprint), long key ID (last 16 hex digits) and full fingerprint. The short key ID used to be popular, and since 5 years it is known that it is computationally easy to generate a gnupg key with an arbitrary short key id. A mitigation to this is using "keyid-format long" in gpg.conf, and a better thing to do, especially in scripts, is to use the full fingerprint to refer to a key, or just ship the public key for verification and skip the key servers. Note that in case of keyid collision, gpg will download and import all the matching keys, and will use all the matching keys for verifying signatures.
So... What is this about? We humans are quite bad at recognizing and remembering randomly-generated strings with no inherent patterns in them. Every GPG key can be uniquely identified by its fingerprint, a 128-bit string, usually encoded as ten blocks of four hexadecimal characters (this allows for 160 bits; I guess there's space for a checksum in it). That is, my (full) key's signature is:
AB41 C1C6 8AFD 668C A045  EBF8 673A 03E4 C1DB 921F
However, it's quite hard to recognize such a long string, let alone memorize it! So, we often do what humans do: Given that strong cryptography implies a homogenous probability distribution, people compromised on using just a portion of the key the last portion. The short key ID. Mine is then the last two blocks (shown in boldface): C1DB921F. We can also use what's known as the long key ID, that's twice as long: 64 bits. However, while I can speak my short key ID on a single breath (and maybe even expect you to remember and note it down), try doing so with the long one (shown in italics above): 673A03E4C1DB921F. Nah. Too much for our little, analog brains. This short and almost-rememberable number has then 32 bits of entropy I have less than one in 4,000,000,000 chance of generating a new key with this same short key ID. Besides, key generation is a CPU-intensive operation, so it's quite unlikely we will have a collision, right? Well, wrong. Previous successful attacks on short key IDs Already five years ago, Asheesh Laroia migrated his 1024D key to a 4096R. And, as he describes in his always-entertaining fashion, he made his computer sweat until he was able to create a new key for which the short key ID collided with the old one. It might not seem like a big deal, as he did this non-maliciously, but this easily should have spelt game over for the usage of short key IDs. After all, being able to generate a collision is usually the end for cryptographic systems. Asheesh specifically mentioned in his posting how this could be abused. But we didn't listen. Short key IDs are just too convenient! Besides, they allow us to have fun, can be a means of expression! I know of at least two keys that would qualify as vanity: Obey Arthur Liu's 0x29C0FFEE (created in 2009) and Keith Packard's 0x00000011 (created in 2012). Then we got the Evil 32 project. They developed Scallion, started (AFAICT) in 2012. Scallion automates the search for a 32-bit collision using GPUs; they claim that it takes only four seconds to find a collision. So, they went through the strong set of the public PGP Web of Trust, and created a (32-bit-)colliding key for each of the existing keys. And what happened now? What happened today? We still don't really know, but it seems we found a first potentially malicious collision that is, the first "nonacademic" case. Enrico found two keys sharing the 9F6C6333 short ID, apparently belonging to the same person (as would be the case of Asheesh, mentioned above). After contacting Gustavo, though, he does not know about the second That is, it can be clearly regarded as an impersonation attempt. Besides, what gave away this attempt are the signatures it has: Both keys are signed by what appears to be the same three keys: B29B232A, F2C850CA and 789038F2. Those three keys are not (yet?) uploaded to the keyservers, though... But we can expect them to appear at any point in the future. We don't know who is behind this, or what his purpose is. We just know this looks very evil. Now, don't panic: Gustavo's key is safe. Same for his certifiers, Marga, Agust n and Maxy. It's just a 32-bit collision. So, in principle, the only parties that could be cheated to trust the attacker are humans, right? Nope. Enrico tested on the PGP pathfinder & key statistics service, a keyserver that finds trust paths between any two arbitrary keys in the strong set. Surprise: The pathfinder works on the short key IDs, even when supplied full fingerprints. So, it turns out I have three faked trust paths into our impostor. What next? There are several things this should urge us to do. And there are surely many other important recommendations. But this is a good set of points to start with. [update] I was pointed at Daniel Kahn Gillmor's 2013 blog post, OpenPGP Key IDs are not useful. Daniel argues, in short, that cutting a fingerprint in order to get a (32- or 64-bit) short key ID is the worst of all worlds, and we should rather target either always showing full fingerprints, or not showing it at all (and leaving all the crypto-checking bits to be done by the software, as comparing 160-bit strings is not natural for us humans). [update] This post was picked up by A very interesting discussion continues in their comments.