Search Results: "Michael Stapelberg"

13 April 2024

Paul Tagliamonte: Domo Arigato, Mr. debugfs

Years ago, at what I think I remember was DebConf 15, I hacked for a while on debhelper to write build-ids to debian binary control files, so that the build-id (more specifically, the ELF note .note.gnu.build-id) wound up in the Debian apt archive metadata. I ve always thought this was super cool, and seeing as how Michael Stapelberg blogged some great pointers around the ecosystem, including the fancy new debuginfod service, and the find-dbgsym-packages helper, which uses these same headers, I don t think I m the only one. At work I ve been using a lot of rust, specifically, async rust using tokio. To try and work on my style, and to dig deeper into the how and why of the decisions made in these frameworks, I ve decided to hack up a project that I ve wanted to do ever since 2015 write a debug filesystem. Let s get to it.

Back to the Future Time to admit something. I really love Plan 9. It s just so good. So many ideas from Plan 9 are just so prescient, and everything just feels right. Not just right like, feels good like, correct. The bit that I ve always liked the most is 9p, the network protocol for serving a filesystem over a network. This leads to all sorts of fun programs, like the Plan 9 ftp client being a 9p server you mount the ftp server and access files like any other files. It s kinda like if fuse were more fully a part of how the operating system worked, but fuse is all running client-side. With 9p there s a single client, and different servers that you can connect to, which may be backed by a hard drive, remote resources over something like SFTP, FTP, HTTP or even purely synthetic. The interesting (maybe sad?) part here is that 9p wound up outliving Plan 9 in terms of adoption 9p is in all sorts of places folks don t usually expect. For instance, the Windows Subsystem for Linux uses the 9p protocol to share files between Windows and Linux. ChromeOS uses it to share files with Crostini, and qemu uses 9p (virtio-p9) to share files between guest and host. If you re noticing a pattern here, you d be right; for some reason 9p is the go-to protocol to exchange files between hypervisor and guest. Why? I have no idea, except maybe due to being designed well, simple to implement, and it s a lot easier to validate the data being shared and validate security boundaries. Simplicity has its value. As a result, there s a lot of lingering 9p support kicking around. Turns out Linux can even handle mounting 9p filesystems out of the box. This means that I can deploy a filesystem to my LAN or my localhost by running a process on top of a computer that needs nothing special, and mount it over the network on an unmodified machine unlike fuse, where you d need client-specific software to run in order to mount the directory. For instance, let s mount a 9p filesystem running on my localhost machine, serving requests on 127.0.0.1:564 (tcp) that goes by the name mountpointname to /mnt.
$ mount -t 9p \
-o trans=tcp,port=564,version=9p2000.u,aname=mountpointname \
127.0.0.1 \
/mnt
Linux will mount away, and attach to the filesystem as the root user, and by default, attach to that mountpoint again for each local user that attempts to use it. Nifty, right? I think so. The server is able to keep track of per-user access and authorization along with the host OS.

WHEREIN I STYX WITH IT Since I wanted to push myself a bit more with rust and tokio specifically, I opted to implement the whole stack myself, without third party libraries on the critical path where I could avoid it. The 9p protocol (sometimes called Styx, the original name for it) is incredibly simple. It s a series of client to server requests, which receive a server to client response. These are, respectively, T messages, which transmit a request to the server, which trigger an R message in response (Reply messages). These messages are TLV payload with a very straight forward structure so straight forward, in fact, that I was able to implement a working server off nothing more than a handful of man pages. Later on after the basics worked, I found a more complete spec page that contains more information about the unix specific variant that I opted to use (9P2000.u rather than 9P2000) due to the level of Linux specific support for the 9P2000.u variant over the 9P2000 protocol.

MR ROBOTO The backend stack over at zoo is rust and tokio running i/o for an HTTP and WebRTC server. I figured I d pick something fairly similar to write my filesystem with, since 9P can be implemented on basically anything with I/O. That means tokio tcp server bits, which construct and use a 9p server, which has an idiomatic Rusty API that partially abstracts the raw R and T messages, but not so much as to cause issues with hiding implementation possibilities. At each abstraction level, there s an escape hatch allowing someone to implement any of the layers if required. I called this framework arigato which can be found over on docs.rs and crates.io.
/// Simplified version of the arigato File trait; this isn't actually
/// the same trait; there's some small cosmetic differences. The
/// actual trait can be found at:
///
/// https://docs.rs/arigato/latest/arigato/server/trait.File.html
trait File  
/// OpenFile is the type returned by this File via an Open call.
 type OpenFile: OpenFile;
/// Return the 9p Qid for this file. A file is the same if the Qid is
 /// the same. A Qid contains information about the mode of the file,
 /// version of the file, and a unique 64 bit identifier.
 fn qid(&self) -> Qid;
/// Construct the 9p Stat struct with metadata about a file.
 async fn stat(&self) -> FileResult<Stat>;
/// Attempt to update the file metadata.
 async fn wstat(&mut self, s: &Stat) -> FileResult<()>;
/// Traverse the filesystem tree.
 async fn walk(&self, path: &[&str]) -> FileResult<(Option<Self>, Vec<Self>)>;
/// Request that a file's reference be removed from the file tree.
 async fn unlink(&mut self) -> FileResult<()>;
/// Create a file at a specific location in the file tree.
 async fn create(
&mut self,
name: &str,
perm: u16,
ty: FileType,
mode: OpenMode,
extension: &str,
) -> FileResult<Self>;
/// Open the File, returning a handle to the open file, which handles
 /// file i/o. This is split into a second type since it is genuinely
 /// unrelated -- and the fact that a file is Open or Closed can be
 /// handled by the  arigato  server for us.
 async fn open(&mut self, mode: OpenMode) -> FileResult<Self::OpenFile>;
 
/// Simplified version of the arigato OpenFile trait; this isn't actually
/// the same trait; there's some small cosmetic differences. The
/// actual trait can be found at:
///
/// https://docs.rs/arigato/latest/arigato/server/trait.OpenFile.html
trait OpenFile  
/// iounit to report for this file. The iounit reported is used for Read
 /// or Write operations to signal, if non-zero, the maximum size that is
 /// guaranteed to be transferred atomically.
 fn iounit(&self) -> u32;
/// Read some number of bytes up to  buf.len()  from the provided
 ///  offset  of the underlying file. The number of bytes read is
 /// returned.
 async fn read_at(
&mut self,
buf: &mut [u8],
offset: u64,
) -> FileResult<u32>;
/// Write some number of bytes up to  buf.len()  from the provided
 ///  offset  of the underlying file. The number of bytes written
 /// is returned.
 fn write_at(
&mut self,
buf: &mut [u8],
offset: u64,
) -> FileResult<u32>;
 

Thanks, decade ago paultag! Let s do it! Let s use arigato to implement a 9p filesystem we ll call debugfs that will serve all the debug files shipped according to the Packages metadata from the apt archive. We ll fetch the Packages file and construct a filesystem based on the reported Build-Id entries. For those who don t know much about how an apt repo works, here s the 2-second crash course on what we re doing. The first is to fetch the Packages file, which is specific to a binary architecture (such as amd64, arm64 or riscv64). That architecture is specific to a component (such as main, contrib or non-free). That component is specific to a suite, such as stable, unstable or any of its aliases (bullseye, bookworm, etc). Let s take a look at the Packages.xz file for the unstable-debug suite, main component, for all amd64 binaries.
$ curl \
https://deb.debian.org/debian-debug/dists/unstable-debug/main/binary-amd64/Packages.xz \
  unxz
This will return the Debian-style rfc2822-like headers, which is an export of the metadata contained inside each .deb file which apt (or other tools that can use the apt repo format) use to fetch information about debs. Let s take a look at the debug headers for the netlabel-tools package in unstable which is a package named netlabel-tools-dbgsym in unstable-debug.
Package: netlabel-tools-dbgsym
Source: netlabel-tools (0.30.0-1)
Version: 0.30.0-1+b1
Installed-Size: 79
Maintainer: Paul Tagliamonte <paultag@debian.org>
Architecture: amd64
Depends: netlabel-tools (= 0.30.0-1+b1)
Description: debug symbols for netlabel-tools
Auto-Built-Package: debug-symbols
Build-Ids: e59f81f6573dadd5d95a6e4474d9388ab2777e2a
Description-md5: a0e587a0cf730c88a4010f78562e6db7
Section: debug
Priority: optional
Filename: pool/main/n/netlabel-tools/netlabel-tools-dbgsym_0.30.0-1+b1_amd64.deb
Size: 62776
SHA256: 0e9bdb087617f0350995a84fb9aa84541bc4df45c6cd717f2157aa83711d0c60
So here, we can parse the package headers in the Packages.xz file, and store, for each Build-Id, the Filename where we can fetch the .deb at. Each .deb contains a number of files but we re only really interested in the files inside the .deb located at or under /usr/lib/debug/.build-id/, which you can find in debugfs under rfc822.rs. It s crude, and very single-purpose, but I m feeling a bit lazy.

Who needs dpkg?! For folks who haven t seen it yet, a .deb file is a special type of .ar file, that contains (usually) three files inside debian-binary, control.tar.xz and data.tar.xz. The core of an .ar file is a fixed size (60 byte) entry header, followed by the specified size number of bytes.
[8 byte .ar file magic]
[60 byte entry header]
[N bytes of data]
[60 byte entry header]
[N bytes of data]
[60 byte entry header]
[N bytes of data]
...
First up was to implement a basic ar parser in ar.rs. Before we get into using it to parse a deb, as a quick diversion, let s break apart a .deb file by hand something that is a bit of a rite of passage (or at least it used to be? I m getting old) during the Debian nm (new member) process, to take a look at where exactly the .debug file lives inside the .deb file.
$ ar x netlabel-tools-dbgsym_0.30.0-1+b1_amd64.deb
$ ls
control.tar.xz debian-binary
data.tar.xz netlabel-tools-dbgsym_0.30.0-1+b1_amd64.deb
$ tar --list -f data.tar.xz   grep '.debug$'
./usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug
Since we know quite a bit about the structure of a .deb file, and I had to implement support from scratch anyway, I opted to implement a (very!) basic debfile parser using HTTP Range requests. HTTP Range requests, if supported by the server (denoted by a accept-ranges: bytes HTTP header in response to an HTTP HEAD request to that file) means that we can add a header such as range: bytes=8-68 to specifically request that the returned GET body be the byte range provided (in the above case, the bytes starting from byte offset 8 until byte offset 68). This means we can fetch just the ar file entry from the .deb file until we get to the file inside the .deb we are interested in (in our case, the data.tar.xz file) at which point we can request the body of that file with a final range request. I wound up writing a struct to handle a read_at-style API surface in hrange.rs, which we can pair with ar.rs above and start to find our data in the .deb remotely without downloading and unpacking the .deb at all. After we have the body of the data.tar.xz coming back through the HTTP response, we get to pipe it through an xz decompressor (this kinda sucked in Rust, since a tokio AsyncRead is not the same as an http Body response is not the same as std::io::Read, is not the same as an async (or sync) Iterator is not the same as what the xz2 crate expects; leading me to read blocks of data to a buffer and stuff them through the decoder by looping over the buffer for each lzma2 packet in a loop), and tarfile parser (similarly troublesome). From there we get to iterate over all entries in the tarfile, stopping when we reach our file of interest. Since we can t seek, but gdb needs to, we ll pull it out of the stream into a Cursor<Vec<u8>> in-memory and pass a handle to it back to the user. From here on out its a matter of gluing together a File traited struct in debugfs, and serving the filesystem over TCP using arigato. Done deal!

A quick diversion about compression I was originally hoping to avoid transferring the whole tar file over the network (and therefore also reading the whole debug file into ram, which objectively sucks), but quickly hit issues with figuring out a way around seeking around an xz file. What s interesting is xz has a great primitive to solve this specific problem (specifically, use a block size that allows you to seek to the block as close to your desired seek position just before it, only discarding at most block size - 1 bytes), but data.tar.xz files generated by dpkg appear to have a single mega-huge block for the whole file. I don t know why I would have expected any different, in retrospect. That means that this now devolves into the base case of How do I seek around an lzma2 compressed data stream ; which is a lot more complex of a question. Thankfully, notoriously brilliant tianon was nice enough to introduce me to Jon Johnson who did something super similar adapted a technique to seek inside a compressed gzip file, which lets his service oci.dag.dev seek through Docker container images super fast based on some prior work such as soci-snapshotter, gztool, and zran.c. He also pulled this party trick off for apk based distros over at apk.dag.dev, which seems apropos. Jon was nice enough to publish a lot of his work on this specifically in a central place under the name targz on his GitHub, which has been a ton of fun to read through. The gist is that, by dumping the decompressor s state (window of previous bytes, in-memory data derived from the last N-1 bytes) at specific checkpoints along with the compressed data stream offset in bytes and decompressed offset in bytes, one can seek to that checkpoint in the compressed stream and pick up where you left off creating a similar block mechanism against the wishes of gzip. It means you d need to do an O(n) run over the file, but every request after that will be sped up according to the number of checkpoints you ve taken. Given the complexity of xz and lzma2, I don t think this is possible for me at the moment especially given most of the files I ll be requesting will not be loaded from again especially when I can just cache the debug header by Build-Id. I want to implement this (because I m generally curious and Jon has a way of getting someone excited about compression schemes, which is not a sentence I thought I d ever say out loud), but for now I m going to move on without this optimization. Such a shame, since it kills a lot of the work that went into seeking around the .deb file in the first place, given the debian-binary and control.tar.gz members are so small.

The Good First, the good news right? It works! That s pretty cool. I m positive my younger self would be amused and happy to see this working; as is current day paultag. Let s take debugfs out for a spin! First, we need to mount the filesystem. It even works on an entirely unmodified, stock Debian box on my LAN, which is huge. Let s take it for a spin:
$ mount \
-t 9p \
-o trans=tcp,version=9p2000.u,aname=unstable-debug \
192.168.0.2 \
/usr/lib/debug/.build-id/
And, let s prove to ourselves that this actually mounted before we go trying to use it:
$ mount   grep build-id
192.168.0.2 on /usr/lib/debug/.build-id type 9p (rw,relatime,aname=unstable-debug,access=user,trans=tcp,version=9p2000.u,port=564)
Slick. We ve got an open connection to the server, where our host will keep a connection alive as root, attached to the filesystem provided in aname. Let s take a look at it.
$ ls /usr/lib/debug/.build-id/
00 0d 1a 27 34 41 4e 5b 68 75 82 8E 9b a8 b5 c2 CE db e7 f3
01 0e 1b 28 35 42 4f 5c 69 76 83 8f 9c a9 b6 c3 cf dc E7 f4
02 0f 1c 29 36 43 50 5d 6a 77 84 90 9d aa b7 c4 d0 dd e8 f5
03 10 1d 2a 37 44 51 5e 6b 78 85 91 9e ab b8 c5 d1 de e9 f6
04 11 1e 2b 38 45 52 5f 6c 79 86 92 9f ac b9 c6 d2 df ea f7
05 12 1f 2c 39 46 53 60 6d 7a 87 93 a0 ad ba c7 d3 e0 eb f8
06 13 20 2d 3a 47 54 61 6e 7b 88 94 a1 ae bb c8 d4 e1 ec f9
07 14 21 2e 3b 48 55 62 6f 7c 89 95 a2 af bc c9 d5 e2 ed fa
08 15 22 2f 3c 49 56 63 70 7d 8a 96 a3 b0 bd ca d6 e3 ee fb
09 16 23 30 3d 4a 57 64 71 7e 8b 97 a4 b1 be cb d7 e4 ef fc
0a 17 24 31 3e 4b 58 65 72 7f 8c 98 a5 b2 bf cc d8 E4 f0 fd
0b 18 25 32 3f 4c 59 66 73 80 8d 99 a6 b3 c0 cd d9 e5 f1 fe
0c 19 26 33 40 4d 5a 67 74 81 8e 9a a7 b4 c1 ce da e6 f2 ff
Outstanding. Let s try using gdb to debug a binary that was provided by the Debian archive, and see if it ll load the ELF by build-id from the right .deb in the unstable-debug suite:
$ gdb -q /usr/sbin/netlabelctl
Reading symbols from /usr/sbin/netlabelctl...
Reading symbols from /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug...
(gdb)
Yes! Yes it will!
$ file /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug
/usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter *empty*, BuildID[sha1]=e59f81f6573dadd5d95a6e4474d9388ab2777e2a, for GNU/Linux 3.2.0, with debug_info, not stripped

The Bad Linux s support for 9p is mainline, which is great, but it s not robust. Network issues or server restarts will wedge the mountpoint (Linux can t reconnect when the tcp connection breaks), and things that work fine on local filesystems get translated in a way that causes a lot of network chatter for instance, just due to the way the syscalls are translated, doing an ls, will result in a stat call for each file in the directory, even though linux had just got a stat entry for every file while it was resolving directory names. On top of that, Linux will serialize all I/O with the server, so there s no concurrent requests for file information, writes, or reads pending at the same time to the server; and read and write throughput will degrade as latency increases due to increasing round-trip time, even though there are offsets included in the read and write calls. It works well enough, but is frustrating to run up against, since there s not a lot you can do server-side to help with this beyond implementing the 9P2000.L variant (which, maybe is worth it).

The Ugly Unfortunately, we don t know the file size(s) until we ve actually opened the underlying tar file and found the correct member, so for most files, we don t know the real size to report when getting a stat. We can t parse the tarfiles for every stat call, since that d make ls even slower (bummer). Only hiccup is that when I report a filesize of zero, gdb throws a bit of a fit; let s try with a size of 0 to start:
$ ls -lah /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug
-r--r--r-- 1 root root 0 Dec 31 1969 /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug
$ gdb -q /usr/sbin/netlabelctl
Reading symbols from /usr/sbin/netlabelctl...
Reading symbols from /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug...
warning: Discarding section .note.gnu.build-id which has a section size (24) larger than the file size [in module /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug]
[...]
This obviously won t work since gdb will throw away all our hard work because of stat s output, and neither will loading the real size of the underlying file. That only leaves us with hardcoding a file size and hope nothing else breaks significantly as a result. Let s try it again:
$ ls -lah /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug
-r--r--r-- 1 root root 954M Dec 31 1969 /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug
$ gdb -q /usr/sbin/netlabelctl
Reading symbols from /usr/sbin/netlabelctl...
Reading symbols from /usr/lib/debug/.build-id/e5/9f81f6573dadd5d95a6e4474d9388ab2777e2a.debug...
(gdb)
Much better. I mean, terrible but better. Better for now, anyway.

Kilroy was here Do I think this is a particularly good idea? I mean; kinda. I m probably going to make some fun 9p arigato-based filesystems for use around my LAN, but I don t think I ll be moving to use debugfs until I can figure out how to ensure the connection is more resilient to changing networks, server restarts and fixes on i/o performance. I think it was a useful exercise and is a pretty great hack, but I don t think this ll be shipping anywhere anytime soon. Along with me publishing this post, I ve pushed up all my repos; so you should be able to play along at home! There s a lot more work to be done on arigato; but it does handshake and successfully export a working 9P2000.u filesystem. Check it out on on my github at arigato, debugfs and also on crates.io and docs.rs. At least I can say I was here and I got it working after all these years.

27 August 2023

Gunnar Wolf: Interested in adopting the RPi images for Debian?

Back in June 2018, Michael Stapelberg put the Raspberry Pi image building up for adoption. He created the first set of unofficial, experimental Raspberry Pi images for Debian. I promptly answered to him, and while it took me some time to actually warp my head around Michael s work, managed to eventually do so. By December, I started pushing some updates. Not only that: I didn t think much about it in the beginning, as the needed non-free pacakge was called raspi3-firmware, but By early 2019, I had it running for all of the then-available Raspberry families (so the package was naturally renamed to raspi-firmware). I got my Raspberry Pi 4 at DebConf19 (thanks to Andy, who brought it from Cambridge), and it soon joined the happy Debian family. The images are built daily, and are available in https://raspi.debian.net. In the process, I also adopted Lars great vmdb2 image building tool, and have kept it decently up to date (yes, I m currently lagging behind, but I ll get to it soonish ). Anyway This year, I have been seriously neglecting the Raspberry builds. I have simply not had time to regularly test built images, nor to debug why the builder has not picked up building for trixie (testing). And my time availability is not going to improve any time soon. We are close to one month away from moving for six months to Paran (Argentina), where I ll be focusing on my PhD. And while I do contemplate taking my Raspberries along, I do not forsee being able to put much energy to them. So This is basically a call for adoption for the Raspberry Debian images building service. I do intend to stick around and try to help. It s not only me (although I m responsible for the build itself) we have a nice and healthy group of Debian people hanging out in the #debian-raspberrypi channel in OFTC IRC. Don t be afraid, and come ask. I hope giving this project in adoption will breathe new life into it!

6 November 2022

Arturo Borrero Gonz lez: Home network refresh: 10G and IPv6

Post header A few days ago, my home network got a refresh that resulted in the enablement of some next-generation technologies for me and my family. Well, next-generation or current-generation, depending on your point of view. Per the ISP standards in Spain (my country), what I ll describe next is literally the most and latest you can get. The post title spoiled it already. I have now 10G internet uplink and native IPv6 since I changed my ISP to https://digimobil.es. My story began a few months ago when a series of fiber deployments started in my neighborhood by a back-then mostly unknown ISP (digimobil). The workers were deploying the fiber inside the street sewers, and I noticed that they were surrounded by advertisements promoting the fastest FTTH deployment in Spain. Indeed, their website was promoting 1G and 10G fiber, so a few days later I asked the workers when would that be available for subscription. They told me to wait just a couple of months, and the wait ended this week. I called the ISP, and a marketing person told me a lot of unnecessary information about how good service I was purchasing. I asked about IPv6 availability, but that person had no idea. They called me the next day to confirm that the home router they were installing would support both IPv6 and Wi-Fi 6. I was suspicious about nobody in the marketing department knowing anything about any of the two technologies, but I decided to proceed anyway. Just 24 hours after calling them, a technician came to my house and 45 minutes later the setup was ready. The new home router was a ZTE ZXHN F8648P unit. By the way, it had Linux inside, but I got no GPL copyright notice or anything. It had 1x10G and 4x1G ethernet LAN ports. The optical speed tests that the technician did were giving between 8 Gbps to 9 Gbps in uplink speed, which seemed fair enough. Upon quick search, there is apparently a community of folks online which already know how to get the most out of this router by unbloking the root account (sorry, in spanish only) and using other tools. When I plugged the RJ45 in my laptop, the magic happened: the interface got a native, public IPv6 from the router. I ran to run the now-classic IPv6 browser test at https://test-ipv6.com/. And here is the result: IPv6 test If you are curious, this was the IPv6 prefix whois information:
route6: 2a0c:5a80::/29
descr: Digi Spain Telecom S.L.U.
origin: AS57269
They were handing my router a prefix like 2a0c:5a80:2218:4a00::/56. I ignored if the prefix was somehow static, dynamic, just for me, or anything else. I ve been waiting for native IPv6 at home for years. In the past, I ve had many ideas and projects to host network services at home leveraging IPv6. But when I finally got it, I didn t know what to do next. I had a 7 months old baby, and honestly I didn t have the spare time to play a lot with the setup. Actually, I had no need or use for such fast network either. But my coworker Andrew convinced me: given the price 30 EUR / month, I didn t have any reason not to buy it. In fact, I didn t have any 10G-enabled NIC at home. I had a few laptops with 2.5G ports, though, and that was enough to experience the new network speeds. Since this write-up was inspired by the now almost-legenday post by Michael Stapelberg My upgrade to 25 Gbit/s Fiber To The Home, I contacted him, and he suggested running a few speed tests using the Ookla suite against his own server. Here are the results:
$ docker run --net host --rm -it docker.io/stapelberg/speedtest:latest -s 50092
[..]
     Server: Michael Stapelberg - Zurich (id = 50092)
        ISP: Digi Spain
    Latency:    34.29 ms   (0.20 ms jitter)
   Download:  2252.42 Mbps (data used: 3.4 GB )
     Upload:  2239.27 Mbps (data used: 2.8 GB )
Packet Loss:     0.0%
 Result URL: https://www.speedtest.net/result/c/cc8d6a78-c6f8-4f71-b554-a79812e10106
$ docker run --net host --rm -it docker.io/stapelberg/speedtest:latest -s 50092
[..]
     Server: Michael Stapelberg - Zurich (id = 50092)
        ISP: Digi Spain
    Latency:    34.05 ms   (0.21 ms jitter)
   Download:  2209.85 Mbps (data used: 3.2 GB )
     Upload:  2223.45 Mbps (data used: 2.9 GB )
Packet Loss:     0.0%
 Result URL: https://www.speedtest.net/result/c/32f9158e-fc1a-47e9-bd33-130e66c25417
This is over IPv6. Very satisfying. Bonus point: when I called my former ISP to cancel the old subscription the conversation was like: I didn t even bother mentioning IPv6. Cheers!

6 March 2021

Michael Stapelberg: Debian Code Search: OpenAPI now available

Debian Code Search now offers an OpenAPI-based API! Various developers have created ad-hoc client libraries based on how the web interface works. The goal of offering an OpenAPI-based API is to provide developers with automatically generated client libraries for a large number of programming languages, that target a stable interface independent of the web interface s implementation details.

Getting started
  1. Visit https://codesearch.debian.net/apikeys/ to download your personal API key. Login via Debian s GitLab instance salsa.debian.org; register there if you have no account yet.
  2. Find the Debian Code Search client library for your programming language. If none exists yet, auto-generate a client library on editor.swagger.io: click Generate Client .
  3. Search all code in Debian from your own analysis tool, migration tracking dashboard, etc.

curl example
curl \
  -H "x-dcs-apikey: $(cat dcs-apikey-stapelberg.txt)" \
  -X GET \
  "https://codesearch.debian.net/api/v1/search?query=i3Font&match_mode=regexp" 

Web browser example You can try out the API in your web browser in the OpenAPI documentation.

Code example (Go) Here s an example program that demonstrates how to set up an auto-generated Go client for the Debian Code Search OpenAPI, run a query, and aggregate the results:
func burndown() error  
	cfg := openapiclient.NewConfiguration()
	cfg.AddDefaultHeader("x-dcs-apikey", apiKey)
	client := openapiclient.NewAPIClient(cfg)
	ctx := context.Background()
	// Search through the full Debian Code Search corpus, blocking until all
	// results are available:
	results, _, err := client.SearchApi.Search(ctx, "fmt.Sprint(err)", &openapiclient.SearchApiSearchOpts 
		// Literal searches are faster and do not require escaping special
		// characters, regular expression searches are more powerful.
		MatchMode: optional.NewString("literal"),
	 )
	if err != nil  
		return err
	 
	// Print to stdout a CSV file with the path and number of occurrences:
	wr := csv.NewWriter(os.Stdout)
	header := []string "path", "number of occurrences" 
	if err := wr.Write(header); err != nil  
		return err
	 
	occurrences := make(map[string]int)
	for _, result := range results  
		occurrences[result.Path]++
	 
	for _, result := range results  
		o, ok := occurrences[result.Path]
		if !ok  
			continue
		 
		// Print one CSV record per path:
		delete(occurrences, result.Path)
		record := []string result.Path, strconv.Itoa(o) 
		if err := wr.Write(record); err != nil  
			return err
		 
	 
	wr.Flush()
	return wr.Error()
 
The full example can be found under burndown.go.

Feedback? File a GitHub issue on github.com/Debian/dcs please!

Migration status I m aware of the following third-party projects using Debian Code Search:
Tool Migration status
Debian Code Search CLI tool Updated to OpenAPI
identify-incomplete-xs-go-import-path Update pending
gnome-codesearch makes no API queries
If you find any others, please point them to this post in case they are not using Debian Code Search s OpenAPI yet.

21 January 2021

Russell Coker: Links January 2021

Krebs on Security has an informative article about web notifications and how they are being used for spamming and promoting malware [1]. He also includes links for how to permanently disable them. If nothing else clicking no on each new site that wants to send notifications is annoying. Michael Stapelberg wrote an insightful posts about inefficiencies in the Debian development processes [2]. While I agree with most of his assessment of Debian issues I am not going to decrease my involvement in Debian. Of the issues he mentions the 2 that seem to have the best effort to reward ratio are improvements to mailing list archives (to ideally make it practical to post to lists without subscribing and read responses in the archives) and the issues of forgetting all the complexities of the development process which can be alleviated by better Wiki pages. In my Debian work I ve contributed more to the Wiki in recent times but not nearly as much as I should. Jacobin has an insightful article Ending Poverty in the United States Would Actually Be Pretty Easy [3]. Mark Brown wrote an interesting blog post about the Rust programming language [4]. He links to a couple of longer blog posts about it. Rust has some great features and I ve been meaning to learn it. Scientific America has an informative article about research on the spread of fake news and memes [5]. Something to consider when using social media. Bruce Schneier wrote an insightful blog post on whether there should be limits on persuasive technology [6]. Jonathan Dowland wrote an interesting blog post about git rebasing and lab books [7]. I think it s an interesting thought experiment to compare the process of developing code worthy of being committed to a master branch of a VCS to the process of developing a Ph.D thesis. CBS has a disturbing article about the effect of Covid19 on people s lungs [8]. Apparently it usually does more lung damage than long-term smoking and even 70%+ of people who don t have symptoms of the disease get significant lung damage. People who live in heavily affected countries like the US now have to worry that they might have had the disease and got lung damage without knowing it. Russ Allbery wrote an interesting review of the book Because Internet about modern linguistics [9]. The topic is interesting and I might read that book at some future time (I have many good books I want to read). Jonathan Carter wrote an interesting blog post about CentOS Streams and why using a totally free OS like Debian is going to be a better option for most users [10]. Linus has slammed Intel for using ECC support as a way of segmenting the market between server and desktop to maximise profits [11]. It would be nice if a company made a line of Ryzen systems with ECC RAM support, but most manufacturers seem to be in on the market segmentation scam. Russ Allbery wrote an interesting review of the book Can t Even about millenials as the burnout generation and the blame that the corporate culture deserves for this [12].

21 November 2020

Michael Stapelberg: Winding down my Debian involvement

This post is hard to write, both in the emotional sense but also in the I would have written a shorter letter, but I didn t have the time sense. Hence, please assume the best of intentions when reading it it is not my intention to make anyone feel bad about their contributions, but rather to provide some insight into why my frustration level ultimately exceeded the threshold. Debian has been in my life for well over 10 years at this point. A few weeks ago, I have visited some old friends at the Z rich Debian meetup after a multi-year period of absence. On my bike ride home, it occurred to me that the topics of our discussions had remarkable overlap with my last visit. We had a discussion about the merits of systemd, which took a detour to respect in open source communities, returned to processes in Debian and eventually culminated in democracies and their theoretical/practical failings. Admittedly, that last one might be a Swiss thing. I say this not to knock on the Debian meetup, but because it prompted me to reflect on what feelings Debian is invoking lately and whether it s still a good fit for me. So I m finally making a decision that I should have made a long time ago: I am winding down my involvement in Debian to a minimum.

What does this mean? Over the coming weeks, I will:
  • transition packages to be team-maintained where it makes sense
  • remove myself from the Uploaders field on packages with other maintainers
  • orphan packages where I am the sole maintainer
I will try to keep up best-effort maintenance of the manpages.debian.org service and the codesearch.debian.net service, but any help would be much appreciated. For all intents and purposes, please treat me as permanently on vacation. I will try to be around for administrative issues (e.g. permission transfers) and questions addressed directly to me, permitted they are easy enough to answer.

Why? When I joined Debian, I was still studying, i.e. I had luxurious amounts of spare time. Now, over 5 years of full time work later, my day job taught me a lot, both about what works in large software engineering projects and how I personally like my computer systems. I am very conscious of how I spend the little spare time that I have these days. The following sections each deal with what I consider a major pain point, in no particular order. Some of them influence each other for example, if changes worked better, we could have a chance at transitioning packages to be more easily machine readable.

Change process in Debian The last few years, my current team at work conducted various smaller and larger refactorings across the entire code base (touching thousands of projects), so we have learnt a lot of valuable lessons about how to effectively do these changes. It irks me that Debian works almost the opposite way in every regard. I appreciate that every organization is different, but I think a lot of my points do actually apply to Debian. In Debian, packages are nudged in the right direction by a document called the Debian Policy, or its programmatic embodiment, lintian. While it is great to have a lint tool (for quick, local/offline feedback), it is even better to not require a lint tool at all. The team conducting the change (e.g. the C++ team introduces a new hardening flag for all packages) should be able to do their work transparent to me. Instead, currently, all packages become lint-unclean, all maintainers need to read up on what the new thing is, how it might break, whether/how it affects them, manually run some tests, and finally decide to opt in. This causes a lot of overhead and manually executed mechanical changes across packages. Notably, the cost of each change is distributed onto the package maintainers in the Debian model. At work, we have found that the opposite works better: if the team behind the change is put in power to do the change for as many users as possible, they can be significantly more efficient at it, which reduces the total cost and time a lot. Of course, exceptions (e.g. a large project abusing a language feature) should still be taken care of by the respective owners, but the important bit is that the default should be the other way around. Debian is lacking tooling for large changes: it is hard to programmatically deal with packages and repositories (see the section below). The closest to sending out a change for review is to open a bug report with an attached patch. I thought the workflow for accepting a change from a bug report was too complicated and started mergebot, but only Guido ever signaled interest in the project. Culturally, reviews and reactions are slow. There are no deadlines. I literally sometimes get emails notifying me that a patch I sent out a few years ago (!!) is now merged. This turns projects from a small number of weeks into many years, which is a huge demotivator for me. Interestingly enough, you can see artifacts of the slow online activity manifest itself in the offline culture as well: I don t want to be discussing systemd s merits 10 years after I first heard about it. Lastly, changes can easily be slowed down significantly by holdouts who refuse to collaborate. My canonical example for this is rsync, whose maintainer refused my patches to make the package use debhelper purely out of personal preference. Granting so much personal freedom to individual maintainers prevents us as a project from raising the abstraction level for building Debian packages, which in turn makes tooling harder. How would things look like in a better world?
  1. As a project, we should strive towards more unification. Uniformity still does not rule out experimentation, it just changes the trade-off from easier experimentation and harder automation to harder experimentation and easier automation.
  2. Our culture needs to shift from this package is my domain, how dare you touch it to a shared sense of ownership, where anyone in the project can easily contribute (reviewed) changes without necessarily even involving individual maintainers.
To learn more about how successful large changes can look like, I recommend my colleague Hyrum Wright s talk Large-Scale Changes at Google: Lessons Learned From 5 Yrs of Mass Migrations .

Fragmented workflow and infrastructure Debian generally seems to prefer decentralized approaches over centralized ones. For example, individual packages are maintained in separate repositories (as opposed to in one repository), each repository can use any SCM (git and svn are common ones) or no SCM at all, and each repository can be hosted on a different site. Of course, what you do in such a repository also varies subtly from team to team, and even within teams. In practice, non-standard hosting options are used rarely enough to not justify their cost, but frequently enough to be a huge pain when trying to automate changes to packages. Instead of using GitLab s API to create a merge request, you have to design an entirely different, more complex system, which deals with intermittently (or permanently!) unreachable repositories and abstracts away differences in patch delivery (bug reports, merge requests, pull requests, email, ). Wildly diverging workflows is not just a temporary problem either. I participated in long discussions about different git workflows during DebConf 13, and gather that there were similar discussions in the meantime. Personally, I cannot keep enough details of the different workflows in my head. Every time I touch a package that works differently than mine, it frustrates me immensely to re-learn aspects of my day-to-day. After noticing workflow fragmentation in the Go packaging team (which I started), I tried fixing this with the workflow changes proposal, but did not succeed in implementing it. The lack of effective automation and slow pace of changes in the surrounding tooling despite my willingness to contribute time and energy killed any motivation I had.

Old infrastructure: package uploads When you want to make a package available in Debian, you upload GPG-signed files via anonymous FTP. There are several batch jobs (the queue daemon, unchecked, dinstall, possibly others) which run on fixed schedules (e.g. dinstall runs at 01:52 UTC, 07:52 UTC, 13:52 UTC and 19:52 UTC). Depending on timing, I estimated that you might wait for over 7 hours (!!) before your package is actually installable. What s worse for me is that feedback to your upload is asynchronous. I like to do one thing, be done with it, move to the next thing. The current setup requires a many-minute wait and costly task switch for no good technical reason. You might think a few minutes aren t a big deal, but when all the time I can spend on Debian per day is measured in minutes, this makes a huge difference in perceived productivity and fun. The last communication I can find about speeding up this process is ganneff s post from 2008. How would things look like in a better world?
  1. Anonymous FTP would be replaced by a web service which ingests my package and returns an authoritative accept or reject decision in its response.
  2. For accepted packages, there would be a status page displaying the build status and when the package will be available via the mirror network.
  3. Packages should be available within a few minutes after the build completed.

Old infrastructure: bug tracker I dread interacting with the Debian bug tracker. debbugs is a piece of software (from 1994) which is only used by Debian and the GNU project these days. Debbugs processes emails, which is to say it is asynchronous and cumbersome to deal with. Despite running on the fastest machines we have available in Debian (or so I was told when the subject last came up), its web interface loads very slowly. Notably, the web interface at bugs.debian.org is read-only. Setting up a working email setup for reportbug(1) or manually dealing with attachments is a rather big hurdle. For reasons I don t understand, every interaction with debbugs results in many different email threads. Aside from the technical implementation, I also can never remember the different ways that Debian uses pseudo-packages for bugs and processes. I need them rarely enough to establish a mental model of how they are set up, or working memory of how they are used, but frequently enough to be annoyed by this. How would things look like in a better world?
  1. Debian would switch from a custom bug tracker to a (any) well-established one.
  2. Debian would offer automation around processes. It is great to have a paper-trail and artifacts of the process in the form of a bug report, but the primary interface should be more convenient (e.g. a web form).

Old infrastructure: mailing list archives It baffles me that in 2019, we still don t have a conveniently browsable threaded archive of mailing list discussions. Email and threading is more widely used in Debian than anywhere else, so this is somewhat ironic. Gmane used to paper over this issue, but Gmane s availability over the last few years has been spotty, to say the least (it is down as I write this). I tried to contribute a threaded list archive, but our listmasters didn t seem to care or want to support the project.

Debian is hard to machine-read While it is obviously possible to deal with Debian packages programmatically, the experience is far from pleasant. Everything seems slow and cumbersome. I have picked just 3 quick examples to illustrate my point. debiman needs help from piuparts in analyzing the alternatives mechanism of each package to display the manpages of e.g. psql(1). This is because maintainer scripts modify the alternatives database by calling shell scripts. Without actually installing a package, you cannot know which changes it does to the alternatives database. pk4 needs to maintain its own cache to look up package metadata based on the package name. Other tools parse the apt database from scratch on every invocation. A proper database format, or at least a binary interchange format, would go a long way. Debian Code Search wants to ingest new packages as quickly as possible. There used to be a fedmsg instance for Debian, but it no longer seems to exist. It is unclear where to get notifications from for new packages, and where best to fetch those packages.

Complicated build stack See my Debian package build tools post. It really bugs me that the sprawl of tools is not seen as a problem by others.

Developer experience pretty painful Most of the points discussed so far deal with the experience in developing Debian, but as I recently described in my post Debugging experience in Debian , the experience when developing using Debian leaves a lot to be desired, too.

I have more ideas At this point, the article is getting pretty long, and hopefully you got a rough idea of my motivation. While I described a number of specific shortcomings above, the final nail in the coffin is actually the lack of a positive outlook. I have more ideas that seem really compelling to me, but, based on how my previous projects have been going, I don t think I can make any of these ideas happen within the Debian project. I intend to publish a few more posts about specific ideas for improving operating systems here. Stay tuned. Lastly, I hope this post inspires someone, ideally a group of people, to improve the developer experience within Debian.

Michael Stapelberg: Linux package managers are slow

I measured how long the most popular Linux distribution s package manager take to install small and large packages (the ack(1p) source code search Perl script and qemu, respectively). Where required, my measurements include metadata updates such as transferring an up-to-date package list. For me, requiring a metadata update is the more common case, particularly on live systems or within Docker containers. All measurements were taken on an Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz running Docker 1.13.1 on Linux 4.19, backed by a Samsung 970 Pro NVMe drive boasting many hundreds of MB/s write performance. The machine is located in Z rich and connected to the Internet with a 1 Gigabit fiber connection, so the expected top download speed is 115 MB/s. See Appendix C for details on the measurement method and command outputs.

Measurements Keep in mind that these are one-time measurements. They should be indicative of actual performance, but your experience may vary.

ack (small Perl program)
distribution package manager data wall-clock time rate
Fedora dnf 114 MB 33s 3.4 MB/s
Debian apt 16 MB 10s 1.6 MB/s
NixOS Nix 15 MB 5s 3.0 MB/s
Arch Linux pacman 6.5 MB 3s 2.1 MB/s
Alpine apk 10 MB 1s 10.0 MB/s

qemu (large C program)
distribution package manager data wall-clock time rate
Fedora dnf 226 MB 4m37s 1.2 MB/s
Debian apt 224 MB 1m35s 2.3 MB/s
Arch Linux pacman 142 MB 44s 3.2 MB/s
NixOS Nix 180 MB 34s 5.2 MB/s
Alpine apk 26 MB 2.4s 10.8 MB/s

(Looking for older measurements? See Appendix B (2019). The difference between the slowest and fastest package managers is 30x! How can Alpine s apk and Arch Linux s pacman be an order of magnitude faster than the rest? They are doing a lot less than the others, and more efficiently, too.

Pain point: too much metadata For example, Fedora transfers a lot more data than others because its main package list is 60 MB (compressed!) alone. Compare that with Alpine s 734 KB APKINDEX.tar.gz. Of course the extra metadata which Fedora provides helps some use case, otherwise they hopefully would have removed it altogether. The amount of metadata seems excessive for the use case of installing a single package, which I consider the main use-case of an interactive package manager. I expect any modern Linux distribution to only transfer absolutely required data to complete my task.

Pain point: no concurrency Because they need to sequence executing arbitrary package maintainer-provided code (hooks and triggers), all tested package managers need to install packages sequentially (one after the other) instead of concurrently (all at the same time). In my blog post Can we do without hooks and triggers? , I outline that hooks and triggers are not strictly necessary to build a working Linux distribution.

Thought experiment: further speed-ups Strictly speaking, the only required feature of a package manager is to make available the package contents so that the package can be used: a program can be started, a kernel module can be loaded, etc. By only implementing what s needed for this feature, and nothing more, a package manager could likely beat apk s performance. It could, for example:
  • skip archive extraction by mounting file system images (like AppImage or snappy)
  • use compression which is light on CPU, as networks are fast (like apk)
  • skip fsync when it is safe to do so, i.e.:
    • package installations don t modify system state
    • atomic package installation (e.g. an append-only package store)
    • automatically clean up the package store after crashes

Current landscape Here s a table outlining how the various package managers listed on Wikipedia s list of software package management systems fare:
name scope package file format hooks/triggers
AppImage apps image: ISO9660, SquashFS no
snappy apps image: SquashFS yes: hooks
FlatPak apps archive: OSTree no
0install apps archive: tar.bz2 no
nix, guix distro archive: nar. bz2,xz activation script
dpkg distro archive: tar. gz,xz,bz2 in ar(1) yes
rpm distro archive: cpio. bz2,lz,xz scriptlets
pacman distro archive: tar.xz install
slackware distro archive: tar. gz,xz yes: doinst.sh
apk distro archive: tar.gz yes: .post-install
Entropy distro archive: tar.bz2 yes
ipkg, opkg distro archive: tar ,.gz yes

Conclusion As per the current landscape, there is no distribution-scoped package manager which uses images and leaves out hooks and triggers, not even in smaller Linux distributions. I think that space is really interesting, as it uses a minimal design to achieve significant real-world speed-ups. I have explored this idea in much more detail, and am happy to talk more about it in my post Introducing the distri research linux distribution".

Appendix C: measurement details (2020)

ack You can expand each of these:
Fedora s dnf takes almost 33 seconds to fetch and unpack 114 MB.
% docker run -t -i fedora /bin/bash
[root@62d3cae2e2f9 /]# time dnf install -y ack
Fedora 32 openh264 (From Cisco) - x86_64     1.9 kB/s   2.5 kB     00:01
Fedora Modular 32 - x86_64                   6.8 MB/s   4.9 MB     00:00
Fedora Modular 32 - x86_64 - Updates         5.6 MB/s   3.7 MB     00:00
Fedora 32 - x86_64 - Updates                 9.9 MB/s    23 MB     00:02
Fedora 32 - x86_64                            39 MB/s    70 MB     00:01
[ ]
real	0m32.898s
user	0m25.121s
sys	0m1.408s
NixOS s Nix takes a little over 5s to fetch and unpack 15 MB.
% docker run -t -i nixos/nix
39e9186422ba:/# time sh -c 'nix-channel --update && nix-env -iA nixpkgs.ack'
unpacking channels...
created 1 symlinks in user environment
installing 'perl5.32.0-ack-3.3.1'
these paths will be fetched (15.55 MiB download, 85.51 MiB unpacked):
  /nix/store/34l8jdg76kmwl1nbbq84r2gka0kw6rc8-perl5.32.0-ack-3.3.1-man
  /nix/store/9df65igwjmf2wbw0gbrrgair6piqjgmi-glibc-2.31
  /nix/store/9fd4pjaxpjyyxvvmxy43y392l7yvcwy1-perl5.32.0-File-Next-1.18
  /nix/store/czc3c1apx55s37qx4vadqhn3fhikchxi-libunistring-0.9.10
  /nix/store/dj6n505iqrk7srn96a27jfp3i0zgwa1l-acl-2.2.53
  /nix/store/ifayp0kvijq0n4x0bv51iqrb0yzyz77g-perl-5.32.0
  /nix/store/w9wc0d31p4z93cbgxijws03j5s2c4gyf-coreutils-8.31
  /nix/store/xim9l8hym4iga6d4azam4m0k0p1nw2rm-libidn2-2.3.0
  /nix/store/y7i47qjmf10i1ngpnsavv88zjagypycd-attr-2.4.48
  /nix/store/z45mp61h51ksxz28gds5110rf3wmqpdc-perl5.32.0-ack-3.3.1
copying path '/nix/store/34l8jdg76kmwl1nbbq84r2gka0kw6rc8-perl5.32.0-ack-3.3.1-man' from 'https://cache.nixos.org'...
copying path '/nix/store/czc3c1apx55s37qx4vadqhn3fhikchxi-libunistring-0.9.10' from 'https://cache.nixos.org'...
copying path '/nix/store/9fd4pjaxpjyyxvvmxy43y392l7yvcwy1-perl5.32.0-File-Next-1.18' from 'https://cache.nixos.org'...
copying path '/nix/store/xim9l8hym4iga6d4azam4m0k0p1nw2rm-libidn2-2.3.0' from 'https://cache.nixos.org'...
copying path '/nix/store/9df65igwjmf2wbw0gbrrgair6piqjgmi-glibc-2.31' from 'https://cache.nixos.org'...
copying path '/nix/store/y7i47qjmf10i1ngpnsavv88zjagypycd-attr-2.4.48' from 'https://cache.nixos.org'...
copying path '/nix/store/dj6n505iqrk7srn96a27jfp3i0zgwa1l-acl-2.2.53' from 'https://cache.nixos.org'...
copying path '/nix/store/w9wc0d31p4z93cbgxijws03j5s2c4gyf-coreutils-8.31' from 'https://cache.nixos.org'...
copying path '/nix/store/ifayp0kvijq0n4x0bv51iqrb0yzyz77g-perl-5.32.0' from 'https://cache.nixos.org'...
copying path '/nix/store/z45mp61h51ksxz28gds5110rf3wmqpdc-perl5.32.0-ack-3.3.1' from 'https://cache.nixos.org'...
building '/nix/store/m0rl62grplq7w7k3zqhlcz2hs99y332l-user-environment.drv'...
created 49 symlinks in user environment
real	0m 5.60s
user	0m 3.21s
sys	0m 1.66s
Debian s apt takes almost 10 seconds to fetch and unpack 16 MB.
% docker run -t -i debian:sid
root@1996bb94a2d1:/# time (apt update && apt install -y ack-grep)
Get:1 http://deb.debian.org/debian sid InRelease [146 kB]
Get:2 http://deb.debian.org/debian sid/main amd64 Packages [8400 kB]
Fetched 8546 kB in 1s (8088 kB/s)
[ ]
The following NEW packages will be installed:
  ack libfile-next-perl libgdbm-compat4 libgdbm6 libperl5.30 netbase perl perl-modules-5.30
0 upgraded, 8 newly installed, 0 to remove and 23 not upgraded.
Need to get 7341 kB of archives.
After this operation, 46.7 MB of additional disk space will be used.
[ ]
real	0m9.544s
user	0m2.839s
sys	0m0.775s
Arch Linux s pacman takes a little under 3s to fetch and unpack 6.5 MB.
% docker run -t -i archlinux/base
[root@9f6672688a64 /]# time (pacman -Sy && pacman -S --noconfirm ack)
:: Synchronizing package databases...
 core            130.8 KiB  1090 KiB/s 00:00
 extra          1655.8 KiB  3.48 MiB/s 00:00
 community         5.2 MiB  6.11 MiB/s 00:01
resolving dependencies...
looking for conflicting packages...
Packages (2) perl-file-next-1.18-2  ack-3.4.0-1
Total Download Size:   0.07 MiB
Total Installed Size:  0.19 MiB
[ ]
real	0m2.936s
user	0m0.375s
sys	0m0.160s
Alpine s apk takes a little over 1 second to fetch and unpack 10 MB.
% docker run -t -i alpine
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
(1/4) Installing libbz2 (1.0.8-r1)
(2/4) Installing perl (5.30.3-r0)
(3/4) Installing perl-file-next (1.18-r0)
(4/4) Installing ack (3.3.1-r0)
Executing busybox-1.31.1-r16.trigger
OK: 43 MiB in 18 packages
real	0m 1.24s
user	0m 0.40s
sys	0m 0.15s

qemu You can expand each of these:
Fedora s dnf takes over 4 minutes to fetch and unpack 226 MB.
% docker run -t -i fedora /bin/bash
[root@6a52ecfc3afa /]# time dnf install -y qemu
Fedora 32 openh264 (From Cisco) - x86_64     3.1 kB/s   2.5 kB     00:00
Fedora Modular 32 - x86_64                   6.3 MB/s   4.9 MB     00:00
Fedora Modular 32 - x86_64 - Updates         6.0 MB/s   3.7 MB     00:00
Fedora 32 - x86_64 - Updates                 334 kB/s    23 MB     01:10
Fedora 32 - x86_64                            33 MB/s    70 MB     00:02
[ ]
Total download size: 181 M
Downloading Packages:
[ ]
real	4m37.652s
user	0m38.239s
sys	0m6.321s
NixOS s Nix takes almost 34s to fetch and unpack 180 MB.
% docker run -t -i nixos/nix
83971cf79f7e:/# time sh -c 'nix-channel --update && nix-env -iA nixpkgs.qemu'
unpacking channels...
created 1 symlinks in user environment
installing 'qemu-5.1.0'
these paths will be fetched (180.70 MiB download, 1146.92 MiB unpacked):
[ ]
real	0m 33.64s
user	0m 16.96s
sys	0m 3.05s
Debian s apt takes over 95 seconds to fetch and unpack 224 MB.
% docker run -t -i debian:sid
root@b7cc25a927ab:/# time (apt update && apt install -y qemu-system-x86)
Get:1 http://deb.debian.org/debian sid InRelease [146 kB]
Get:2 http://deb.debian.org/debian sid/main amd64 Packages [8400 kB]
Fetched 8546 kB in 1s (5998 kB/s)
[ ]
Fetched 216 MB in 43s (5006 kB/s)
[ ]
real	1m25.375s
user	0m29.163s
sys	0m12.835s
Arch Linux s pacman takes almost 44s to fetch and unpack 142 MB.
% docker run -t -i archlinux/base
[root@58c78bda08e8 /]# time (pacman -Sy && pacman -S --noconfirm qemu)
:: Synchronizing package databases...
 core          130.8 KiB  1055 KiB/s 00:00
 extra        1655.8 KiB  3.70 MiB/s 00:00
 community       5.2 MiB  7.89 MiB/s 00:01
[ ]
Total Download Size:   135.46 MiB
Total Installed Size:  661.05 MiB
[ ]
real	0m43.901s
user	0m4.980s
sys	0m2.615s
Alpine s apk takes only about 2.4 seconds to fetch and unpack 26 MB.
% docker run -t -i alpine
/ # time apk add qemu-system-x86_64
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/community/x86_64/APKINDEX.tar.gz
[ ]
OK: 78 MiB in 95 packages
real	0m 2.43s
user	0m 0.46s
sys	0m 0.09s

Appendix B: measurement details (2019)

ack You can expand each of these:
Fedora s dnf takes almost 30 seconds to fetch and unpack 107 MB.
% docker run -t -i fedora /bin/bash
[root@722e6df10258 /]# time dnf install -y ack
Fedora Modular 30 - x86_64            4.4 MB/s   2.7 MB     00:00
Fedora Modular 30 - x86_64 - Updates  3.7 MB/s   2.4 MB     00:00
Fedora 30 - x86_64 - Updates           17 MB/s    19 MB     00:01
Fedora 30 - x86_64                     31 MB/s    70 MB     00:02
[ ]
Install  44 Packages
Total download size: 13 M
Installed size: 42 M
[ ]
real	0m29.498s
user	0m22.954s
sys	0m1.085s
NixOS s Nix takes 14s to fetch and unpack 15 MB.
% docker run -t -i nixos/nix
39e9186422ba:/# time sh -c 'nix-channel --update && nix-env -i perl5.28.2-ack-2.28'
unpacking channels...
created 2 symlinks in user environment
installing 'perl5.28.2-ack-2.28'
these paths will be fetched (14.91 MiB download, 80.83 MiB unpacked):
  /nix/store/57iv2vch31v8plcjrk97lcw1zbwb2n9r-perl-5.28.2
  /nix/store/89gi8cbp8l5sf0m8pgynp2mh1c6pk1gk-attr-2.4.48
  /nix/store/gkrpl3k6s43fkg71n0269yq3p1f0al88-perl5.28.2-ack-2.28-man
  /nix/store/iykxb0bmfjmi7s53kfg6pjbfpd8jmza6-glibc-2.27
  /nix/store/k8lhqzpaaymshchz8ky3z4653h4kln9d-coreutils-8.31
  /nix/store/svgkibi7105pm151prywndsgvmc4qvzs-acl-2.2.53
  /nix/store/x4knf14z1p0ci72gl314i7vza93iy7yc-perl5.28.2-File-Next-1.16
  /nix/store/zfj7ria2kwqzqj9dh91kj9kwsynxdfk0-perl5.28.2-ack-2.28
copying path '/nix/store/gkrpl3k6s43fkg71n0269yq3p1f0al88-perl5.28.2-ack-2.28-man' from 'https://cache.nixos.org'...
copying path '/nix/store/iykxb0bmfjmi7s53kfg6pjbfpd8jmza6-glibc-2.27' from 'https://cache.nixos.org'...
copying path '/nix/store/x4knf14z1p0ci72gl314i7vza93iy7yc-perl5.28.2-File-Next-1.16' from 'https://cache.nixos.org'...
copying path '/nix/store/89gi8cbp8l5sf0m8pgynp2mh1c6pk1gk-attr-2.4.48' from 'https://cache.nixos.org'...
copying path '/nix/store/svgkibi7105pm151prywndsgvmc4qvzs-acl-2.2.53' from 'https://cache.nixos.org'...
copying path '/nix/store/k8lhqzpaaymshchz8ky3z4653h4kln9d-coreutils-8.31' from 'https://cache.nixos.org'...
copying path '/nix/store/57iv2vch31v8plcjrk97lcw1zbwb2n9r-perl-5.28.2' from 'https://cache.nixos.org'...
copying path '/nix/store/zfj7ria2kwqzqj9dh91kj9kwsynxdfk0-perl5.28.2-ack-2.28' from 'https://cache.nixos.org'...
building '/nix/store/q3243sjg91x1m8ipl0sj5gjzpnbgxrqw-user-environment.drv'...
created 56 symlinks in user environment
real	0m 14.02s
user	0m 8.83s
sys	0m 2.69s
Debian s apt takes almost 10 seconds to fetch and unpack 16 MB.
% docker run -t -i debian:sid
root@b7cc25a927ab:/# time (apt update && apt install -y ack-grep)
Get:1 http://cdn-fastly.deb.debian.org/debian sid InRelease [233 kB]
Get:2 http://cdn-fastly.deb.debian.org/debian sid/main amd64 Packages [8270 kB]
Fetched 8502 kB in 2s (4764 kB/s)
[ ]
The following NEW packages will be installed:
  ack ack-grep libfile-next-perl libgdbm-compat4 libgdbm5 libperl5.26 netbase perl perl-modules-5.26
The following packages will be upgraded:
  perl-base
1 upgraded, 9 newly installed, 0 to remove and 60 not upgraded.
Need to get 8238 kB of archives.
After this operation, 42.3 MB of additional disk space will be used.
[ ]
real	0m9.096s
user	0m2.616s
sys	0m0.441s
Arch Linux s pacman takes a little over 3s to fetch and unpack 6.5 MB.
% docker run -t -i archlinux/base
[root@9604e4ae2367 /]# time (pacman -Sy && pacman -S --noconfirm ack)
:: Synchronizing package databases...
 core            132.2 KiB  1033K/s 00:00
 extra          1629.6 KiB  2.95M/s 00:01
 community         4.9 MiB  5.75M/s 00:01
[ ]
Total Download Size:   0.07 MiB
Total Installed Size:  0.19 MiB
[ ]
real	0m3.354s
user	0m0.224s
sys	0m0.049s
Alpine s apk takes only about 1 second to fetch and unpack 10 MB.
% docker run -t -i alpine
/ # time apk add ack
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/community/x86_64/APKINDEX.tar.gz
(1/4) Installing perl-file-next (1.16-r0)
(2/4) Installing libbz2 (1.0.6-r7)
(3/4) Installing perl (5.28.2-r1)
(4/4) Installing ack (3.0.0-r0)
Executing busybox-1.30.1-r2.trigger
OK: 44 MiB in 18 packages
real	0m 0.96s
user	0m 0.25s
sys	0m 0.07s

qemu You can expand each of these:
Fedora s dnf takes over a minute to fetch and unpack 266 MB.
% docker run -t -i fedora /bin/bash
[root@722e6df10258 /]# time dnf install -y qemu
Fedora Modular 30 - x86_64            3.1 MB/s   2.7 MB     00:00
Fedora Modular 30 - x86_64 - Updates  2.7 MB/s   2.4 MB     00:00
Fedora 30 - x86_64 - Updates           20 MB/s    19 MB     00:00
Fedora 30 - x86_64                     31 MB/s    70 MB     00:02
[ ]
Install  262 Packages
Upgrade    4 Packages
Total download size: 172 M
[ ]
real	1m7.877s
user	0m44.237s
sys	0m3.258s
NixOS s Nix takes 38s to fetch and unpack 262 MB.
% docker run -t -i nixos/nix
39e9186422ba:/# time sh -c 'nix-channel --update && nix-env -i qemu-4.0.0'
unpacking channels...
created 2 symlinks in user environment
installing 'qemu-4.0.0'
these paths will be fetched (262.18 MiB download, 1364.54 MiB unpacked):
[ ]
real	0m 38.49s
user	0m 26.52s
sys	0m 4.43s
Debian s apt takes 51 seconds to fetch and unpack 159 MB.
% docker run -t -i debian:sid
root@b7cc25a927ab:/# time (apt update && apt install -y qemu-system-x86)
Get:1 http://cdn-fastly.deb.debian.org/debian sid InRelease [149 kB]
Get:2 http://cdn-fastly.deb.debian.org/debian sid/main amd64 Packages [8426 kB]
Fetched 8574 kB in 1s (6716 kB/s)
[ ]
Fetched 151 MB in 2s (64.6 MB/s)
[ ]
real	0m51.583s
user	0m15.671s
sys	0m3.732s
Arch Linux s pacman takes 1m2s to fetch and unpack 124 MB.
% docker run -t -i archlinux/base
[root@9604e4ae2367 /]# time (pacman -Sy && pacman -S --noconfirm qemu)
:: Synchronizing package databases...
 core       132.2 KiB   751K/s 00:00
 extra     1629.6 KiB  3.04M/s 00:01
 community    4.9 MiB  6.16M/s 00:01
[ ]
Total Download Size:   123.20 MiB
Total Installed Size:  587.84 MiB
[ ]
real	1m2.475s
user	0m9.272s
sys	0m2.458s
Alpine s apk takes only about 2.4 seconds to fetch and unpack 26 MB.
% docker run -t -i alpine
/ # time apk add qemu-system-x86_64
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/community/x86_64/APKINDEX.tar.gz
[ ]
OK: 78 MiB in 95 packages
real	0m 2.43s
user	0m 0.46s
sys	0m 0.09s

Michael Stapelberg: Debian Code Search: positional index, TurboPFor-compressed

See the Conclusion for a summary if you re impatient :-)

Motivation Over the last few months, I have been developing a new index format for Debian Code Search. This required a lot of careful refactoring, re-implementation, debug tool creation and debugging. Multiple factors motivated my work on a new index format:
  1. The existing index format has a 2G size limit, into which we have bumped a few times, requiring manual intervention to keep the system running.
  2. Debugging the existing system required creating ad-hoc debugging tools, which made debugging sessions unnecessarily lengthy and painful.
  3. I wanted to check whether switching to a different integer compression format would improve performance (it does not).
  4. I wanted to check whether storing positions with the posting lists would improve performance of identifier queries (= queries which are not using any regular expression features), which make up 78.2% of all Debian Code Search queries (it does).
I figured building a new index from scratch was the easiest approach, compared to refactoring the existing index to increase the size limit (point ). I also figured it would be a good idea to develop the debugging tool in lock step with the index format so that I can be sure the tool works and is useful (point ).

Integer compression: TurboPFor As a quick refresher, search engines typically store document IDs (representing source code files, in our case) in an ordered list ( posting list ). It usually makes sense to apply at least a rudimentary level of compression: our existing system used variable integer encoding. TurboPFor, the self-proclaimed Fastest Integer Compression library, combines an advanced on-disk format with a carefully tuned SIMD implementation to reach better speeds (in micro benchmarks) at less disk usage than Russ Cox s varint implementation in github.com/google/codesearch. If you are curious about its inner workings, check out my TurboPFor: an analysis . Applied on the Debian Code Search index, TurboPFor indeed compresses integers better:

Disk space
8.9G codesearch varint index
5.5G TurboPFor index Switching to TurboPFor (via cgo) for storing and reading the index results in a slight speed-up of a dcs replay benchmark, which is more pronounced the more i/o is required.

Query speed (regexp, cold page cache)
18s codesearch varint index
14s TurboPFor index (cgo)

Query speed (regexp, warm page cache)
15s codesearch varint index
14s TurboPFor index (cgo) Overall, TurboPFor is an all-around improvement in efficiency, albeit with a high cost in implementation complexity.

Positional index: trade more disk for faster queries This section builds on the previous section: all figures come from the TurboPFor index, which can optionally support positions. Conceptually, we re going from:
type docid uint32
type index map[trigram][]docid
to:
type occurrence struct  
    doc docid
    pos uint32 // byte offset in doc
 
type index map[trigram][]occurrence
The resulting index consumes more disk space, but can be queried faster:
  1. We can do fewer queries: instead of reading all the posting lists for all the trigrams, we can read the posting lists for the query s first and last trigram only.
    This is one of the tricks described in the paper AS-Index: A Structure For String Search Using n-grams and Algebraic Signatures (PDF), and goes a long way without incurring the complexity, computational cost and additional disk usage of calculating algebraic signatures.
  2. Verifying the delta between the last and first position matches the length of the query term significantly reduces the number of files to read (lower false positive rate).
  3. The matching phase is quicker: instead of locating the query term in the file, we only need to compare a few bytes at a known offset for equality.
  4. More data is read sequentially (from the index), which is faster.

Disk space A positional index consumes significantly more disk space, but not so much as to pose a challenge: a Hetzner EX61-NVME dedicated server ( 64 /month) provides 1 TB worth of fast NVMe flash storage.
6.5G non-positional
123G positional
93G positional (posrel) The idea behind the positional index (posrel) is to not store a (doc,pos) tuple on disk, but to store positions, accompanied by a stream of doc/pos relationship bits: 1 means this position belongs to the next document, 0 means this position belongs to the current document. This is an easy way of saving some space without modifying the TurboPFor on-disk format: the posrel technique reduces the index size to about . With the increase in size, the Linux page cache hit ratio will be lower for the positional index, i.e. more data will need to be fetched from disk for querying the index. As long as the disk can deliver data as fast as you can decompress posting lists, this only translates into one disk seek s worth of additional latency. This is the case with modern NVMe disks that deliver thousands of MB/s, e.g. the Samsung 960 Pro (used in Hetzner s aforementioned EX61-NVME server). The values were measured by running dcs du -h /srv/dcs/shard*/full without and with the -pos argument.

Bytes read A positional index requires fewer queries: reading only the first and last trigram s posting lists and positions is sufficient to achieve a lower (!) false positive rate than evaluating all trigram s posting lists in a non-positional index. As a consequence, fewer files need to be read, resulting in fewer bytes required to read from disk overall. As an additional bonus, in a positional index, more data is read sequentially (index), which is faster than random i/o, regardless of the underlying disk.
1.2G
19.8G
21.0G regexp queries
4.2G (index)
10.8G (files)
15.0G identifier queries The values were measured by running iostat -d 25 just before running bench.zsh on an otherwise idle system.

Query speed Even though the positional index is larger and requires more data to be read at query time (see above), thanks to the C TurboPFor library, the 2 queries on a positional index are roughly as fast as the n queries on a non-positional index ( 4s instead of 3s). This is more than made up for by the combined i/o matching stage, which shrinks from 18.5s (7.1s i/o + 11.4s matching) to 1.3s.
3.3s (index)
7.1s (i/o)
11.4s (matching)
21.8s regexp queries
3.92s (index)
1.3s
5.22s identifier queries Note that identifier query i/o was sped up not just by needing to read fewer bytes, but also by only having to verify bytes at a known offset instead of needing to locate the identifier within the file.

Conclusion The new index format is overall slightly more efficient. This disk space efficiency allows us to introduce a positional index section for the first time. Most Debian Code Search queries are positional queries (78.2%) and will be answered much quicker by leveraging the positions. Bottomline, it is beneficial to use a positional index on disk over a non-positional index in RAM.

14 October 2020

Michael Stapelberg: distri: a Linux distribution to research fast package management

Over the last year or so I have worked on a research linux distribution in my spare time. It s not a distribution for researchers (like Scientific Linux), but my personal playground project to research linux distribution development, i.e. try out fresh ideas. This article focuses on the package format and its advantages, but there is more to distri, which I will cover in upcoming blog posts.

Motivation I was a Debian Developer for the 7 years from 2012 to 2019, but using the distribution often left me frustrated, ultimately resulting in me winding down my Debian work. Frequently, I was noticing a large gap between the actual speed of an operation (e.g. doing an update) and the possible speed based on back of the envelope calculations. I wrote more about this in my blog post Package managers are slow . To me, this observation means that either there is potential to optimize the package manager itself (e.g. apt), or what the system does is just too complex. While I remember seeing some low-hanging fruit , through my work on distri, I wanted to explore whether all the complexity we currently have in Linux distributions such as Debian or Fedora is inherent to the problem space. I have completed enough of the experiment to conclude that the complexity is not inherent: I can build a Linux distribution for general-enough purposes which is much less complex than existing ones. Those were low-hanging fruit from a user perspective. I m not saying that fixing them is easy in the technical sense; I know too little about apt s code base to make such a statement.

Key idea: packages are images, not archives One key idea is to switch from using archives to using images for package contents. Common package managers such as dpkg(1) use tar(1) archives with various compression algorithms. distri uses SquashFS images, a comparatively simple file system image format that I happen to be familiar with from my work on the gokrazy Raspberry Pi 3 Go platform. This idea is not novel: AppImage and snappy also use images, but only for individual, self-contained applications. distri however uses images for distribution packages with dependencies. In particular, there is no duplication of shared libraries in distri. A nice side effect of using read-only image files is that applications are immutable and can hence not be broken by accidental (or malicious!) modification.

Key idea: separate hierarchies Package contents are made available under a fully-qualified path. E.g., all files provided by package zsh-amd64-5.6.2-3 are available under /ro/zsh-amd64-5.6.2-3. The mountpoint /ro stands for read-only, which is short yet descriptive. Perhaps surprisingly, building software with custom prefix values of e.g. /ro/zsh-amd64-5.6.2-3 is widely supported, thanks to:
  1. Linux distributions, which build software with prefix set to /usr, whereas FreeBSD (and the autotools default), which build with prefix set to /usr/local.
  2. Enthusiast users in corporate or research environments, who install software into their home directories.
Because using a custom prefix is a common scenario, upstream awareness for prefix-correctness is generally high, and the rarely required patch will be quickly accepted.

Key idea: exchange directories Software packages often exchange data by placing or locating files in well-known directories. Here are just a few examples:
  • gcc(1) locates the libusb(3) headers via /usr/include
  • man(1) locates the nginx(1) manpage via /usr/share/man.
  • zsh(1) locates executable programs via PATH components such as /bin
In distri, these locations are called exchange directories and are provided via FUSE in /ro. Exchange directories come in two different flavors:
  1. global. The exchange directory, e.g. /ro/share, provides the union of the share sub directory of all packages in the package store.
    Global exchange directories are largely used for compatibility, see below.
  2. per-package. Useful for tight coupling: e.g. irssi(1) does not provide any ABI guarantees, so plugins such as irssi-robustirc can declare that they want e.g. /ro/irssi-amd64-1.1.1-1/out/lib/irssi/modules to be a per-package exchange directory and contain files from their lib/irssi/modules.

Search paths sometimes need to be fixed Programs which use exchange directories sometimes use search paths to access multiple exchange directories. In fact, the examples above were taken from gcc(1) s INCLUDEPATH, man(1) s MANPATH and zsh(1) s PATH. These are prominent ones, but more examples are easy to find: zsh(1) loads completion functions from its FPATH. Some search path values are derived from --datadir=/ro/share and require no further attention, but others might derive from e.g. --prefix=/ro/zsh-amd64-5.6.2-3/out and need to be pointed to an exchange directory via a specific command line flag.

FHS compatibility Global exchange directories are used to make distri provide enough of the Filesystem Hierarchy Standard (FHS) that third-party software largely just works. This includes a C development environment. I successfully ran a few programs from their binary packages such as Google Chrome, Spotify, or Microsoft s Visual Studio Code.

Fast package manager I previously wrote about how Linux distribution package managers are too slow. distri s package manager is extremely fast. Its main bottleneck is typically the network link, even at high speed links (I tested with a 100 Gbps link). Its speed comes largely from an architecture which allows the package manager to do less work. Specifically:
  1. Package images can be added atomically to the package store, so we can safely skip fsync(2) . Corruption will be cleaned up automatically, and durability is not important: if an interactive installation is interrupted, the user can just repeat it, as it will be fresh on their mind.
  2. Because all packages are co-installable thanks to separate hierarchies, there are no conflicts at the package store level, and no dependency resolution (an optimization problem requiring SAT solving) is required at all.
    In exchange directories, we resolve conflicts by selecting the package with the highest monotonically increasing distri revision number.
  3. distri proves that we can build a useful Linux distribution entirely without hooks and triggers. Not having to serialize hook execution allows us to download packages into the package store with maximum concurrency.
  4. Because we are using images instead of archives, we do not need to unpack anything. This means installing a package is really just writing its package image and metadata to the package store. Sequential writes are typically the fastest kind of storage usage pattern.
Fast installation also make other use-cases more bearable, such as creating disk images, be it for testing them in qemu(1) , booting them on real hardware from a USB drive, or for cloud providers such as Google Cloud.

Fast package builder Contrary to how distribution package builders are usually implemented, the distri package builder does not actually install any packages into the build environment. Instead, distri makes available a filtered view of the package store (only declared dependencies are available) at /ro in the build environment. This means that even for large dependency trees, setting up a build environment happens in a fraction of a second! Such a low latency really makes a difference in how comfortable it is to iterate on distribution packages.

Package stores In distri, package images are installed from a remote package store into the local system package store /roimg, which backs the /ro mount. A package store is implemented as a directory of package images and their associated metadata files. You can easily make available a package store by using distri export. To provide a mirror for your local network, you can periodically distri update from the package store you want to mirror, and then distri export your local copy. Special tooling (e.g. debmirror in Debian) is not required because distri install is atomic (and update uses install). Producing derivatives is easy: just add your own packages to a copy of the package store. The package store is intentionally kept simple to manage and distribute. Its files could be exchanged via peer-to-peer file systems, or synchronized from an offline medium.

distri s first release distri works well enough to demonstrate the ideas explained above. I have branched this state into branch jackherer, distri s first release code name. This way, I can keep experimenting in the distri repository without breaking your installation. From the branch contents, our autobuilder creates:
  1. disk images, which
  2. a package repository. Installations can pick up new packages with distri update.
  3. documentation for the release.
The project website can be found at https://distr1.org. The website is just the README for now, but we can improve that later. The repository can be found at https://github.com/distr1/distri

Project outlook Right now, distri is mainly a vehicle for my spare-time Linux distribution research. I don t recommend anyone use distri for anything but research, and there are no medium-term plans of that changing. At the very least, please contact me before basing anything serious on distri so that we can talk about limitations and expectations. I expect the distri project to live for as long as I have blog posts to publish, and we ll see what happens afterwards. Note that this is a hobby for me: I will continue to explore, at my own pace, parts that I find interesting. My hope is that established distributions might get a useful idea or two from distri.

There s more to come: subscribe to the distri feed I don t want to make this post too long, but there is much more! Please subscribe to the following URL in your feed reader to get all posts about distri: https://michael.stapelberg.ch/posts/tags/distri/feed.xml Next in my queue are articles about hermetic packages and good package maintainer experience (including declarative packaging).

Feedback or questions? I d love to discuss these ideas in case you re interested! Please send feedback to the distri mailing list so that everyone can participate!

12 August 2020

Michael Stapelberg: Linux distributions: Can we do without hooks and triggers?

Hooks are an extension feature provided by all package managers that are used in larger Linux distributions. For example, Debian uses apt, which has various maintainer scripts. Fedora uses rpm, which has scriptlets. Different package managers use different names for the concept, but all of them offer package maintainers the ability to run arbitrary code during package installation and upgrades. Example hook use cases include adding daemon user accounts to your system (e.g. postgres), or generating/updating cache files. Triggers are a kind of hook which run when other packages are installed. For example, on Debian, the man(1) package comes with a trigger which regenerates the search database index whenever any package installs a manpage. When, for example, the nginx(8) package is installed, a trigger provided by the man(1) package runs. Over the past few decades, Open Source software has become more and more uniform: instead of each piece of software defining its own rules, a small number of build systems are now widely adopted. Hence, I think it makes sense to revisit whether offering extension via hooks and triggers is a net win or net loss.

Hooks preclude concurrent package installation Package managers commonly can make very little assumptions about what hooks do, what preconditions they require, and which conflicts might be caused by running multiple package s hooks concurrently. Hence, package managers cannot concurrently install packages. At least the hook/trigger part of the installation needs to happen in sequence. While it seems technically feasible to retrofit package manager hooks with concurrency primitives such as locks for mutual exclusion between different hook processes, the required overhaul of all hooks seems like such a daunting task that it might be better to just get rid of the hooks instead. Only deleting code frees you from the burden of maintenance, automated testing and debugging. In Debian, there are 8620 non-generated maintainer scripts, as reported by find shard*/src/*/debian -regex ".*\(pre\ post\)\(inst\ rm\)$" on a Debian Code Search instance.

Triggers slow down installing/updating other packages Personally, I never use the apropos(1) command, so I don t appreciate the man(1) package s trigger which updates the database used by apropos(1). The process takes a long time and, because hooks and triggers must be executed serially (see previous section), blocks my installation or update. When I tell people this, they are often surprised to learn about the existance of the apropos(1) command. I suggest adopting an opt-in model.

Unnecessary work if programs are not used between updates Hooks run when packages are installed. If a package s contents are not used between two updates, running the hook in the first update could have been skipped. Running the hook lazily when the package contents are used reduces unnecessary work. As a welcome side-effect, lazy hook evaluation automatically makes the hook work in operating system images, such as live USB thumb drives or SD card images for the Raspberry Pi. Such images must not ship the same crypto keys (e.g. OpenSSH host keys) to all machines, but instead generate a different key on each machine. Why do users keep packages installed they don t use? It s extra work to remember and clean up those packages after use. Plus, users might not realize or value that having fewer packages installed has benefits such as faster updates. I can also imagine that there are people for whom the cost of re-installing packages incentivizes them to just keep packages installed you never know when you might need the program again

Implemented in an interpreted language While working on hermetic packages (more on that in another blog post), where the contained programs are started with modified environment variables (e.g. PATH) via a wrapper bash script, I noticed that the overhead of those wrapper bash scripts quickly becomes significant. For example, when using the excellent magit interface for Git in Emacs, I encountered second-long delays when using hermetic packages compared to standard packages. Re-implementing wrappers in a compiled language provided a significant speed-up. Similarly, getting rid of an extension point which mandates using shell scripts allows us to build an efficient and fast implementation of a predefined set of primitives, where you can reason about their effects and interactions. magit needs to run git a few times for displaying the full status, so small overhead quickly adds up.

Incentivizing more upstream standardization Hooks are an escape hatch for distribution maintainers to express anything which their packaging system cannot express. Distributions should only rely on well-established interfaces such as autoconf s classic ./configure && make && make install (including commonly used flags) to build a distribution package. Integrating upstream software into a distribution should not require custom hooks. For example, instead of requiring a hook which updates a cache of schema files, the library used to interact with those files should transparently (re-)generate the cache or fall back to a slower code path. Distribution maintainers are hard to come by, so we should value their time. In particular, there is a 1:n relationship of packages to distribution package maintainers (software is typically available in multiple Linux distributions), so it makes sense to spend the work in the 1 and have the n benefit.

Can we do without them? If we want to get rid of hooks, we need another mechanism to achieve what we currently achieve with hooks. If the hook is not specific to the package, it can be moved to the package manager. The desired system state should either be derived from the package contents (e.g. required system users can be discovered from systemd service files) or declaratively specified in the package build instructions more on that in another blog post. This turns hooks (arbitrary code) into configuration, which allows the package manager to collapse and sequence the required state changes. E.g., when 5 packages are installed which each need a new system user, the package manager could update /etc/passwd just once. If the hook is specific to the package, it should be moved into the package contents. This typically means moving the functionality into the program start (or the systemd service file if we are talking about a daemon). If (while?) upstream is not convinced, you can either wrap the program or patch it. Note that this case is relatively rare: I have worked with hundreds of packages and the only package-specific functionality I came across was automatically generating host keys before starting OpenSSH s sshd(8) . There is one exception where moving the hook doesn t work: packages which modify state outside of the system, such as bootloaders or kernel images. Even that can be moved out of a package-specific hook, as Fedora demonstrates.

Conclusion Global state modifications performed as part of package installation today use hooks, an overly expressive extension mechanism. Instead, all modifications should be driven by configuration. This is feasible because there are only a few different kinds of desired state modifications. This makes it possible for package managers to optimize package installation.

Michael Stapelberg: Optional dependencies don t work

In the i3 projects, we have always tried hard to avoid optional dependencies. There are a number of reasons behind it, and as I have recently encountered some of the downsides of optional dependencies firsthand, I summarized my thoughts in this article.

What is a (compile-time) optional dependency? When building software from source, most programming languages and build systems support conditional compilation: different parts of the source code are compiled based on certain conditions. An optional dependency is conditional compilation hooked up directly to a knob (e.g. command line flag, configuration file, ), with the effect that the software can now be built without an otherwise required dependency. Let s walk through a few issues with optional dependencies.

Inconsistent experience in different environments Software is usually not built by end users, but by packagers, at least when we are talking about Open Source. Hence, end users don t see the knob for the optional dependency, they are just presented with the fait accompli: their version of the software behaves differently than other versions of the same software. Depending on the kind of software, this situation can be made obvious to the user: for example, if the optional dependency is needed to print documents, the program can produce an appropriate error message when the user tries to print a document. Sometimes, this isn t possible: when i3 introduced an optional dependency on cairo and pangocairo, the behavior itself (rendering window titles) worked in all configurations, but non-ASCII characters might break depending on whether i3 was compiled with cairo. For users, it is frustrating to only discover in conversation that a program has a feature that the user is interested in, but it s not available on their computer. For support, this situation can be hard to detect, and even harder to resolve to the user s satisfaction.

Packaging is more complicated Unfortunately, many build systems don t stop the build when optional dependencies are not present. Instead, you sometimes end up with a broken build, or, even worse: with a successful build that does not work correctly at runtime. This means that packagers need to closely examine the build output to know which dependencies to make available. In the best case, there is a summary of available and enabled options, clearly outlining what this build will contain. In the worst case, you need to infer the features from the checks that are done, or work your way through the --help output. The better alternative is to configure your build system such that it stops when any dependency was not found, and thereby have packagers acknowledge each optional dependency by explicitly disabling the option.

Untested code paths bit rot Code paths which are not used will inevitably bit rot. If you have optional dependencies, you need to test both the code path without the dependency and the code path with the dependency. It doesn t matter whether the tests are automated or manual, the test matrix must cover both paths. Interestingly enough, this principle seems to apply to all kinds of software projects (but it slows down as change slows down): one might think that important Open Source building blocks should have enough users to cover all sorts of configurations. However, consider this example: building cairo without libxrender results in all GTK application windows, menus, etc. being displayed as empty grey surfaces. Cairo does not fail to build without libxrender, but the code path clearly is broken without libxrender.

Can we do without them? I m not saying optional dependencies should never be used. In fact, for bootstrapping, disabling dependencies can save a lot of work and can sometimes allow breaking circular dependencies. For example, in an early bootstrapping stage, binutils can be compiled with --disable-nls to disable internationalization. However, optional dependencies are broken so often that I conclude they are overused. Read on and see for yourself whether you would rather commit to best practices or not introduce an optional dependency.

Best practices If you do decide to make dependencies optional, please:
  1. Set up automated testing for all code path combinations.
  2. Fail the build until packagers explicitly pass a --disable flag.
  3. Tell users their version is missing a dependency at runtime, e.g. in --version.

Michael Stapelberg: distri: 20x faster initramfs (initrd) from scratch

In case you are not yet familiar with why an initramfs (or initrd, or initial ramdisk) is typically used when starting Linux, let me quote the wikipedia definition: [ ] initrd is a scheme for loading a temporary root file system into memory, which may be used as part of the Linux startup process [ ] to make preparations before the real root file system can be mounted. Many Linux distributions do not compile all file system drivers into the kernel, but instead load them on-demand from an initramfs, which saves memory. Another common scenario, in which an initramfs is required, is full-disk encryption: the disk must be unlocked from userspace, but since userspace is encrypted, an initramfs is used.

Motivation Thus far, building a distri disk image was quite slow: This is on an AMD Ryzen 3900X 12-core processor (2019):
distri % time make cryptimage serial=1
80.29s user 13.56s system 186% cpu 50.419 total # 19s image, 31s initrd
Of these 50 seconds, dracut s initramfs generation accounts for 31 seconds (62%)! Initramfs generation time drops to 8.7 seconds once dracut no longer needs to use the single-threaded gzip(1) , but the multi-threaded replacement pigz(1) : This brings the total time to build a distri disk image down to:
distri % time make cryptimage serial=1
76.85s user 13.23s system 327% cpu 27.509 total # 19s image, 8.7s initrd
Clearly, when you use dracut on any modern computer, you should make pigz available. dracut should fail to compile unless one explicitly opts into the known-slower gzip. For more thoughts on optional dependencies, see Optional dependencies don t work . But why does it take 8.7 seconds still? Can we go faster? The answer is Yes! I recently built a distri-specific initramfs I m calling minitrd. I wrote both big parts from scratch:
  1. the initramfs generator program (distri initrd)
  2. a custom Go userland (cmd/minitrd), running as /init in the initramfs.
minitrd generates the initramfs image in 400ms, bringing the total time down to:
distri % time make cryptimage serial=1
50.09s user 8.80s system 314% cpu 18.739 total # 18s image, 400ms initrd
(The remaining time is spent in preparing the file system, then installing and configuring the distri system, i.e. preparing a disk image you can run on real hardware.) How can minitrd be 20 times faster than dracut? dracut is mainly written in shell, with a C helper program. It drives the generation process by spawning lots of external dependencies (e.g. ldd or the dracut-install helper program). I assume that the combination of using an interpreted language (shell) that spawns lots of processes and precludes a concurrent architecture is to blame for the poor performance. minitrd is written in Go, with speed as a goal. It leverages concurrency and uses no external dependencies; everything happens within a single process (but with enough threads to saturate modern hardware). Measuring early boot time using qemu, I measured the dracut-generated initramfs taking 588ms to display the full disk encryption passphrase prompt, whereas minitrd took only 195ms. The rest of this article dives deeper into how minitrd works.

What does an initramfs do? Ultimately, the job of an initramfs is to make the root file system available and continue booting the system from there. Depending on the system setup, this involves the following 5 steps:

1. Load kernel modules to access the block devices with the root file system Depending on the system, the block devices with the root file system might already be present when the initramfs runs, or some kernel modules might need to be loaded first. On my Dell XPS 9360 laptop, the NVMe system disk is already present when the initramfs starts, whereas in qemu, we need to load the virtio_pci module, followed by the virtio_scsi module. How will our userland program know which kernel modules to load? Linux kernel modules declare patterns for their supported hardware as an alias, e.g.:
initrd# grep virtio_pci lib/modules/5.4.6/modules.alias
alias pci:v00001AF4d*sv*sd*bc*sc*i* virtio_pci
Devices in sysfs have a modalias file whose content can be matched against these declarations to identify the module to load:
initrd# cat /sys/devices/pci0000:00/*/modalias
pci:v00001AF4d00001005sv00001AF4sd00000004bc00scFFi00
pci:v00001AF4d00001004sv00001AF4sd00000008bc01sc00i00
[ ]
Hence, for the initial round of module loading, it is sufficient to locate all modalias files within sysfs and load the responsible modules. Loading a kernel module can result in new devices appearing. When that happens, the kernel sends a uevent, which the uevent consumer in userspace receives via a netlink socket. Typically, this consumer is udev(7) , but in our case, it s minitrd. For each uevent messages that comes with a MODALIAS variable, minitrd will load the relevant kernel module(s). When loading a kernel module, its dependencies need to be loaded first. Dependency information is stored in the modules.dep file in a Makefile-like syntax:
initrd# grep virtio_pci lib/modules/5.4.6/modules.dep
kernel/drivers/virtio/virtio_pci.ko: kernel/drivers/virtio/virtio_ring.ko kernel/drivers/virtio/virtio.ko
To load a module, we can open its file and then call the Linux-specific finit_module(2) system call. Some modules are expected to return an error code, e.g. ENODEV or ENOENT when some hardware device is not actually present. Side note: next to the textual versions, there are also binary versions of the modules.alias and modules.dep files. Presumably, those can be queried more quickly, but for simplicitly, I have not (yet?) implemented support in minitrd.

2. Console settings: font, keyboard layout Setting a legible font is necessary for hi-dpi displays. On my Dell XPS 9360 (3200 x 1800 QHD+ display), the following works well:
initrd# setfont latarcyrheb-sun32
Setting the user s keyboard layout is necessary for entering the LUKS full-disk encryption passphrase in their preferred keyboard layout. I use the NEO layout:
initrd# loadkeys neo

3. Block device identification In the Linux kernel, block device enumeration order is not necessarily the same on each boot. Even if it was deterministic, device order could still be changed when users modify their computer s device topology (e.g. connect a new disk to a formerly unused port). Hence, it is good style to refer to disks and their partitions with stable identifiers. This also applies to boot loader configuration, and so most distributions will set a kernel parameter such as root=UUID=1fa04de7-30a9-4183-93e9-1b0061567121. Identifying the block device or partition with the specified UUID is the initramfs s job. Depending on what the device contains, the UUID comes from a different place. For example, ext4 file systems have a UUID field in their file system superblock, whereas LUKS volumes have a UUID in their LUKS header. Canonically, probing a device to extract the UUID is done by libblkid from the util-linux package, but the logic can easily be re-implemented in other languages and changes rarely. minitrd comes with its own implementation to avoid cgo or running the blkid(8) program.

4. LUKS full-disk encryption unlocking (only on encrypted systems) Unlocking a LUKS-encrypted volume is done in userspace. The kernel handles the crypto, but reading the metadata, obtaining the passphrase (or e.g. key material from a file) and setting up the device mapper table entries are done in user space.
initrd# modprobe algif_skcipher
initrd# cryptsetup luksOpen /dev/sda4 cryptroot1
After the user entered their passphrase, the root file system can be mounted:
initrd# mount /dev/dm-0 /mnt

5. Continuing the boot process (switch_root) Now that everything is set up, we need to pass execution to the init program on the root file system with a careful sequence of chdir(2) , mount(2) , chroot(2) , chdir(2) and execve(2) system calls that is explained in this busybox switch_root comment.
initrd# mount -t devtmpfs dev /mnt/dev
initrd# exec switch_root -c /dev/console /mnt /init
To conserve RAM, the files in the temporary file system to which the initramfs archive is extracted are typically deleted.

How is an initramfs generated? An initramfs image (more accurately: archive) is a compressed cpio archive. Typically, gzip compression is used, but the kernel supports a bunch of different algorithms and distributions such as Ubuntu are switching to lz4. Generators typically prepare a temporary directory and feed it to the cpio(1) program. In minitrd, we read the files into memory and generate the cpio archive using the go-cpio package. We use the pgzip package for parallel gzip compression. The following files need to go into the cpio archive:

minitrd Go userland The minitrd binary is copied into the cpio archive as /init and will be run by the kernel after extracting the archive. Like the rest of distri, minitrd is built statically without cgo, which means it can be copied as-is into the cpio archive.

Linux kernel modules Aside from the modules.alias and modules.dep metadata files, the kernel modules themselves reside in e.g. /lib/modules/5.4.6/kernel and need to be copied into the cpio archive. Copying all modules results in a 80 MiB archive, so it is common to only copy modules that are relevant to the initramfs s features. This reduces archive size to 24 MiB. The filtering relies on hard-coded patterns and module names. For example, disk encryption related modules are all kernel modules underneath kernel/crypto, plus kernel/drivers/md/dm-crypt.ko. When generating a host-only initramfs (works on precisely the computer that generated it), some initramfs generators look at the currently loaded modules and just copy those.

Console Fonts and Keymaps The kbd package s setfont(8) and loadkeys(1) programs load console fonts and keymaps from /usr/share/consolefonts and /usr/share/keymaps, respectively. Hence, these directories need to be copied into the cpio archive. Depending on whether the initramfs should be generic (work on many computers) or host-only (works on precisely the computer/settings that generated it), the entire directories are copied, or only the required font/keymap.

cryptsetup, setfont, loadkeys These programs are (currently) required because minitrd does not implement their functionality. As they are dynamically linked, not only the programs themselves need to be copied, but also the ELF dynamic linking loader (path stored in the .interp ELF section) and any ELF library dependencies. For example, cryptsetup in distri declares the ELF interpreter /ro/glibc-amd64-2.27-3/out/lib/ld-linux-x86-64.so.2 and declares dependencies on shared libraries libcryptsetup.so.12, libblkid.so.1 and others. Luckily, in distri, packages contain a lib subdirectory containing symbolic links to the resolved shared library paths (hermetic packaging), so it is sufficient to mirror the lib directory into the cpio archive, recursing into shared library dependencies of shared libraries. cryptsetup also requires the GCC runtime library libgcc_s.so.1 to be present at runtime, and will abort with an error message about not being able to call pthread_cancel(3) if it is unavailable.

time zone data To print log messages in the correct time zone, we copy /etc/localtime from the host into the cpio archive.

minitrd outside of distri? I currently have no desire to make minitrd available outside of distri. While the technical challenges (such as extending the generator to not rely on distri s hermetic packages) are surmountable, I don t want to support people s initramfs remotely. Also, I think that people s efforts should in general be spent on rallying behind dracut and making it work faster, thereby benefiting all Linux distributions that use dracut (increasingly more). With minitrd, I have demonstrated that significant speed-ups are achievable.

Conclusion It was interesting to dive into how an initramfs really works. I had been working with the concept for many years, from small tasks such as debug why the encrypted root file system is not unlocked to more complicated tasks such as set up a root file system on DRBD for a high-availability setup . But even with that sort of experience, I didn t know all the details, until I was forced to implement every little thing. As I suspected going into this exercise, dracut is much slower than it needs to be. Re-implementing its generation stage in a modern language instead of shell helps a lot. Of course, my minitrd does a bit less than dracut, but not drastically so. The overall architecture is the same. I hope my effort helps with two things:
  1. As a teaching implementation: instead of wading through the various components that make up a modern initramfs (udev, systemd, various shell scripts, ), people can learn about how an initramfs works in a single place.
  2. I hope the significant time difference motivates people to improve dracut.

Appendix: qemu development environment Before writing any Go code, I did some manual prototyping. Learning how other people prototype is often immensely useful to me, so I m sharing my notes here. First, I copied all kernel modules and a statically built busybox binary:
% mkdir -p lib/modules/5.4.6
% cp -Lr /ro/lib/modules/5.4.6/* lib/modules/5.4.6/
% cp ~/busybox-1.22.0-amd64/busybox sh
To generate an initramfs from the current directory, I used:
% find .   cpio -o -H newc   pigz > /tmp/initrd
In distri s Makefile, I append these flags to the QEMU invocation:
-kernel /tmp/kernel \
-initrd /tmp/initrd \
-append "root=/dev/mapper/cryptroot1 rdinit=/sh ro console=ttyS0,115200 rd.luks=1 rd.luks.uuid=63051f8a-54b9-4996-b94f-3cf105af2900 rd.luks.name=63051f8a-54b9-4996-b94f-3cf105af2900=cryptroot1 rd.vconsole.keymap=neo rd.vconsole.font=latarcyrheb-sun32 init=/init systemd.setenv=PATH=/bin rw vga=836"
The vga= mode parameter is required for loading font latarcyrheb-sun32. Once in the busybox shell, I manually prepared the required mount points and kernel modules:
ln -s sh mount
ln -s sh lsmod
mkdir /proc /sys /run /mnt
mount -t proc proc /proc
mount -t sysfs sys /sys
mount -t devtmpfs dev /dev
modprobe virtio_pci
modprobe virtio_scsi
As a next step, I copied cryptsetup and dependencies into the initramfs directory:
% for f in /ro/cryptsetup-amd64-2.0.4-6/lib/*; do full=$(readlink -f $f); rel=$(echo $full   sed 's,^/,,g'); mkdir -p $(dirname $rel); install $full $rel; done
% ln -s ld-2.27.so ro/glibc-amd64-2.27-3/out/lib/ld-linux-x86-64.so.2
% cp /ro/glibc-amd64-2.27-3/out/lib/ld-2.27.so ro/glibc-amd64-2.27-3/out/lib/ld-2.27.so
% cp -r /ro/cryptsetup-amd64-2.0.4-6/lib ro/cryptsetup-amd64-2.0.4-6/
% mkdir -p ro/gcc-libs-amd64-8.2.0-3/out/lib64/
% cp /ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1 ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1
% ln -s /ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1 ro/cryptsetup-amd64-2.0.4-6/lib
% cp -r /ro/lvm2-amd64-2.03.00-6/lib ro/lvm2-amd64-2.03.00-6/
In busybox, I used the following commands to unlock the root file system:
modprobe algif_skcipher
./cryptsetup luksOpen /dev/sda4 cryptroot1
mount /dev/dm-0 /mnt

Michael Stapelberg: Hermetic packages (in distri)

In distri, packages (e.g. emacs) are hermetic. By hermetic, I mean that the dependencies a package uses (e.g. libusb) don t change, even when newer versions are installed. For example, if package libusb-amd64-1.0.22-7 is available at build time, the package will always use that same version, even after the newer libusb-amd64-1.0.23-8 will be installed into the package store. Another way of saying the same thing is: packages in distri are always co-installable. This makes the package store more robust: additions to it will not break the system. On a technical level, the package store is implemented as a directory containing distri SquashFS images and metadata files, into which packages are installed in an atomic way.

Out of scope: plugins are not hermetic by design One exception where hermeticity is not desired are plugin mechanisms: optionally loading out-of-tree code at runtime obviously is not hermetic. As an example, consider glibc s Name Service Switch (NSS) mechanism. Page 29.4.1 Adding another Service to NSS describes how glibc searches $prefix/lib for shared libraries at runtime. Debian ships about a dozen NSS libraries for a variety of purposes, and enterprise setups might add their own into the mix. systemd (as of v245) accounts for 4 NSS libraries, e.g. nss-systemd for user/group name resolution for users allocated through systemd s DynamicUser= option. Having packages be as hermetic as possible remains a worthwhile goal despite any exceptions: I will gladly use a 99% hermetic system over a 0% hermetic system any day. Side note: Xorg s driver model (which can be characterized as a plugin mechanism) does not fall under this category because of its tight API/ABI coupling! For this case, where drivers are only guaranteed to work with precisely the Xorg version for which they were compiled, distri uses per-package exchange directories.

Implementation of hermetic packages in distri On a technical level, the requirement is: all paths used by the program must always result in the same contents. This is implemented in distri via the read-only package store mounted at /ro, e.g. files underneath /ro/emacs-amd64-26.3-15 never change. To change all paths used by a program, in practice, three strategies cover most paths:

ELF interpreter and dynamic libraries Programs on Linux use the ELF file format, which contains two kinds of references: First, the ELF interpreter (PT_INTERP segment), which is used to start the program. For dynamically linked programs on 64-bit systems, this is typically ld.so(8). Many distributions use system-global paths such as /lib64/ld-linux-x86-64.so.2, but distri compiles programs with -Wl,--dynamic-linker=/ro/glibc-amd64-2.31-4/out/lib/ld-linux-x86-64.so.2 so that the full path ends up in the binary. The ELF interpreter is shown by file(1), but you can also use readelf -a $BINARY grep 'program interpreter' to display it. And secondly, the rpath, a run-time search path for dynamic libraries. Instead of storing full references to all dynamic libraries, we set the rpath so that ld.so(8) will find the correct dynamic libraries. Originally, we used to just set a long rpath, containing one entry for each dynamic library dependency. However, we have since switched to using a single lib subdirectory per package as its rpath, and placing symlinks with full path references into that lib directory, e.g. using -Wl,-rpath=/ro/grep-amd64-3.4-4/lib. This is better for performance, as ld.so uses a per-directory cache. Note that program load times are significantly influenced by how quickly you can locate the dynamic libraries. distri uses a FUSE file system to load programs from, so getting proper -ENOENT caching into place drastically sped up program load times. Instead of compiling software with the -Wl,--dynamic-linker and -Wl,-rpath flags, one can also modify these fields after the fact using patchelf(1). For closed-source programs, this is the only possibility. The rpath can be inspected by using e.g. readelf -a $BINARY grep RPATH.

Environment variable setup wrapper programs Many programs are influenced by environment variables: to start another program, said program is often found by checking each directory in the PATH environment variable. Such search paths are prevalent in scripting languages, too, to find modules. Python has PYTHONPATH, Perl has PERL5LIB, and so on. To set up these search path environment variables at run time, distri employs an indirection. Instead of e.g. teensy-loader-cli, you run a small wrapper program that calls precisely one execve system call with the desired environment variables. Initially, I used shell scripts as wrapper programs because they are easily inspectable. This turned out to be too slow, so I switched to compiled programs. I m linking them statically for fast startup, and I m linking them against musl libc for significantly smaller file sizes than glibc (per-executable overhead adds up quickly in a distribution!). Note that the wrapper programs prepend to the PATH environment variable, they don t replace it in its entirely. This is important so that users have a way to extend the PATH (and other variables) if they so choose. This doesn t hurt hermeticity because it is only relevant for programs that were not present at build time, i.e. plugin mechanisms which, by design, cannot be hermetic.

Shebang interpreter patching The Shebang of scripts contains a path, too, and hence needs to be changed. We don t do this in distri yet (the number of packaged scripts is small), but we should.

Performance requirements The performance improvements in the previous sections are not just good to have, but practically required when many processes are involved: without them, you ll encounter second-long delays in magit which spawns many git processes under the covers, or in dracut, which spawns one cp(1) process per file.

Downside: rebuild of packages required to pick up changes Linux distributions such as Debian consider it an advantage to roll out security fixes to the entire system by updating a single shared library package (e.g. openssl). The flip side of that coin is that changes to a single critical package can break the entire system. With hermetic packages, all reverse dependencies must be rebuilt when a library s changes should be picked up by the whole system. E.g., when openssl changes, curl must be rebuilt to pick up the new version of openssl. This approach trades off using more bandwidth and more disk space (temporarily) against reducing the blast radius of any individual package update.

Downside: long env variables are cumbersome to deal with This can be partially mitigated by removing empty directories at build time, which will result in shorter variables. In general, there is no getting around this. One little trick is to use tr : '\n', e.g.:
distri0# echo $PATH
/usr/bin:/bin:/usr/sbin:/sbin:/ro/openssh-amd64-8.2p1-11/out/bin
distri0# echo $PATH   tr : '\n'
/usr/bin
/bin
/usr/sbin
/sbin
/ro/openssh-amd64-8.2p1-11/out/bin

Edge cases The implementation outlined above works well in hundreds of packages, and only a small handful exhibited problems of any kind. Here are some issues I encountered:

Issue: accidental ABI breakage in plugin mechanisms NSS libraries built against glibc 2.28 and newer cannot be loaded by glibc 2.27. In all likelihood, such changes do not happen too often, but it does illustrate that glibc s published interface spec is not sufficient for forwards and backwards compatibility. In distri, we could likely use a per-package exchange directory for glibc s NSS mechanism to prevent the above problem from happening in the future.

Issue: wrapper bypass when a program re-executes itself Some programs try to arrange for themselves to be re-executed outside of their current process tree. For example, consider building a program with the meson build system:
  1. When meson first configures the build, it generates ninja files (think Makefiles) which contain command lines that run the meson --internal helper.
  2. Once meson returns, ninja is called as a separate process, so it will not have the environment which the meson wrapper sets up. ninja then runs the previously persisted meson command line. Since the command line uses the full path to meson (not to its wrapper), it bypasses the wrapper.
Luckily, not many programs try to arrange for other process trees to run them. Here is a table summarizing how affected programs might try to arrange for re-execution, whether the technique results in a wrapper bypass, and what we do about it in distri:
technique to execute itself uses wrapper mitigation
run-time: find own basename in PATH yes wrapper program
compile-time: embed expected path no; bypass! configure or patch
run-time: argv[0] or /proc/self/exe no; bypass! patch
One might think that setting argv[0] to the wrapper location seems like a way to side-step this problem. We tried doing this in distri, but had to revert and go the other way.

Misc smaller issues

Appendix: Could other distributions adopt hermetic packages? At a very high level, adopting hermetic packages will require two steps:
  1. Using fully qualified paths whose contents don t change (e.g. /ro/emacs-amd64-26.3-15) generally requires rebuilding programs, e.g. with --prefix set.
  2. Once you use fully qualified paths you need to make the packages able to exchange data. distri solves this with exchange directories, implemented in the /ro file system which is backed by a FUSE daemon.
The first step is pretty simple, whereas the second step is where I expect controversy around any suggested mechanism.

Appendix: demo (in distri) This appendix contains commands and their outputs, run on upcoming distri version supersilverhaze, but verified to work on older versions, too. Large outputs have been collapsed and can be expanded by clicking on the output. The /bin directory contains symlinks for the union of all package s bin subdirectories:
distri0# readlink -f /bin/teensy_loader_cli
/ro/teensy-loader-cli-amd64-2.1+g20180927-7/bin/teensy_loader_cli
The wrapper program in the bin subdirectory is small:
distri0# ls -lh $(readlink -f /bin/teensy_loader_cli)
-rwxr-xr-x 1 root root 46K Apr 21 21:56 /ro/teensy-loader-cli-amd64-2.1+g20180927-7/bin/teensy_loader_cli
Wrapper programs execute quickly:
distri0# strace -fvy /bin/teensy_loader_cli & head cat -n
     1  execve("/bin/teensy_loader_cli", ["/bin/teensy_loader_cli"], ["USER=root", "LOGNAME=root", "HOME=/root", "PATH=/ro/bash-amd64-5.0-4/bin:/r"..., "SHELL=/bin/zsh", "TERM=screen.xterm-256color", "XDG_SESSION_ID=c1", "XDG_RUNTIME_DIR=/run/user/0", "DBUS_SESSION_BUS_ADDRESS=unix:pa"..., "XDG_SESSION_TYPE=tty", "XDG_SESSION_CLASS=user", "SSH_CLIENT=10.0.2.2 42556 22", "SSH_CONNECTION=10.0.2.2 42556 10"..., "SSHTTY=/dev/pts/0", "SHLVL=1", "PWD=/root", "OLDPWD=/root", "=/usr/bin/strace", "LD_LIBRARY_PATH=/ro/bash-amd64-5"..., "PERL5LIB=/ro/bash-amd64-5.0-4/ou"..., "PYTHONPATH=/ro/bash-amd64-5.b0-4/"...]) = 0
     2  arch_prctl(ARCH_SET_FS, 0x40c878)       = 0
     3  set_tid_address(0x40ca9c)               = 715
     4  brk(NULL)                               = 0x15b9000
     5  brk(0x15ba000)                          = 0x15ba000
     6  brk(0x15bb000)                          = 0x15bb000
     7  brk(0x15bd000)                          = 0x15bd000
     8  brk(0x15bf000)                          = 0x15bf000
     9  brk(0x15c1000)                          = 0x15c1000
    10  execve("/ro/teensy-loader-cli-amd64-2.1+g20180927-7/out/bin/teensy_loader_cli", ["/ro/teensy-loader-cli-amd64-2.1+"...], ["USER=root", "LOGNAME=root", "HOME=/root", "PATH=/ro/bash-amd64-5.0-4/bin:/r"..., "SHELL=/bin/zsh", "TERM=screen.xterm-256color", "XDG_SESSION_ID=c1", "XDG_RUNTIME_DIR=/run/user/0", "DBUS_SESSION_BUS_ADDRESS=unix:pa"..., "XDG_SESSION_TYPE=tty", "XDG_SESSION_CLASS=user", "SSH_CLIENT=10.0.2.2 42556 22", "SSH_CONNECTION=10.0.2.2 42556 10"..., "SSHTTY=/dev/pts/0", "SHLVL=1", "PWD=/root", "OLDPWD=/root", "=/usr/bin/strace", "LD_LIBRARY_PATH=/ro/bash-amd64-5"..., "PERL5LIB=/ro/bash-amd64-5.0-4/ou"..., "PYTHONPATH=/ro/bash-amd64-5.0-4/"...]) = 0
Confirm which ELF interpreter is set for a binary using readelf(1):
distri0# readelf -a /ro/teensy-loader-cli-amd64-2.1+g20180927-7/out/bin/teensy_loader_cli grep 'program interpreter'
[Requesting program interpreter: /ro/glibc-amd64-2.31-4/out/lib/ld-linux-x86-64.so.2]
Confirm the rpath is set to the package s lib subdirectory using readelf(1):
distri0# readelf -a /ro/teensy-loader-cli-amd64-2.1+g20180927-7/out/bin/teensy_loader_cli grep RPATH
 0x000000000000000f (RPATH)              Library rpath: [/ro/teensy-loader-cli-amd64-2.1+g20180927-7/lib]
and verify the lib subdirectory has the expected symlinks and target versions:
distri0# find /ro/teensy-loader-cli-amd64-*/lib -type f -printf '%P -> %l\n'
libc.so.6 -> /ro/glibc-amd64-2.31-4/out/lib/libc-2.31.so
libpthread.so.0 -> /ro/glibc-amd64-2.31-4/out/lib/libpthread-2.31.so
librt.so.1 -> /ro/glibc-amd64-2.31-4/out/lib/librt-2.31.so
libudev.so.1 -> /ro/libudev-amd64-245-11/out/lib/libudev.so.1.6.17
libusb-0.1.so.4 -> /ro/libusb-compat-amd64-0.1.5-7/out/lib/libusb-0.1.so.4.4.4
libusb-1.0.so.0 -> /ro/libusb-amd64-1.0.23-8/out/lib/libusb-1.0.so.0.2.0
To verify the correct libraries are actually loaded, you can set the LD_DEBUG environment variable for ld.so(8):
distri0# LD_DEBUG=libs teensy_loader_cli
[ ]
       678:     find library=libc.so.6 [0]; searching
       678:      search path=/ro/teensy-loader-cli-amd64-2.1+g20180927-7/lib            (RPATH from file /ro/teensy-loader-cli-amd64-2.1+g20180927-7/out/bin/teensy_loader_cli)
       678:       trying file=/ro/teensy-loader-cli-amd64-2.1+g20180927-7/lib/libc.so.6
       678:
[ ]
NSS libraries that distri ships:
find /lib/ -name "libnss_*.so.2" -type f -printf '%P -> %l\n'
libnss_myhostname.so.2 -> ../systemd-amd64-245-11/out/lib/libnss_myhostname.so.2
libnss_mymachines.so.2 -> ../systemd-amd64-245-11/out/lib/libnss_mymachines.so.2
libnss_resolve.so.2 -> ../systemd-amd64-245-11/out/lib/libnss_resolve.so.2
libnss_systemd.so.2 -> ../systemd-amd64-245-11/out/lib/libnss_systemd.so.2
libnss_compat.so.2 -> ../glibc-amd64-2.31-4/out/lib/libnss_compat.so.2
libnss_db.so.2 -> ../glibc-amd64-2.31-4/out/lib/libnss_db.so.2
libnss_dns.so.2 -> ../glibc-amd64-2.31-4/out/lib/libnss_dns.so.2
libnss_files.so.2 -> ../glibc-amd64-2.31-4/out/lib/libnss_files.so.2
libnss_hesiod.so.2 -> ../glibc-amd64-2.31-4/out/lib/libnss_hesiod.so.2

16 May 2020

Michael Stapelberg: a new distri linux (fast package management) release

I just released a new version of distri. The focus of this release lies on: See the release notes for more details. The distri research linux distribution project was started in 2019 to research whether a few architectural changes could enable drastically faster package management. While the package managers in common Linux distributions (e.g. apt, dnf, ) top out at data rates of only a few MB/s, distri effortlessly saturates 1 Gbit, 10 Gbit and even 40 Gbit connections, resulting in fast installation and update speeds.

22 October 2017

Michael Stapelberg: Which VCS do Debian s Go package upstreams use?

In the pkg-go team, we are currently discussing which workflows we should standardize on. One of the considerations is what goes into the upstream Git branch of our repositories: should it track the upstream Git repository, or should it contain orig tarball imports? Now, tracking the upstream Git repository only works if upstream actually uses Git. The go tool, which is widely used within the Go community for managing Go packages, supports Git, Mercurial, Bazaar and Subversion. But which of these are actually used in practice? Let s find out! Option 1: If you have the sources lists of all suites locally anyway
/usr/lib/apt/apt-helper cat-file \
  $(apt-get indextargets --format '$(FILENAME)' 'ShortDesc: Sources' 'Origin: Debian') \
    sed -n 's,Go-Import-Path: ,,gp' \
    sort -u
Option 2: If you prefer to use a relational database over textfiles This is the harder option, but also the more complete one. First, we ll need the Go package import paths of all Go packages which are in Debian. We can get them from the ProjectB database, Debian s main PostgreSQL database containing all of the state about the Debian archive. Unfortunately, only Debian Developers have SSH access to a mirror of ProjectB at the moment. I contacted DSA to ask about providing public ProjectB access.
  ssh mirror.ftp-master.debian.org "echo \"SELECT value FROM source_metadata \
  LEFT JOIN metadata_keys ON (source_metadata.key_id = metadata_keys.key_id) \
  WHERE metadata_keys.key = 'Go-Import-Path' GROUP BY value\"   \
    psql -A -t service=projectb" > go_import_path.txt
I uploaded a copy of resulting go_import_path.txt, if you re curious. Now, let s come up with a little bit of Go to print the VCS responsible for each specified Go import path:
go get -u golang.org/x/tools/go/vcs
cat >vcs4.go <<'EOT'
package main
import (
	"fmt"
	"log"
	"os"
	"sync"
	"golang.org/x/tools/go/vcs"
)
func main()  
	var wg sync.WaitGroup
	for _, arg := range os.Args[1:]  
		wg.Add(1)
		go func(arg string)  
			defer wg.Done()
			rr, err := vcs.RepoRootForImportPath(arg, false)
			if err != nil  
				log.Println(err)
				return
			 
			fmt.Println(rr.VCS.Name)
		 (arg)
	 
	wg.Wait()
 
EOT
Lastly, run it in combination with uniq(1) to discover
go run vcs4.go $(tr '\n' ' ' < go_import_path.txt)   sort   uniq -c
    760 Git
      1 Mercurial

21 October 2017

Michael Stapelberg: pk4: a new tool to avail the Debian source package producing the specified package

UNIX distributions used to come with the system source code in /usr/src. This is a concept which fascinates me: if you want to change something in any part of your system, just make your change in the corresponding directory, recomile, reinstall, and you can immediately see your changes in action. So, I decided I wanted to build a tool which can give you the impression of that, without the downsides of additional disk space usage and slower update times because of /usr/src maintenance. The result of this effort is a tool called pk4 (mnemonic: get me the package for ) which I just uploaded to Debian. What distinguishes this tool from an apt source call is the combination of a number of features: If you don t want to wait for the package to clear the NEW queue, you can get it from here in the meantime:
wget https://people.debian.org/~stapelberg/pk4/pk4_1_amd64.deb
sudo apt install ./pk4_1_amd64.deb
You can find the sources and issue tracker at https://github.com/Debian/pk4. Here is a terminal screencast of the tool in action, availing the sources of i3, applying a patch, rebuilding the package and replacing the installed binary packages:

8 October 2017

Michael Stapelberg: Debian stretch on the Raspberry Pi 3 (update)

I previously wrote about my Debian stretch preview image for the Raspberry Pi 3. Now, I m publishing an updated version, containing the following changes: A couple of issues remain, notably the lack of WiFi and bluetooth support (see wiki:RaspberryPi3 for details. Any help with fixing these issues is very welcome! As a preview version (i.e. unofficial, unsupported, etc.) until all the necessary bits and pieces are in place to build images in a proper place in Debian, I built and uploaded the resulting image. Find it at https://people.debian.org/~stapelberg/raspberrypi3/2017-10-08/. To install the image, insert the SD card into your computer (I m assuming it s available as /dev/sdb) and copy the image onto it:
$ wget https://people.debian.org/~stapelberg/raspberrypi3/2017-10-08/2017-10-08-raspberry-pi-3-buster-PREVIEW.img.bz2
$ bunzip2 2017-10-08-raspberry-pi-3-buster-PREVIEW.img.bz2
$ sudo dd if=2017-10-08-raspberry-pi-3-buster-PREVIEW.img of=/dev/sdb bs=5M
If resolving client-supplied DHCP hostnames works in your network, you should be able to log into the Raspberry Pi 3 using SSH after booting it:
$ ssh root@rpi3
# Password is  raspberry 

2 August 2017

Jonathan Dowland: Debian on the Raspberry Pi3

Back in November, Michael Stapelberg blogged about running (pure) Debian on the Raspberry Pi 3. This is pretty exciting because Raspbian still provide 32 bit packages, so this means you can run a true ARM64 OS on the Pi. Unfortunately, one of the major missing pieces with Debian on the Pi3 at this time is broken video support. A helpful person known as "SandPox" wrote to me in June to explain that they had working video for a custom kernel build on top of pure Debian on the Pi, and they achieved this simply by enabling CONFIG_FB_SIMPLE in the kernel configuration. On request, this has since been enabled for official Debian kernel builds. Michael and I explored this and eventually figured out that this does work when building the kernel using the upstream build instructions, but it doesn't work when building using the Debian kernel package's build instructions. I've since ran out of time to look at this more, so I wrote to request help from the debian-kernel mailing list, alas, nobody has replied yet. I've put up the dmesg.txt for a boot with the failing kernel, which might offer some clues. Can anyone help figure out what's wrong? Thanks to Michael for driving efforts for Debian on the Pi, and to SandPox for getting in touch to make their first contribution to Debian. Thanks also to Daniel Silverstone who loaned me an ARM64 VM (from Scaleway) upon which I performed some of my kernel builds.

22 July 2017

Niels Thykier: Improving bulk performance in debhelper

Since debhelper/10.3, there has been a number of performance related changes. The vast majority primarily improves bulk performance or only have visible effects at larger input sizes. Most visible cases are: For debhelper, this mostly involved: How to take advantage of these improvements in tools that use Dh_Lib: Credits: I would like to thank the following for reporting performance issues, regressions or/and providing patches. The list is in no particular order: Should I have missed your contribution, please do not hesitate to let me know.
Filed under: Debhelper, Debian

Niels Thykier: Improving bulk performance in debhelper

Since debhelper/10.3, there has been a number of performance related changes. The vast majority primarily improves bulk performance or only have visible effects at larger input sizes. Most visible cases are: For debhelper, this mostly involved: How to take advantage of these improvements in tools that use Dh_Lib: Credits: I would like to thank the following for reporting performance issues, regressions or/and providing patches. The list is in no particular order: Should I have missed your contribution, please do not hesitate to let me know.
Filed under: Debhelper, Debian

Next.