Search Results: "Colin Watson"

13 April 2024

Simon Josefsson: Reproducible and minimal source-only tarballs

With the release of Libntlm version 1.8 the release tarball can be reproduced on several distributions. We also publish a signed minimal source-only tarball, produced by git-archive which is the same format used by Savannah, Codeberg, GitLab, GitHub and others. Reproducibility of both tarballs are tested continuously for regressions on GitLab through a CI/CD pipeline. If that wasn t enough to excite you, the Debian packages of Libntlm are now built from the reproducible minimal source-only tarball. The resulting binaries are reproducible on several architectures. What does that even mean? Why should you care? How you can do the same for your project? What are the open issues? Read on, dear reader This article describes my practical experiments with reproducible release artifacts, following up on my earlier thoughts that lead to discussion on Fosstodon and a patch by Janneke Nieuwenhuizen to make Guix tarballs reproducible that inspired me to some practical work. Let s look at how a maintainer release some software, and how a user can reproduce the released artifacts from the source code. Libntlm provides a shared library written in C and uses GNU Make, GNU Autoconf, GNU Automake, GNU Libtool and gnulib for build management, but these ideas should apply to most project and build system. The following illustrate the steps a maintainer would take to prepare a release:
git clone https://gitlab.com/gsasl/libntlm.git
cd libntlm
git checkout v1.8
./bootstrap
./configure
make distcheck
gpg -b libntlm-1.8.tar.gz
The generated files libntlm-1.8.tar.gz and libntlm-1.8.tar.gz.sig are published, and users download and use them. This is how the GNU project have been doing releases since the late 1980 s. That is a testament to how successful this pattern has been! These tarballs contain source code and some generated files, typically shell scripts generated by autoconf, makefile templates generated by automake, documentation in formats like Info, HTML, or PDF. Rarely do they contain binary object code, but historically that happened. The XZUtils incident illustrate that tarballs with files that are not included in the git archive offer an opportunity to disguise malicious backdoors. I blogged earlier how to mitigate this risk by using signed minimal source-only tarballs. The risk of hiding malware is not the only motivation to publish signed minimal source-only tarballs. With pre-generated content in tarballs, there is a risk that GNU/Linux distributions such as Trisquel, Guix, Debian/Ubuntu or Fedora ship generated files coming from the tarball into the binary *.deb or *.rpm package file. Typically the person packaging the upstream project never realized that some installed artifacts was not re-built through a typical autoconf -fi && ./configure && make install sequence, and never wrote the code to rebuild everything. This can also happen if the build rules are written but are buggy, shipping the old artifact. When a security problem is found, this can lead to time-consuming situations, as it may be that patching the relevant source code and rebuilding the package is not sufficient: the vulnerable generated object from the tarball would be shipped into the binary package instead of a rebuilt artifact. For architecture-specific binaries this rarely happens, since object code is usually not included in tarballs although for 10+ years I shipped the binary Java JAR file in the GNU Libidn release tarball, until I stopped shipping it. For interpreted languages and especially for generated content such as HTML, PDF, shell scripts this happens more than you would like. Publishing minimal source-only tarballs enable easier auditing of a project s code, to avoid the need to read through all generated files looking for malicious content. I have taken care to generate the source-only minimal tarball using git-archive. This is the same format that GitLab, GitHub etc offer for the automated download links on git tags. The minimal source-only tarballs can thus serve as a way to audit GitLab and GitHub download material! Consider if/when hosting sites like GitLab or GitHub has a security incident that cause generated tarballs to include a backdoor that is not present in the git repository. If people rely on the tag download artifact without verifying the maintainer PGP signature using GnuPG, this can lead to similar backdoor scenarios that we had for XZUtils but originated with the hosting provider instead of the release manager. This is even more concerning, since this attack can be mounted for some selected IP address that you want to target and not on everyone, thereby making it harder to discover. With all that discussion and rationale out of the way, let s return to the release process. I have added another step here:
make srcdist
gpg -b libntlm-1.8-src.tar.gz
Now the release is ready. I publish these four files in the Libntlm s Savannah Download area, but they can be uploaded to a GitLab/GitHub release area as well. These are the SHA256 checksums I got after building the tarballs on my Trisquel 11 aramo laptop:
91de864224913b9493c7a6cec2890e6eded3610d34c3d983132823de348ec2ca  libntlm-1.8-src.tar.gz
ce6569a47a21173ba69c990965f73eb82d9a093eb871f935ab64ee13df47fda1  libntlm-1.8.tar.gz
So how can you reproduce my artifacts? Here is how to reproduce them in a Ubuntu 22.04 container:
podman run -it --rm ubuntu:22.04
apt-get update
apt-get install -y --no-install-recommends autoconf automake libtool make git ca-certificates
git clone https://gitlab.com/gsasl/libntlm.git
cd libntlm
git checkout v1.8
./bootstrap
./configure
make dist srcdist
sha256sum libntlm-*.tar.gz
You should see the exact same SHA256 checksum values. Hooray! This works because Trisquel 11 and Ubuntu 22.04 uses the same version of git, autoconf, automake, and libtool. These tools do not guarantee the same output content for all versions, similar to how GNU GCC does not generate the same binary output for all versions. So there is still some delicate version pairing needed. Ideally, the artifacts should be possible to reproduce from the release artifacts themselves, and not only directly from git. It is possible to reproduce the full tarball in a AlmaLinux 8 container replace almalinux:8 with rockylinux:8 if you prefer RockyLinux:
podman run -it --rm almalinux:8
dnf update -y
dnf install -y make wget gcc
wget https://download.savannah.nongnu.org/releases/libntlm/libntlm-1.8.tar.gz
tar xfa libntlm-1.8.tar.gz
cd libntlm-1.8
./configure
make dist
sha256sum libntlm-1.8.tar.gz
The source-only minimal tarball can be regenerated on Debian 11:
podman run -it --rm debian:11
apt-get update
apt-get install -y --no-install-recommends make git ca-certificates
git clone https://gitlab.com/gsasl/libntlm.git
cd libntlm
git checkout v1.8
make -f cfg.mk srcdist
sha256sum libntlm-1.8-src.tar.gz 
As the Magnus Opus or chef-d uvre, let s recreate the full tarball directly from the minimal source-only tarball on Trisquel 11 replace docker.io/kpengboy/trisquel:11.0 with ubuntu:22.04 if you prefer.
podman run -it --rm docker.io/kpengboy/trisquel:11.0
apt-get update
apt-get install -y --no-install-recommends autoconf automake libtool make wget git ca-certificates
wget https://download.savannah.nongnu.org/releases/libntlm/libntlm-1.8-src.tar.gz
tar xfa libntlm-1.8-src.tar.gz
cd libntlm-v1.8
./bootstrap
./configure
make dist
sha256sum libntlm-1.8.tar.gz
Yay! You should now have great confidence in that the release artifacts correspond to what s in version control and also to what the maintainer intended to release. Your remaining job is to audit the source code for vulnerabilities, including the source code of the dependencies used in the build. You no longer have to worry about auditing the release artifacts. I find it somewhat amusing that the build infrastructure for Libntlm is now in a significantly better place than the code itself. Libntlm is written in old C style with plenty of string manipulation and uses broken cryptographic algorithms such as MD4 and single-DES. Remember folks: solving supply chain security issues has no bearing on what kind of code you eventually run. A clean gun can still shoot you in the foot. Side note on naming: GitLab exports tarballs with pathnames libntlm-v1.8/ (i.e.., PROJECT-TAG/) and I ve adopted the same pathnames, which means my libntlm-1.8-src.tar.gz tarballs are bit-by-bit identical to GitLab s exports and you can verify this with tools like diffoscope. GitLab name the tarball libntlm-v1.8.tar.gz (i.e., PROJECT-TAG.ARCHIVE) which I find too similar to the libntlm-1.8.tar.gz that we also publish. GitHub uses the same git archive style, but unfortunately they have logic that removes the v in the pathname so you will get a tarball with pathname libntlm-1.8/ instead of libntlm-v1.8/ that GitLab and I use. The content of the tarball is bit-by-bit identical, but the pathname and archive differs. Codeberg (running Forgejo) uses another approach: the tarball is called libntlm-v1.8.tar.gz (after the tag) just like GitLab, but the pathname inside the archive is libntlm/, otherwise the produced archive is bit-by-bit identical including timestamps. Savannah s CGIT interface uses archive name libntlm-1.8.tar.gz with pathname libntlm-1.8/, but otherwise file content is identical. Savannah s GitWeb interface provides snapshot links that are named after the git commit (e.g., libntlm-a812c2ca.tar.gz with libntlm-a812c2ca/) and I cannot find any tag-based download links at all. Overall, we are so close to get SHA256 checksum to match, but fail on pathname within the archive. I ve chosen to be compatible with GitLab regarding the content of tarballs but not on archive naming. From a simplicity point of view, it would be nice if everyone used PROJECT-TAG.ARCHIVE for the archive filename and PROJECT-TAG/ for the pathname within the archive. This aspect will probably need more discussion. Side note on git archive output: It seems different versions of git archive produce different results for the same repository. The version of git in Debian 11, Trisquel 11 and Ubuntu 22.04 behave the same. The version of git in Debian 12, AlmaLinux/RockyLinux 8/9, Alpine, ArchLinux, macOS homebrew, and upcoming Ubuntu 24.04 behave in another way. Hopefully this will not change that often, but this would invalidate reproducibility of these tarballs in the future, forcing you to use an old git release to reproduce the source-only tarball. Alas, GitLab and most other sites appears to be using modern git so the download tarballs from them would not match my tarballs even though the content would. Side note on ChangeLog: ChangeLog files were traditionally manually curated files with version history for a package. In recent years, several projects moved to dynamically generate them from git history (using tools like git2cl or gitlog-to-changelog). This has consequences for reproducibility of tarballs: you need to have the entire git history available! The gitlog-to-changelog tool also output different outputs depending on the time zone of the person using it, which arguable is a simple bug that can be fixed. However this entire approach is incompatible with rebuilding the full tarball from the minimal source-only tarball. It seems Libntlm s ChangeLog file died on the surgery table here. So how would a distribution build these minimal source-only tarballs? I happen to help on the libntlm package in Debian. It has historically used the generated tarballs as the source code to build from. This means that code coming from gnulib is vendored in the tarball. When a security problem is discovered in gnulib code, the security team needs to patch all packages that include that vendored code and rebuild them, instead of merely patching the gnulib package and rebuild all packages that rely on that particular code. To change this, the Debian libntlm package needs to Build-Depends on Debian s gnulib package. But there was one problem: similar to most projects that use gnulib, Libntlm depend on a particular git commit of gnulib, and Debian only ship one commit. There is no coordination about which commit to use. I have adopted gnulib in Debian, and add a git bundle to the *_all.deb binary package so that projects that rely on gnulib can pick whatever commit they need. This allow an no-network GNULIB_URL and GNULIB_REVISION approach when running Libntlm s ./bootstrap with the Debian gnulib package installed. Otherwise libntlm would pick up whatever latest version of gnulib that Debian happened to have in the gnulib package, which is not what the Libntlm maintainer intended to be used, and can lead to all sorts of version mismatches (and consequently security problems) over time. Libntlm in Debian is developed and tested on Salsa and there is continuous integration testing of it as well, thanks to the Salsa CI team. Side note on git bundles: unfortunately there appears to be no reproducible way to export a git repository into one or more files. So one unfortunate consequence of all this work is that the gnulib *.orig.tar.gz tarball in Debian is not reproducible any more. I have tried to get Git bundles to be reproducible but I never got it to work see my notes in gnulib s debian/README.source on this aspect. Of course, source tarball reproducibility has nothing to do with binary reproducibility of gnulib in Debian itself, fortunately. One open question is how to deal with the increased build dependencies that is triggered by this approach. Some people are surprised by this but I don t see how to get around it: if you depend on source code for tools in another package to build your package, it is a bad idea to hide that dependency. We ve done it for a long time through vendored code in non-minimal tarballs. Libntlm isn t the most critical project from a bootstrapping perspective, so adding git and gnulib as Build-Depends to it will probably be fine. However, consider if this pattern was used for other packages that uses gnulib such as coreutils, gzip, tar, bison etc (all are using gnulib) then they would all Build-Depends on git and gnulib. Cross-building those packages for a new architecture will therefor require git on that architecture first, which gets circular quick. The dependency on gnulib is real so I don t see that going away, and gnulib is a Architecture:all package. However, the dependency on git is merely a consequence of how the Debian gnulib package chose to make all gnulib git commits available to projects: through a git bundle. There are other ways to do this that doesn t require the git tool to extract the necessary files, but none that I found practical ideas welcome! Finally some brief notes on how this was implemented. Enabling bootstrappable source-only minimal tarballs via gnulib s ./bootstrap is achieved by using the GNULIB_REVISION mechanism, locking down the gnulib commit used. I have always disliked git submodules because they add extra steps and has complicated interaction with CI/CD. The reason why I gave up git submodules now is because the particular commit to use is not recorded in the git archive output when git submodules is used. So the particular gnulib commit has to be mentioned explicitly in some source code that goes into the git archive tarball. Colin Watson added the GNULIB_REVISION approach to ./bootstrap back in 2018, and now it no longer made sense to continue to use a gnulib git submodule. One alternative is to use ./bootstrap with --gnulib-srcdir or --gnulib-refdir if there is some practical problem with the GNULIB_URL towards a git bundle the GNULIB_REVISION in bootstrap.conf. The srcdist make rule is simple:
git archive --prefix=libntlm-v1.8/ -o libntlm-v1.8.tar.gz HEAD
Making the make dist generated tarball reproducible can be more complicated, however for Libntlm it was sufficient to make sure the modification times of all files were set deterministically to the timestamp of the last commit in the git repository. Interestingly there seems to be a couple of different ways to accomplish this, Guix doesn t support minimal source-only tarballs but rely on a .tarball-timestamp file inside the tarball. Paul Eggert explained what TZDB is using some time ago. The approach I m using now is fairly similar to the one I suggested over a year ago. If there are problems because all files in the tarball now use the same modification time, there is a solution by Bruno Haible that could be implemented. Side note on git tags: Some people may wonder why not verify a signed git tag instead of verifying a signed tarball of the git archive. Currently most git repositories uses SHA-1 for git commit identities, but SHA-1 is not a secure hash function. While current SHA-1 attacks can be detected and mitigated, there are fundamental doubts that a git SHA-1 commit identity uniquely refers to the same content that was intended. Verifying a git tag will never offer the same assurance, since a git tag can be moved or re-signed at any time. Verifying a git commit is better but then we need to trust SHA-1. Migrating git to SHA-256 would resolve this aspect, but most hosting sites such as GitLab and GitHub does not support this yet. There are other advantages to using signed tarballs instead of signed git commits or git tags as well, e.g., tar.gz can be a deterministically reproducible persistent stable offline storage format but .git sub-directory trees or git bundles do not offer this property. Doing continous testing of all this is critical to make sure things don t regress. Libntlm s pipeline definition now produce the generated libntlm-*.tar.gz tarballs and a checksum as a build artifact. Then I added the 000-reproducability job which compares the checksums and fails on mismatches. You can read its delicate output in the job for the v1.8 release. Right now we insists that builds on Trisquel 11 match Ubuntu 22.04, that PureOS 10 builds match Debian 11 builds, that AlmaLinux 8 builds match RockyLinux 8 builds, and AlmaLinux 9 builds match RockyLinux 9 builds. As you can see in pipeline job output, not all platforms lead to the same tarballs, but hopefully this state can be improved over time. There is also partial reproducibility, where the full tarball is reproducible across two distributions but not the minimal tarball, or vice versa. If this way of working plays out well, I hope to implement it in other projects too. What do you think? Happy Hacking!

12 April 2024

Freexian Collaborators: Debian Contributions: SSO Authentication for jitsi.debian.social, /usr-move updates, and more! (by Utkarsh Gupta)

Contributing to Debian is part of Freexian s mission. This article covers the latest achievements of Freexian and their collaborators. All of this is made possible by organizations subscribing to our Long Term Support contracts and consulting services. P.S. We ve completed over a year of writing these blogs. If you have any suggestions on how to make them better or what you d like us to cover, or any other opinions/reviews you might have, et al, please let us know by dropping an email to us. We d be happy to hear your thoughts. :)

SSO Authentication for jitsi.debian.social, by Stefano Rivera Debian.social s jitsi instance has been getting some abuse by (non-Debian) people sharing sexually explicit content on the service. After playing whack-a-mole with this for a month, and shutting the instance off for another month, we opened it up again and the abuse immediately re-started. Stefano sat down and wrote an SSO Implementation that hooks into Jitsi s existing JWT SSO support. This requires everyone using jitsi.debian.social to have a Salsa account. With only a little bit of effort, we could change this in future, to only require an account to open a room, and allow guests to join the call.

/usr-move, by Helmut Grohne The biggest task this month was sending mitigation patches for all of the /usr-move issues arising from package renames due to the 2038 transition. As a result, we can now say that every affected package in unstable can either be converted with dh-sequence-movetousr or has an open bug report. The package set relevant to debootstrap except for the set that has to be uploaded concurrently has been moved to /usr and is awaiting migration. The move of coreutils happened to affect piuparts which hard codes the location of /bin/sync and received multiple updates as a result.

Miscellaneous contributions
  • Stefano Rivera uploaded a stable release update to python3.11 for bookworm, fixing a use-after-free crash.
  • Stefano uploaded a new version of python-html2text, and updated python3-defaults to build with it.
  • In support of Python 3.12, Stefano dropped distutils as a Build-Dependency from a few packages, and uploaded a complex set of patches to python-mitogen.
  • Stefano landed some merge requests to clean up dead code in dh-python, removed the flit plugin, and uploaded it.
  • Stefano uploaded new upstream versions of twisted, hatchling, python-flexmock, python-authlib, python mitogen, python-pipx, and xonsh.
  • Stefano requested removal of a few packages supporting the Opsis HDMI2USB hardware that DebConf Video team used to use for HDMI capture, as they are not being maintained upstream. They started to FTBFS, with recent sdcc changes.
  • DebConf 24 is getting ready to open registration, Stefano spent some time fixing bugs in the website, caused by infrastructure updates.
  • Stefano reviewed all the DebConf 23 travel reimbursements, filing requests for more information from SPI where our records mismatched.
  • Stefano spun up a Wafer website for the Berlin 2024 mini DebConf.
  • Roberto C. S nchez worked on facilitating the transfer of upstream maintenance responsibility for the dormant Shorewall project to a new team led by the current maintainer of the Shorewall packages in Debian.
  • Colin Watson fixed build failures in celery-haystack-ng, db1-compat, jsonpickle, libsdl-perl, kali, knews, openssh-ssh1, python-json-log-formatter, python-typing-extensions, trn4, vigor, and wcwidth. Some of these were related to the 64-bit time_t transition, since that involved enabling -Werror=implicit-function-declaration.
  • Colin fixed an off-by-one error in neovim, which was already causing a build failure in Ubuntu and would eventually have caused a build failure in Debian with stricter toolchain settings.
  • Colin added an sshd@.service template to openssh to help newer systemd versions make containers and VMs SSH-accessible over AF_VSOCK sockets.
  • Following the xz-utils backdoor, Colin spent some time testing and discussing OpenSSH upstream s proposed inline systemd notification patch, since the current implementation via libsystemd was part of the attack vector used by that backdoor.
  • Utkarsh reviewed and sponsored some Go packages for Lena Voytek and Rajudev.
  • Utkarsh also helped Mitchell Dzurick with the adoption of pyparted package.
  • Helmut sent 10 patches for cross build failures.
  • Helmut partially fixed architecture cross bootstrap tooling to deal with changes in linux-libc-dev and the recent gcc-for-host changes and also fixed a 64bit-time_t FTBFS in libtextwrap.
  • Thorsten Alteholz uploaded several packages from debian-printing: cjet, lprng, rlpr and epson-inkjet-printer-escpr were affected by the newly enabled compiler switch -Werror=implicit-function-declaration. Besides fixing these serious bugs, Thorsten also worked on other bugs and could fix one or the other.
  • Carles updated simplemonitor and python-ring-doorbell packages with new upstream versions.
  • Santiago is still working on the Salsa CI MRs to adapt the build jobs so they can rely on sbuild. Current work includes adapting the images used by the build job, implementing the basic sbuild support the related jobs, and adjusting the support for experimental and *-backports releases..
    Additionally, Santiago reviewed some MR such as Make timeout action explicit in the logs and the subsequent Implement conditional timeout verbosity, and the batch of MRs included in https://salsa.debian.org/salsa-ci-team/pipeline/-/merge_requests/482.
  • Santiago also reviewed applications for the improving Salsa CI in Debian GSoC 2024 project. We received applications from four very talented candidates. The selection process is currently ongoing. A huge thanks to all of them!
  • As part of the DebConf 24 organization, Santiago has taken part in the Content team discussions.

1 April 2024

Colin Watson: Free software activity in March 2024

My Debian contributions this month were all sponsored by Freexian.

19 March 2024

Colin Watson: apt install everything?

On Mastodon, the question came up of how Ubuntu would deal with something like the npm install everything situation. I replied:
Ubuntu is curated, so it probably wouldn t get this far. If it did, then the worst case is that it would get in the way of CI allowing other packages to be removed (again from a curated system, so people are used to removal not being self-service); but the release team would have no hesitation in removing a package like this to fix that, and it certainly wouldn t cause this amount of angst. If you did this in a PPA, then I can t think of any particular negative effects.
OK, if you added lots of build-dependencies (as well as run-time dependencies) then you might be able to take out a builder. But Launchpad builders already run arbitrary user-submitted code by design and are therefore very carefully sandboxed and treated as ephemeral, so this is hardly novel. There s a lot to be said for the arrangement of having a curated system for the stuff people actually care about plus an ecosystem of add-on repositories. PPAs cover a wide range of levels of developer activity, from throwaway experiments to quasi-official distribution methods; there are certainly problems that arise from it being difficult to tell the difference between those extremes and from there being no systematic confinement, but for this particular kind of problem they re very nearly ideal. (Canonical has tried various other approaches to software distribution, and while they address some of the problems, they aren t obviously better at helping people make reliable social judgements about code they don t know.) For a hypothetical package with a huge number of dependencies, to even try to upload it directly to Ubuntu you d need to be an Ubuntu developer with upload rights (or to go via Debian, where you d have to clear a similar hurdle). If you have those, then the first upload has to pass manual review by an archive administrator. If your package passes that, then it still has to build and get through proposed-migration CI before it reaches anything that humans typically care about. On the other hand, if you were inclined to try this sort of experiment, you d almost certainly try it in a PPA, and that would trouble nobody but yourself.

13 March 2024

Freexian Collaborators: Debian Contributions: Upcoming Improvements to Salsa CI, /usr-move, packaging simplemonitor, and more! (by Utkarsh Gupta)

Contributing to Debian is part of Freexian s mission. This article covers the latest achievements of Freexian and their collaborators. All of this is made possible by organizations subscribing to our Long Term Support contracts and consulting services.

/usr-move, by Helmut Grohne Much of the work was spent on handling interaction with time time64 transition and sending patches for mitigating fallout. The set of packages relevant to debootstrap is mostly converted and the patches for glibc and base-files have been refined due to feedback from the upload to Ubuntu noble. Beyond this, he sent patches for all remaining packages that cannot move their files with dh-sequence-movetousr and packages using dpkg-divert in ways that dumat would not recognize.

Upcoming improvements to Salsa CI, by Santiago Ruano Rinc n Last month, Santiago Ruano Rinc n started the work on integrating sbuild into the Salsa CI pipeline. Initially, Santiago used sbuild with the unshare chroot mode. However, after discussion with josch, jochensp and helmut (thanks to them!), it turns out that the unshare mode is not the most suitable for the pipeline, since the level of isolation it provides is not needed, and some test suites would fail (eg: krb5). Additionally, one of the requirements of the build job is the use of ccache, since it is needed by some C/C++ large projects to reduce the compilation time. In the preliminary work with unshare last month, it was not possible to make ccache to work. Finally, Santiago changed the chroot mode, and now has a couple of POC (cf: 1 and 2) that rely on the schroot and sudo, respectively. And the good news is that ccache is successfully used by sbuild with schroot! The image here comes from an example of building grep. At the end of the build, ccache -s shows the statistics of the cache that it used, and so a little more than half of the calls of that job were cacheable. The most important pieces are in place to finish the integration of sbuild into the pipeline. Other than that, Santiago also reviewed the very useful merge request !346, made by IOhannes zm lnig to autodetect the release from debian/changelog. As agreed with IOhannes, Santiago is preparing a merge request to include the release autodetection use case in the very own Salsa CI s CI.

Packaging simplemonitor, by Carles Pina i Estany Carles started using simplemonitor in 2017, opened a WNPP bug in 2022 and started packaging simplemonitor dependencies in October 2023. After packaging five direct and indirect dependencies, Carles finally uploaded simplemonitor to unstable in February. During the packaging of simplemonitor, Carles reported a few issues to upstream. Some of these were to make the simplemonitor package build and run tests reproducibly. A reproducibility issue was reprotest overriding the timezone, which broke simplemonitor s tests. There have been discussions on resolving this upstream in simplemonitor and in reprotest, too. Carles also started upgrading or improving some of simplemonitor s dependencies.

Miscellaneous contributions
  • Stefano Rivera spent some time doing admin on debian.social infrastructure. Including dealing with a spike of abuse on the Jitsi server.
  • Stefano started to prepare a new release of dh-python, including cleaning out a lot of old Python 2.x related code. Thanks to Niels Thykier (outside Freexian) for spear-heading this work.
  • DebConf 24 planning is beginning. Stefano discussed venues and finances with the local team and remotely supported a site-visit by Nattie (outside Freexian).
  • Also in the DebConf 24 context, Santiago took part in discussions and preparations related to the Content Team.
  • A JIT bug was reported against pypy3 in Debian Bookworm. Stefano bisected the upstream history to find the patch (it was already resolved upstream) and released an update to pypy3 in bookworm.
  • Enrico participated in /usr-merge discussions with Helmut.
  • Colin Watson backported a python-channels-redis fix to bookworm, rediscovered while working on debusine.
  • Colin dug into a cluster of celery build failures and tracked the hardest bit down to a Python 3.12 regression, now fixed in unstable. celery should be back in testing once the 64-bit time_t migration is out of the way.
  • Thorsten Alteholz uploaded a new upstream version of cpdb-libs. Unfortunately upstream changed the naming of their release tags, so updating the watch file was a bit demanding. Anyway this version 2.0 is a huge step towards introduction of the new Common Print Dialog Backends.
  • Helmut send patches for 48 cross build failures.
  • Helmut changed debvm to use mkfs.ext4 instead of genext2fs.
  • Helmut sent a debci MR for improving collector robustness.
  • In preparation for DebConf 25, Santiago worked on the Brest Bid.

4 March 2024

Colin Watson: Free software activity in January/February 2024

Two months into my new gig and it s going great! Tracking my time has taken a bit of getting used to, but having something that amounts to a queryable database of everything I ve done has also allowed some helpful introspection. Freexian sponsors up to 20% of my time on Debian tasks of my choice. In fact I ve been spending the bulk of my time on debusine which is itself intended to accelerate work on Debian, but more details on that later. While I contribute to Freexian s summaries now, I ve also decided to start writing monthly posts about my free software activity as many others do, to get into some more detail. January 2024 February 2024

11 February 2024

Freexian Collaborators: Debian Contributions: Upcoming Improvements to Salsa CI, /usr-move, and more! (by Utkarsh Gupta)

Contributing to Debian is part of Freexian s mission. This article covers the latest achievements of Freexian and their collaborators. All of this is made possible by organizations subscribing to our Long Term Support contracts and consulting services.

Upcoming Improvements to Salsa CI, by Santiago Ruano Rinc n Santiago started picking up the work made by Outreachy Intern, Enock Kashada (a big thanks to him!), to solve some long-standing issues in Salsa CI. Currently, the first job in a Salsa CI pipeline is the extract-source job, used to produce a debianize source tree of the project. This job was introduced to make it possible to build the projects on different architectures, on the subsequent build jobs. However, that extract-source approach is sub-optimal: not only it increases the execution time of the pipeline by some minutes, but also projects whose source tree is too large are not able to use the pipeline. The debianize source tree is passed as an artifact to the build jobs, and for those large projects, the size of their source tree exceeds the Salsa s limits. This is specific issue is documented as issue #195, and the proposed solution is to get rid of the extract-source job, relying on sbuild in the very build job (see issue #296). Switching to sbuild would also help to improve the build source job, solving issues such as #187 and #298. The current work-in-progress is very preliminary, but it has already been possible to run the build (amd64), build-i386 and build-source job using sbuild with the unshare mode. The image on the right shows a pipeline that builds grep. All the test jobs use the artifacts of the new build job. There is a lot of remaining work, mainly making the integration with ccache work. This change could break some things, it will also be important to test how the new pipeline works with complex projects. Also, thanks to Emmanuel Arias, we are proposing a Google Summer of Code 2024 project to improve Salsa CI. As part of the ongoing work in preparation for the GSoC 2024 project, Santiago has proposed a merge request to make more efficient how contributors can test their changes on the Salsa CI pipeline.

/usr-move, by Helmut Grohne In January, we sent most of the moving patches for the set of packages involved with debootstrap. Notably missing is glibc, which turns out harder than anticipated via dumat, because it has Conflicts between different architectures, which dumat does not analyze. Patches for diversion mitigations have been updated in a way to not exhibit any loss anymore. The main change here is that packages which are being diverted now support the diverting packages in transitioning their diversions. We also supported a few packages with non-trivial changes such as netplan.io. dumat has been enhanced to better support derivatives such as Ubuntu.

Miscellaneous contributions
  • Python 3.12 migration trundles on. Stefano Rivera helped port several new packages to support 3.12.
  • Stefano updated the Sphinx configuration of DebConf Video Team s documentation, which was broken by Sphinx 7.
  • Stefano published the videos from the Cambridge MiniDebConf to YouTube and PeerTube.
  • DebConf 24 planning has begun, and Stefano & Utkarsh have started work on this.
  • Utkarsh re-sponsored the upload of golang-github-prometheus-community-pgbouncer-exporter for Lena.
  • Colin Watson added Incus support to autopkgtest.
  • Colin discovered Perl::Critic and used it to tidy up some poor practices in several of his packages, including debconf.
  • Colin did some overdue debconf maintenance, mainly around tidying up error message handling in several places (1, 2, 3).
  • Colin figured out how to update the mirror size documentation in debmirror, last updated in 2010. It should now be much easier to keep it up to date regularly.
  • Colin issued a man-db buster update to clean up some irritations due to strict sandboxing.
  • Thorsten Alteholz adopted two more packages, magicfilter and ifhp, for the debian-printing team. Those packages are the last ones of the latest round of adoptions to preserve the old printing protocol within Debian. If you know of other packages that should be retained, please don t hesitate to contact Thorsten.
  • Enrico participated in /usr-merge discussions with Helmut.
  • Helmut sent patches for 16 cross build failures.
  • Helmut supported Matthias Klose (not affiliated with Freexian) with adding -for-host support to gcc-defaults.
  • Helmut uploaded dput-ng enabling dcut migrate and merging two MRs of Ben Hutchings.
  • Santiago took part in the discussions relating to the EU Cyber Resilience Act (CRA) and the Debian public statement that was published last year. He participated in a meeting with Members of the European Parliament (MEPs), Marcel Kolaja and Karen Melchior, and their teams to clarify some points about the impact of the CRA and Debian and downstream projects, and the improvements in the last version of the proposed regulation.

17 January 2024

Colin Watson: Task management

Now that I m freelancing, I need to actually track my time, which is something I ve had the luxury of not having to do before. That meant something of a rethink of the way I ve been keeping track of my to-do list. Up to now that was a combination of things like the bug lists for the projects I m working on at the moment, whatever task tracking system Canonical was using at the moment (Jira when I left), and a giant flat text file in which I recorded logbook-style notes of what I d done each day plus a few extra notes at the bottom to remind myself of particularly urgent tasks. I could have started manually adding times to each logbook entry, but ugh, let s not. In general, I had the following goals (which were a bit reminiscent of my address book): I didn t do an elaborate evaluation of multiple options, because I m not trying to come up with the best possible solution for a client here. Also, there are a bazillion to-do list trackers out there and if I tried to evaluate them all I d never do anything else. I just wanted something that works well enough for me. Since it came up on Mastodon: a bunch of people swear by Org mode, which I know can do at least some of this sort of thing. However, I don t use Emacs and don t plan to use Emacs. nvim-orgmode does have some support for time tracking, but when I ve tried vim-based versions of Org mode in the past I ve found they haven t really fitted my brain very well. Taskwarrior and Timewarrior One of the other Freexian collaborators mentioned Taskwarrior and Timewarrior, so I had a look at those. The basic idea of Taskwarrior is that you have a task command that tracks each task as a blob of JSON and provides subcommands to let you add, modify, and remove tasks with a minimum of friction. task add adds a task, and you can add metadata like project:Personal (I always make sure every task has a project, for ease of filtering). Just running task shows you a task list sorted by Taskwarrior s idea of urgency, with an ID for each task, and there are various other reports with different filtering and verbosity. task <id> annotate lets you attach more information to a task. task <id> done marks it as done. So far so good, so a redacted version of my to-do list looks like this:
$ task ls
ID A Project     Tags                 Description
17   Freexian                         Add Incus support to autopkgtest [2]
 7   Columbiform                      Figure out Lloyds online banking [1]
 2   Debian                           Fix troffcvt for groff 1.23.0 [1]
11   Personal                         Replace living room curtain rail
Once I got comfortable with it, this was already a big improvement. I haven t bothered to learn all the filtering gadgets yet, but it was easy enough to see that I could do something like task all project:Personal and it d show me both pending and completed tasks in that project, and that all the data was stored in ~/.task - though I have to say that there are enough reporting bells and whistles that I haven t needed to poke around manually. In combination with the regular backups that I do anyway (you do too, right?), this gave me enough confidence to abandon my previous text-file logbook approach. Next was time tracking. Timewarrior integrates with Taskwarrior, albeit in an only semi-packaged way, and it was easy enough to set that up. Now I can do:
$ task 25 start
Starting task 00a9516f 'Write blog post about task tracking'.
Started 1 task.
Note: '"Write blog post about task tracking"' is a new tag.
Tracking Columbiform "Write blog post about task tracking"
  Started 2024-01-10T11:28:38
  Current                  38
  Total               0:00:00
You have more urgent tasks.
Project 'Columbiform' is 25% complete (3 of 4 tasks remaining).
When I stop work on something, I do task active to find the ID, then task <id> stop. Timewarrior does the tedious stopwatch business for me, and I can manually enter times if I forget to start/stop a task. Then the really useful bit: I can do something like timew summary :month <name-of-client> and it tells me how much to bill that client for this month. Perfect. I also started using VIT to simplify the day-to-day flow a little, which means I m normally just using one or two keystrokes rather than typing longer commands. That isn t really necessary from my point of view, but it does save some time. Android integration I left Android integration for a bit later since it wasn t essential. When I got round to it, I have to say that it felt a bit clumsy, but it did eventually work. The first step was to set up a taskserver. Most of the setup procedure was OK, but I wanted to use Let s Encrypt to minimize the amount of messing around with CAs I had to do. Getting this to work involved hitting things with sticks a bit, and there s still a local CA involved for client certificates. What I ended up with was a certbot setup with the webroot authenticator and a custom deploy hook as follows (with cert_name replaced by a DNS name in my house domain):
#! /bin/sh
set -eu
cert_name=taskd.example.org
found=false
for domain in $RENEWED_DOMAINS; do
    case "$domain" in
        $cert_name)
            found=:
            ;;
    esac
done
$found   exit 0
install -m 644 "/etc/letsencrypt/live/$cert_name/fullchain.pem" \
    /var/lib/taskd/pki/fullchain.pem
install -m 640 -g Debian-taskd "/etc/letsencrypt/live/$cert_name/privkey.pem" \
    /var/lib/taskd/pki/privkey.pem
systemctl restart taskd.service
I could then set this in /etc/taskd/config (server.crl.pem and ca.cert.pem were generated using the documented taskserver setup procedure):
server.key=/var/lib/taskd/pki/privkey.pem
server.cert=/var/lib/taskd/pki/fullchain.pem
server.crl=/var/lib/taskd/pki/server.crl.pem
ca.cert=/var/lib/taskd/pki/ca.cert.pem
Then I could set taskd.ca on my laptop to /usr/share/ca-certificates/mozilla/ISRG_Root_X1.crt and otherwise follow the client setup instructions, run task sync init to get things started, and then task sync every so often to sync changes between my laptop and the taskserver. I used TaskWarrior Mobile as the client. I have to say I wouldn t want to use that client as my primary task tracking interface: the setup procedure is clunky even beyond the necessity of copying a client certificate around, it expects you to give it a .taskrc rather than having a proper settings interface for that, and it only seems to let you add a task if you specify a due date for it. It also lacks Timewarrior integration, so I can only really use it when I don t care about time tracking, e.g. personal tasks. But that s really all I need, so it meets my minimum requirements. Next? Considering this is literally the first thing I tried, I have to say I m pretty happy with it. There are a bunch of optional extras I haven t tried yet, but in general it kind of has the vim nature for me: if I need something it s very likely to exist or easy enough to build, but the features I don t use don t get in my way. I wouldn t recommend any of this to somebody who didn t already spend most of their time in a terminal - but I do. I m glad people have gone to all the effort to build this so I didn t have to.

15 January 2024

Colin Watson: OpenUK New Year s Honours

Apparently I got an honour from OpenUK. There are a bunch of people I know on that list. Chris Lamb and Mark Brown are familiar names from Debian. Colin King and Jonathan Riddell are people I know from past work in Ubuntu. I ve admired David MacIver s work on Hypothesis and Richard Hughes work on firmware updates from afar. And there are a bunch of other excellent projects represented there: OpenStreetMap, Textualize, and my alma mater of Cambridge to name but a few. My friend Stuart Langridge wrote about being on a similar list a few years ago, and I can t do much better than to echo it: in particular he wrote about the way the open source development community is often at best unwelcoming to people who don t look like Stuart and I do. I can t tell a whole lot about demographic distribution just by looking at a list of names, but while these honours still seem to be skewed somewhat male, I m fairly sure they re doing a lot better in terms of gender balance than my home project of Debian is, for one. I hope this is a sign of improvement for the future, and I ll do what I can to pay it forward.

10 January 2024

Colin Watson: Going freelance

I ve mentioned this in a couple of other places, but I realized I never got round to posting about it on my own blog rather than on other people s services. How remiss of me. Anyway: after much soul-searching, I decided a few months ago that it was time for me to move on from Canonical and the Launchpad team there. Nearly 20 years is a long time to spend at any company, and although there are a bunch of people I ll miss, Launchpad is in a reasonable state where I can let other people have a turn. I m now in business for myself as a freelance developer! My new company is Columbiform, and I m focusing on Debian packaging and custom Python development. My services page has some self-promotion on the sorts of things I can do. My first gig, and the one that made it viable to make this jump, is at Freexian where I m helping with an exciting infrastructure project that we hope will start making Debian developers lives easier in the near future. This is likely to take up most of my time at least through to the end of 2024, but I may have some spare cycles. Drop me a line if you have something where you think I could be a good fit, and we can have a talk about it.

11 November 2022

Reproducible Builds: Reproducible Builds in October 2022

Welcome to the Reproducible Builds report for October 2022! In these reports we attempt to outline the most important things that we have been up to over the past month. As ever, if you are interested in contributing to the project, please visit our Contribute page on our website.

Our in-person summit this year was held in the past few days in Venice, Italy. Activity and news from the summit will therefore be covered in next month s report!
A new article related to reproducible builds was recently published in the 2023 IEEE Symposium on Security and Privacy. Titled Taxonomy of Attacks on Open-Source Software Supply Chains and authored by Piergiorgio Ladisa, Henrik Plate, Matias Martinez and Olivier Barais, their paper:
[ ] proposes a general taxonomy for attacks on opensource supply chains, independent of specific programming languages or ecosystems, and covering all supply chain stages from code contributions to package distribution.
Taking the form of an attack tree, the paper covers 107 unique vectors linked to 94 real world supply-chain incidents which is then mapped to 33 mitigating safeguards including, of course, reproducible builds:
Reproducible Builds received a very high utility rating (5) from 10 participants (58.8%), but also a high-cost rating (4 or 5) from 12 (70.6%). One expert commented that a reproducible build like used by Solarwinds now, is a good measure against tampering with a single build system and another claimed this is going to be the single, biggest barrier .

It was noticed this month that Solarwinds published a whitepaper back in December 2021 in order to:
[ ] illustrate a concerning new reality for the software industry and illuminates the increasingly sophisticated threats made by outside nation-states to the supply chains and infrastructure on which we all rely.
The 12-month anniversary of the 2020 Solarwinds attack (which SolarWinds Worldwide LLC itself calls the SUNBURST attack) was, of course, the likely impetus for publication.
Whilst collaborating on making the Cyrus IMAP server reproducible, Ellie Timoney asked why the Reproducible Builds testing framework uses two remarkably distinctive build paths when attempting to flush out builds that vary on the absolute system path in which they were built. In the case of the Cyrus IMAP server, these happened to be: Asked why they vary in three different ways, Chris Lamb listed in detail the motivation behind to each difference.
On our mailing list this month:
The Reproducible Builds project is delighted to welcome openEuler to the Involved projects page [ ]. openEuler is Linux distribution developed by Huawei, a counterpart to it s more commercially-oriented EulerOS.

Debian Colin Watson wrote about his experience towards making the databases generated by the man-db UNIX manual page indexing tool:
One of the people working on [reproducible builds] noticed that man-db s database files were an obstacle to [reproducibility]: in particular, the exact contents of the database seemed to depend on the order in which files were scanned when building it. The reporter proposed solving this by processing files in sorted order, but I wasn t keen on that approach: firstly because it would mean we could no longer process files in an order that makes it more efficient to read them all from disk (still valuable on rotational disks), but mostly because the differences seemed to point to other bugs.
Colin goes on to describe his approach to solving the problem, including fixing various fits of internal caching, and he ends his post with None of this is particularly glamorous work, but it paid off .
Vagrant Cascadian announced on our mailing list another online sprint to help clear the huge backlog of reproducible builds patches submitted by performing NMUs (Non-Maintainer Uploads). The first such sprint took place on September 22nd, but another was held on October 6th, and another small one on October 20th. This resulted in the following progress:
41 reviews of Debian packages were added, 62 were updated and 12 were removed this month adding to our knowledge about identified issues. A number of issue types were updated too. [1][ ]
Lastly, Luca Boccassi submitted a patch to debhelper, a set of tools used in the packaging of the majority of Debian packages. The patch addressed an issue in the dh_installsysusers utility so that the postinst post-installation script that debhelper generates the same data regardless of the underlying filesystem ordering.

Other distributions F-Droid is a community-run app store that provides free software applications for Android phones. This month, F-Droid changed their documentation and guidance to now explicitly encourage RB for new apps [ ][ ], and FC Stegerman created an extremely in-depth issue on GitLab concerning the APK signing block. You can read more about F-Droid s approach to reproducibility in our July 2022 interview with Hans-Christoph Steiner of the F-Droid Project. In openSUSE, Bernhard M. Wiedemann published his usual openSUSE monthly report.

Upstream patches The Reproducible Builds project detects, dissects and attempts to fix as many currently-unreproducible packages as possible. We endeavour to send all of our patches upstream where appropriate. This month, we wrote a large number of such patches, including:

diffoscope diffoscope is our in-depth and content-aware diff utility. Not only can it locate and diagnose reproducibility issues, it can provide human-readable diffs from many kinds of binary formats. This month, Chris Lamb prepared and uploaded versions 224 and 225 to Debian:
  • Add support for comparing the text content of HTML files using html2text. [ ]
  • Add support for detecting ordering-only differences in XML files. [ ]
  • Fix an issue with detecting ordering differences. [ ]
  • Use the capitalised version of Ordering consistently everywhere in output. [ ]
  • Add support for displaying font metadata using ttx(1) from the fonttools suite. [ ]
  • Testsuite improvements:
    • Temporarily allow the stable-po pipeline to fail in the CI. [ ]
    • Rename the order1.diff test fixture to json_expected_ordering_diff. [ ]
    • Tidy the JSON tests. [ ]
    • Use assert_diff over get_data and an manual assert within the XML tests. [ ]
    • Drop the ALLOWED_TEST_FILES test; it was mostly just annoying. [ ]
    • Tidy the tests/test_source.py file. [ ]
Chris Lamb also added a link to diffoscope s OpenBSD packaging on the diffoscope.org homepage [ ] and Mattia Rizzolo fix an test failure that was occurring under with LLVM 15 [ ].

Testing framework The Reproducible Builds project operates a comprehensive testing framework at tests.reproducible-builds.org in order to check packages and other artifacts for reproducibility. In October, the following changes were made by Holger Levsen:
  • Run the logparse tool to analyse results on the Debian Edu build logs. [ ]
  • Install btop(1) on all nodes running Debian. [ ]
  • Switch Arch Linux from using SHA1 to SHA256. [ ]
  • When checking Debian debstrap jobs, correctly log the tool usage. [ ]
  • Cleanup more task-related temporary directory names when testing Debian packages. [ ][ ]
  • Use the cdebootstrap-static binary for the 2nd runs of the cdebootstrap tests. [ ]
  • Drop a workaround when testing OpenWrt and coreboot as the issue in diffoscope has now been fixed. [ ]
  • Turn on an rm(1) warning into an info -level message. [ ]
  • Special case the osuosl168 node for running Debian bookworm already. [ ][ ]
  • Use the new non-free-firmware suite on the o168 node. [ ]
In addition, Mattia Rizzolo made the following changes:
  • Ensure that 2nd build has a merged /usr. [ ]
  • Only reconfigure the usrmerge package on Debian bookworm and above. [ ]
  • Fix bc(1) syntax in the computation of the percentage of unreproducible packages in the dashboard. [ ][ ][ ]
  • In the index_suite_ pages, order the package status to be the same order of the menu. [ ]
  • Pass the --distribution parameter to the pbuilder utility. [ ]
Finally, Roland Clobus continued his work on testing Live Debian images. In particular, he extended the maintenance script to warn when workspace directories cannot be deleted. [ ]
If you are interested in contributing to the Reproducible Builds project, please visit our Contribute page on our website. However, you can get in touch with us via:

16 October 2022

Colin Watson: Reproducible man-db databases

I ve released man-db 2.11.0 (announcement, NEWS), and uploaded it to Debian unstable. The biggest chunk of work here was fixing some extremely long-standing issues with how the database is built. Despite being in the package name, man-db s database is much less important than it used to be: most uses of man(1) haven t required it in a long time, and both hardware and software improvements mean that even some searches can be done by brute force without needing prior indexing. However, the database is still needed for the whatis(1) and apropos(1) commands. The database has a simple format - no relational structure here, it s just a simple key-value database using old-fashioned DBM-like interfaces and composing a few fields to form values - but there are a number of subtleties involved. The issues tend to amount to this: what does a manual page name mean? At first glance it might seem simple, because you have file names that look something like /usr/share/man/man1/ls.1.gz and that s obviously ls(1). Some pages are symlinks to other pages (which we track separately because it makes it easier to figure out which entries to update when the contents of the file system change), and sometimes multiple pages are even hard links to the same file. The real complications come with whatis references . Pages can list a bunch of names in their NAME section, and the historical expectation is that it should be possible to use those names as arguments to man(1) even if they don t also appear in the file system (although Debian policy has deprecated relying on this for some time). Not only does that mean that man(1) sometimes needs to consult the database, but it also means that the database is inherently more complicated, since a page might list something in its NAME section that conflicts with an actual file name in the file system, and now you need a priority system to resolve ambiguities. There are some other possible causes of ambiguity as well. The people working on reproducible builds in Debian branched out to the related challenge of reproducible installations some time ago: can you take a collection of packages, bootstrap a file system image from them, and reproduce that exact same image somewhere else? This is useful for the same sorts of reasons that reproducible builds are useful: it lets you verify that an image is built from the components it s supposed to be built from, and doesn t contain any other skulduggery by accident or design. One of the people working on this noticed that man-db s database files were an obstacle to that: in particular, the exact contents of the database seemed to depend on the order in which files were scanned when building it. The reporter proposed solving this by processing files in sorted order, but I wasn t keen on that approach: firstly because it would mean we could no longer process files in an order that makes it more efficient to read them all from disk (still valuable on rotational disks), but mostly because the differences seemed to point to other bugs. Having understood this, there then followed several late nights of very fiddly work on the details of how the database is maintained. None of this was conceptually difficult: it mainly amounted to ensuring that we maintain a consistent well-order for different entries that we might want to insert for a given database key, and that we consider the same names for insertion regardless of the order in which we encounter files. As usual, the tricky bit is making sure that we have the right data structures to support this. man-db is written in C which is not very well-supplied with built-in data structures, and originally much of the code was written in a style that tried to minimize memory allocations; this came at the cost of ownership and lifetime often being rather unclear, and it was often difficult to make changes without causing leaks or double-frees. Over the years I ve been gradually introducing better encapsulation to make things easier to follow, and I had to do another round of that here. There were also some problems with caching being done at slightly the wrong layer: we need to make use of a trace of the chain of links followed to resolve a page to its ultimate source file, but we were incorrectly caching that trace and reusing it for any link to the same file, with incorrect results in many cases. Oh, and after doing all that I found that the on-disk representation of a GDBM database is insertion-order-dependent, so I ended up having to manually reorganize the database at the end by reading it all in and writing it all back out in sorted order, which feels really weird to me coming from spending most of my time with PostgreSQL these days. Fortunately the database is small so this takes negligible time. None of this is particularly glamorous work, but it paid off:
# export SOURCE_DATE_EPOCH="$(date +%s)"
# mkdir emptydir disorder
# disorderfs --multi-user=yes --shuffle-dirents=yes --reverse-dirents=no emptydir disorder
# export TMPDIR="$(pwd)/disorder"
# mmdebstrap --variant=standard --hook-dir=/usr/share/mmdebstrap/hooks/merged-usr \
      unstable out1.tar
# mmdebstrap --variant=standard --hook-dir=/usr/share/mmdebstrap/hooks/merged-usr \
      unstable out2.tar
# cmp out1.tar out2.tar
# echo $?
0

18 February 2022

Colin Watson: Launchpad now supports SSH Ed25519 keys and RSA SHA-2 signatures

As of 2022-02-16, Launchpad supports a couple of features on its SSH endpoints (git.launchpad.net, bazaar.launchpad.net, ppa.launchpad.net, and upload.ubuntu.com) that it previously didn t: Ed25519 public keys (a well-regarded format, supported by OpenSSH since 6.5 in 2014) and signatures with existing RSA public keys using SHA-2 rather than SHA-1 (supported by OpenSSH since 7.2 in 2016). I m hesitant to call these features new , since they ve been around for a long time elsewhere, and people might quite reasonably ask why it s taken us so long. The problem has always been that Launchpad can t really use a normal SSH server such as OpenSSH because it needs features that aren t practical to implement that way, such as virtual filesystems and dynamic user key authorization against the Launchpad database. Instead, we use Twisted Conch, which is a very extensible Python SSH implementation that has generally served us well. The downside is that, because it s an independent implementation and one that occupies a relatively small niche, it often lags behind in terms of newer protocol features. Catching up to this point has been something we ve been working on for around five years, although it s taken a painfully long time for a variety of reasons which I thought some people might find interesting to go into, at least people who have the patience for details of the SSH protocol. Many of the delays were my own responsibility, although realistically we probably couldn t have added Ed25519 support before OpenSSL/cryptography work that landed in 2019. Phew! Thanks to everyone who works on Twisted, cryptography, and OpenSSL - it s been really useful to be able to build on solid lower-level cryptographic primitives - and to those who helped with code review.

2 August 2021

Colin Watson: Launchpad now runs on Python 3!

After a very long porting journey, Launchpad is finally running on Python 3 across all of our systems. I wanted to take a bit of time to reflect on why my emotional responses to this port differ so much from those of some others who ve done large ports, such as the Mercurial maintainers. It s hard to deny that we ve had to burn a lot of time on this, which I m sure has had an opportunity cost, and from one point of view it s essentially running to stand still: there is no single compelling feature that we get solely by porting to Python 3, although it s clearly a prerequisite for tidying up old compatibility code and being able to use modern language facilities in the future. And yet, on the whole, I found this a rewarding project and enjoyed doing it. Some of this may be because by inclination I m a maintenance programmer and actually enjoy this sort of thing. My default view tends to be that software version upgrades may be a pain but it s much better to get that pain over with as soon as you can rather than trying to hold back the tide; you can certainly get involved and try to shape where things end up, but rightly or wrongly I can t think of many cases when a righteously indignant user base managed to arrange for the old version to be maintained in perpetuity so that they never had to deal with the new thing (OK, maybe Perl 5 counts here). I think a more compelling difference between Launchpad and Mercurial, though, may be that very few other people really had a vested interest in what Python version Launchpad happened to be running, because it s all server-side code (aside from some client libraries such as launchpadlib, which were ported years ago). As such, we weren t trying to do this with the internet having Strong Opinions at us. We were doing this because it was obviously the only long-term-maintainable path forward, and in more recent times because some of our library dependencies were starting to drop support for Python 2 and so it was obviously going to become a practical problem for us sooner or later; but if we d just stayed on Python 2 forever then fundamentally hardly anyone else would really have cared directly, only maybe about some indirect consequences of that. I don t follow Mercurial development so I may be entirely off-base, but if other people were yelling at me about how late my project was to finish its port, that in itself would make me feel more negatively about the project even if I thought it was a good idea. Having most of the pressure come from ourselves rather than from outside meant that wasn t an issue for us. I m somewhat inclined to think of the process as an extreme version of paying down technical debt. Moving from Python 2.7 to 3.5, as we just did, means skipping over multiple language versions in one go, and if similar changes had been made more gradually it would probably have felt a lot more like the typical dependency update treadmill. I appreciate why not everyone might want to think of it this way: maybe this is just my own rationalization. Reflections on porting to Python 3 I m not going to defend the Python 3 migration process; it was pretty rough in a lot of ways. Nor am I going to spend much effort relitigating it here, as it s already been done to death elsewhere, and as I understand it the core Python developers have got the message loud and clear by now. At a bare minimum, a lot of valuable time was lost early in Python 3 s lifetime hanging on to flag-day-type porting strategies that were impractical for large projects, when it should have been providing for bilingual strategies (code that runs in both Python 2 and 3 for a transitional period) which is where most libraries and most large migrations ended up in practice. For instance, the early advice to library maintainers to maintain two parallel versions or perhaps translate dynamically with 2to3 was entirely impractical in most non-trivial cases and wasn t what most people ended up doing, and yet the idea that 2to3 is all you need still floats around Stack Overflow and the like as a result. (These days, I would probably point people towards something more like Eevee s porting FAQ as somewhere to start.) There are various fairly straightforward things that people often suggest could have been done to smooth the path, and I largely agree: not removing the u'' string prefix only to put it back in 3.3, fewer gratuitous compatibility breaks in the name of tidiness, and so on. But if I had a time machine, the number one thing I would ask to have been done differently would be introducing type annotations in Python 2 before Python 3 branched off. It s true that it s technically possible to do type annotations in Python 2, but the fact that it s a different syntax that would have to be fixed later is offputting, and in practice it wasn t widely used in Python 2 code. To make a significant difference to the ease of porting, annotations would need to have been introduced early enough that lots of Python 2 library code used them so that porting code didn t have to be quite so much of an exercise of manually figuring out the exact nature of string types from context. Launchpad is a complex piece of software that interacts with multiple domains: for example, it deals with a database, HTTP, web page rendering, Debian-format archive publishing, and multiple revision control systems, and there s often overlap between domains. Each of these tends to imply different kinds of string handling. Web page rendering is normally done mainly in Unicode, converting to bytes as late as possible; revision control systems normally want to spend most of their time working with bytes, although the exact details vary; HTTP is of course bytes on the wire, but Python s WSGI interface has some string type subtleties. In practice I found myself thinking about at least four string-like types (that is, things that in a language with a stricter type system I might well want to define as distinct types and restrict conversion between them): bytes, text, ordinary native strings (str in either language, encoded to UTF-8 in Python 2), and native strings with WSGI s encoding rules. Some of these are emergent properties of writing in the intersection of Python 2 and 3, which is effectively a specialized language of its own without coherent official documentation whose users must intuit its behaviour by comparing multiple sources of information, or by referring to unofficial porting guides: not a very satisfactory situation. Fortunately much of the complexity collapses once it becomes possible to write solely in Python 3. Some of the difficulties we ran into are not ones that are typically thought of as Python 2-to-3 porting issues, because they were changed later in Python 3 s development process. For instance, the email module was substantially improved in around the 3.2/3.3 timeframe to handle Python 3 s bytes/text model more correctly, and since Launchpad sends quite a few different kinds of email messages and has some quite picky tests for exactly what it emits, this entailed a lot of work in our email sending code and in our test suite to account for that. (It took me a while to work out whether we should be treating raw email messages as bytes or as text; bytes turned out to work best.) 3.4 made some tweaks to the implementation of quoted-printable encoding that broke a number of our tests in ways that took some effort to fix, because the tests needed to work on both 2.7 and 3.5. The list goes on. I got quite proficient at digging through Python s git history to figure out when and why some particular bit of behaviour had changed. One of the thorniest problems was parsing HTTP form data. We mainly rely on zope.publisher for this, which in turn relied on cgi.FieldStorage; but cgi.FieldStorage is badly broken in some situations on Python 3. Even if that bug were fixed in a more recent version of Python, we can t easily use anything newer than 3.5 for the first stage of our port due to the version of the base OS we re currently running, so it wouldn t help much. In the end I fixed some minor issues in the multipart module (and was kindly given co-maintenance of it) and converted zope.publisher to use it. Although this took a while to sort out, it seems to have gone very well. A couple of other interesting late-arriving issues were around pickle. For most things we normally prefer safer formats such as JSON, but there are a few cases where we use pickle, particularly for our session databases. One of my colleagues pointed out that I needed to remember to tell pickle to stick to protocol 2, so that we d be able to switch back and forward between Python 2 and 3 for a while; quite right, and we later ran into a similar problem with marshal too. A more surprising problem was that datetime.datetime objects pickled on Python 2 require special care when unpickling on Python 3; rather than the approach that ended up being implemented and documented for Python 3.6, though, I preferred a custom unpickler, both so that things would work on Python 3.5 and so that I wouldn t have to risk affecting the decoding of other pickled strings in the session database. General lessons Writing this over a year after Python 2 s end-of-life date, and certainly nowhere near the leading edge of Python 3 porting work, it s perhaps more useful to look at this in terms of the lessons it has for other large technical debt projects. I mentioned in my previous article that I used the approach of an enormous and frequently-rebased git branch as a working area for the port, committing often and sometimes combining and extracting commits for review once they seemed to be ready. A port of this scale would have been entirely intractable without a tool of similar power to git rebase, so I m very glad that we finished migrating to git in 2019. I relied on this right up to the end of the port, and it also allowed for quick assessments of how much more there was to land. git worktree was also helpful, in that I could easily maintain working trees built for each of Python 2 and 3 for comparison. As is usual for most multi-developer projects, all changes to Launchpad need to go through code review, although we sometimes make exceptions for very simple and obvious changes that can be self-reviewed. Since I knew from the outset that this was going to generate a lot of changes for review, I therefore structured my work from the outset to try to make it as easy as possible for my colleagues to review it. This generally involved keeping most changes to a somewhat manageable size of 800 lines or less (although this wasn t always possible), and arranging commits mainly according to the kind of change they made rather than their location. For example, when I needed to fix issues with / in Python 3 being true division rather than floor division, I did so in one commit across the various places where it mattered and took care not to mix it with other unrelated changes. This is good practice for nearly any kind of development, but it was especially important here since it allowed reviewers to consider a clear explanation of what I was doing in the commit message and then skim-read the rest of it much more quickly. It was vital to keep the codebase in a working state at all times, and deploy to production reasonably often: this way if something went wrong the amount of code we had to debug to figure out what had happened was always tractable. (Although I can t seem to find it now to link to it, I saw an account a while back of a company that had taken a flag-day approach instead with a large codebase. It seemed to work for them, but I m certain we couldn t have made it work for Launchpad.) I can t speak too highly of Launchpad s test suite, much of which originated before my time. Without a great deal of extensive coverage of all sorts of interesting edge cases at both the unit and functional level, and a corresponding culture of maintaining that test suite well when making new changes, it would have been impossible to be anything like as confident of the port as we were. As part of the porting work, we split out a couple of substantial chunks of the Launchpad codebase that could easily be decoupled from the core: its Mailman integration and its code import worker. Both of these had substantial dependencies with complex requirements for porting to Python 3, and arranging to be able to do these separately on their own schedule was absolutely worth it. Like disentangling balls of wool, any opportunity you can take to make things less tightly-coupled is probably going to make it easier to disentangle the rest. (I can see a tractable way forward to porting the code import worker, so we may well get that done soon. Our Mailman integration will need to be rewritten, though, since it currently depends on the Python-2-only Mailman 2, and Mailman 3 has a different architecture.) Python lessons Our database layer was already in pretty good shape for a port, since at least the modern bits of its table modelling interface were already strict about using Unicode for text columns. If you have any kind of pervasive low-level framework like this, then making it be pedantic at you in advance of a Python 3 port will probably incur much less swearing in the long run, as you won t be trying to deal with quite so many bytes/text issues at the same time as everything else. Early in our port, we established a standard set of __future__ imports and started incrementally converting files over to them, mainly because we weren t yet sure what else to do and it seemed likely to be helpful. absolute_import was definitely reasonable (and not often a problem in our code), and print_function was annoying but necessary. In hindsight I m not sure about unicode_literals, though. For files that only deal with bytes and text it was reasonable enough, but as I mentioned above there were also a number of cases where we needed literals of the language s native str type, i.e. bytes in Python 2 and text in Python 3: this was particularly noticeable in WSGI contexts, but also cropped up in some other surprising places. We generally either omitted unicode_literals or used six.ensure_str in such cases, but it was definitely a bit awkward and maybe I should have listened more to people telling me it might be a bad idea. A lot of Launchpad s early tests used doctest, mainly in the style where you have text files that interleave narrative commentary with examples. The development team later reached consensus that this was best avoided in most cases, but by then there were far too many doctests to conveniently rewrite in some other form. Porting doctests to Python 3 is really annoying. You run into all the little changes in how objects are represented as text (particularly u'...' versus '...', but plenty of other cases as well); you have next to no tools to do anything useful like skipping individual bits of a doctest that don t apply; using __future__ imports requires the rather obscure approach of adding the relevant names to the doctest s globals in the relevant DocFileSuite or DocTestSuite; dealing with many exception tracebacks requires something like zope.testing.renormalizing; and whatever code refactoring tools you re using probably don t work properly. Basically, don t have done that. It did all turn out to be tractable for us in the end, and I managed to avoid using much in the way of fragile doctest extensions aside from the aforementioned zope.testing.renormalizing, but it was not an enjoyable experience. Regressions I know of nine regressions that reached Launchpad s production systems as a result of this porting work; of course there were various other regressions caught by CI or in manual testing. (Considering the size of this project, I count it as a resounding success that there were only nine production issues, and that for the most part we were able to fix them quickly.) Equality testing of removed database objects One of the things we had to do while porting to Python 3 was to implement the __eq__, __ne__, and __hash__ special methods for all our database objects. This was quite conceptually fiddly, because doing this requires knowing each object s primary key, and that may not yet be available if we ve created an object in Python but not yet flushed the actual INSERT statement to the database (most of our primary keys are auto-incrementing sequences). We thus had to take care to flush pending SQL statements in such cases in order to ensure that we know the primary keys. However, it s possible to have a problem at the other end of the object lifecycle: that is, a Python object might still be reachable in memory even though the underlying row has been DELETEd from the database. In most cases we don t keep removed objects around for obvious reasons, but it can happen in caching code, and buildd-manager crashed as a result (in fact while it was still running on Python 2). We had to take extra care to avoid this problem. Debian imports crashed on non-UTF-8 filenames Python 2 has some unfortunate behaviour around passing bytes or Unicode strings (depending on the platform) to shutil.rmtree, and the combination of some porting work and a particular source package in Debian that contained a non-UTF-8 file name caused us to run into this. The fix was to ensure that the argument passed to shutil.rmtree is a str regardless of Python version. We d actually run into something similar before: it s a subtle porting gotcha, since it s quite easy to end up passing Unicode strings to shutil.rmtree if you re in the process of porting your code to Python 3, and you might easily not notice if the file names in your tests are all encoded using UTF-8. lazr.restful ETags We eventually got far enough along that we could switch one of our four appserver machines (we have quite a number of other machines too, but the appservers handle web and API requests) to Python 3 and see what happened. By this point our extensive test suite had shaken out the vast majority of the things that could go wrong, but there was always going to be room for some interesting edge cases. One of the Ubuntu kernel team reported that they were seeing an increase in 412 Precondition Failed errors in some of their scripts that use our webservice API. These can happen when you re trying to modify an existing resource: the underlying protocol involves sending an If-Match header with the ETag that the client thinks the resource has, and if this doesn t match the ETag that the server calculates for the resource then the client has to refresh its copy of the resource and try again. We initially thought that this might be legitimate since it can happen in normal operation if you collide with another client making changes to the same resource, but it soon became clear that something stranger was going on: we were getting inconsistent ETags for the same object even when it was unchanged. Since we d recently switched a quarter of our appservers to Python 3, that was a natural suspect. Our lazr.restful package provides the framework for our webservice API, and roughly speaking it generates ETags by serializing objects into some kind of canonical form and hashing the result. Unfortunately the serialization was dependent on the Python version in a few ways, and in particular it serialized lists of strings such as lists of bug tags differently: Python 2 used [u'foo', u'bar', u'baz'] where Python 3 used ['foo', 'bar', 'baz']. In lazr.restful 1.0.3 we switched to using JSON for this, removing the Python version dependency and ensuring consistent behaviour between appservers. Memory leaks This problem took the longest to solve. We noticed fairly quickly from our graphs that the appserver machine we d switched to Python 3 had a serious memory leak. Our appservers had always been a bit leaky, but now it wasn t so much a small hole that we can bail occasionally as the boat is sinking rapidly : A serious memory leak (Yes, this got in the way of working out what was going on with ETags for a while.) I spent ages messing around with various attempts to fix this. Since only a quarter of our appservers were affected, and we could get by on 75% capacity for a while, it wasn t urgent but it was definitely annoying. After spending some quality time with objgraph, for some time I thought traceback reference cycles might be at fault, and I sent a number of fixes to various upstream projects for those (e.g. zope.pagetemplate). Those didn t help the leaks much though, and after a while it became clear to me that this couldn t be the sole problem: Python has a cyclic garbage collector that will eventually collect reference cycles as long as there are no strong references to any objects in them, although it might not happen very quickly. Something else must be going on. Debugging reference leaks in any non-trivial and long-running Python program is extremely arduous, especially with ORMs that naturally tend to end up with lots of cycles and caches. After a while I formed a hypothesis that zope.server might be keeping a strong reference to something, although I never managed to nail it down more firmly than that. This was an attractive theory as we were already in the process of migrating to Gunicorn for other reasons anyway, and Gunicorn also has a convenient max_requests setting that s good at mitigating memory leaks. Getting this all in place took some time, but once we did we found that everything was much more stable: A rather flat memory graph This isn t completely satisfying as we never quite got to the bottom of the leak itself, and it s entirely possible that we ve only papered over it using max_requests: I expect we ll gradually back off on how frequently we restart workers over time to try to track this down. However, pragmatically, it s no longer an operational concern. Mirror prober HTTPS proxy handling After we switched our script servers to Python 3, we had several reports of mirror probing failures. (Launchpad keeps lists of Ubuntu archive and image mirrors, and probes them every so often to check that they re reasonably complete and up to date.) This only affected HTTPS mirrors when probed via a proxy server, support for which is a relatively recent feature in Launchpad and involved some code that we never managed to unit-test properly: of course this is exactly the code that went wrong. Sadly I wasn t able to sort out that gap, but at least the fix was simple. Non-MIME-encoded email headers As I mentioned above, there were substantial changes in the email package between Python 2 and 3, and indeed between minor versions of Python 3. Our test coverage here is pretty good, but it s an area where it s very easy to have gaps. We noticed that a script that processes incoming email was crashing on messages with headers that were non-ASCII but not MIME-encoded (and indeed then crashing again when it tried to send a notification of the crash!). The only examples of these I looked at were spam, but we still didn t want to crash on them. The fix involved being somewhat more careful about both the handling of headers returned by Python s email parser and the building of outgoing email notifications. This seems to be working well so far, although I wouldn t be surprised to find the odd other incorrect detail in this sort of area. Failure to handle non-ISO-8859-1 URL-encoded form input Remember how I said that parsing HTTP form data was thorny? After we finished upgrading all our appservers to Python 3, people started reporting that they couldn t post Unicode comments to bugs, which turned out to be only if the attempt was made using JavaScript, and was because I hadn t quite managed to get URL-encoded form data working properly with zope.publisher and multipart. The current standard describes the URL-encoded format for form data as in many ways an aberrant monstrosity , so this was no great surprise. Part of the problem was some very strange choices in zope.publisher dating back to 2004 or earlier, which I attempted to clean up and simplify. The rest was that Python 2 s urlparse.parse_qs unconditionally decodes percent-encoded sequences as ISO-8859-1 if they re passed in as part of a Unicode string, so multipart needs to work around this on Python 2. I m still not completely confident that this is correct in all situations, but at least now that we re on Python 3 everywhere the matrix of cases we need to care about is smaller. Inconsistent marshalling of Loggerhead s disk cache We use Loggerhead for providing web browsing of Bazaar branches. When we upgraded one of its two servers to Python 3, we immediately noticed that the one still on Python 2 was failing to read back its revision information cache, which it stores in a database on disk. (We noticed this because it caused a deployment to fail: when we tried to roll out new code to the instance still on Python 2, Nagios checks had already caused an incompatible cache to be written for one branch from the Python 3 instance.) This turned out to be a similar problem to the pickle issue mentioned above, except this one was with marshal, which I didn t think to look for because it s a relatively obscure module mostly used for internal purposes by Python itself; I m not sure that Loggerhead should really be using it in the first place. The fix was relatively straightforward, complicated mainly by now needing to cope with throwing away unreadable cache data. Ironically, if we d just gone ahead and taken the nominally riskier path of upgrading both servers at the same time, we might never have had a problem here. Intermittent bzr failures Finally, after we upgraded one of our two Bazaar codehosting servers to Python 3, we had a report of intermittent bzr branch hangs. After some digging I found this in our logs:
Traceback (most recent call last):
  ...
  File "/srv/bazaar.launchpad.net/production/codehosting1-rev-20124175fa98fcb4b43973265a1561174418f4bd/env/lib/python3.5/site-packages/twisted/conch/ssh/channel.py", line 136, in addWindowBytes
    self.startWriting()
  File "/srv/bazaar.launchpad.net/production/codehosting1-rev-20124175fa98fcb4b43973265a1561174418f4bd/env/lib/python3.5/site-packages/lazr/sshserver/session.py", line 88, in startWriting
    resumeProducing()
  File "/srv/bazaar.launchpad.net/production/codehosting1-rev-20124175fa98fcb4b43973265a1561174418f4bd/env/lib/python3.5/site-packages/twisted/internet/process.py", line 894, in resumeProducing
    for p in self.pipes.itervalues():
builtins.AttributeError: 'dict' object has no attribute 'itervalues'
I d seen this before in our git hosting service: it was a bug in Twisted s Python 3 port, fixed after 20.3.0 but unfortunately after the last release that supported Python 2, so we had to backport that patch. Using the same backport dealt with this. Onwards!

11 June 2021

Colin Watson: SSH quoting

A while back there was a thread on one of our company mailing lists about SSH quoting, and I posted a long answer to it. Since then a few people have asked me questions that caused me to reach for it, so I thought it might be helpful if I were to anonymize the original question and post my answer here. The question was why a sequence of commands involving ssh and fiddly quoting produced the output they did. The first example was this:
$ ssh user@machine.local bash -lc "cd /tmp;pwd"
/home/user
Oh hi, my dubious life choices have been such that this is my specialist subject! This is because SSH command-line parsing is not quite what you expect. First, recall that your local shell will apply its usual parsing, and the actual OS-level execution of ssh will be like this:
[0]: ssh
[1]: user@machine.local
[2]: bash
[3]: -lc
[4]: cd /tmp;pwd
Now, the SSH wire protocol only takes a single string as the command, with the expectation that it should be passed to a shell by the remote end. The OpenSSH client deals with this by taking all its arguments after things like options and the target, which in this case are:
[0]: bash
[1]: -lc
[2]: cd /tmp;pwd
It then joins them with a single space:
bash -lc cd /tmp;pwd
This is passed as a string to the server, which then passes that entire string to a shell for evaluation, so as if you d typed this directly on the server:
sh -c 'bash -lc cd /tmp;pwd'
The shell then parses this as two commands:
bash -lc cd /tmp
pwd
The directory change thus happens in a subshell (actually it doesn t quite even do that, because bash -lc cd /tmp in fact ends up just calling cd because of the way bash -c parses multiple arguments), and then that subshell exits, then pwd is called in the outer shell which still has the original working directory. The second example was this:
$ ssh user@machine.local bash -lc "pwd;cd /tmp;pwd"
/home/user
/tmp
Following the logic above, this ends up as if you d run this on the server:
sh -c 'bash -lc pwd; cd /tmp; pwd'
The third example was this:
$ ssh user@machine.local bash -lc "cd /tmp;cd /tmp;pwd"
/tmp
And this is as if you d run:
sh -c 'bash -lc cd /tmp; cd /tmp; pwd'
Now, I wouldn t have implemented the SSH client this way, because I agree that it s confusing. But /usr/bin/ssh is used as a transport for other things so much that changing its behaviour now would be enormously disruptive, so it s probably impossible to fix. (I have occasionally agitated on openssh-unix-dev@ for at least documenting this better, but haven t made much headway yet; I need to get round to preparing a documentation patch.) Once you know about it you can use the proper quoting, though. In this case that would simply be:
ssh user@machine.local 'cd /tmp;pwd'
Or if you do need to specifically invoke bash -l there for some reason (I m assuming that the original example was reduced from something more complicated), then you can minimise your confusion by passing the whole thing as a single string in the form you want the remote sh -c to see, in a way that ensures that the quotes are preserved and sent to the server rather than being removed by your local shell:
ssh user@machine.local 'bash -lc "cd /tmp;pwd"'
Shell parsing is hard.

25 September 2020

Colin Watson: Porting Launchpad to Python 3: progress report

Launchpad still requires Python 2, which in 2020 is a bit of a problem. Unlike a lot of the rest of 2020, though, there s good reason to be optimistic about progress. I ve been porting Python 2 code to Python 3 on and off for a long time, from back when I was on the Ubuntu Foundations team and maintaining things like the Ubiquity installer. When I moved to Launchpad in 2015 it was certainly on my mind that this was a large body of code still stuck on Python 2. One option would have been to just accept that and leave it as it is, maybe doing more backporting work over time as support for Python 2 fades away. I ve long been of the opinion that this would doom Launchpad to being unmaintainable in the long run, and since I genuinely love working on Launchpad - I find it an incredibly rewarding project - this wasn t something I was willing to accept. We re already seeing some of our important dependencies dropping support for Python 2, which is perfectly reasonable on their terms but which is starting to become a genuine obstacle to delivering important features when we need new features from newer versions of those dependencies. It also looks as though it may be difficult for us to run on Ubuntu 20.04 LTS (we re currently on 16.04, with an upgrade to 18.04 in progress) as long as we still require Python 2, since we have some system dependencies that 20.04 no longer provides. And then there are exciting new features like type hints and async/await that we d like to be able to use. However, until last year there were so many blockers that even considering a port was barely conceivable. What changed in 2019 was sorting out a trifecta of core dependencies. We ported our database layer, Storm. We upgraded to modern versions of our Zope Toolkit dependencies (after contributing various fixes upstream, including some substantial changes to Zope s test runner that we d carried as local patches for some years). And we ported our Bazaar code hosting infrastructure to Breezy. With all that in place, a port seemed more of a realistic possibility. Still, even with this, it was never going to be a matter of just following some standard porting advice and calling it good. Launchpad has almost a million lines of Python code in its main git tree, and around 250 dependencies of which a number are quite Launchpad-specific. In a project that size, not only is following standard porting advice an extremely time-consuming task in its own right, but just about every strange corner case is going to show up somewhere. (Did you know that StringIO.StringIO(None) and io.StringIO(None) do different things even after you account for the native string vs. Unicode text difference? How about the behaviour of .union() on a subclass of frozenset?) Launchpad s test suite is fortunately extremely thorough, but even just starting up the test suite involves importing most of the data model code, so before you can start taking advantage of it you have to make a large fraction of the codebase be at least syntactically-correct Python 3 code and use only modules that exist in Python 3 while still working in Python 2; in a project this size that turns out to be a large effort on its own, and can be quite risky in places. Canonical s product engineering teams work on a six-month cycle, but it just isn t possible to cram this sort of thing into six months unless you do literally nothing else, and please can we put all feature development on hold while we run to stand still is a pretty tough sell to even the most understanding management. Fortunately, we ve been able to grow the Launchpad team in the last year or so, and so it s been possible to put Python 3 on our roadmap on the understanding that we aren t going to get all the way there in one cycle, while still being able to do other substantial feature development work as well. So, with all that preamble, what have we done this cycle? We ve taken a two-pronged approach. From one end, we identified 147 classes that needed to be ported away from some compatibility code in our database layer that was substantially less friendly to Python 3: we ve ported 38 of those, so there s clearly a fair bit more to do, but we were able to distribute this work out among the team quite effectively. From the other end, it was clear that it would be very inefficient to do general porting work when any attempt to even run the test suite would run straight into the same crashes in the same order, so I set myself a target of getting the test suite to start up, and started hacking on an enormous git branch that I never expected to try to land directly: instead, I felt free to commit just about anything that looked reasonable and moved things forward even if it was very rough, and every so often went back to tidy things up and cherry-pick individual commits into a form that included some kind of explanation and passed existing tests so that I could propose them for review. This strategy has been dramatically more successful than anything I ve tried before at this scale. So far this cycle, considering only Launchpad s main git tree, we ve landed 137 Python-3-relevant merge proposals for a total of 39552 lines of git diff output, keeping our existing tests passing along the way and deploying incrementally to production. We have about 27000 more lines of patch at varying degrees of quality to tidy up and merge. Our main development branch is only perhaps 10 or 20 more patches away from the test suite being able to start up, at which point we ll be able to get a buildbot running so that multiple developers can work on this much more easily and see the effect of their work. With the full unlanded patch stack, about 75% of the test suite passes on Python 3! This still leaves a long tail of several thousand tests to figure out and fix, but it s a much more incrementally-tractable kind of problem than where we started. Finally: the funniest (to me) bug I ve encountered in this effort was the one I encountered in the test runner and fixed in zopefoundation/zope.testrunner#106: IDs of failing tests were written to a pipe, so if you have a test suite that s large enough and broken enough then eventually that pipe would reach its capacity and your test runner would just give up and hang. Pretty annoying when it meant an overnight test run didn t give useful results, but also eloquent commentary of sorts.

17 August 2020

Ian Jackson: Doctrinal obstructiveness in Free Software

Any software system has underlying design principles, and any software project has process rules. But I seem to be seeing more often, a pathological pattern where abstract and shakily-grounded broad principles, and even contrived and sophistic objections, are used to block sensible changes. Today I will go through an example in detail, before ending with a plea: PostgreSQL query planner, WITH [MATERIALIZED] optimisation fence Background history PostgreSQL has a sophisticated query planner which usually gets the right answer. For good reasons, the pgsql project has resisted providing lots of knobs to control query planning. But there are a few ways to influence the query planner, for when the programmer knows more than the planner. One of these is the use of a WITH common table expression. In pgsql versions prior to 12, the planner would first make a plan for the WITH clause; and then, it would make a plan for the second half, counting the WITH clause's likely output as a given. So WITH acts as an "optimisation fence". This was documented in the manual - not entirely clearly, but a careful reading of the docs reveals this behaviour:
The WITH query will generally be evaluated as written, without suppression of rows that the parent query might discard afterwards.
Users (authors of applications which use PostgreSQL) have been using this technique for a long time. New behaviour in PostgreSQL 12 In PostgreSQL 12 upstream were able to make the query planner more sophisticated. In particular, it is now often capable of looking "into" the WITH common table expression. Much of the time this will make things better and faster. But if WITH was being used for its side-effect as an optimisation fence, this change will break things: queries that ran very quickly in earlier versions might now run very slowly. Helpfully, pgsql 12 still has a way to specify an optimisation fence: specifying WITH ... AS MATERIALIZED in the query. So far so good. Upgrade path for existing users of WITH fence But what about the upgrade path for existing users of the WITH fence behaviour? Such users will have to update their queries to add AS MATERIALIZED. This is a small change. Having to update a query like this is part of routine software maintenance and not in itself very objectionable. However, this change cannnot be made in advance because pgsql versions prior to 12 will reject the new syntax. So the users are in a bit of a bind. The old query syntax can be unuseably slow with the new database and the new syntax is rejected by the old database. Upgrading both the database and the application, in lockstep, is a flag day upgrade, which every good sysadmin will want to avoid. A solution to this problem Colin Watson proposed a very simple solution: make the earlier PostgreSQL versions accept the new MATERIALIZED syntax. This is correct since the new syntax specifies precisely the actual behaviour of the old databases. It has no deleterious effect on any users of older pgsql versions. It makes it possible to add the new syntax to the application, before doing the database upgrade, decoupling the two upgrades. Colin Watson even provided an implementation of this proposal. The solution is rejected by upstream Unfortunately upstream did not accept this idea. You can read the whole thread yourself if you like. But in summary, the objections were (italic indicates literal quotes): I find these extremely unconvincing, even taken together. Many of them are very unattractive things to hear one's upstream saying. At best they are knee-jerk and inflexible application of very general principles. The authors of these objections seem to have lost sight of the fact that these principles have a purpose. When these kind of software principles work against their purposes, they should be revised, or exceptions made. At worst, it looks like a collective effort to find reasons - any reasons, no matter how bad - not to make this change. The OFFSET 0 trick One of the responses in the thread mentions OFFSET 0. As part of writing the queries in the Xen Project CI system, and preparing for our system upgrade, I had carefully read the relevant pgsql documentation. This OFFSET 0 trick was new to me. But, now that I know the answer, it is easy to provide the right search terms and find, for example, this answer on stackmumble. Apparently adding a no-op OFFSET 0 to the subquery defeats the pgsql 12 query planner's ability to see into the subquery.
I think OFFSET 0 is the better approach since it's more obviously a hack showing that something weird is going on, and it's unlikely we'll ever change the optimiser behaviour around OFFSET 0 ... wheras hopefully CTEs will become inlineable at some point CTEs became inlineable by default in PostgreSQL 12.
So in fact there is a syntax for an optimisation fence that is accepted by both earlier and later PostgreSQL versions. It's even recommended by pgsql devs. It's just not documented, and is described by pgsql developers as a "hack". Astonishingly, the fact that it is a "hack" is given as a reason to use it! Well, I have therefore deployed this "hack". No doubt it will stay in our codebase indefinitely. Please don't be like that! I could come up with a lot more examples of other projects that have exhibited similar arrogance. It is becoming a plague! But every example is contentious, and I don't really feel I need to annoy a dozen separate Free Software communities. So I won't make a laundry list of obstructiveness. If you are an upstream software developer, or a distributor of software to users (eg, a distro maintainer), you have a lot of practical power. In theory it is Free Software so your users could just change it themselves. But for a user or downstream, carrying a patch is often an unsustainable amount of work and risk. Most of us have patches we would love to be running, but which we haven't even written because simply running a nonstandard build is too difficult, no matter how technically excellent our delta. As an upstream, it is very easy to get into a mindset of defending your code's existing behaviour, and to turn your project's guidelines into inflexible rules. Constant exposure to users who make silly mistakes, and rudely ask for absurd changes, can lead to core project members feeling embattled. But there is no need for an upstream to feel embattled! You have the vast majority of the power over the software, and over your project communication fora. Use that power consciously, for good. I can't say that arrogance will hurt you in the short term. Users of software with obstructive upstreams do not have many good immediate options. But we do have longer-term choices: we can choose which software to use, and we can choose whether to try to help improve the software we use. After reading Colin's experience, I am less likely to try to help improve the experience of other PostgreSQL users by contributing upstream. It doesn't seem like there would be any point. Indeed, instead of helping the PostgreSQL community I am now using them as an example of bad practice. I'm only half sorry about that.

comment count unavailable comments

16 November 2017

Colin Watson: Kitten Block equivalent for Firefox 57

I ve been using Kitten Block for years, since I don t really need the blood pressure spike caused by accidentally following links to certain UK newspapers. Unfortunately it hasn t been ported to Firefox 57. I tried emailing the author a couple of months ago, but my email bounced. However, if your primary goal is just to block the websites in question rather than seeing kitten pictures as such (let s face it, the internet is not short of alternative sources of kitten pictures), then it s easy to do with uBlock Origin. After installing the extension if necessary, go to Tools Add-ons Extensions uBlock Origin Preferences My filters, and add www.dailymail.co.uk and www.express.co.uk, each on its own line. (Of course you can easily add more if you like.) Voil : instant tranquility. Incidentally, this also works fine on Android. The fact that it was easy to install a good ad blocker without having to mess about with a rooted device or strange proxy settings was the main reason I switched to Firefox on my phone.

26 September 2017

Colin Watson: A mysterious bug with Twisted plugins

I fixed a bug in Launchpad recently that led me deeper than I expected. Launchpad uses Buildout as its build system for Python packages, and it s served us well for many years. However, we re using 1.7.1, which doesn t support ensuring that packages required using setuptools setup_requires keyword only ever come from the local index URL when one is specified; that s an essential constraint we need to be able to impose so that our build system isn t immediately sensitive to downtime or changes in PyPI. There are various issues/PRs about this in Buildout (e.g. #238), but even if those are fixed it ll almost certainly only be in Buildout v2, and upgrading to that is its own kettle of fish for other reasons. All this is a serious problem for us because newer versions of many of our vital dependencies (Twisted and testtools, to name but two) use setup_requires to pull in pbr, and so we ve been stuck on old versions for some time; this is part of why Launchpad doesn t yet support newer SSH key types, for instance. This situation obviously isn t sustainable. To deal with this, I ve been working for some time on switching to virtualenv and pip. This is harder than you might think: Launchpad is a long-lived and complicated project, and it had quite a number of explicit and implicit dependencies on Buildout s configuration and behaviour. Upgrading our infrastructure from Ubuntu 12.04 to 16.04 has helped a lot (12.04 s baseline virtualenv and pip have some deficiencies that would have required a more complicated bootstrapping procedure). I ve dealt with most of these: for example, I had to reorganise a lot of our helper scripts (1, 2, 3), but there are still a few more things to go. One remaining problem was that our Buildout configuration relied on building several different environments with different Python paths for various things. While this would technically be possible by way of building multiple virtualenvs, this would inflate our build time even further (we re already going to have to cope with some slowdown as a result of using virtualenv, because the build system now has to do a lot more than constructing a glorified link farm to a bunch of cached eggs), and it seems like unnecessary complexity. The obvious thing to do seemed to be to collapse these into a single environment, since there was no obvious reason why it should actually matter if txpkgupload and txlongpoll were carefully kept off the path when running most of Launchpad: so I did that. Then our build system got very sad. Hmm, I thought. To keep our test times somewhat manageable, we run them in parallel across 20 containers, and we randomise the order in which they run to try to shake out test isolation bugs. It s not completely unknown for there to be some oddities resulting from that. So I ran it again. Nope, but slightly differently sad this time. Furthermore, I couldn t reproduce these failures locally no matter how hard I tried. Oh dear. This was obviously not going to be a good day. In fact I spent a while on various different guesswork-based approaches. I found bug 571334 in Ampoule, an AMP-based process pool implementation that we use for some job runners, and proposed a fix for that, but cherry-picking that fix into Launchpad didn t help matters. I tried backing out subsets of my changes and determined that if both txlongpoll and txpkgupload were absent from the Python module path in the context of the tests in question then everything was fine. I tried running strace locally and staring at the output for some time in the hope of enlightenment: that reminded me that the two packages in question install modules under twisted.plugins, which did at least establish a reason they might affect the environment that was more plausible than magic, but nothing much more specific than that. On Friday I was fiddling about with this again and trying to insert some more debugging when I noticed some interesting behaviour around plugin caching. If I caused the txpkgupload plugin to raise an exception when loaded, the Twisted plugin system would remove its dropin.cache (because it was stale) and not create a new one (because there was now no content to put in it). After that, running the relevant tests would fail as I d seen in our buildbot. Aha! This meant that I could also reproduce it by doing an even cleaner build than I d previously tried to do, by removing the cached txpkgupload and txlongpoll eggs and allowing the build system to recreate them. When they were recreated, they didn t contain dropin.cache, instead allowing that to be created on first use. Based on this clue I was able to get to the answer relatively quickly. Ampoule has a specialised bootstrapping sequence for its worker processes that starts by doing this:
from twisted.application import reactors
reactors.installReactor(reactor)
Now, twisted.application.reactors.installReactor calls twisted.plugin.getPlugins, so the very start of this bootstrapping sequence is going to involve loading all plugins found on the module path (I assume it s possible to write a plugin that adds an alternative reactor implementation). If dropin.cache is up to date, then it will just get the information it needs from that; but if it isn t, it will go ahead and import the plugin. If the plugin happens (as Twisted code often does) to run from twisted.internet import reactor at some point while being imported, then that will install the platform s default reactor, and then twisted.application.reactors.installReactor will raise ReactorAlreadyInstalledError. Since Ampoule turns this into an info-level log message for some reason, and the tests in question only passed through error-level messages or higher, this meant that all we could see was that a worker process had exited non-zero but not why. The Twisted documentation recommends generating the plugin cache at build time for other reasons, but we weren t doing that. Fixing that makes everything work again. There are still a few more things needed to get us onto pip, but we re now pretty close. After that we can finally start bringing our dependencies up to date.

29 August 2017

Colin Watson: env chdir

I was recently asked to sort things out so that snap builds on Launchpad could themselves install snaps as build-dependencies. To make this work we need to start doing builds in LXD containers rather than in chroots. As a result I ve been doing some quite extensive refactoring of launchpad-buildd: it previously had the assumption that it was going to use a chroot for everything baked into lots of untested helper shell scripts, and I ve been rewriting those in Python with unit tests and with a single Backend abstraction that isolates the high-level logic from the details of where each build is being performed. This is all interesting work in its own right, but it s not what I want to talk about here. While I was doing all this refactoring, I ran across a couple of methods I wrote a while back which looked something like this:
def chroot(self, args, echo=False):
    """Run a command in the chroot.
    :param args: the command and arguments to run.
    """
    args = set_personality(
        args, self.options.arch, series=self.options.series)
    if echo:
        print("Running in chroot: %s" %
              ' '.join("'%s'" % arg for arg in args))
        sys.stdout.flush()
    subprocess.check_call([
        "/usr/bin/sudo", "/usr/sbin/chroot", self.chroot_path] + args)
def run_build_command(self, args, env=None, echo=False):
    """Run a build command in the chroot.
    This is unpleasant because we need to run it in /build under sudo
    chroot, and there's no way to do this without either a helper
    program in the chroot or unpleasant quoting.  We go for the
    unpleasant quoting.
    :param args: the command and arguments to run.
    :param env: dictionary of additional environment variables to set.
    """
    args = [shell_escape(arg) for arg in args]
    if env:
        args = ["env"] + [
            "%s=%s" % (key, shell_escape(value))
            for key, value in env.items()] + args
    command = "cd /build && %s" % " ".join(args)
    self.chroot(["/bin/sh", "-c", command], echo=echo)
(I ve already replaced the chroot method with a call to Backend.run, but it s easier to see what I m talking about in the original form.) One thing to notice about this code is that it uses several adverbial commands: that is, commands that run another command in a different way. For example, sudo runs another command as another user, while chroot runs another command with a different root directory, and env runs another command with different environment variables set. These commands chain neatly, and they also have the useful property that they take the subsidiary command and its arguments as a list of arguments. coreutils has several other commands that behave this way, and adverbio is another useful example. By contrast, su -c is something you might call a quasi-adverbial command: it does modify the behaviour of another command, but it takes it as a single argument which it then passes to sh -c. Every time you have something that s passed to a shell like this, you need a corresponding layer of shell quoting to escape any shell metacharacters that should be interpreted literally. This is often cumbersome and is easy to get wrong. My Python implementation is as follows, and I wouldn t be totally surprised to discover that it contained a bug:
import re
non_meta_re = re.compile(r'^[a-zA-Z0-9+,./:=@_-]+$')
def shell_escape(arg):
    if non_meta_re.match(arg):
        return arg
    else:
        return "'%s'" % arg.replace("'", "'\\''")
Python >= 3.3 has shlex.quote, which is an improvement and we should probably use that instead, but it s still another thing to forget to call. This is why process-spawning libraries such as Python s subprocess, Perl s system and open, and my own libpipeline for C encourage programmers to use a list syntax and to avoid involving the shell entirely wherever possible. One thing that the standard Unix tools don t let you do in an adverbial way is to change your working directory, and I ve run into this annoying limitation several times. This means that it s difficult to chain that operation together with other adverbs, for example to run a command in a particular working directory inside a chroot. The workaround I used above was to invoke a shell that runs cd /build && ..., but that s another command that s only quasi-adverbial, since the extra shell means an extra layer of shell quoting. (Ian Jackson rightly observes that you can in fact write the necessary adverb as something like sh -ec 'cd "$1"; shift; exec "$@"' chdir. I think that s a bit uglier than I ideally want to use in production code, but you might reasonably think that it s worth it to avoid the extra layer of shell quoting.) I therefore decided that this was a feature that belonged in coreutils, and after a bit of mailing list discussion we felt it was best implemented as a new option to env(1). I sent a patch for this which has been accepted. This means that we have a new composable adverb, env --chdir=NEWDIR, which will allow the run_build_command method above to be rewritten as something like this:
def run_build_command(self, args, env=None, echo=False):
    """Run a build command in the chroot.
    :param args: the command and arguments to run.
    :param env: dictionary of additional environment variables to set.
    """
    env_args = ["env", "--chdir=/build"]
    if env:
        for key, value in env.items():
            env_args.append("%s=%s" % (key, value))
    self.chroot(env_args + args, echo=echo)
The env --chdir option will be in coreutils 8.28. We won t be able to use it in launchpad-buildd until that s available in all Ubuntu series we might want to build for, so in this particular application that s going to take a few years; but other applications may well be able to make use of it sooner.

Next.