Search Results: "Colin Watson"

4 March 2024

Colin Watson: Free software activity in January/February 2024

Two months into my new gig and it s going great! Tracking my time has taken a bit of getting used to, but having something that amounts to a queryable database of everything I ve done has also allowed some helpful introspection. Freexian sponsors up to 20% of my time on Debian tasks of my choice. In fact I ve been spending the bulk of my time on debusine which is itself intended to accelerate work on Debian, but more details on that later. While I contribute to Freexian s summaries now, I ve also decided to start writing monthly posts about my free software activity as many others do, to get into some more detail. January 2024 February 2024

11 February 2024

Freexian Collaborators: Debian Contributions: Upcoming Improvements to Salsa CI, /usr-move, and more! (by Utkarsh Gupta)

Contributing to Debian is part of Freexian s mission. This article covers the latest achievements of Freexian and their collaborators. All of this is made possible by organizations subscribing to our Long Term Support contracts and consulting services.

Upcoming Improvements to Salsa CI, by Santiago Ruano Rinc n Santiago started picking up the work made by Outreachy Intern, Enock Kashada (a big thanks to him!), to solve some long-standing issues in Salsa CI. Currently, the first job in a Salsa CI pipeline is the extract-source job, used to produce a debianize source tree of the project. This job was introduced to make it possible to build the projects on different architectures, on the subsequent build jobs. However, that extract-source approach is sub-optimal: not only it increases the execution time of the pipeline by some minutes, but also projects whose source tree is too large are not able to use the pipeline. The debianize source tree is passed as an artifact to the build jobs, and for those large projects, the size of their source tree exceeds the Salsa s limits. This is specific issue is documented as issue #195, and the proposed solution is to get rid of the extract-source job, relying on sbuild in the very build job (see issue #296). Switching to sbuild would also help to improve the build source job, solving issues such as #187 and #298. The current work-in-progress is very preliminary, but it has already been possible to run the build (amd64), build-i386 and build-source job using sbuild with the unshare mode. The image on the right shows a pipeline that builds grep. All the test jobs use the artifacts of the new build job. There is a lot of remaining work, mainly making the integration with ccache work. This change could break some things, it will also be important to test how the new pipeline works with complex projects. Also, thanks to Emmanuel Arias, we are proposing a Google Summer of Code 2024 project to improve Salsa CI. As part of the ongoing work in preparation for the GSoC 2024 project, Santiago has proposed a merge request to make more efficient how contributors can test their changes on the Salsa CI pipeline.

/usr-move, by Helmut Grohne In January, we sent most of the moving patches for the set of packages involved with debootstrap. Notably missing is glibc, which turns out harder than anticipated via dumat, because it has Conflicts between different architectures, which dumat does not analyze. Patches for diversion mitigations have been updated in a way to not exhibit any loss anymore. The main change here is that packages which are being diverted now support the diverting packages in transitioning their diversions. We also supported a few packages with non-trivial changes such as dumat has been enhanced to better support derivatives such as Ubuntu.

Miscellaneous contributions
  • Python 3.12 migration trundles on. Stefano Rivera helped port several new packages to support 3.12.
  • Stefano updated the Sphinx configuration of DebConf Video Team s documentation, which was broken by Sphinx 7.
  • Stefano published the videos from the Cambridge MiniDebConf to YouTube and PeerTube.
  • DebConf 24 planning has begun, and Stefano & Utkarsh have started work on this.
  • Utkarsh re-sponsored the upload of golang-github-prometheus-community-pgbouncer-exporter for Lena.
  • Colin Watson added Incus support to autopkgtest.
  • Colin discovered Perl::Critic and used it to tidy up some poor practices in several of his packages, including debconf.
  • Colin did some overdue debconf maintenance, mainly around tidying up error message handling in several places (1, 2, 3).
  • Colin figured out how to update the mirror size documentation in debmirror, last updated in 2010. It should now be much easier to keep it up to date regularly.
  • Colin issued a man-db buster update to clean up some irritations due to strict sandboxing.
  • Thorsten Alteholz adopted two more packages, magicfilter and ifhp, for the debian-printing team. Those packages are the last ones of the latest round of adoptions to preserve the old printing protocol within Debian. If you know of other packages that should be retained, please don t hesitate to contact Thorsten.
  • Enrico participated in /usr-merge discussions with Helmut.
  • Helmut sent patches for 16 cross build failures.
  • Helmut supported Matthias Klose (not affiliated with Freexian) with adding -for-host support to gcc-defaults.
  • Helmut uploaded dput-ng enabling dcut migrate and merging two MRs of Ben Hutchings.
  • Santiago took part in the discussions relating to the EU Cyber Resilience Act (CRA) and the Debian public statement that was published last year. He participated in a meeting with Members of the European Parliament (MEPs), Marcel Kolaja and Karen Melchior, and their teams to clarify some points about the impact of the CRA and Debian and downstream projects, and the improvements in the last version of the proposed regulation.

17 January 2024

Colin Watson: Task management

Now that I m freelancing, I need to actually track my time, which is something I ve had the luxury of not having to do before. That meant something of a rethink of the way I ve been keeping track of my to-do list. Up to now that was a combination of things like the bug lists for the projects I m working on at the moment, whatever task tracking system Canonical was using at the moment (Jira when I left), and a giant flat text file in which I recorded logbook-style notes of what I d done each day plus a few extra notes at the bottom to remind myself of particularly urgent tasks. I could have started manually adding times to each logbook entry, but ugh, let s not. In general, I had the following goals (which were a bit reminiscent of my address book): I didn t do an elaborate evaluation of multiple options, because I m not trying to come up with the best possible solution for a client here. Also, there are a bazillion to-do list trackers out there and if I tried to evaluate them all I d never do anything else. I just wanted something that works well enough for me. Since it came up on Mastodon: a bunch of people swear by Org mode, which I know can do at least some of this sort of thing. However, I don t use Emacs and don t plan to use Emacs. nvim-orgmode does have some support for time tracking, but when I ve tried vim-based versions of Org mode in the past I ve found they haven t really fitted my brain very well. Taskwarrior and Timewarrior One of the other Freexian collaborators mentioned Taskwarrior and Timewarrior, so I had a look at those. The basic idea of Taskwarrior is that you have a task command that tracks each task as a blob of JSON and provides subcommands to let you add, modify, and remove tasks with a minimum of friction. task add adds a task, and you can add metadata like project:Personal (I always make sure every task has a project, for ease of filtering). Just running task shows you a task list sorted by Taskwarrior s idea of urgency, with an ID for each task, and there are various other reports with different filtering and verbosity. task <id> annotate lets you attach more information to a task. task <id> done marks it as done. So far so good, so a redacted version of my to-do list looks like this:
$ task ls
ID A Project     Tags                 Description
17   Freexian                         Add Incus support to autopkgtest [2]
 7   Columbiform                      Figure out Lloyds online banking [1]
 2   Debian                           Fix troffcvt for groff 1.23.0 [1]
11   Personal                         Replace living room curtain rail
Once I got comfortable with it, this was already a big improvement. I haven t bothered to learn all the filtering gadgets yet, but it was easy enough to see that I could do something like task all project:Personal and it d show me both pending and completed tasks in that project, and that all the data was stored in ~/.task - though I have to say that there are enough reporting bells and whistles that I haven t needed to poke around manually. In combination with the regular backups that I do anyway (you do too, right?), this gave me enough confidence to abandon my previous text-file logbook approach. Next was time tracking. Timewarrior integrates with Taskwarrior, albeit in an only semi-packaged way, and it was easy enough to set that up. Now I can do:
$ task 25 start
Starting task 00a9516f 'Write blog post about task tracking'.
Started 1 task.
Note: '"Write blog post about task tracking"' is a new tag.
Tracking Columbiform "Write blog post about task tracking"
  Started 2024-01-10T11:28:38
  Current                  38
  Total               0:00:00
You have more urgent tasks.
Project 'Columbiform' is 25% complete (3 of 4 tasks remaining).
When I stop work on something, I do task active to find the ID, then task <id> stop. Timewarrior does the tedious stopwatch business for me, and I can manually enter times if I forget to start/stop a task. Then the really useful bit: I can do something like timew summary :month <name-of-client> and it tells me how much to bill that client for this month. Perfect. I also started using VIT to simplify the day-to-day flow a little, which means I m normally just using one or two keystrokes rather than typing longer commands. That isn t really necessary from my point of view, but it does save some time. Android integration I left Android integration for a bit later since it wasn t essential. When I got round to it, I have to say that it felt a bit clumsy, but it did eventually work. The first step was to set up a taskserver. Most of the setup procedure was OK, but I wanted to use Let s Encrypt to minimize the amount of messing around with CAs I had to do. Getting this to work involved hitting things with sticks a bit, and there s still a local CA involved for client certificates. What I ended up with was a certbot setup with the webroot authenticator and a custom deploy hook as follows (with cert_name replaced by a DNS name in my house domain):
#! /bin/sh
set -eu
for domain in $RENEWED_DOMAINS; do
    case "$domain" in
$found   exit 0
install -m 644 "/etc/letsencrypt/live/$cert_name/fullchain.pem" \
install -m 640 -g Debian-taskd "/etc/letsencrypt/live/$cert_name/privkey.pem" \
systemctl restart taskd.service
I could then set this in /etc/taskd/config (server.crl.pem and ca.cert.pem were generated using the documented taskserver setup procedure):
Then I could set on my laptop to /usr/share/ca-certificates/mozilla/ISRG_Root_X1.crt and otherwise follow the client setup instructions, run task sync init to get things started, and then task sync every so often to sync changes between my laptop and the taskserver. I used TaskWarrior Mobile as the client. I have to say I wouldn t want to use that client as my primary task tracking interface: the setup procedure is clunky even beyond the necessity of copying a client certificate around, it expects you to give it a .taskrc rather than having a proper settings interface for that, and it only seems to let you add a task if you specify a due date for it. It also lacks Timewarrior integration, so I can only really use it when I don t care about time tracking, e.g. personal tasks. But that s really all I need, so it meets my minimum requirements. Next? Considering this is literally the first thing I tried, I have to say I m pretty happy with it. There are a bunch of optional extras I haven t tried yet, but in general it kind of has the vim nature for me: if I need something it s very likely to exist or easy enough to build, but the features I don t use don t get in my way. I wouldn t recommend any of this to somebody who didn t already spend most of their time in a terminal - but I do. I m glad people have gone to all the effort to build this so I didn t have to.

15 January 2024

Colin Watson: OpenUK New Year s Honours

Apparently I got an honour from OpenUK. There are a bunch of people I know on that list. Chris Lamb and Mark Brown are familiar names from Debian. Colin King and Jonathan Riddell are people I know from past work in Ubuntu. I ve admired David MacIver s work on Hypothesis and Richard Hughes work on firmware updates from afar. And there are a bunch of other excellent projects represented there: OpenStreetMap, Textualize, and my alma mater of Cambridge to name but a few. My friend Stuart Langridge wrote about being on a similar list a few years ago, and I can t do much better than to echo it: in particular he wrote about the way the open source development community is often at best unwelcoming to people who don t look like Stuart and I do. I can t tell a whole lot about demographic distribution just by looking at a list of names, but while these honours still seem to be skewed somewhat male, I m fairly sure they re doing a lot better in terms of gender balance than my home project of Debian is, for one. I hope this is a sign of improvement for the future, and I ll do what I can to pay it forward.

10 January 2024

Colin Watson: Going freelance

I ve mentioned this in a couple of other places, but I realized I never got round to posting about it on my own blog rather than on other people s services. How remiss of me. Anyway: after much soul-searching, I decided a few months ago that it was time for me to move on from Canonical and the Launchpad team there. Nearly 20 years is a long time to spend at any company, and although there are a bunch of people I ll miss, Launchpad is in a reasonable state where I can let other people have a turn. I m now in business for myself as a freelance developer! My new company is Columbiform, and I m focusing on Debian packaging and custom Python development. My services page has some self-promotion on the sorts of things I can do. My first gig, and the one that made it viable to make this jump, is at Freexian where I m helping with an exciting infrastructure project that we hope will start making Debian developers lives easier in the near future. This is likely to take up most of my time at least through to the end of 2024, but I may have some spare cycles. Drop me a line if you have something where you think I could be a good fit, and we can have a talk about it.

11 November 2022

Reproducible Builds: Reproducible Builds in October 2022

Welcome to the Reproducible Builds report for October 2022! In these reports we attempt to outline the most important things that we have been up to over the past month. As ever, if you are interested in contributing to the project, please visit our Contribute page on our website.

Our in-person summit this year was held in the past few days in Venice, Italy. Activity and news from the summit will therefore be covered in next month s report!
A new article related to reproducible builds was recently published in the 2023 IEEE Symposium on Security and Privacy. Titled Taxonomy of Attacks on Open-Source Software Supply Chains and authored by Piergiorgio Ladisa, Henrik Plate, Matias Martinez and Olivier Barais, their paper:
[ ] proposes a general taxonomy for attacks on opensource supply chains, independent of specific programming languages or ecosystems, and covering all supply chain stages from code contributions to package distribution.
Taking the form of an attack tree, the paper covers 107 unique vectors linked to 94 real world supply-chain incidents which is then mapped to 33 mitigating safeguards including, of course, reproducible builds:
Reproducible Builds received a very high utility rating (5) from 10 participants (58.8%), but also a high-cost rating (4 or 5) from 12 (70.6%). One expert commented that a reproducible build like used by Solarwinds now, is a good measure against tampering with a single build system and another claimed this is going to be the single, biggest barrier .

It was noticed this month that Solarwinds published a whitepaper back in December 2021 in order to:
[ ] illustrate a concerning new reality for the software industry and illuminates the increasingly sophisticated threats made by outside nation-states to the supply chains and infrastructure on which we all rely.
The 12-month anniversary of the 2020 Solarwinds attack (which SolarWinds Worldwide LLC itself calls the SUNBURST attack) was, of course, the likely impetus for publication.
Whilst collaborating on making the Cyrus IMAP server reproducible, Ellie Timoney asked why the Reproducible Builds testing framework uses two remarkably distinctive build paths when attempting to flush out builds that vary on the absolute system path in which they were built. In the case of the Cyrus IMAP server, these happened to be: Asked why they vary in three different ways, Chris Lamb listed in detail the motivation behind to each difference.
On our mailing list this month:
The Reproducible Builds project is delighted to welcome openEuler to the Involved projects page [ ]. openEuler is Linux distribution developed by Huawei, a counterpart to it s more commercially-oriented EulerOS.

Debian Colin Watson wrote about his experience towards making the databases generated by the man-db UNIX manual page indexing tool:
One of the people working on [reproducible builds] noticed that man-db s database files were an obstacle to [reproducibility]: in particular, the exact contents of the database seemed to depend on the order in which files were scanned when building it. The reporter proposed solving this by processing files in sorted order, but I wasn t keen on that approach: firstly because it would mean we could no longer process files in an order that makes it more efficient to read them all from disk (still valuable on rotational disks), but mostly because the differences seemed to point to other bugs.
Colin goes on to describe his approach to solving the problem, including fixing various fits of internal caching, and he ends his post with None of this is particularly glamorous work, but it paid off .
Vagrant Cascadian announced on our mailing list another online sprint to help clear the huge backlog of reproducible builds patches submitted by performing NMUs (Non-Maintainer Uploads). The first such sprint took place on September 22nd, but another was held on October 6th, and another small one on October 20th. This resulted in the following progress:
41 reviews of Debian packages were added, 62 were updated and 12 were removed this month adding to our knowledge about identified issues. A number of issue types were updated too. [1][ ]
Lastly, Luca Boccassi submitted a patch to debhelper, a set of tools used in the packaging of the majority of Debian packages. The patch addressed an issue in the dh_installsysusers utility so that the postinst post-installation script that debhelper generates the same data regardless of the underlying filesystem ordering.

Other distributions F-Droid is a community-run app store that provides free software applications for Android phones. This month, F-Droid changed their documentation and guidance to now explicitly encourage RB for new apps [ ][ ], and FC Stegerman created an extremely in-depth issue on GitLab concerning the APK signing block. You can read more about F-Droid s approach to reproducibility in our July 2022 interview with Hans-Christoph Steiner of the F-Droid Project. In openSUSE, Bernhard M. Wiedemann published his usual openSUSE monthly report.

Upstream patches The Reproducible Builds project detects, dissects and attempts to fix as many currently-unreproducible packages as possible. We endeavour to send all of our patches upstream where appropriate. This month, we wrote a large number of such patches, including:

diffoscope diffoscope is our in-depth and content-aware diff utility. Not only can it locate and diagnose reproducibility issues, it can provide human-readable diffs from many kinds of binary formats. This month, Chris Lamb prepared and uploaded versions 224 and 225 to Debian:
  • Add support for comparing the text content of HTML files using html2text. [ ]
  • Add support for detecting ordering-only differences in XML files. [ ]
  • Fix an issue with detecting ordering differences. [ ]
  • Use the capitalised version of Ordering consistently everywhere in output. [ ]
  • Add support for displaying font metadata using ttx(1) from the fonttools suite. [ ]
  • Testsuite improvements:
    • Temporarily allow the stable-po pipeline to fail in the CI. [ ]
    • Rename the order1.diff test fixture to json_expected_ordering_diff. [ ]
    • Tidy the JSON tests. [ ]
    • Use assert_diff over get_data and an manual assert within the XML tests. [ ]
    • Drop the ALLOWED_TEST_FILES test; it was mostly just annoying. [ ]
    • Tidy the tests/ file. [ ]
Chris Lamb also added a link to diffoscope s OpenBSD packaging on the homepage [ ] and Mattia Rizzolo fix an test failure that was occurring under with LLVM 15 [ ].

Testing framework The Reproducible Builds project operates a comprehensive testing framework at in order to check packages and other artifacts for reproducibility. In October, the following changes were made by Holger Levsen:
  • Run the logparse tool to analyse results on the Debian Edu build logs. [ ]
  • Install btop(1) on all nodes running Debian. [ ]
  • Switch Arch Linux from using SHA1 to SHA256. [ ]
  • When checking Debian debstrap jobs, correctly log the tool usage. [ ]
  • Cleanup more task-related temporary directory names when testing Debian packages. [ ][ ]
  • Use the cdebootstrap-static binary for the 2nd runs of the cdebootstrap tests. [ ]
  • Drop a workaround when testing OpenWrt and coreboot as the issue in diffoscope has now been fixed. [ ]
  • Turn on an rm(1) warning into an info -level message. [ ]
  • Special case the osuosl168 node for running Debian bookworm already. [ ][ ]
  • Use the new non-free-firmware suite on the o168 node. [ ]
In addition, Mattia Rizzolo made the following changes:
  • Ensure that 2nd build has a merged /usr. [ ]
  • Only reconfigure the usrmerge package on Debian bookworm and above. [ ]
  • Fix bc(1) syntax in the computation of the percentage of unreproducible packages in the dashboard. [ ][ ][ ]
  • In the index_suite_ pages, order the package status to be the same order of the menu. [ ]
  • Pass the --distribution parameter to the pbuilder utility. [ ]
Finally, Roland Clobus continued his work on testing Live Debian images. In particular, he extended the maintenance script to warn when workspace directories cannot be deleted. [ ]
If you are interested in contributing to the Reproducible Builds project, please visit our Contribute page on our website. However, you can get in touch with us via:

16 October 2022

Colin Watson: Reproducible man-db databases

I ve released man-db 2.11.0 (announcement, NEWS), and uploaded it to Debian unstable. The biggest chunk of work here was fixing some extremely long-standing issues with how the database is built. Despite being in the package name, man-db s database is much less important than it used to be: most uses of man(1) haven t required it in a long time, and both hardware and software improvements mean that even some searches can be done by brute force without needing prior indexing. However, the database is still needed for the whatis(1) and apropos(1) commands. The database has a simple format - no relational structure here, it s just a simple key-value database using old-fashioned DBM-like interfaces and composing a few fields to form values - but there are a number of subtleties involved. The issues tend to amount to this: what does a manual page name mean? At first glance it might seem simple, because you have file names that look something like /usr/share/man/man1/ls.1.gz and that s obviously ls(1). Some pages are symlinks to other pages (which we track separately because it makes it easier to figure out which entries to update when the contents of the file system change), and sometimes multiple pages are even hard links to the same file. The real complications come with whatis references . Pages can list a bunch of names in their NAME section, and the historical expectation is that it should be possible to use those names as arguments to man(1) even if they don t also appear in the file system (although Debian policy has deprecated relying on this for some time). Not only does that mean that man(1) sometimes needs to consult the database, but it also means that the database is inherently more complicated, since a page might list something in its NAME section that conflicts with an actual file name in the file system, and now you need a priority system to resolve ambiguities. There are some other possible causes of ambiguity as well. The people working on reproducible builds in Debian branched out to the related challenge of reproducible installations some time ago: can you take a collection of packages, bootstrap a file system image from them, and reproduce that exact same image somewhere else? This is useful for the same sorts of reasons that reproducible builds are useful: it lets you verify that an image is built from the components it s supposed to be built from, and doesn t contain any other skulduggery by accident or design. One of the people working on this noticed that man-db s database files were an obstacle to that: in particular, the exact contents of the database seemed to depend on the order in which files were scanned when building it. The reporter proposed solving this by processing files in sorted order, but I wasn t keen on that approach: firstly because it would mean we could no longer process files in an order that makes it more efficient to read them all from disk (still valuable on rotational disks), but mostly because the differences seemed to point to other bugs. Having understood this, there then followed several late nights of very fiddly work on the details of how the database is maintained. None of this was conceptually difficult: it mainly amounted to ensuring that we maintain a consistent well-order for different entries that we might want to insert for a given database key, and that we consider the same names for insertion regardless of the order in which we encounter files. As usual, the tricky bit is making sure that we have the right data structures to support this. man-db is written in C which is not very well-supplied with built-in data structures, and originally much of the code was written in a style that tried to minimize memory allocations; this came at the cost of ownership and lifetime often being rather unclear, and it was often difficult to make changes without causing leaks or double-frees. Over the years I ve been gradually introducing better encapsulation to make things easier to follow, and I had to do another round of that here. There were also some problems with caching being done at slightly the wrong layer: we need to make use of a trace of the chain of links followed to resolve a page to its ultimate source file, but we were incorrectly caching that trace and reusing it for any link to the same file, with incorrect results in many cases. Oh, and after doing all that I found that the on-disk representation of a GDBM database is insertion-order-dependent, so I ended up having to manually reorganize the database at the end by reading it all in and writing it all back out in sorted order, which feels really weird to me coming from spending most of my time with PostgreSQL these days. Fortunately the database is small so this takes negligible time. None of this is particularly glamorous work, but it paid off:
# export SOURCE_DATE_EPOCH="$(date +%s)"
# mkdir emptydir disorder
# disorderfs --multi-user=yes --shuffle-dirents=yes --reverse-dirents=no emptydir disorder
# export TMPDIR="$(pwd)/disorder"
# mmdebstrap --variant=standard --hook-dir=/usr/share/mmdebstrap/hooks/merged-usr \
      unstable out1.tar
# mmdebstrap --variant=standard --hook-dir=/usr/share/mmdebstrap/hooks/merged-usr \
      unstable out2.tar
# cmp out1.tar out2.tar
# echo $?

18 February 2022

Colin Watson: Launchpad now supports SSH Ed25519 keys and RSA SHA-2 signatures

As of 2022-02-16, Launchpad supports a couple of features on its SSH endpoints (,,, and that it previously didn t: Ed25519 public keys (a well-regarded format, supported by OpenSSH since 6.5 in 2014) and signatures with existing RSA public keys using SHA-2 rather than SHA-1 (supported by OpenSSH since 7.2 in 2016). I m hesitant to call these features new , since they ve been around for a long time elsewhere, and people might quite reasonably ask why it s taken us so long. The problem has always been that Launchpad can t really use a normal SSH server such as OpenSSH because it needs features that aren t practical to implement that way, such as virtual filesystems and dynamic user key authorization against the Launchpad database. Instead, we use Twisted Conch, which is a very extensible Python SSH implementation that has generally served us well. The downside is that, because it s an independent implementation and one that occupies a relatively small niche, it often lags behind in terms of newer protocol features. Catching up to this point has been something we ve been working on for around five years, although it s taken a painfully long time for a variety of reasons which I thought some people might find interesting to go into, at least people who have the patience for details of the SSH protocol. Many of the delays were my own responsibility, although realistically we probably couldn t have added Ed25519 support before OpenSSL/cryptography work that landed in 2019. Phew! Thanks to everyone who works on Twisted, cryptography, and OpenSSL - it s been really useful to be able to build on solid lower-level cryptographic primitives - and to those who helped with code review.

2 August 2021

Colin Watson: Launchpad now runs on Python 3!

After a very long porting journey, Launchpad is finally running on Python 3 across all of our systems. I wanted to take a bit of time to reflect on why my emotional responses to this port differ so much from those of some others who ve done large ports, such as the Mercurial maintainers. It s hard to deny that we ve had to burn a lot of time on this, which I m sure has had an opportunity cost, and from one point of view it s essentially running to stand still: there is no single compelling feature that we get solely by porting to Python 3, although it s clearly a prerequisite for tidying up old compatibility code and being able to use modern language facilities in the future. And yet, on the whole, I found this a rewarding project and enjoyed doing it. Some of this may be because by inclination I m a maintenance programmer and actually enjoy this sort of thing. My default view tends to be that software version upgrades may be a pain but it s much better to get that pain over with as soon as you can rather than trying to hold back the tide; you can certainly get involved and try to shape where things end up, but rightly or wrongly I can t think of many cases when a righteously indignant user base managed to arrange for the old version to be maintained in perpetuity so that they never had to deal with the new thing (OK, maybe Perl 5 counts here). I think a more compelling difference between Launchpad and Mercurial, though, may be that very few other people really had a vested interest in what Python version Launchpad happened to be running, because it s all server-side code (aside from some client libraries such as launchpadlib, which were ported years ago). As such, we weren t trying to do this with the internet having Strong Opinions at us. We were doing this because it was obviously the only long-term-maintainable path forward, and in more recent times because some of our library dependencies were starting to drop support for Python 2 and so it was obviously going to become a practical problem for us sooner or later; but if we d just stayed on Python 2 forever then fundamentally hardly anyone else would really have cared directly, only maybe about some indirect consequences of that. I don t follow Mercurial development so I may be entirely off-base, but if other people were yelling at me about how late my project was to finish its port, that in itself would make me feel more negatively about the project even if I thought it was a good idea. Having most of the pressure come from ourselves rather than from outside meant that wasn t an issue for us. I m somewhat inclined to think of the process as an extreme version of paying down technical debt. Moving from Python 2.7 to 3.5, as we just did, means skipping over multiple language versions in one go, and if similar changes had been made more gradually it would probably have felt a lot more like the typical dependency update treadmill. I appreciate why not everyone might want to think of it this way: maybe this is just my own rationalization. Reflections on porting to Python 3 I m not going to defend the Python 3 migration process; it was pretty rough in a lot of ways. Nor am I going to spend much effort relitigating it here, as it s already been done to death elsewhere, and as I understand it the core Python developers have got the message loud and clear by now. At a bare minimum, a lot of valuable time was lost early in Python 3 s lifetime hanging on to flag-day-type porting strategies that were impractical for large projects, when it should have been providing for bilingual strategies (code that runs in both Python 2 and 3 for a transitional period) which is where most libraries and most large migrations ended up in practice. For instance, the early advice to library maintainers to maintain two parallel versions or perhaps translate dynamically with 2to3 was entirely impractical in most non-trivial cases and wasn t what most people ended up doing, and yet the idea that 2to3 is all you need still floats around Stack Overflow and the like as a result. (These days, I would probably point people towards something more like Eevee s porting FAQ as somewhere to start.) There are various fairly straightforward things that people often suggest could have been done to smooth the path, and I largely agree: not removing the u'' string prefix only to put it back in 3.3, fewer gratuitous compatibility breaks in the name of tidiness, and so on. But if I had a time machine, the number one thing I would ask to have been done differently would be introducing type annotations in Python 2 before Python 3 branched off. It s true that it s technically possible to do type annotations in Python 2, but the fact that it s a different syntax that would have to be fixed later is offputting, and in practice it wasn t widely used in Python 2 code. To make a significant difference to the ease of porting, annotations would need to have been introduced early enough that lots of Python 2 library code used them so that porting code didn t have to be quite so much of an exercise of manually figuring out the exact nature of string types from context. Launchpad is a complex piece of software that interacts with multiple domains: for example, it deals with a database, HTTP, web page rendering, Debian-format archive publishing, and multiple revision control systems, and there s often overlap between domains. Each of these tends to imply different kinds of string handling. Web page rendering is normally done mainly in Unicode, converting to bytes as late as possible; revision control systems normally want to spend most of their time working with bytes, although the exact details vary; HTTP is of course bytes on the wire, but Python s WSGI interface has some string type subtleties. In practice I found myself thinking about at least four string-like types (that is, things that in a language with a stricter type system I might well want to define as distinct types and restrict conversion between them): bytes, text, ordinary native strings (str in either language, encoded to UTF-8 in Python 2), and native strings with WSGI s encoding rules. Some of these are emergent properties of writing in the intersection of Python 2 and 3, which is effectively a specialized language of its own without coherent official documentation whose users must intuit its behaviour by comparing multiple sources of information, or by referring to unofficial porting guides: not a very satisfactory situation. Fortunately much of the complexity collapses once it becomes possible to write solely in Python 3. Some of the difficulties we ran into are not ones that are typically thought of as Python 2-to-3 porting issues, because they were changed later in Python 3 s development process. For instance, the email module was substantially improved in around the 3.2/3.3 timeframe to handle Python 3 s bytes/text model more correctly, and since Launchpad sends quite a few different kinds of email messages and has some quite picky tests for exactly what it emits, this entailed a lot of work in our email sending code and in our test suite to account for that. (It took me a while to work out whether we should be treating raw email messages as bytes or as text; bytes turned out to work best.) 3.4 made some tweaks to the implementation of quoted-printable encoding that broke a number of our tests in ways that took some effort to fix, because the tests needed to work on both 2.7 and 3.5. The list goes on. I got quite proficient at digging through Python s git history to figure out when and why some particular bit of behaviour had changed. One of the thorniest problems was parsing HTTP form data. We mainly rely on zope.publisher for this, which in turn relied on cgi.FieldStorage; but cgi.FieldStorage is badly broken in some situations on Python 3. Even if that bug were fixed in a more recent version of Python, we can t easily use anything newer than 3.5 for the first stage of our port due to the version of the base OS we re currently running, so it wouldn t help much. In the end I fixed some minor issues in the multipart module (and was kindly given co-maintenance of it) and converted zope.publisher to use it. Although this took a while to sort out, it seems to have gone very well. A couple of other interesting late-arriving issues were around pickle. For most things we normally prefer safer formats such as JSON, but there are a few cases where we use pickle, particularly for our session databases. One of my colleagues pointed out that I needed to remember to tell pickle to stick to protocol 2, so that we d be able to switch back and forward between Python 2 and 3 for a while; quite right, and we later ran into a similar problem with marshal too. A more surprising problem was that datetime.datetime objects pickled on Python 2 require special care when unpickling on Python 3; rather than the approach that ended up being implemented and documented for Python 3.6, though, I preferred a custom unpickler, both so that things would work on Python 3.5 and so that I wouldn t have to risk affecting the decoding of other pickled strings in the session database. General lessons Writing this over a year after Python 2 s end-of-life date, and certainly nowhere near the leading edge of Python 3 porting work, it s perhaps more useful to look at this in terms of the lessons it has for other large technical debt projects. I mentioned in my previous article that I used the approach of an enormous and frequently-rebased git branch as a working area for the port, committing often and sometimes combining and extracting commits for review once they seemed to be ready. A port of this scale would have been entirely intractable without a tool of similar power to git rebase, so I m very glad that we finished migrating to git in 2019. I relied on this right up to the end of the port, and it also allowed for quick assessments of how much more there was to land. git worktree was also helpful, in that I could easily maintain working trees built for each of Python 2 and 3 for comparison. As is usual for most multi-developer projects, all changes to Launchpad need to go through code review, although we sometimes make exceptions for very simple and obvious changes that can be self-reviewed. Since I knew from the outset that this was going to generate a lot of changes for review, I therefore structured my work from the outset to try to make it as easy as possible for my colleagues to review it. This generally involved keeping most changes to a somewhat manageable size of 800 lines or less (although this wasn t always possible), and arranging commits mainly according to the kind of change they made rather than their location. For example, when I needed to fix issues with / in Python 3 being true division rather than floor division, I did so in one commit across the various places where it mattered and took care not to mix it with other unrelated changes. This is good practice for nearly any kind of development, but it was especially important here since it allowed reviewers to consider a clear explanation of what I was doing in the commit message and then skim-read the rest of it much more quickly. It was vital to keep the codebase in a working state at all times, and deploy to production reasonably often: this way if something went wrong the amount of code we had to debug to figure out what had happened was always tractable. (Although I can t seem to find it now to link to it, I saw an account a while back of a company that had taken a flag-day approach instead with a large codebase. It seemed to work for them, but I m certain we couldn t have made it work for Launchpad.) I can t speak too highly of Launchpad s test suite, much of which originated before my time. Without a great deal of extensive coverage of all sorts of interesting edge cases at both the unit and functional level, and a corresponding culture of maintaining that test suite well when making new changes, it would have been impossible to be anything like as confident of the port as we were. As part of the porting work, we split out a couple of substantial chunks of the Launchpad codebase that could easily be decoupled from the core: its Mailman integration and its code import worker. Both of these had substantial dependencies with complex requirements for porting to Python 3, and arranging to be able to do these separately on their own schedule was absolutely worth it. Like disentangling balls of wool, any opportunity you can take to make things less tightly-coupled is probably going to make it easier to disentangle the rest. (I can see a tractable way forward to porting the code import worker, so we may well get that done soon. Our Mailman integration will need to be rewritten, though, since it currently depends on the Python-2-only Mailman 2, and Mailman 3 has a different architecture.) Python lessons Our database layer was already in pretty good shape for a port, since at least the modern bits of its table modelling interface were already strict about using Unicode for text columns. If you have any kind of pervasive low-level framework like this, then making it be pedantic at you in advance of a Python 3 port will probably incur much less swearing in the long run, as you won t be trying to deal with quite so many bytes/text issues at the same time as everything else. Early in our port, we established a standard set of __future__ imports and started incrementally converting files over to them, mainly because we weren t yet sure what else to do and it seemed likely to be helpful. absolute_import was definitely reasonable (and not often a problem in our code), and print_function was annoying but necessary. In hindsight I m not sure about unicode_literals, though. For files that only deal with bytes and text it was reasonable enough, but as I mentioned above there were also a number of cases where we needed literals of the language s native str type, i.e. bytes in Python 2 and text in Python 3: this was particularly noticeable in WSGI contexts, but also cropped up in some other surprising places. We generally either omitted unicode_literals or used six.ensure_str in such cases, but it was definitely a bit awkward and maybe I should have listened more to people telling me it might be a bad idea. A lot of Launchpad s early tests used doctest, mainly in the style where you have text files that interleave narrative commentary with examples. The development team later reached consensus that this was best avoided in most cases, but by then there were far too many doctests to conveniently rewrite in some other form. Porting doctests to Python 3 is really annoying. You run into all the little changes in how objects are represented as text (particularly u'...' versus '...', but plenty of other cases as well); you have next to no tools to do anything useful like skipping individual bits of a doctest that don t apply; using __future__ imports requires the rather obscure approach of adding the relevant names to the doctest s globals in the relevant DocFileSuite or DocTestSuite; dealing with many exception tracebacks requires something like zope.testing.renormalizing; and whatever code refactoring tools you re using probably don t work properly. Basically, don t have done that. It did all turn out to be tractable for us in the end, and I managed to avoid using much in the way of fragile doctest extensions aside from the aforementioned zope.testing.renormalizing, but it was not an enjoyable experience. Regressions I know of nine regressions that reached Launchpad s production systems as a result of this porting work; of course there were various other regressions caught by CI or in manual testing. (Considering the size of this project, I count it as a resounding success that there were only nine production issues, and that for the most part we were able to fix them quickly.) Equality testing of removed database objects One of the things we had to do while porting to Python 3 was to implement the __eq__, __ne__, and __hash__ special methods for all our database objects. This was quite conceptually fiddly, because doing this requires knowing each object s primary key, and that may not yet be available if we ve created an object in Python but not yet flushed the actual INSERT statement to the database (most of our primary keys are auto-incrementing sequences). We thus had to take care to flush pending SQL statements in such cases in order to ensure that we know the primary keys. However, it s possible to have a problem at the other end of the object lifecycle: that is, a Python object might still be reachable in memory even though the underlying row has been DELETEd from the database. In most cases we don t keep removed objects around for obvious reasons, but it can happen in caching code, and buildd-manager crashed as a result (in fact while it was still running on Python 2). We had to take extra care to avoid this problem. Debian imports crashed on non-UTF-8 filenames Python 2 has some unfortunate behaviour around passing bytes or Unicode strings (depending on the platform) to shutil.rmtree, and the combination of some porting work and a particular source package in Debian that contained a non-UTF-8 file name caused us to run into this. The fix was to ensure that the argument passed to shutil.rmtree is a str regardless of Python version. We d actually run into something similar before: it s a subtle porting gotcha, since it s quite easy to end up passing Unicode strings to shutil.rmtree if you re in the process of porting your code to Python 3, and you might easily not notice if the file names in your tests are all encoded using UTF-8. lazr.restful ETags We eventually got far enough along that we could switch one of our four appserver machines (we have quite a number of other machines too, but the appservers handle web and API requests) to Python 3 and see what happened. By this point our extensive test suite had shaken out the vast majority of the things that could go wrong, but there was always going to be room for some interesting edge cases. One of the Ubuntu kernel team reported that they were seeing an increase in 412 Precondition Failed errors in some of their scripts that use our webservice API. These can happen when you re trying to modify an existing resource: the underlying protocol involves sending an If-Match header with the ETag that the client thinks the resource has, and if this doesn t match the ETag that the server calculates for the resource then the client has to refresh its copy of the resource and try again. We initially thought that this might be legitimate since it can happen in normal operation if you collide with another client making changes to the same resource, but it soon became clear that something stranger was going on: we were getting inconsistent ETags for the same object even when it was unchanged. Since we d recently switched a quarter of our appservers to Python 3, that was a natural suspect. Our lazr.restful package provides the framework for our webservice API, and roughly speaking it generates ETags by serializing objects into some kind of canonical form and hashing the result. Unfortunately the serialization was dependent on the Python version in a few ways, and in particular it serialized lists of strings such as lists of bug tags differently: Python 2 used [u'foo', u'bar', u'baz'] where Python 3 used ['foo', 'bar', 'baz']. In lazr.restful 1.0.3 we switched to using JSON for this, removing the Python version dependency and ensuring consistent behaviour between appservers. Memory leaks This problem took the longest to solve. We noticed fairly quickly from our graphs that the appserver machine we d switched to Python 3 had a serious memory leak. Our appservers had always been a bit leaky, but now it wasn t so much a small hole that we can bail occasionally as the boat is sinking rapidly : A serious memory leak (Yes, this got in the way of working out what was going on with ETags for a while.) I spent ages messing around with various attempts to fix this. Since only a quarter of our appservers were affected, and we could get by on 75% capacity for a while, it wasn t urgent but it was definitely annoying. After spending some quality time with objgraph, for some time I thought traceback reference cycles might be at fault, and I sent a number of fixes to various upstream projects for those (e.g. zope.pagetemplate). Those didn t help the leaks much though, and after a while it became clear to me that this couldn t be the sole problem: Python has a cyclic garbage collector that will eventually collect reference cycles as long as there are no strong references to any objects in them, although it might not happen very quickly. Something else must be going on. Debugging reference leaks in any non-trivial and long-running Python program is extremely arduous, especially with ORMs that naturally tend to end up with lots of cycles and caches. After a while I formed a hypothesis that zope.server might be keeping a strong reference to something, although I never managed to nail it down more firmly than that. This was an attractive theory as we were already in the process of migrating to Gunicorn for other reasons anyway, and Gunicorn also has a convenient max_requests setting that s good at mitigating memory leaks. Getting this all in place took some time, but once we did we found that everything was much more stable: A rather flat memory graph This isn t completely satisfying as we never quite got to the bottom of the leak itself, and it s entirely possible that we ve only papered over it using max_requests: I expect we ll gradually back off on how frequently we restart workers over time to try to track this down. However, pragmatically, it s no longer an operational concern. Mirror prober HTTPS proxy handling After we switched our script servers to Python 3, we had several reports of mirror probing failures. (Launchpad keeps lists of Ubuntu archive and image mirrors, and probes them every so often to check that they re reasonably complete and up to date.) This only affected HTTPS mirrors when probed via a proxy server, support for which is a relatively recent feature in Launchpad and involved some code that we never managed to unit-test properly: of course this is exactly the code that went wrong. Sadly I wasn t able to sort out that gap, but at least the fix was simple. Non-MIME-encoded email headers As I mentioned above, there were substantial changes in the email package between Python 2 and 3, and indeed between minor versions of Python 3. Our test coverage here is pretty good, but it s an area where it s very easy to have gaps. We noticed that a script that processes incoming email was crashing on messages with headers that were non-ASCII but not MIME-encoded (and indeed then crashing again when it tried to send a notification of the crash!). The only examples of these I looked at were spam, but we still didn t want to crash on them. The fix involved being somewhat more careful about both the handling of headers returned by Python s email parser and the building of outgoing email notifications. This seems to be working well so far, although I wouldn t be surprised to find the odd other incorrect detail in this sort of area. Failure to handle non-ISO-8859-1 URL-encoded form input Remember how I said that parsing HTTP form data was thorny? After we finished upgrading all our appservers to Python 3, people started reporting that they couldn t post Unicode comments to bugs, which turned out to be only if the attempt was made using JavaScript, and was because I hadn t quite managed to get URL-encoded form data working properly with zope.publisher and multipart. The current standard describes the URL-encoded format for form data as in many ways an aberrant monstrosity , so this was no great surprise. Part of the problem was some very strange choices in zope.publisher dating back to 2004 or earlier, which I attempted to clean up and simplify. The rest was that Python 2 s urlparse.parse_qs unconditionally decodes percent-encoded sequences as ISO-8859-1 if they re passed in as part of a Unicode string, so multipart needs to work around this on Python 2. I m still not completely confident that this is correct in all situations, but at least now that we re on Python 3 everywhere the matrix of cases we need to care about is smaller. Inconsistent marshalling of Loggerhead s disk cache We use Loggerhead for providing web browsing of Bazaar branches. When we upgraded one of its two servers to Python 3, we immediately noticed that the one still on Python 2 was failing to read back its revision information cache, which it stores in a database on disk. (We noticed this because it caused a deployment to fail: when we tried to roll out new code to the instance still on Python 2, Nagios checks had already caused an incompatible cache to be written for one branch from the Python 3 instance.) This turned out to be a similar problem to the pickle issue mentioned above, except this one was with marshal, which I didn t think to look for because it s a relatively obscure module mostly used for internal purposes by Python itself; I m not sure that Loggerhead should really be using it in the first place. The fix was relatively straightforward, complicated mainly by now needing to cope with throwing away unreadable cache data. Ironically, if we d just gone ahead and taken the nominally riskier path of upgrading both servers at the same time, we might never have had a problem here. Intermittent bzr failures Finally, after we upgraded one of our two Bazaar codehosting servers to Python 3, we had a report of intermittent bzr branch hangs. After some digging I found this in our logs:
Traceback (most recent call last):
  File "/srv/", line 136, in addWindowBytes
  File "/srv/", line 88, in startWriting
  File "/srv/", line 894, in resumeProducing
    for p in self.pipes.itervalues():
builtins.AttributeError: 'dict' object has no attribute 'itervalues'
I d seen this before in our git hosting service: it was a bug in Twisted s Python 3 port, fixed after 20.3.0 but unfortunately after the last release that supported Python 2, so we had to backport that patch. Using the same backport dealt with this. Onwards!

11 June 2021

Colin Watson: SSH quoting

A while back there was a thread on one of our company mailing lists about SSH quoting, and I posted a long answer to it. Since then a few people have asked me questions that caused me to reach for it, so I thought it might be helpful if I were to anonymize the original question and post my answer here. The question was why a sequence of commands involving ssh and fiddly quoting produced the output they did. The first example was this:
$ ssh user@machine.local bash -lc "cd /tmp;pwd"
Oh hi, my dubious life choices have been such that this is my specialist subject! This is because SSH command-line parsing is not quite what you expect. First, recall that your local shell will apply its usual parsing, and the actual OS-level execution of ssh will be like this:
[0]: ssh
[1]: user@machine.local
[2]: bash
[3]: -lc
[4]: cd /tmp;pwd
Now, the SSH wire protocol only takes a single string as the command, with the expectation that it should be passed to a shell by the remote end. The OpenSSH client deals with this by taking all its arguments after things like options and the target, which in this case are:
[0]: bash
[1]: -lc
[2]: cd /tmp;pwd
It then joins them with a single space:
bash -lc cd /tmp;pwd
This is passed as a string to the server, which then passes that entire string to a shell for evaluation, so as if you d typed this directly on the server:
sh -c 'bash -lc cd /tmp;pwd'
The shell then parses this as two commands:
bash -lc cd /tmp
The directory change thus happens in a subshell (actually it doesn t quite even do that, because bash -lc cd /tmp in fact ends up just calling cd because of the way bash -c parses multiple arguments), and then that subshell exits, then pwd is called in the outer shell which still has the original working directory. The second example was this:
$ ssh user@machine.local bash -lc "pwd;cd /tmp;pwd"
Following the logic above, this ends up as if you d run this on the server:
sh -c 'bash -lc pwd; cd /tmp; pwd'
The third example was this:
$ ssh user@machine.local bash -lc "cd /tmp;cd /tmp;pwd"
And this is as if you d run:
sh -c 'bash -lc cd /tmp; cd /tmp; pwd'
Now, I wouldn t have implemented the SSH client this way, because I agree that it s confusing. But /usr/bin/ssh is used as a transport for other things so much that changing its behaviour now would be enormously disruptive, so it s probably impossible to fix. (I have occasionally agitated on openssh-unix-dev@ for at least documenting this better, but haven t made much headway yet; I need to get round to preparing a documentation patch.) Once you know about it you can use the proper quoting, though. In this case that would simply be:
ssh user@machine.local 'cd /tmp;pwd'
Or if you do need to specifically invoke bash -l there for some reason (I m assuming that the original example was reduced from something more complicated), then you can minimise your confusion by passing the whole thing as a single string in the form you want the remote sh -c to see, in a way that ensures that the quotes are preserved and sent to the server rather than being removed by your local shell:
ssh user@machine.local 'bash -lc "cd /tmp;pwd"'
Shell parsing is hard.

25 September 2020

Colin Watson: Porting Launchpad to Python 3: progress report

Launchpad still requires Python 2, which in 2020 is a bit of a problem. Unlike a lot of the rest of 2020, though, there s good reason to be optimistic about progress. I ve been porting Python 2 code to Python 3 on and off for a long time, from back when I was on the Ubuntu Foundations team and maintaining things like the Ubiquity installer. When I moved to Launchpad in 2015 it was certainly on my mind that this was a large body of code still stuck on Python 2. One option would have been to just accept that and leave it as it is, maybe doing more backporting work over time as support for Python 2 fades away. I ve long been of the opinion that this would doom Launchpad to being unmaintainable in the long run, and since I genuinely love working on Launchpad - I find it an incredibly rewarding project - this wasn t something I was willing to accept. We re already seeing some of our important dependencies dropping support for Python 2, which is perfectly reasonable on their terms but which is starting to become a genuine obstacle to delivering important features when we need new features from newer versions of those dependencies. It also looks as though it may be difficult for us to run on Ubuntu 20.04 LTS (we re currently on 16.04, with an upgrade to 18.04 in progress) as long as we still require Python 2, since we have some system dependencies that 20.04 no longer provides. And then there are exciting new features like type hints and async/await that we d like to be able to use. However, until last year there were so many blockers that even considering a port was barely conceivable. What changed in 2019 was sorting out a trifecta of core dependencies. We ported our database layer, Storm. We upgraded to modern versions of our Zope Toolkit dependencies (after contributing various fixes upstream, including some substantial changes to Zope s test runner that we d carried as local patches for some years). And we ported our Bazaar code hosting infrastructure to Breezy. With all that in place, a port seemed more of a realistic possibility. Still, even with this, it was never going to be a matter of just following some standard porting advice and calling it good. Launchpad has almost a million lines of Python code in its main git tree, and around 250 dependencies of which a number are quite Launchpad-specific. In a project that size, not only is following standard porting advice an extremely time-consuming task in its own right, but just about every strange corner case is going to show up somewhere. (Did you know that StringIO.StringIO(None) and io.StringIO(None) do different things even after you account for the native string vs. Unicode text difference? How about the behaviour of .union() on a subclass of frozenset?) Launchpad s test suite is fortunately extremely thorough, but even just starting up the test suite involves importing most of the data model code, so before you can start taking advantage of it you have to make a large fraction of the codebase be at least syntactically-correct Python 3 code and use only modules that exist in Python 3 while still working in Python 2; in a project this size that turns out to be a large effort on its own, and can be quite risky in places. Canonical s product engineering teams work on a six-month cycle, but it just isn t possible to cram this sort of thing into six months unless you do literally nothing else, and please can we put all feature development on hold while we run to stand still is a pretty tough sell to even the most understanding management. Fortunately, we ve been able to grow the Launchpad team in the last year or so, and so it s been possible to put Python 3 on our roadmap on the understanding that we aren t going to get all the way there in one cycle, while still being able to do other substantial feature development work as well. So, with all that preamble, what have we done this cycle? We ve taken a two-pronged approach. From one end, we identified 147 classes that needed to be ported away from some compatibility code in our database layer that was substantially less friendly to Python 3: we ve ported 38 of those, so there s clearly a fair bit more to do, but we were able to distribute this work out among the team quite effectively. From the other end, it was clear that it would be very inefficient to do general porting work when any attempt to even run the test suite would run straight into the same crashes in the same order, so I set myself a target of getting the test suite to start up, and started hacking on an enormous git branch that I never expected to try to land directly: instead, I felt free to commit just about anything that looked reasonable and moved things forward even if it was very rough, and every so often went back to tidy things up and cherry-pick individual commits into a form that included some kind of explanation and passed existing tests so that I could propose them for review. This strategy has been dramatically more successful than anything I ve tried before at this scale. So far this cycle, considering only Launchpad s main git tree, we ve landed 137 Python-3-relevant merge proposals for a total of 39552 lines of git diff output, keeping our existing tests passing along the way and deploying incrementally to production. We have about 27000 more lines of patch at varying degrees of quality to tidy up and merge. Our main development branch is only perhaps 10 or 20 more patches away from the test suite being able to start up, at which point we ll be able to get a buildbot running so that multiple developers can work on this much more easily and see the effect of their work. With the full unlanded patch stack, about 75% of the test suite passes on Python 3! This still leaves a long tail of several thousand tests to figure out and fix, but it s a much more incrementally-tractable kind of problem than where we started. Finally: the funniest (to me) bug I ve encountered in this effort was the one I encountered in the test runner and fixed in zopefoundation/zope.testrunner#106: IDs of failing tests were written to a pipe, so if you have a test suite that s large enough and broken enough then eventually that pipe would reach its capacity and your test runner would just give up and hang. Pretty annoying when it meant an overnight test run didn t give useful results, but also eloquent commentary of sorts.

17 August 2020

Ian Jackson: Doctrinal obstructiveness in Free Software

Any software system has underlying design principles, and any software project has process rules. But I seem to be seeing more often, a pathological pattern where abstract and shakily-grounded broad principles, and even contrived and sophistic objections, are used to block sensible changes. Today I will go through an example in detail, before ending with a plea: PostgreSQL query planner, WITH [MATERIALIZED] optimisation fence Background history PostgreSQL has a sophisticated query planner which usually gets the right answer. For good reasons, the pgsql project has resisted providing lots of knobs to control query planning. But there are a few ways to influence the query planner, for when the programmer knows more than the planner. One of these is the use of a WITH common table expression. In pgsql versions prior to 12, the planner would first make a plan for the WITH clause; and then, it would make a plan for the second half, counting the WITH clause's likely output as a given. So WITH acts as an "optimisation fence". This was documented in the manual - not entirely clearly, but a careful reading of the docs reveals this behaviour:
The WITH query will generally be evaluated as written, without suppression of rows that the parent query might discard afterwards.
Users (authors of applications which use PostgreSQL) have been using this technique for a long time. New behaviour in PostgreSQL 12 In PostgreSQL 12 upstream were able to make the query planner more sophisticated. In particular, it is now often capable of looking "into" the WITH common table expression. Much of the time this will make things better and faster. But if WITH was being used for its side-effect as an optimisation fence, this change will break things: queries that ran very quickly in earlier versions might now run very slowly. Helpfully, pgsql 12 still has a way to specify an optimisation fence: specifying WITH ... AS MATERIALIZED in the query. So far so good. Upgrade path for existing users of WITH fence But what about the upgrade path for existing users of the WITH fence behaviour? Such users will have to update their queries to add AS MATERIALIZED. This is a small change. Having to update a query like this is part of routine software maintenance and not in itself very objectionable. However, this change cannnot be made in advance because pgsql versions prior to 12 will reject the new syntax. So the users are in a bit of a bind. The old query syntax can be unuseably slow with the new database and the new syntax is rejected by the old database. Upgrading both the database and the application, in lockstep, is a flag day upgrade, which every good sysadmin will want to avoid. A solution to this problem Colin Watson proposed a very simple solution: make the earlier PostgreSQL versions accept the new MATERIALIZED syntax. This is correct since the new syntax specifies precisely the actual behaviour of the old databases. It has no deleterious effect on any users of older pgsql versions. It makes it possible to add the new syntax to the application, before doing the database upgrade, decoupling the two upgrades. Colin Watson even provided an implementation of this proposal. The solution is rejected by upstream Unfortunately upstream did not accept this idea. You can read the whole thread yourself if you like. But in summary, the objections were (italic indicates literal quotes): I find these extremely unconvincing, even taken together. Many of them are very unattractive things to hear one's upstream saying. At best they are knee-jerk and inflexible application of very general principles. The authors of these objections seem to have lost sight of the fact that these principles have a purpose. When these kind of software principles work against their purposes, they should be revised, or exceptions made. At worst, it looks like a collective effort to find reasons - any reasons, no matter how bad - not to make this change. The OFFSET 0 trick One of the responses in the thread mentions OFFSET 0. As part of writing the queries in the Xen Project CI system, and preparing for our system upgrade, I had carefully read the relevant pgsql documentation. This OFFSET 0 trick was new to me. But, now that I know the answer, it is easy to provide the right search terms and find, for example, this answer on stackmumble. Apparently adding a no-op OFFSET 0 to the subquery defeats the pgsql 12 query planner's ability to see into the subquery.
I think OFFSET 0 is the better approach since it's more obviously a hack showing that something weird is going on, and it's unlikely we'll ever change the optimiser behaviour around OFFSET 0 ... wheras hopefully CTEs will become inlineable at some point CTEs became inlineable by default in PostgreSQL 12.
So in fact there is a syntax for an optimisation fence that is accepted by both earlier and later PostgreSQL versions. It's even recommended by pgsql devs. It's just not documented, and is described by pgsql developers as a "hack". Astonishingly, the fact that it is a "hack" is given as a reason to use it! Well, I have therefore deployed this "hack". No doubt it will stay in our codebase indefinitely. Please don't be like that! I could come up with a lot more examples of other projects that have exhibited similar arrogance. It is becoming a plague! But every example is contentious, and I don't really feel I need to annoy a dozen separate Free Software communities. So I won't make a laundry list of obstructiveness. If you are an upstream software developer, or a distributor of software to users (eg, a distro maintainer), you have a lot of practical power. In theory it is Free Software so your users could just change it themselves. But for a user or downstream, carrying a patch is often an unsustainable amount of work and risk. Most of us have patches we would love to be running, but which we haven't even written because simply running a nonstandard build is too difficult, no matter how technically excellent our delta. As an upstream, it is very easy to get into a mindset of defending your code's existing behaviour, and to turn your project's guidelines into inflexible rules. Constant exposure to users who make silly mistakes, and rudely ask for absurd changes, can lead to core project members feeling embattled. But there is no need for an upstream to feel embattled! You have the vast majority of the power over the software, and over your project communication fora. Use that power consciously, for good. I can't say that arrogance will hurt you in the short term. Users of software with obstructive upstreams do not have many good immediate options. But we do have longer-term choices: we can choose which software to use, and we can choose whether to try to help improve the software we use. After reading Colin's experience, I am less likely to try to help improve the experience of other PostgreSQL users by contributing upstream. It doesn't seem like there would be any point. Indeed, instead of helping the PostgreSQL community I am now using them as an example of bad practice. I'm only half sorry about that.

comment count unavailable comments

16 November 2017

Colin Watson: Kitten Block equivalent for Firefox 57

I ve been using Kitten Block for years, since I don t really need the blood pressure spike caused by accidentally following links to certain UK newspapers. Unfortunately it hasn t been ported to Firefox 57. I tried emailing the author a couple of months ago, but my email bounced. However, if your primary goal is just to block the websites in question rather than seeing kitten pictures as such (let s face it, the internet is not short of alternative sources of kitten pictures), then it s easy to do with uBlock Origin. After installing the extension if necessary, go to Tools Add-ons Extensions uBlock Origin Preferences My filters, and add and, each on its own line. (Of course you can easily add more if you like.) Voil : instant tranquility. Incidentally, this also works fine on Android. The fact that it was easy to install a good ad blocker without having to mess about with a rooted device or strange proxy settings was the main reason I switched to Firefox on my phone.

26 September 2017

Colin Watson: A mysterious bug with Twisted plugins

I fixed a bug in Launchpad recently that led me deeper than I expected. Launchpad uses Buildout as its build system for Python packages, and it s served us well for many years. However, we re using 1.7.1, which doesn t support ensuring that packages required using setuptools setup_requires keyword only ever come from the local index URL when one is specified; that s an essential constraint we need to be able to impose so that our build system isn t immediately sensitive to downtime or changes in PyPI. There are various issues/PRs about this in Buildout (e.g. #238), but even if those are fixed it ll almost certainly only be in Buildout v2, and upgrading to that is its own kettle of fish for other reasons. All this is a serious problem for us because newer versions of many of our vital dependencies (Twisted and testtools, to name but two) use setup_requires to pull in pbr, and so we ve been stuck on old versions for some time; this is part of why Launchpad doesn t yet support newer SSH key types, for instance. This situation obviously isn t sustainable. To deal with this, I ve been working for some time on switching to virtualenv and pip. This is harder than you might think: Launchpad is a long-lived and complicated project, and it had quite a number of explicit and implicit dependencies on Buildout s configuration and behaviour. Upgrading our infrastructure from Ubuntu 12.04 to 16.04 has helped a lot (12.04 s baseline virtualenv and pip have some deficiencies that would have required a more complicated bootstrapping procedure). I ve dealt with most of these: for example, I had to reorganise a lot of our helper scripts (1, 2, 3), but there are still a few more things to go. One remaining problem was that our Buildout configuration relied on building several different environments with different Python paths for various things. While this would technically be possible by way of building multiple virtualenvs, this would inflate our build time even further (we re already going to have to cope with some slowdown as a result of using virtualenv, because the build system now has to do a lot more than constructing a glorified link farm to a bunch of cached eggs), and it seems like unnecessary complexity. The obvious thing to do seemed to be to collapse these into a single environment, since there was no obvious reason why it should actually matter if txpkgupload and txlongpoll were carefully kept off the path when running most of Launchpad: so I did that. Then our build system got very sad. Hmm, I thought. To keep our test times somewhat manageable, we run them in parallel across 20 containers, and we randomise the order in which they run to try to shake out test isolation bugs. It s not completely unknown for there to be some oddities resulting from that. So I ran it again. Nope, but slightly differently sad this time. Furthermore, I couldn t reproduce these failures locally no matter how hard I tried. Oh dear. This was obviously not going to be a good day. In fact I spent a while on various different guesswork-based approaches. I found bug 571334 in Ampoule, an AMP-based process pool implementation that we use for some job runners, and proposed a fix for that, but cherry-picking that fix into Launchpad didn t help matters. I tried backing out subsets of my changes and determined that if both txlongpoll and txpkgupload were absent from the Python module path in the context of the tests in question then everything was fine. I tried running strace locally and staring at the output for some time in the hope of enlightenment: that reminded me that the two packages in question install modules under twisted.plugins, which did at least establish a reason they might affect the environment that was more plausible than magic, but nothing much more specific than that. On Friday I was fiddling about with this again and trying to insert some more debugging when I noticed some interesting behaviour around plugin caching. If I caused the txpkgupload plugin to raise an exception when loaded, the Twisted plugin system would remove its dropin.cache (because it was stale) and not create a new one (because there was now no content to put in it). After that, running the relevant tests would fail as I d seen in our buildbot. Aha! This meant that I could also reproduce it by doing an even cleaner build than I d previously tried to do, by removing the cached txpkgupload and txlongpoll eggs and allowing the build system to recreate them. When they were recreated, they didn t contain dropin.cache, instead allowing that to be created on first use. Based on this clue I was able to get to the answer relatively quickly. Ampoule has a specialised bootstrapping sequence for its worker processes that starts by doing this:
from twisted.application import reactors
Now, twisted.application.reactors.installReactor calls twisted.plugin.getPlugins, so the very start of this bootstrapping sequence is going to involve loading all plugins found on the module path (I assume it s possible to write a plugin that adds an alternative reactor implementation). If dropin.cache is up to date, then it will just get the information it needs from that; but if it isn t, it will go ahead and import the plugin. If the plugin happens (as Twisted code often does) to run from twisted.internet import reactor at some point while being imported, then that will install the platform s default reactor, and then twisted.application.reactors.installReactor will raise ReactorAlreadyInstalledError. Since Ampoule turns this into an info-level log message for some reason, and the tests in question only passed through error-level messages or higher, this meant that all we could see was that a worker process had exited non-zero but not why. The Twisted documentation recommends generating the plugin cache at build time for other reasons, but we weren t doing that. Fixing that makes everything work again. There are still a few more things needed to get us onto pip, but we re now pretty close. After that we can finally start bringing our dependencies up to date.

29 August 2017

Colin Watson: env chdir

I was recently asked to sort things out so that snap builds on Launchpad could themselves install snaps as build-dependencies. To make this work we need to start doing builds in LXD containers rather than in chroots. As a result I ve been doing some quite extensive refactoring of launchpad-buildd: it previously had the assumption that it was going to use a chroot for everything baked into lots of untested helper shell scripts, and I ve been rewriting those in Python with unit tests and with a single Backend abstraction that isolates the high-level logic from the details of where each build is being performed. This is all interesting work in its own right, but it s not what I want to talk about here. While I was doing all this refactoring, I ran across a couple of methods I wrote a while back which looked something like this:
def chroot(self, args, echo=False):
    """Run a command in the chroot.
    :param args: the command and arguments to run.
    args = set_personality(
        args, self.options.arch, series=self.options.series)
    if echo:
        print("Running in chroot: %s" %
              ' '.join("'%s'" % arg for arg in args))
        "/usr/bin/sudo", "/usr/sbin/chroot", self.chroot_path] + args)
def run_build_command(self, args, env=None, echo=False):
    """Run a build command in the chroot.
    This is unpleasant because we need to run it in /build under sudo
    chroot, and there's no way to do this without either a helper
    program in the chroot or unpleasant quoting.  We go for the
    unpleasant quoting.
    :param args: the command and arguments to run.
    :param env: dictionary of additional environment variables to set.
    args = [shell_escape(arg) for arg in args]
    if env:
        args = ["env"] + [
            "%s=%s" % (key, shell_escape(value))
            for key, value in env.items()] + args
    command = "cd /build && %s" % " ".join(args)
    self.chroot(["/bin/sh", "-c", command], echo=echo)
(I ve already replaced the chroot method with a call to, but it s easier to see what I m talking about in the original form.) One thing to notice about this code is that it uses several adverbial commands: that is, commands that run another command in a different way. For example, sudo runs another command as another user, while chroot runs another command with a different root directory, and env runs another command with different environment variables set. These commands chain neatly, and they also have the useful property that they take the subsidiary command and its arguments as a list of arguments. coreutils has several other commands that behave this way, and adverbio is another useful example. By contrast, su -c is something you might call a quasi-adverbial command: it does modify the behaviour of another command, but it takes it as a single argument which it then passes to sh -c. Every time you have something that s passed to a shell like this, you need a corresponding layer of shell quoting to escape any shell metacharacters that should be interpreted literally. This is often cumbersome and is easy to get wrong. My Python implementation is as follows, and I wouldn t be totally surprised to discover that it contained a bug:
import re
non_meta_re = re.compile(r'^[a-zA-Z0-9+,./:=@_-]+$')
def shell_escape(arg):
    if non_meta_re.match(arg):
        return arg
        return "'%s'" % arg.replace("'", "'\\''")
Python >= 3.3 has shlex.quote, which is an improvement and we should probably use that instead, but it s still another thing to forget to call. This is why process-spawning libraries such as Python s subprocess, Perl s system and open, and my own libpipeline for C encourage programmers to use a list syntax and to avoid involving the shell entirely wherever possible. One thing that the standard Unix tools don t let you do in an adverbial way is to change your working directory, and I ve run into this annoying limitation several times. This means that it s difficult to chain that operation together with other adverbs, for example to run a command in a particular working directory inside a chroot. The workaround I used above was to invoke a shell that runs cd /build && ..., but that s another command that s only quasi-adverbial, since the extra shell means an extra layer of shell quoting. (Ian Jackson rightly observes that you can in fact write the necessary adverb as something like sh -ec 'cd "$1"; shift; exec "$@"' chdir. I think that s a bit uglier than I ideally want to use in production code, but you might reasonably think that it s worth it to avoid the extra layer of shell quoting.) I therefore decided that this was a feature that belonged in coreutils, and after a bit of mailing list discussion we felt it was best implemented as a new option to env(1). I sent a patch for this which has been accepted. This means that we have a new composable adverb, env --chdir=NEWDIR, which will allow the run_build_command method above to be rewritten as something like this:
def run_build_command(self, args, env=None, echo=False):
    """Run a build command in the chroot.
    :param args: the command and arguments to run.
    :param env: dictionary of additional environment variables to set.
    env_args = ["env", "--chdir=/build"]
    if env:
        for key, value in env.items():
            env_args.append("%s=%s" % (key, value))
    self.chroot(env_args + args, echo=echo)
The env --chdir option will be in coreutils 8.28. We won t be able to use it in launchpad-buildd until that s available in all Ubuntu series we might want to build for, so in this particular application that s going to take a few years; but other applications may well be able to make use of it sooner.

27 June 2017

Colin Watson: New address book

I ve had a kludgy mess of electronic address books for most of two decades, and have got rather fed up with it. My stack consisted of: The biggest practical problem with this was that I had the address book that was most convenient for me to add things to (Google Contacts) and the one I used when sending email, and no sensible way to merge them or move things between them. I also wasn t especially comfortable with having all my contact information in a proprietary web service. My goals for a replacement address book system were: I think I have all this now! New stack The obvious basic technology to use is CardDAV: it s fairly complex, admittedly, but lots of software supports it and one of my goals was not having to write my own thing. This meant I needed a CardDAV server, some way to sync the database to and from both Android and the system where I run mutt, and whatever query glue was necessary to get mutt to understand vCards. There are lots of different alternatives here, and if anything the problem was an embarrassment of choice. In the end I just decided to go for things that looked roughly the right shape for me and tried not to spend too much time in analysis paralysis. CardDAV server I went with Xandikos for the server, largely because I know Jelmer and have generally had pretty good experiences with their software, but also because using Git for history of the backend storage seems like something my future self will thank me for. It isn t packaged in stretch, but it s in Debian unstable, so I installed it from there. Rather than the standalone mode suggested on the web page, I decided to set it up in what felt like a more robust way using WSGI. I installed uwsgi, uwsgi-plugin-python3, and libapache2-mod-proxy-uwsgi, and created the following file in /etc/uwsgi/apps-available/xandikos.ini which I then symlinked into /etc/uwsgi/apps-enabled/xandikos.ini:
socket =
uid = xandikos
gid = xandikos
umask = 022
master = true
cheaper = 2
processes = 4
plugin = python3
module = xandikos.wsgi:app
env = XANDIKOSPATH=/srv/xandikos/collections
The port number was arbitrary, as was the path. You need to create the xandikos user and group first (adduser --system --group --no-create-home --disabled-login xandikos). I created /srv/xandikos owned by xandikos:xandikos and mode 0700, and I recommend setting a umask as shown above since uwsgi s default umask is 000 (!). You should also run sudo -u xandikos xandikos -d /srv/xandikos/collections --autocreate and then Ctrl-c it after a short time (I think it would be nicer if there were a way to ask the WSGI wrapper to do this). For Apache setup, I kept it reasonably simple: I ran a2enmod proxy_uwsgi, used htpasswd to create /etc/apache2/xandikos.passwd with a username and password for myself, added a virtual host in /etc/apache2/sites-available/xandikos.conf, and enabled it with a2ensite xandikos:
<VirtualHost *:443>
        ErrorLog /var/log/apache2/xandikos-error.log
        TransferLog /var/log/apache2/xandikos-access.log
        <Location />
                ProxyPass "uwsgi://"
                AuthType Basic
                AuthName "Xandikos"
                AuthBasicProvider file
                AuthUserFile "/etc/apache2/xandikos.passwd"
                Require valid-user
Then service apache2 reload, set the new virtual host up with Let s Encrypt, reloaded again, and off we go. Android integration I installed DAVdroid from the Play Store: it cost a few pounds, but I was OK with that since it s GPLv3 and I m happy to help fund free software. I created two accounts, one for my existing Google Contacts database (and in fact calendaring as well, although I don t intend to switch over to self-hosting that just yet), and one for the new Xandikos instance. The Google setup was a bit fiddly because I have two-step verification turned on so I had to create an app-specific password. The Xandikos setup was straightforward: base URL, username, password, and done. Since I didn t completely trust the new setup yet, I followed what seemed like the most robust option from the DAVdroid contacts syncing documentation, and used the stock contacts app to export my Google Contacts account to a .vcf file and then import that into the appropriate DAVdroid account (which showed up automatically). This seemed straightforward and everything got pushed to Xandikos. There are some weird delays in syncing contacts that I don t entirely understand, but it all seems to get there in the end. mutt integration First off I needed to sync the contacts. (In fact I happen to run mutt on the same system where I run Xandikos at the moment, but I don t want to rely on that, and going through the CardDAV server means that I don t have to poke holes for myself using filesystem permissions.) I used vdirsyncer for this. In ~/.vdirsyncer/config:
status_path = "~/.vdirsyncer/status/"
[pair contacts]
a = "contacts_local"
b = "contacts_remote"
collections = ["from a", "from b"]
[storage contacts_local]
type = "filesystem"
path = "~/.contacts/"
fileext = ".vcf"
[storage contacts_remote]
type = "carddav"
url = "<Xandikos base URL>"
username = "<my username>"
password = "<my password>"
Running vdirsyncer discover and vdirsyncer sync then synced everything into ~/.contacts/. I added an hourly crontab entry to run vdirsyncer -v WARNING sync. Next, I needed a command-line address book tool based on this. khard looked about right and is in stretch, so I installed that. In ~/.config/khard/khard.conf (this is mostly just the example configuration, but I preferred to sort by first name since not all my contacts have neat first/last names):
path = ~/.contacts/<UUID of my contacts collection>/
debug = no
default_action = list
editor = vim
merge_editor = vimdiff
[contact table]
# display names by first or last name: first_name / last_name
display = first_name
# group by address book: yes / no
group_by_addressbook = no
# reverse table ordering: yes / no
reverse = no
# append nicknames to name column: yes / no
show_nicknames = no
# show uid table column: yes / no
show_uids = yes
# sort by first or last name: first_name / last_name
sort = first_name
# extend contacts with your own private objects
# these objects are stored with a leading "X-" before the object name in the vcard files
# every object label may only contain letters, digits and the - character
# example:
#   private_objects = Jabber, Skype, Twitter
private_objects = Jabber, Skype, Twitter
# preferred vcard version: 3.0 / 4.0
preferred_version = 3.0
# Look into source vcf files to speed up search queries: yes / no
search_in_source_files = no
# skip unparsable vcard files: yes / no
skip_unparsable = no
Now khard list shows all my contacts. So far so good. Apparently there are some awkward vCard compatibility issues with creating or modifying contacts from the khard end. I ve tried adding one address from ~/.mutt/aliases using khard and it seems to at least minimally work for me, but I haven t explored this very much yet. I had to install python3-vobject from experimental to fix eventable/vobject#39 saving certain vCard files. Finally, mutt integration. I already had set query_command="lbdbq '%s'" in ~/.muttrc, and I wanted to keep that in place since I still wanted to use LDAP querying as well. I had to write a very small amount of code for this (perhaps I should contribute this to lbdb upstream?), in ~/.lbdb/modules/m_khard:
#! /bin/sh
m_khard_query ()  
    khard email --parsable --remove-first-line --search-in-source-files "$1"
My full ~/.lbdb/rc now reads as follows (you probably won t want the LDAP stuff, but I ve included it here for completeness):
METHODS='m_muttalias m_khard m_ldap'
LDAP_NICKS='debian canonical'
Next steps I ve deleted one account from Google Contacts just to make sure that everything still works (e.g. I can still search for it when composing a new message), but I haven t yet deleted everything. I won t be adding anything new there though. I need to push everything from ~/.mutt/aliases into the new system. This is only about 30 contacts so shouldn t take too long. Overall this feels like a big improvement! It wasn t a trivial amount of setup for just me, but it means I have both better usability for myself and more independence from proprietary services, and I think I can add extra users with much less effort if I need to. Postscript A day later and I ve consolidated all my accounts from Google Contacts and ~/.mutt/aliases into the new system, with the exception of one group that I had defined as a mutt alias and need to work out what to do with. This all went smoothly. I ve filed the new lbdb module as #866178, and the python3-vobject bug as #866181.

6 April 2017

Reproducible builds folks: Reproducible Builds: week 101 in Stretch cycle

Here's what happened in the Reproducible Builds effort between Sunday March 26 and Saturday April 1 2017: Media coverage Sylvain Beucler wrote a follow-up post Practical basics of reproducible builds 2, which like last weeks article is about his experiences making software build reproducibly. Reproducible work in other projects Colin Watson started writing a patch to make launchpad store .buildinfo files. (It's not yet deployed.) Toolchain development and fixes Ximin Luo continued to work on BUILD_PATH_PREFIX_MAP patches for GCC 6 and dpkg. Packages reviewed and fixed, and bugs filed Chris Lamb: Mattia Rizzolo: Reviews of unreproducible packages 49 package reviews have been added, 25 have been updated and 42 have been removed in this week, adding to our knowledge about identified issues. 1 issue type has been updated: Weekly QA work During our reproducibility testing, FTBFS bugs have been detected and reported by: diffoscope development diffoscope 81 was uploaded to experimental by Chris Lamb. It included contributions from: reprotest development reprotest development continued in git, including contributions from: development development continued in git, including contributions from: reproducible-website development Holger switched and to letsencrypt certificates. Misc. This week's edition was written by Ximin Luo and Holger Levsen & reviewed by a bunch of Reproducible Builds folks on IRC & the mailing lists.

11 December 2016

Colin Watson: The sad tale of CVE-2015-1336

Today I released man-db 2.7.6 (announcement, NEWS, git log), and uploaded it to Debian unstable. The major change in this release was a set of fixes for two security vulnerabilities, one of which affected all man-db installations since 2.3.12 (or 2.3.10-66 in Debian), and the other of which was specific to Debian and its derivatives. It s probably obvious from the dates here that this has not been my finest hour in terms of responding to security issues in a timely fashion, and I apologise for that. Some of this is just the usual life reasons, which I shan t bore you by reciting, but some of it has been that fixing this properly in man-db was genuinely rather complicated and delicate. Since I ve previously advocated man-db over some of its competitors on the basis of a better security posture, I think it behooves me to write up a longer description. I took over maintaining man-db over fifteen years ago in slightly unexpected circumstances (I got annoyed with its bug list and made a couple of non-maintainer uploads, and then the previous maintainer died, so I ended up taking over both in Debian and upstream). I was a fairly new developer at the time, and there weren t a lot of people I could ask questions of, but I did my best to recover as much of the history as I could and learn from it. One thing that became clear very quickly, both from my own inspection and from the bug list, was that most of the code had been written in a rather more innocent time. It was absolutely riddled with dangerous uses of the shell, poor temporary file handling, buffer overruns, and various common-or-garden deficiencies of that kind. I spent several years reworking large swathes of the codebase to be more robust against those kinds of bugs by design, and for example libpipeline came out of that effort. The most subtle and risky set of problems came from the fact that the man and mandb programs were installed set-user-id to the man user. Part of this was so that man could maintain preformatted cat pages , and part of it was so that users could run mandb if the system databases were out of date (this is now much less useful since most package managers, including dpkg, support some kind of trigger mechanism that can run mandb whenever new system-level manual pages are installed). One of the first things I did was to make this optional, and this has been a disabled-by-default debconf option in Debian for a long time now. But it s still a supported option and is enabled by default upstream, and when running setuid man and mandb need to take care to drop privileges when dealing with user-controlled data and to write files with the appropriate ownership and permissions. My predecessor had problems related to this such as Debian #26002, and one of the ways they dealt with them was to make /var/cache/man/ set-group-id root, in order that files written to that directory would have consistent group ownership. This always struck me as rather strange and I meant to do something about it at some point, but until the first vulnerability report above I regarded it as mainly a curiosity, since nothing in there was group-writeable anyway. As a result, with the more immediate aim of making the system behave consistently and dealing with bug reports, various bits of code had accreted that assumed that /var/cache/man/ would be man:root 2755, and not all of it was immediately obvious. This interacted with the second vulnerability report in two ways. Firstly, at some level it caused it because I was dealing with the day-to-day problems rather than thinking at a higher level: a series of bugs led me down the path of whacking problems over the head with a recursive chown of /var/cache/man/ from cron, rather than working out why things got that way in the first place. Secondly, once I d done that, I couldn t remove the chown without a much more extensive excursion into all the code that dealt with cache files, for fear of reintroducing those bugs. So although the fix for the second vulnerability is very simple in itself, I couldn t get there without dealing with the first vulnerability. In some ways, of course, cat pages are a bit of an anachronism. Most modern systems can format pages quickly enough that it s not much of an issue. However, I m loath to drop the feature entirely: I m generally wary of assuming that just because I have a fast system that everyone does. So, instead, I did what I should have done years ago: make man and mandb set-group-id man as well as set-user-id man, at which point we can simply make all the cache files and directories be owned by man:man and drop the setgid bit on cache directories. This should be simpler and less prone to difficult-to-understand problems. I expect that my next substantial upstream release will switch to --disable-setuid by default to reduce exposure, though, and distributions can start thinking about whether they want to follow that (Fedora already does, for example). If this becomes widely disabled without complaints then that would be good evidence that it s reasonable to drop the feature entirely. I m not in a rush, but if you do need cat pages then now is a good time to write to me and tell me why. This is the fiddliest set of vulnerabilities I ve dealt with in man-db for quite some time, so I hope that if there are more then I can get back to my previous quick response time.

3 December 2016

Ross Gammon: My Open Source Contributions June November 2016

So much for my monthly blogging! Here s what I have been up to in the Open Source world over the last 6 months. Debian Ubuntu Other Plan for December Debian Before the 5th January 2017 Debian Stretch soft freeze I hope to: Ubuntu Other

1 October 2016

Vincent Sanders: Paul Hollywood and the pistoris stone

There has been a great deal of comment among my friends recently about a particularly British cookery program called "The Great British Bake Off". There has been some controversy as the program is moving from the BBC to a commercial broadcaster.

Part of this discussion comes from all the presenters, excepting Paul Hollywood, declining to sign with the new broadcaster and partly because of speculation the BBC might continue with a similar format show with a new name.

Rob Kendrick provided the start to this conversation by passing on a satirical link suggesting Samuel L Jackson might host "cakes on a plane"

This caused a large number of suggestions for alternate names which I will be reporting but Rob Kendrick, Vivek Das Mohapatra, Colin Watson, Jonathan McDowell, Oki Kuma, Dan Alderman, Dagfinn Ilmari Manns ke, Lesley Mitchell and Daniel Silverstone are the ones to blame.

So that is our list, anyone else got better ideas?