Two weeks ago I upgraded chiark from Debian jessie i386 to bullseye amd64, after nearly 30 years running Debian i386. This went really quite well, in fact!
Background
chiark is my colo - a server I run, which lives in a data centre in London. It hosts ~200 users with shell accounts, various websites and mailing lists, moderators for a number of USENET newsgroups, and countless other services. chiark s internal setup is designed to enable my users to do a maximum number of exciting things with a minimum of intervention from me.
chiark s OS install dates to 1993, when I installed Debian 0.93R5, the first version of Debian to advertise the ability to be upgraded without reinstalling. I think that makes it one of the oldest Debian installations in existence.
Obviously it s had several new hardware platforms too. (There was a prior install of Linux on the initial hardware, remnants of which can maybe still be seen in some obscure corners of chiark s
/usr/local
.)
chiark s install is also at the very high end of the installation complexity, and customisation, scale: reinstalling it completely would be an enormous amount of work. And it s unique.
chiark s upgrade history
chiark s last major OS upgrade was to jessie (Debian 8, released in April 2015). That was in 2016. Since then we have been relying on Debian s excellent security support posture, and the
Debian LTS and more recently Freexian s
Debian ELTS projects and some local updates, The use of ELTS - which supports only a subset of packages - was particularly uncomfortable.
Additionally, chiark was installed with 32-bit x86 Linux (Debian i386), since that was what was supported and available at the time. But 32-bit is looking very long in the tooth.
Why do a skip upgrade
So, I wanted to move to the fairly recent stable release - Debian 11 (bullseye), which is just short of a year old. And I wanted to crossgrade (as its called) to 64-bit.
In the past, I have found I have had greater success by doing direct upgrades, skipping intermediate releases, rather than by following the officially-supported path of going via every intermediate release.
Doing a skip upgrade avoids exposure to any packaging bugs which were present only in intermediate release(s). Debian does usually fix bugs, but Debian has many cautious users, so it is not uncommon for bugs to be found after release, and then not be fixed until the next one.
A skip upgrade avoids the need to try to upgrade to already-obsolete releases (which can involve messing about with multiple snapshots from
snapshot.debian.org. It is also significantly faster and simpler, which is important not only because it reduces downtime, but also because it removes opportunities (and reduces the time available) for things to go badly.
One downside is that sometimes maintainers aggressively remove compatibility measures for older releases. (And compatibililty packages are generally removed quite quickly by even cautious maintainers.) That means that the sysadmin who wants to skip-upgrade needs to do more manual fixing of things that haven t been dealt with automatically. And occasionally one finds compatibility problems that show up only when mixing very old and very new software, that no-one else has seen.
Crossgrading
Crossgrading is fairly complex and hazardous. It is well supported by the low level tools (eg, dpkg) but the higher-level packaging tools (eg, apt) get very badly confused.
Nowadays the system is so complex that downloading things by hand and manually feeding them to dpkg is impractical, other than as a very occasional last resort.
The approach, generally, has been to set the system up to want to be the new architecture, run apt in a download-only mode, and do the package installation manually, with some fixing up and retrying, until the system is coherent enough for apt to work.
This is the approach I took. (In current releases, there are tools that will help but they are only in recent releases and I wanted to go direct. I also doubted that they would work properly on chiark, since it s so unusual.)
Peril and planning
Overall, this was a risky strategy to choose. The package dependencies wouldn t necessarily express all of the sequencing needed. But it still seemed that if I could come up with a working recipe, I could do it.
I restored most of one of chiark s backups onto a scratch volume on my laptop. With the LVM snapshot tools and chroots. I was able to develop and test a set of scripts that would perform the upgrade. This was a very effective approach: my super-fast laptop, with local caches of the package repositories, was able to do many edit, test, debug cycles.
My recipe made heavy use of snapshot.debian.org, to make sure that it wouldn t rot between testing and implementation.
When I had a working scheme, I told my users about the planned downtime. I warned everyone it might take even 2 or 3 days. I made sure that my access arrangemnts to the data centre were in place, in case I needed to visit in person. (I have remote serial console and power cycler access.)
Reality - the terrible rescue install
My first task on taking the service down was the check that the emergency rescue installation worked: chiark has an ancient USB stick in the back, which I can boot to from the BIOS. The idea being that many things that go wrong could be repaired from there.
I found that that install was too old to understand chiark s storage arrangements. mdadm tools gave very strange output. So I needed to upgrade it. After some experiments, I rebooted back into the main install, bringing chiark s service back online.
I then used the main install of chiark as a kind of meta-rescue-image for the rescue-image. The process of getting the rescue image upgraded (not even to amd64, but just to something not totally ancient) was fraught. Several times I had to rescue it by copying files in from the main install outside. And, the rescue install was on a truly ancient 2G USB stick which was terribly terribly slow, and also very small.
I hadn t done any significant planning for this subtask, because it was low-risk: there was little way to break the main install. Due to all these adverse factors, sorting out the rescue image took
five hours.
If I had known how long it would take, at the beginning, I would have skipped it. 5 hours is more than it would have taken to go to London and fix something in person.
Reality - the actual core upgrade
I was able to start the actual upgrade in the mid-afternoon. I meticulously checked and executed the steps from my plan.
The terrifying scripts which sequenced the critical package updates ran flawlessly. Within an hour or so I had a system which was running bullseye amd64, albeit with many important packages still missing or unconfigured.
So I didn t need the rescue image after all, nor to go to the datacentre.
Fixing all the things
Then I had to deal with all the inevitable fallout from an upgrade.
Notable incidents:
exim4 has a new tainting system
This is to try to help the sysadmin avoid writing unsafe string interpolations. (
Little Bobby Tables. ) This was done by Exim upstream in a great hurry as part of a security response process.
The new checks meant that the mail configuration did not work at all. I had to turn off the taint check completely. I m fairly confident that this is correct, because I am hyper-aware of quoting issues and all of my configuration is written to avoid the problems that tainting is supposed to avoid.
One particular annoyance is that the approach taken for sqlite lookups makes it totally impossible to use more than one sqlite database. I think the sqlite quoting operator which one uses to interpolate values produces tainted output? I need to investigate this properly.
LVM now ignores PVs which are directly contained within LVs by default
chiark has LVM-on-RAID-on-LVM. This generally works really well.
However, there was one edge case where I ended up without the intermediate RAID layer. The result is LVM-on-LVM.
But recent versions of the LVM tools do not look at PVs inside LVs, by default. This is to help you avoid corrupting the state of any VMs you have on your system. I didn t know that at the time, though. All I knew was that LVM was claiming my PV was unusable , and wouldn t explain why.
I was about to start on a thorough reading of the 15,000-word essay that is the commentary in the default
/etc/lvm/lvm.conf
to try to see if anything was relevant, when I received a helpful tipoff on IRC pointing me to the
scan_lvs
option.
I need to file a bug asking for the LVM tools to explain
why they have declared a PV unuseable.
apache2 s default config no longer read one of my config files
I had to do a merge (of my changes vs the maintainers changes) for
/etc/apache2/apache2.conf
. When doing this merge I failed to notice that the file
/etc/apache2/conf.d/httpd.conf
was no longer included by default. My merge dropped that line. There were some important things in there, and until I found this the webserver was broken.
dpkg --skip-same-version
DTWT during a crossgrade
(This is not a fix all the things - I found it when developing my upgrade process.)
When doing a crossgrade, one often wants to say to dpkg install all these things, but don t reinstall things that have already been done . That s what
--skip-same-version
is for.
However, the logic had not been updated as part of the work to support multiarch, so it was wrong. I prepared a patched version of dpkg, and inserted it in the appropriate point in my prepared crossgrade plan.
The patch is now filed as
bug #1014476 against dpkg upstream
Mailman
Mailman is no longer in bullseye. It s only available in the previous release, buster.
bullseye has Mailman 3 which is a totally different system - requiring basically, a completely new install and configuration. To even preserve existing archive links (a very important requirement) is decidedly nontrivial.
I decided to punt on this whole situation. Currently chiark is running buster s version of Mailman. I will have to deal with this at some point and I m not looking forward to it.
Python
Of course that Mailman is Python 2. The Python project s extremely badly handled transition includes a recommendation to change the meaning of
#!/usr/bin/python
from Python 2, to Python 3.
But Python 3 is a new language, barely compatible with Python 2 even in the most recent iterations of both, and it is usual to need to coinstall them.
Happily Debian have provided the
python-is-python2
package to make things work sensibly, albeit with unpleasant imprecations in the package summary description.
USENET news
Oh my god. INN uses many non-portable data formats, which just depend on your C types. And there are complicated daemons, statically linked libraries which cache on-disk data, and much to go wrong.
I had numerous problems with this, and several outages and malfunctions. I may write about that on a future occasion.
(edited 2022-07-20 11:36 +01:00 and 2022-07-30 12:28+01:00 to fix typos) comments