Search Results: "mkd"

8 February 2023

Stephan Lachnit: Setting up fast Debian package builds using sbuild, mmdebstrap and apt-cacher-ng

In this post I will give a quick tutorial on how to set up fast Debian package builds using sbuild with mmdebstrap and apt-cacher-ng. The usual tool for building Debian packages is dpkg-buildpackage, or a user-friendly wrapper like debuild, and while these are geat tools, if you want to upload something to the Debian archive they lack the required separation from the system they are run on to ensure that your packaging also works on a different system. The usual candidate here is sbuild. But setting up a schroot is tedious and performance tuning can be annoying. There is an alternative backend for sbuild that promises to make everything simpler: unshare. In this tutorial I will show you how to set up sbuild with this backend. Additionally to the normal performance tweaking, caching downloaded packages can be a huge performance increase when rebuilding packages. I do rebuilds quite often, mostly when a new dependency got introduced I didn t specify in debian/control yet or lintian notices a something I can easily fix. So let s begin with setting up this caching.

Setting up apt-cacher-ng Install apt-cacher-ng:
sudo apt install apt-cacher-ng
A pop-up will appear, if you are unsure how to answer it select no, we don t need it for this use-case. To enable apt-cacher-ng on your system, create /etc/apt/apt.conf.d/02proxy and insert:
Acquire::http::proxy "http://127.0.0.1:3142";
Acquire::https::proxy "DIRECT";
In /etc/apt-cacher-ng/acng.conf you can increase the value of ExThreshold to hold packages for a shorter or longer duration. The length depends on your specific use case and resources. A longer threshold takes more disk space, a short threshold like one day effecitvely only reduces the build time for rebuilds. If you encounter weird issues on apt update at some point the future, you can try to clean the cache from apt-cacher-ng. You can use this script:

Setting up mmdebstrap Install mmdebstrap:
sudo apt install mmdebstrap
We will create a small helper script to ease creating a chroot. Open ~/.local/bin/mmupdate and insert:
#!/bin/sh
mmdebstrap \
  --variant=buildd \
  --aptopt='Acquire::http::proxy "http://127.0.0.1:3142";' \
  --arch=amd64 \
  --components=main,contrib,non-free \
  unstable \
  ~/.cache/sbuild/unstable-amd64.tar.xz \
  http://deb.debian.org/debian
Notes:
  • aptopt enables apt-cacher-ng inside the chroot.
  • --arch sets the CPU architecture (see Debian Wiki).
  • --components sets the archive components, if you don t want non-free pacakges you might want to remove some entries here.
  • unstable sets the Debian release, you can also set for example bookworm-backports here.
  • unstable-amd64.tar.xz is the output tarball containing the chroot, change accordingly to your pick of the CPU architecture and Debian release.
  • http://deb.debian.org/debian is the Debian mirror, you should set this to the same one you use in your /etc.apt/sources.list.
Make mmupdate executable and run it once:
chmod +x ~/.local/bin/mmupdate
mkdir -p ~/.cache/sbuild
~/.local/bin/mmupdate
If you execute mmupdate again you can see that the downloading stage is much faster thanks to apt-cacher-ng. For me the difference is from about 115s to about 95s. Your results may vary, this depends on the speed of your internet, Debian mirror and disk. If you have used the schroot backend and sbuild-update before, you probably notice that creating a new chroot with mmdebstrap is slower. It would be a bit annoying to do this manually before we start a new Debian packaging session, so let s create a systemd service that does this for us. First create a folder for user services:
mkdir -p ~/.config/systemd/user
Create ~/.config/systemd/user/mmupdate.service and add:
[Unit]
Description=Run mmupdate
Wants=network-online.target
[Service]
Type=oneshot
ExecStart=%h/.local/bin/mmupdate
Start the service and test that it works:
systemctl --user daemon-reload
systemctl --user start mmupdate
systemctl --user status mmupdate
Create ~/.config/systemd/user/mmupdate.timer:
[Unit]
Description=Run mmupdate daily
[Timer]
OnCalendar=daily
Persistent=true
[Install]
WantedBy=timers.target
Enable the timer:
systemctl --user enable mmupdate.timer
Now every day mmupdte will be run automatically. You can adjust the period if you think daily rebuilds are a bit excessive. A neat advantage of period rebuilds is that they the base files in your apt-cacher-ng cache warm every time they run.

Setting up sbuild: Install sbuild and (optionally) autopkgtest:
sudo apt install --no-install-recommends sbuild autopkgtest
Create ~/.sbuildrc and insert:
# backend for using mmdebstrap chroots
$chroot_mode = 'unshare';
# build in tmpfs
$unshare_tmpdir_template = '/dev/shm/tmp.sbuild.XXXXXXXX';
# upgrade before starting build
$apt_update = 1;
$apt_upgrade = 1;
# build everything including source for source-only uploads
$build_arch_all = 1;
$build_arch_any = 1;
$build_source = 1;
$source_only_changes = 1;
# go to shell on failure instead of exiting
$external_commands =   "build-failed-commands" => [ [ '%SBUILD_SHELL' ] ]  ;
# always clean build dir, even on failure
$purge_build_directory = "always";
# run lintian
$run_lintian = 1;
$lintian_opts = [ '-i', '-I', '-E', '--pedantic' ];
# do not run piuparts
$run_piuparts = 0;
# run autopkgtest
$run_autopkgtest = 1;
$autopkgtest_root_args = '';
$autopkgtest_opts = [ '--apt-upgrade', '--', 'unshare', '--release', '%r', '--arch', '%a', '--prefix=/dev/shm/tmp.autopkgtest.' ];
# set uploader for correct signing
$uploader_name = 'Stephan Lachnit <stephanlachnit@debian.org>';
You should adjust uploader_name. If you don t want to run autopkgtest or lintian by default you can also disable it here. Note that for packages that need a lot of space for building, you might want to comment the unshare_tmpdir_template line to prevent a OOM build failure. You can now build your Debian packages with the sbuild command :)

Finishing touches You can add these variables to your ~/.bashrc as bonus (with adjusted name / email):
export DEBFULLNAME="<your_name>"
export DEBEMAIL="<your_email>"
export DEB_BUILD_OPTIONS="parallel=<threads>"
In particular adjust the value of parallel to ensure parallel builds. If you are new to signing / uploading your package, first install the required tools:
sudo apt install devscripts dput-ng
Create ~/.devscripts and insert:
DEBSIGN_KEYID=<your_gpg_fingerpring>
USCAN_SYMLINK=rename
You can now sign the .changes file with:
debsign ../<pkgname_version_arch>.changes
And for source-only uploads with:
debsign -S ../<pkgname_version_arch>_source.changes
If you don t introduce a new binary package, you always want to go with source-only changes. You can now upload the package to Debian with
dput ../<filename>.changes

Update Feburary 22nd Jochen Sprickerhof, who originally advised me to use the unshare backend, commented that one can also use --include=auto-apt-proxy instead of the --aptopt option in mmdebstrap to detect apt proxies automatically. He also let me know that it is possible to use autopkgtest on tmpfs (config in the blog post is updated) and added an entry on the sbuild wiki page on how to setup sbuild+unshare with ccache if you often need to build a large package. Further, using --variant=apt and --include=build-essential will produce smaller build chroots if wished. On the contrary, one can of course also use the --include option to include debhelper and lintian (or any other packages you like) to further decrease the setup time. However, staying with buildd variant is a good choice for official uploads.

Resources for further reading https://wiki.debian.org/sbuild
https://www.unix-ag.uni-kl.de/~bloch/acng/html/index.html
https://wiki.ubuntu.com/SimpleSbuild
https://wiki.archlinux.org/title/Systemd/Timers
https://manpages.debian.org/unstable/autopkgtest/autopkgtest-virt-unshare.1.en.html
Thanks for reading!

7 February 2023

Stephan Lachnit: Installing Debian on F2FS rootfs with deboostrap and systemd-boot

I recently got a new NVME drive. My plan was to create a fresh Debian install on an F2FS root partition with compression for maximum performance. As it turns out, this is not entirely trivil to accomplish. For one, the Debian installer does not support F2FS (here is my attempt to add it from 2021). And even if it did, grub does not support F2FS with the extra_attr flag that is required for compression support (at least as of grub 2.06). Luckily, we can install Debian anyway with all these these shiny new features when we go the manual road with debootstrap and using systemd-boot as bootloader. We can break down the process into several steps:
  1. Creating the partition table
  2. Creating and mounting the root partition
  3. Bootstrapping with debootstrap
  4. Chrooting into the system
  5. Configure the base system
  6. Define static file system information
  7. Installing the kernel and bootloader
  8. Finishing touches
Warning: Playing around with partitions can easily result in data if you mess up! Make sure to double check your commands and create a data backup if you don t feel confident about the process.

Creating the partition partble The first step is to create the GPT partition table on the new drive. There are several tools to do this, I recommend the ArchWiki page on this topic for details. For simplicity I just went with the GParted since it has an easy GUI, but feel free to use any other tool. The layout should look like this:
Type         Partition        Suggested size
 
EFI          /dev/nvme0n1p1           512MiB
Linux swap   /dev/nvme0n1p2             1GiB
Linux fs     /dev/nvme0n1p3        remainder
Notes:
  • The disk names are just an example and have to be adjusted for your system.
  • Don t set disk labels, they don t appear on the new install anyway and some UEFIs might not like it on your boot partition.
  • The size of the EFI partition can be smaller, in practive it s unlikely that you need more than 300 MiB. However some UEFIs might be buggy and if you ever want to install an additional kernel or something like memtest86+ you will be happy to have the extra space.
  • The swap partition can be omitted, it is not strictly needed. If you need more swap for some reason you can also add more using a swap file later (see ArchWiki page). If you know you want to use suspend-to-RAM, you want to increase the size to something more than the size of your memory.
  • If you used GParted, create the EFI partition as FAT32 and set the esp flag. For the root partition use ext4 or F2FS if available.

Creating and mounting the root partition To create the root partition, we need to install the f2fs-tools first:
sudo apt install f2fs-tools
Now we can create the file system with the correct flags:
mkfs.f2fs -O extra_attr,inode_checksum,sb_checksum,compression,encrypt /dev/nvme0n1p3
For details on the flags visit the ArchWiki page. Next, we need to mount the partition with the correct flags. First, create a working directory:
mkdir boostrap
cd boostrap
mkdir root
export DFS=$(pwd)/root
Then we can mount the partition:
sudo mount -o compress_algorithm=zstd:6,compress_chksum,atgc,gc_merge,lazytime /dev/nvme0n1p3 $DFS
Again, for details on the mount options visit the above mentioned ArchWiki page.

Bootstrapping with debootstrap First we need to install the debootstrap package:
sudo apt install debootstrap
Now we can do the bootstrapping:
debootstrap --arch=amd64 --components=main,contrib,non-free,non-free-firmware unstable $DFS http://deb.debian.org/debian
Notes:
  • --arch sets the CPU architecture (see Debian Wiki).
  • --components sets the archive components, if you don t want non-free pacakges you might want to remove some entries here.
  • unstable is the Debian release, you might want to change that to testing or bookworm.
  • $DFS points to the mounting point of the root partition.
  • http://deb.debian.org/debian is the Debian mirror, you might want to set that to http://ftp.de.debian.org/debian or similar if you have a fast mirror in you area.

Chrooting into the system Before we can chroot into the newly created system, we need to prepare and mount virtual kernel file systems. First create the directories:
sudo mkdir -p $DFS/dev $DFS/dev/pts $DFS/proc $DFS/sys $DFS/run $DFS/sys/firmware/efi/efivars $DFS/boot/efi
Then bind-mount the directories from your system to the mount point of the new system:
sudo mount -v -B /dev $DFS/dev
sudo mount -v -B /dev/pts $DFS/dev/pts
sudo mount -v -B /proc $DFS/proc
sudo mount -v -B /sys $DFS/sys
sudo mount -v -B /run $DFS/run
sudo mount -v -B /sys/firmware/efi/efivars $DFS/sys/firmware/efi/efivars
As a last step, we need to mount the EFI partition:
sudo mount -v -B /dev/nvme0n1p1 $DFS/boot/efi
Now we can chroot into new system:
sudo chroot $DFS /bin/bash

Configure the base system The first step in the chroot is setting the locales. We need this since we might leak the locales from our base system into the chroot and if this happens we get a lot of annoying warnings.
export LC_ALL=C.UTF-8 LANG=C.UTF-8
apt install locales console-setup
Set your locales:
dpkg-reconfigure locales
Set your keyboard layout:
dpkg-reconfigure keyboard-configuration
Set your timezone:
dpkg-reconfigure tzdata
Now you have a fully functional Debian chroot! However, it is not bootable yet, so let s fix that.

Define static file system information The first step is to make sure the system mounts all partitions on startup with the correct mount flags. This is done in /etc/fstab (see ArchWiki page). Open the file and change its content to:
# file system                               mount point   type   options                                                            dump   pass
# NVME efi partition
UUID=XXXX-XXXX                              /boot/efi     vfat   umask=0077                                                         0      0
# NVME swap
UUID=XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX   none          swap   sw                                                                 0      0
# NVME main partition
UUID=XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX   /             f2fs   compress_algorithm=zstd:6,compress_chksum,atgc,gc_merge,lazytime   0      1
You need to fill in the UUIDs for the partitions. You can use
ls -lAph /dev/disk/by-uuid/
to match the UUIDs to the more readable disk name under /dev.

Installing the kernel and bootloader First install the systemd-boot and efibootmgr packages:
apt install systemd-boot efibootmgr
Now we can install the bootloader:
bootctl install --path=/boot/efi
You can verify the procedure worked with
efibootmgr -v
The next step is to install the kernel, you can find a fitting image with:
apt search linux-image-*
In my case:
apt install linux-image-amd64
After the installation of the kernel, apt will add an entry for systemd-boot automatically. Neat! However, since we are in a chroot the current settings are not bootable. The first reason is the boot partition, which will likely be the one from your current system. To change that, navigate to /boot/efi/loader/entries, it should contain one config file. When you open this file, it should look something like this:
title      Debian GNU/Linux bookworm/sid
version    6.1.0-3-amd64
machine-id 2967cafb6420ce7a2b99030163e2ee6a
sort-key   debian
options    root=PARTUUID=f81d4fae-7dec-11d0-a765-00a0c91e6bf6 ro systemd.machine_id=2967cafb6420ce7a2b99030163e2ee6a
linux      /2967cafb6420ce7a2b99030163e2ee6a/6.1.0-3-amd64/linux
initrd     /2967cafb6420ce7a2b99030163e2ee6a/6.1.0-3-amd64/initrd.img-6.1.0-3-amd64
The PARTUUID needs to point to the partition equivalent to /dev/nvme0n1p3 on your system. You can use
ls -lAph /dev/disk/by-partuuid/
to match the PARTUUIDs to the more readable disk name under /dev. The second problem is the ro flag in options which tell the kernel to boot in read-only mode. The default is rw, so you can just remove the ro flag. Once this is fixed, the new system should be bootable. You can change the boot order with:
efibootmgr --bootorder
However, before we reboot we might add well add a user and install some basic software.

Finishing touches Add a user:
useradd -m -G sudo -s /usr/bin/bash -c 'Full Name' username
Debian provides a TUI to install Desktop Environment. To open it, run:
tasksel
Now you can finally reboot into your new system:
reboot

Resources for further reading https://ivanb.neocities.org/blogs/y2022/debootstrap
https://www.debian.org/releases/stable/amd64/apds03.en.html
https://www.addictivetips.com/ubuntu-linux-tips/set-up-systemd-boot-on-arch-linux/
https://p5r.uk/blog/2020/using-systemd-boot-on-debian-bullseye.html
https://www.linuxfromscratch.org/lfs/view/stable/chapter07/kernfs.html
Thanks for reading!

12 January 2023

Jonathan McDowell: Building a read-only Debian root setup: Part 1

I mentioned in the post about upgrading my home internet that part of the work I did was creating a read-only Debian root with a squashfs image. This post covers the details of how I boot with that image; a later post will cover how I build the squashfs image. First, David Reader kindly pointed me at his rodebian setup, which was helpful in making me think about the whole problem but ultimately not the direction I went. Primarily because on the old router (an RB3011) I am space constrained, with only 120M of usable flash, and so ideally I wanted as much as possible of the system in a well compressed filesystem. squashfs seemed like the best option for that, and ultimately I ended up with a 39M image. I ve then used overlayfs to mount a tmpfs, so I get what looks like a writeable system without having to do too many tweaks to the actual install. On the plus side I can then see exactly what is getting written where and decide whether I need to update something in the squashfs. I don t boot with an initrd - for initial testing I booted directly off a USB stick. I ve actually ended up continuing to do this in production, because I ve had no pressing reason to move it all to booting off internal flash (I ve ended up with a Sandisk SDCZ430-032G-G46 which is tiny). However nothing I m going to describe is dependent on that - this would work perfectly well for a initial UBIFS rootfs on internal NAND. So the basic overview is I boot off a minimal rootfs, mount a squashfs, create an appropriate tmpfs, mount an overlayfs that combines the two, then pivotroot into the overlayfs and exec its init so it becomes the rootfs. For the minimal rootfs I started with busybox, in particular I used the armhf busybox-static package from Debian. My RB5009 is an ARM64, but I wanted to be able to test on the RB3011 as well, which is ARMv7. Picking an armhf binary for the minimal rootfs lets me use the same image for both. Using the static build helps reduce the number of pieces involved in putting it all together. The busybox binary goes in /bin. I was able to cheat and chroot into the empty rootfs and call busybox --install -s to create symlinks for all the tools it provides, but I could have done this manually. There s only a handful that are actually needed, but it s amazing how much is crammed into a 1.2M binary. /sbin/init is a shell script:
Contents
#!/bin/ash
# Make sure we have a sane date
if [ -e /data/saved-date ]; then
        CURRENT_DATE=$(date -Iseconds)
        if [ "$ CURRENT_DATE:0:4 " -lt "2022" -o \
                        "$ CURRENT_DATE:0:4 " -gt "2030" ]; then
                echo Setting initial date
                date -s "$(cat /data/saved-date)"
        fi
fi
# Work out what platform we're on
ARCH=$(uname -m)
if [ "$ ARCH " == "aarch64" ]; then
        ARCH=arm64
else
        ARCH=armhf
fi
# Mount a tmpfs to store the changes
mount -t tmpfs root-rw /mnt/overlay/rw
# Make the directories we need in the tmpfs
mkdir /mnt/overlay/rw/upper
mkdir /mnt/overlay/rw/work
# Mount the squashfs and build an overlay root filesystem of it + the tmpfs
mount -t squashfs -o loop /data/router.$ ARCH .squashfs /mnt/overlay/lower
mount -t overlay \
        -o lowerdir=/mnt/overlay/lower,upperdir=/mnt/overlay/rw/upper,workdir=/mnt/overlay/rw/work \
        overlayfs-root /mnt/root
# Build the directories we need within the new root
mkdir /mnt/root/mnt/flash
mkdir /mnt/root/mnt/overlay
mkdir /mnt/root/mnt/overlay/lower
mkdir /mnt/root/mnt/overlay/rw
# Copy any stored state
if [ -e /data/state.$ ARCH .tar ]; then
        echo Restoring stored state
        cd /mnt/root
        tar xf /data/state.$ ARCH .tar
fi
cd /mnt/root
pivot_root . mnt/flash
echo Switching into root filesystem
exec chroot . sh -c "$(cat <<END
mount --move /mnt/flash/mnt/overlay/lower /mnt/overlay/lower
mount --move /mnt/flash/mnt/overlay/rw /mnt/overlay/rw
exec /sbin/init
END
)"
Most of what the script is doing is sorting out the squashfs + tmpfs backed overlayfs that becomes the full root filesystems, but there are a few other bits to note. First, we pick up a saved date from /data/saved-date - the router has no RTC and while it ll sort itself out with NTP once it gets networking up it s useful to make sure we don t end up comically far in the past or future. Second, the script looks at what architecture we re running and picks up an appropriate squashfs image from /data based on that. This let me use the same USB stick for testing on both the RB3011 and the RB5011. Finally we allow for a /data/state.$ ARCH .tar file to let us pick up changes to the rootfs at boot time - this prevents having to rebuild the squashfs image every time there s a persistent change. The other piece that doesn t show up in the script is that the kernel and its modules are all installed into this initial rootfs (and then symlinked from the squashfs). This lets me build a mostly modular kernel, as long as all the necessary drivers to mount the USB stick are built in. Once the system is fully booted the initial rootfs is available at /mnt/flash, by default mounted read-only (to avoid inadvertent writes), but able to be remounted to update the squashfs image, install a new kernel, or update the state tarball. /mnt/overlay/rw/upper/ is where updates to the overlayfs are written, which provides an easy way to see what files are changing, initially to determine what might need tweaked in the squashfs creation process and subsequently to be able to see what needs updated in the state tarball.

30 December 2022

Simon Josefsson: Preseeding Trisquel Virtual Machines Using netinst Images

I m migrating some self-hosted virtual machines to Trisquel, and noticed that Trisquel does not offer cloud-images similar to the Debian Cloud and Ubuntu Cloud images. Thus my earlier approach based on virt-install --cloud-init and cloud-localds does not work with Trisquel. While I hope that Trisquel will eventually publish cloud-compatible images, I wanted to document an alternative approach for Trisquel based on preseeding. This is how I used to install Debian and Ubuntu in the old days, and the automated preseed method is best documented in the Debian installation manual. I was hoping to forget about the preseed format, but maybe it will become one of those legacy technologies that never really disappears? Like FAT16 and 8-bit microcontrollers. Below I assume you have a virtual machine host server up that runs libvirt and has virt-install and similar tools; install them with the following command. I run a pre-release version of Trisquel 11 aramo on my VM-host, but I believe any recent dpkg-based distribution like Trisquel 9/10, PureOS 10, Debian 11 or Ubuntu 20.04/22.04 would work.
apt-get install libvirt-daemon-system virtinst genisoimage cloud-image-utils osinfo-db-tools
The approach can install Trisquel 9 (etiona), Trisquel 10 (nabia) and the pre-release of Trisquel 11. First download and verify the integrity of the netinst images that we will need. Unfortunately the Trisquel 11 netinst beta image does not have any checksum or signature available.
mkdir -p /root/iso
cd /root/iso
wget -q https://mirror.fsf.org/trisquel-images/trisquel-netinst_9.0.2_amd64.iso
wget -q https://mirror.fsf.org/trisquel-images/trisquel-netinst_9.0.2_amd64.iso.asc
wget -q https://mirror.fsf.org/trisquel-images/trisquel-netinst_9.0.2_amd64.iso.sha256
wget -q https://mirror.fsf.org/trisquel-images/trisquel-netinst_10.0.1_amd64.iso
wget -q https://mirror.fsf.org/trisquel-images/trisquel-netinst_10.0.1_amd64.iso.asc
wget -q https://mirror.fsf.org/trisquel-images/trisquel-netinst_10.0.1_amd64.iso.sha256
wget -q -O- https://archive.trisquel.info/trisquel/trisquel-archive-signkey.gpg   gpg --import
sha256sum -c trisquel-netinst_9.0.2_amd64.iso.sha256
gpg --verify trisquel-netinst_9.0.2_amd64.iso.asc
sha256sum -c trisquel-netinst_10.0.1_amd64.iso.sha256
gpg --verify trisquel-netinst_10.0.1_amd64.iso.asc
wget -q https://cdbuilds.trisquel.org/aramo/trisquel-netinst_11.0-20221225_amd64.iso
echo '179566639ca8f14f0c3d5658209c59a0916d9e3bf9c026660cc07b28f2311631  trisquel-netinst_11.0-20221225_amd64.iso'   sha256sum -c
I have developed the following fairly minimal preseed file that works with all three Trisquel releases. Compare it against the official Trisquel 11 preseed skeleton and the Debian 11 example preseed file. You should modify obvious things like SSH key, host/IP settings, partition layout and decide for yourself how to deal with passwords. While Ubuntu/Trisquel usually wants to setup a user account, I prefer to login as root hence setting passwd/root-login to true and passwd/make-user to false.

root@trana:~# cat>trisquel.preseed 
d-i debian-installer/locale select en_US
d-i keyboard-configuration/xkb-keymap select us
d-i netcfg/choose_interface select auto
d-i netcfg/disable_autoconfig boolean true
d-i netcfg/get_ipaddress string 192.168.10.201
d-i netcfg/get_netmask string 255.255.255.0
d-i netcfg/get_gateway string 192.168.10.46
d-i netcfg/get_nameservers string 192.168.10.46
d-i netcfg/get_hostname string trisquel
d-i netcfg/get_domain string sjd.se
d-i clock-setup/utc boolean true
d-i time/zone string UTC
d-i mirror/country string manual
d-i mirror/http/hostname string ftp.acc.umu.se
d-i mirror/http/directory string /mirror/trisquel/packages
d-i mirror/http/proxy string
d-i partman-auto/method string regular
d-i partman-partitioning/confirm_write_new_label boolean true
d-i partman/choose_partition select finish
d-i partman/confirm boolean true
d-i partman/confirm_nooverwrite boolean true
d-i partman-basicfilesystems/no_swap boolean false
d-i partman-auto/expert_recipe string myroot :: 1000 50 -1 ext4 \
     $primary    $bootable    method  format   \
     format    use_filesystem    filesystem  ext4   \
     mountpoint  /   \
    .
d-i partman-auto/choose_recipe select myroot
d-i passwd/root-login boolean true
d-i user-setup/allow-password-weak boolean true
d-i passwd/root-password password r00tme
d-i passwd/root-password-again password r00tme
d-i passwd/make-user boolean false
tasksel tasksel/first multiselect
d-i pkgsel/include string openssh-server
popularity-contest popularity-contest/participate boolean false
d-i grub-installer/only_debian boolean true
d-i grub-installer/with_other_os boolean true
d-i grub-installer/bootdev string default
d-i finish-install/reboot_in_progress note
d-i preseed/late_command string mkdir /target/root/.ssh ; echo ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILzCFcHHrKzVSPDDarZPYqn89H5TPaxwcORgRg+4DagE cardno:FFFE67252015 > /target/root/.ssh/authorized_keys
^D
root@trana:~# 
Use the file above as a skeleton for preparing a VM-specific preseed file as follows. The environment variables HOST and IPS will be used later on too.

root@trana:~# HOST=foo
root@trana:~# IP=192.168.10.197
root@trana:~# sed -e "s,get_ipaddress string.*,get_ipaddress string $IP," -e "s,get_hostname string.*,get_hostname string $HOST," < trisquel.preseed > vm-$HOST.preseed
root@trana:~# 
The following script is used to prepare the ISO images with the preseed file that we will need. This script is inspired by the Debian Wiki Preseed EditIso page and the Trisquel ISO customization wiki page. There are a couple of variations based on earlier works. Paths are updated to match the Trisquel netinst ISO layout, which differ slightly from Debian. We modify isolinux.cfg to boot the auto label without a timeout. On Trisquel 11 the auto boot label exists, but on Trisquel 9 and Trisquel 10 it does not exist so we add it in order to be able to start the automated preseed installation.

root@trana:~# cat gen-preseed-iso 
#!/bin/sh
# Copyright (C) 2018-2022 Simon Josefsson -- GPLv3+
# https://wiki.debian.org/DebianInstaller/Preseed/EditIso
# https://trisquel.info/en/wiki/customizing-trisquel-iso
set -e
set -x
ISO="$1"
PRESEED="$2"
OUTISO="$3"
LASTPWD="$PWD"
test -f "$ISO"
test -f "$PRESEED"
test ! -f "$OUTISO"
TMPDIR=$(mktemp -d)
mkdir "$TMPDIR/mnt"
mkdir "$TMPDIR/tmp"
cp "$PRESEED" "$TMPDIR"/preseed.cfg
cd "$TMPDIR"
mount "$ISO" mnt/
cp -rT mnt/ tmp/
umount mnt/
chmod +w -R tmp/
gunzip tmp/initrd.gz
echo preseed.cfg   cpio -H newc -o -A -F tmp/initrd
gzip tmp/initrd
chmod -w -R tmp/
sed -i "s/timeout 0/timeout 1/" tmp/isolinux.cfg
sed -i "s/default vesamenu.c32/default auto/" tmp/isolinux.cfg
if ! grep -q auto tmp/adtxt.cfg; then
    cat<<EOF >> tmp/adtxt.cfg
label auto
	menu label ^Automated install
	kernel linux
	append auto=true priority=critical vga=788 initrd=initrd.gz --- quiet
EOF
fi
cd tmp/
find -follow -type f   xargs md5sum  > md5sum.txt
cd ..
cd "$LASTPWD"
genisoimage -r -J -b isolinux.bin -c boot.cat \
            -no-emul-boot -boot-load-size 4 -boot-info-table \
            -o "$OUTISO" "$TMPDIR/tmp/"
rm -rf "$TMPDIR"
exit 0
^D
root@trana:~# chmod +x gen-preseed-iso 
root@trana:~# 
Next run the command on one of the downloaded ISO image and the generated preseed file.

root@trana:~# ./gen-preseed-iso /root/iso/trisquel-netinst_10.0.1_amd64.iso vm-$HOST.preseed vm-$HOST.iso
+ ISO=/root/iso/trisquel-netinst_10.0.1_amd64.iso
+ PRESEED=vm-foo.preseed
+ OUTISO=vm-foo.iso
+ LASTPWD=/root
+ test -f /root/iso/trisquel-netinst_10.0.1_amd64.iso
+ test -f vm-foo.preseed
+ test ! -f vm-foo.iso
+ mktemp -d
+ TMPDIR=/tmp/tmp.mNEprT4Tx9
+ mkdir /tmp/tmp.mNEprT4Tx9/mnt
+ mkdir /tmp/tmp.mNEprT4Tx9/tmp
+ cp vm-foo.preseed /tmp/tmp.mNEprT4Tx9/preseed.cfg
+ cd /tmp/tmp.mNEprT4Tx9
+ mount /root/iso/trisquel-netinst_10.0.1_amd64.iso mnt/
mount: /tmp/tmp.mNEprT4Tx9/mnt: WARNING: source write-protected, mounted read-only.
+ cp -rT mnt/ tmp/
+ umount mnt/
+ chmod +w -R tmp/
+ gunzip tmp/initrd.gz
+ echo preseed.cfg
+ cpio -H newc -o -A -F tmp/initrd
5 blocks
+ gzip tmp/initrd
+ chmod -w -R tmp/
+ sed -i s/timeout 0/timeout 1/ tmp/isolinux.cfg
+ sed -i s/default vesamenu.c32/default auto/ tmp/isolinux.cfg
+ grep -q auto tmp/adtxt.cfg
+ cat
+ cd tmp/
+ find -follow -type f
+ xargs md5sum
+ cd ..
+ cd /root
+ genisoimage -r -J -b isolinux.bin -c boot.cat -no-emul-boot -boot-load-size 4 -boot-info-table -o vm-foo.iso /tmp/tmp.mNEprT4Tx9/tmp/
I: -input-charset not specified, using utf-8 (detected in locale settings)
Using GCRY_000.MOD;1 for  /tmp/tmp.mNEprT4Tx9/tmp/boot/grub/x86_64-efi/gcry_sha512.mod (gcry_sha256.mod)
Using XNU_U000.MOD;1 for  /tmp/tmp.mNEprT4Tx9/tmp/boot/grub/x86_64-efi/xnu_uuid.mod (xnu_uuid_test.mod)
Using PASSW000.MOD;1 for  /tmp/tmp.mNEprT4Tx9/tmp/boot/grub/x86_64-efi/password_pbkdf2.mod (password.mod)
Using PART_000.MOD;1 for  /tmp/tmp.mNEprT4Tx9/tmp/boot/grub/x86_64-efi/part_sunpc.mod (part_sun.mod)
Using USBSE000.MOD;1 for  /tmp/tmp.mNEprT4Tx9/tmp/boot/grub/x86_64-efi/usbserial_pl2303.mod (usbserial_ftdi.mod)
Using USBSE001.MOD;1 for  /tmp/tmp.mNEprT4Tx9/tmp/boot/grub/x86_64-efi/usbserial_ftdi.mod (usbserial_usbdebug.mod)
Using VIDEO000.MOD;1 for  /tmp/tmp.mNEprT4Tx9/tmp/boot/grub/x86_64-efi/videotest.mod (videotest_checksum.mod)
Using GFXTE000.MOD;1 for  /tmp/tmp.mNEprT4Tx9/tmp/boot/grub/x86_64-efi/gfxterm_background.mod (gfxterm_menu.mod)
Using GCRY_001.MOD;1 for  /tmp/tmp.mNEprT4Tx9/tmp/boot/grub/x86_64-efi/gcry_sha256.mod (gcry_sha1.mod)
Using MULTI000.MOD;1 for  /tmp/tmp.mNEprT4Tx9/tmp/boot/grub/x86_64-efi/multiboot2.mod (multiboot.mod)
Using USBSE002.MOD;1 for  /tmp/tmp.mNEprT4Tx9/tmp/boot/grub/x86_64-efi/usbserial_usbdebug.mod (usbserial_common.mod)
Using MDRAI000.MOD;1 for  /tmp/tmp.mNEprT4Tx9/tmp/boot/grub/x86_64-efi/mdraid09.mod (mdraid09_be.mod)
Size of boot image is 4 sectors -> No emulation
 22.89% done, estimate finish Thu Dec 29 23:36:18 2022
 45.70% done, estimate finish Thu Dec 29 23:36:18 2022
 68.56% done, estimate finish Thu Dec 29 23:36:18 2022
 91.45% done, estimate finish Thu Dec 29 23:36:18 2022
Total translation table size: 2048
Total rockridge attributes bytes: 24816
Total directory bytes: 40960
Path table size(bytes): 64
Max brk space used 46000
21885 extents written (42 MB)
+ rm -rf /tmp/tmp.mNEprT4Tx9
+ exit 0
root@trana:~#
Now the image is ready for installation, so invoke virt-install as follows. The machine will start directly, launching the preseed automatic installation. At this point, I usually click on the virtual machine in virt-manager to follow screen output until the installation has finished. If everything works OK the machines comes up and I can ssh into it.

root@trana:~# virt-install --name $HOST --disk vm-$HOST.img,size=5 --cdrom vm-$HOST.iso --osinfo linux2020 --autostart --noautoconsole --wait
Using linux2020 default --memory 4096
Starting install...
Allocating 'vm-foo.img'                                                                                                                                     0 B  00:00:00 ... 
Creating domain...                                                                                                                                          0 B  00:00:00     
Domain is still running. Installation may be in progress.
Waiting for the installation to complete.
Domain has shutdown. Continuing.
Domain creation completed.
Restarting guest.
root@trana:~# 
There are some problems that I have noticed that would be nice to fix, but are easy to work around. The first is that at the end of the installation of Trisquel 9 and Trisquel 10, the VM hangs after displaying Sent SIGKILL to all processes followed by Requesting system reboot. I kill the VM manually using virsh destroy foo and start it up again using virsh start foo. For production use I expect to be running Trisquel 11, where the problem doesn t happen, so this does not bother me enough to debug further. The remaining issue that once booted, a Trisquel 11 VM has lost its DNS nameserver configuration, presumably due to poor integration with systemd-resolved. Both Trisquel 9 and Trisquel 10 uses systemd-resolved where DNS works after first boot, so this appears to be a Trisquel 11 bug. You can work around it with rm -f /etc/resolv.conf && echo 'nameserver A.B.C.D' > /etc/resolv.conf or drink the systemd Kool-Aid. If you want to clean up and re-start the process, here is how you wipe out what you did. After this, you may run the sed, ./gen-preseed-iso and virt-install commands again. Remember, use virsh shutdown foo to gracefully shutdown a VM.

root@trana:~# virsh destroy foo
Domain 'foo' destroyed
root@trana:~# virsh undefine foo --remove-all-storage
Domain 'foo' has been undefined
Volume 'vda'(/root/vm-foo.img) removed.
root@trana:~# rm vm-foo.*
root@trana:~# 
Happy hacking on your virtal machines!

26 December 2022

Vincent Bernat: Managing infrastructure with Terraform, CDKTF, and NixOS

A few years ago, I downsized my personal infrastructure. Until 2018, there were a dozen containers running on a single Hetzner server.1 I migrated my emails to Fastmail and my DNS zones to Gandi. It left me with only my blog to self-host. As of today, my low-scale infrastructure is composed of 4 virtual machines running NixOS on Hetzner Cloud and Vultr, a handful of DNS zones on Gandi and Route 53, and a couple of Cloudfront distributions. It is managed by CDK for Terraform (CDKTF), while NixOS deployments are handled by NixOps. In this article, I provide a brief introduction to Terraform, CDKTF, and the Nix ecosystem. I also explain how to use Nix to access these tools within your shell, so you can quickly start using them.

CDKTF: infrastructure as code Terraform is an infrastructure-as-code tool. You can define your infrastructure by declaring resources with the HCL language. This language has some additional features like loops to declare several resources from a list, built-in functions you can call in expressions, and string templates. Terraform relies on a large set of providers to manage resources.

Managing servers Here is a short example using the Hetzner Cloud provider to spawn a virtual machine:
variable "hcloud_token"  
  sensitive = true
 
provider "hcloud"  
  token = var.hcloud_token
 
resource "hcloud_server" "web03"  
  name = "web03"
  server_type = "cpx11"
  image = "debian-11"
  datacenter = "nbg1-dc3"
 
resource "hcloud_rdns" "rdns4-web03"  
  server_id = hcloud_server.web03.id
  ip_address = hcloud_server.web03.ipv4_address
  dns_ptr = "web03.luffy.cx"
 
resource "hcloud_rdns" "rdns6-web03"  
  server_id = hcloud_server.web03.id
  ip_address = hcloud_server.web03.ipv6_address
  dns_ptr = "web03.luffy.cx"
 
HCL expressiveness is quite limited and I find a general-purpose language more convenient to describe all the resources. This is where CDK for Terraform comes in: you can manage your infrastructure using your preferred programming language, including TypeScript, Go, and Python. Here is the previous example using CDKTF and TypeScript:
import   App, TerraformStack, Fn   from "cdktf";
import   HcloudProvider   from "./.gen/providers/hcloud/provider";
import * as hcloud from "./.gen/providers/hcloud";
class MyStack extends TerraformStack  
  constructor(scope: Construct, name: string)  
    super(scope, name);
    const hcloudToken = new TerraformVariable(this, "hcloudToken",  
      type: "string",
      sensitive: true,
     );
    const hcloudProvider = new HcloudProvider(this, "hcloud",  
      token: hcloudToken.value,
     );
    const web03 = new hcloud.server.Server(this, "web03",  
      name: "web03",
      serverType: "cpx11",
      image: "debian-11",
      datacenter: "nbg1-dc3",
      provider: hcloudProvider,
     );
    new hcloud.rdns.Rdns(this, "rdns4-web03",  
      serverId: Fn.tonumber(web03.id),
      ipAddress: web03.ipv4Address,
      dnsPtr: "web03.luffy.cx",
      provider: hcloudProvider,
     );
    new hcloud.rdns.Rdns(this, "rdns6-web03",  
      serverId: Fn.tonumber(web03.id),
      ipAddress: web03.ipv6Address,
      dnsPtr: "web03.luffy.cx",
      provider: hcloudProvider,
     );
   
 
const app = new App();
new MyStack(app, "cdktf-take1");
app.synth();
Running cdktf synth generates a configuration file for Terraform, terraform plan previews the changes, and terraform apply applies them. Now that you have a general-purpose language, you can use functions.

Managing DNS records While using CDKTF for 4 web servers may seem a tad overkill, this is quite different when it comes to managing a few DNS zones. With DNSControl, which is using JavaScript as a domain-specific language, I was able to define the bernat.ch zone with this snippet of code:
D("bernat.ch", REG_NONE, DnsProvider(DNS_BIND, 0), DnsProvider(DNS_GANDI),
  DefaultTTL('2h'),
  FastMailMX('bernat.ch',  subdomains: ['vincent'] ),
  WebServers('@'),
  WebServers('vincent');
This generated 38 records. With CDKTF, I use:
new Route53Zone(this, "bernat.ch", providers.aws)
  .sign(dnsCMK)
  .registrar(providers.gandiVB)
  .www("@", servers)
  .www("vincent", servers)
  .www("media", servers)
  .fastmailMX(["vincent"]);
All the magic is in the code that I did not show you. You can check the dns.ts file in the cdktf-take1 repository to see how it works. Here is a quick explanation:
  • Route53Zone() creates a new zone hosted by Route 53,
  • sign() signs the zone with the provided master key,
  • registrar() registers the zone to the registrar of the domain and sets up DNSSEC,
  • www() creates A and AAAA records for the provided name pointing to the web servers,
  • fastmailMX() creates the MX records and other support records to direct emails to Fastmail.
Here is the content of the fastmailMX() function. It generates a few records and returns the current zone for chaining:
fastmailMX(subdomains?: string[])  
  (subdomains ?? [])
    .concat(["@", "*"])
    .forEach((subdomain) =>
      this.MX(subdomain, [
        "10 in1-smtp.messagingengine.com.",
        "20 in2-smtp.messagingengine.com.",
      ])
    );
  this.TXT("@", "v=spf1 include:spf.messagingengine.com ~all");
  ["mesmtp", "fm1", "fm2", "fm3"].forEach((dk) =>
    this.CNAME( $ dk ._domainkey ,  $ dk .$ this.name .dkim.fmhosted.com. )
  );
  this.TXT("_dmarc", "v=DMARC1; p=none; sp=none");
  return this;
 
I encourage you to browse the repository if you need more information.

About Pulumi My first tentative around Terraform was to use Pulumi. You can find this attempt on GitHub. This is quite similar to what I currently do with CDKTF. The main difference is that I am using Python instead of TypeScript because I was not familiar with TypeScript at the time.2 Pulumi predates CDKTF and it uses a slightly different approach. CDKTF generates a Terraform configuration (in JSON format instead of HCL), delegating planning, state management, and deployment to Terraform. It is therefore bound to the limitations of what can be expressed by Terraform, notably when you need to transform data obtained from one resource to another.3 Pulumi needs specific providers for each resource. Many Pulumi providers are thin wrappers encapsulating Terraform providers. While Pulumi provides a good user experience, I switched to CDKTF because writing providers for Pulumi is a chore. CDKTF does not require such a step. Outside the big players (AWS, Azure and Google Cloud), the existence, quality, and freshness of the Pulumi providers are inconsistent. Most providers rely on a Terraform provider and they may lag a few versions behind, miss a few resources, or have a few bugs of their own. When a provider does not exist, you can write one with the help of the pulumi-terraform-bridge library. The Pulumi project provides a boilerplate for this purpose. I had a bad experience with it when writing providers for Gandi and Vultr: the Makefile automatically installs Pulumi using a curl sh pattern and does not work with /bin/sh. There is a lack of interest for community-based contributions4 or even for providers for smaller players.

NixOS & NixOps Nix is a functional, purely-functional programming language. Nix is also the name of the package manager that is built on top of the Nix language. It allows users to declaratively install packages. nixpkgs is a repository of packages. You can install Nix on top of a regular Linux distribution. If you want more details, a good resource is the official website, and notably the learn section. There is a steep learning curve, but the reward is tremendous.

NixOS: declarative Linux distribution NixOS is a Linux distribution built on top of the Nix package manager. Here is a configuration snippet to add some packages:
environment.systemPackages = with pkgs;
  [
    bat
    htop
    liboping
    mg
    mtr
    ncdu
    tmux
  ];
It is possible to alter an existing derivation5 to use a different version, enable a specific feature, or apply a patch. Here is how I enable and configure Nginx to disable the stream module, add the Brotli compression module, and add the IP address anonymizer module. Moreover, instead of using OpenSSL 3, I keep using OpenSSL 1.1.6
services.nginx =  
  enable = true;
  package = (pkgs.nginxStable.override  
    withStream = false;
    modules = with pkgs.nginxModules; [
      brotli
      ipscrub
    ];
    openssl = pkgs.openssl_1_1;
   );
If you need to add some patches, it is also possible. Here are the patches I added in 2019 to circumvent the DoS vulnerabilities in Nginx until they were fixed in NixOS:7
services.nginx.package = pkgs.nginxStable.overrideAttrs (old:  
  patches = oldAttrs.patches ++ [
    # HTTP/2: reject zero length headers with PROTOCOL_ERROR.
    (pkgs.fetchpatch  
      url = https://github.com/nginx/nginx/commit/dbdd[ ].patch;
      sha256 = "a48190[ ]";
     )
    # HTTP/2: limited number of DATA frames.
    (pkgs.fetchpatch  
      url = https://github.com/nginx/nginx/commit/94c5[ ].patch;
      sha256 = "af591a[ ]";
     )
    #  HTTP/2: limited number of PRIORITY frames.
    (pkgs.fetchpatch  
      url = https://github.com/nginx/nginx/commit/39bb[ ].patch;
      sha256 = "1ad8fe[ ]";
     )
  ];
 );
If you are interested, have a look at my relatively small configuration: common.nix contains the configuration to be applied to any host (SSH, users, common software packages), web.nix contains the configuration for the web servers, isso.nix runs Isso into a systemd container.

NixOps: NixOS deployment tool On a single node, NixOS configuration is in the /etc/nixos/configuration.nix file. After modifying it, you have to run nixos-rebuild switch. Nix fetches all possible dependencies from the binary cache and builds the remaining packages. It creates a new entry in the boot loader menu and activates the new configuration. To manage several nodes, there exists several options, including NixOps, deploy-rs, Colmena, and morph. I do not know all of them, but from my point of view, the differences are not that important. It is also possible to build such a tool yourself as Nix provides the most important building blocks: nix build and nix copy. NixOps is one of the first tools available but I encourage you to explore the alternatives. NixOps configuration is written in Nix. Here is a simplified configuration to deploy znc01.luffy.cx, web01.luffy.cx, and web02.luffy.cx, with the help of the server and web functions:
let
  server = hardware: name: imports:  
    deployment.targetHost = "$ name .luffy.cx";
    networking.hostName = name;
    networking.domain = "luffy.cx";
    imports = [ (./hardware/. + "/$ hardware .nix") ] ++ imports;
   ;
  web = hardware: idx: imports:
    server hardware "web$ lib.fixedWidthNumber 2 idx " ([ ./web.nix ] ++ imports);
in  
  network.description = "Luffy infrastructure";
  network.enableRollback = true;
  defaults = import ./common.nix;
  znc01 = server "exoscale" [ ./znc.nix ];
  web01 = web "hetzner" 1 [ ./isso.nix ];
  web02 = web "hetzner" 2 [];
 

Tying everything together with Nix The Nix ecosystem is a unified solution to the various problems around software and configuration management. A very interesting feature is the declarative and reproducible developer environments. This is similar to Python virtual environments, except it is not language-specific.

Brief introduction to Nix flakes I am using flakes, a new Nix feature improving reproducibility by pinning all dependencies and making the build hermetic. While the feature is marked as experimental,8 it is widely used and you may see flake.nix and flake.lock at the root of some repositories. As a short example, here is the flake.nix content shipped with Snimpy, an interactive SNMP tool for Python relying on libsmi, a C library:
 
  inputs =  
    nixpkgs.url = "nixpkgs";
    flake-utils.url = "github:numtide/flake-utils";
   ;
  outputs =   self, ...  @inputs:
    inputs.flake-utils.lib.eachDefaultSystem (system:
      let
        pkgs = inputs.nixpkgs.legacyPackages."$ system ";
      in
       
        # nix build
        packages.default = pkgs.python3Packages.buildPythonPackage  
          name = "snimpy";
          src = self;
          preConfigure = ''echo "1.0.0-0-000000000000" > version.txt'';
          checkPhase = "pytest";
          checkInputs = with pkgs.python3Packages; [ pytest mock coverage ];
          propagatedBuildInputs = with pkgs.python3Packages; [ cffi pysnmp ipython ];
          buildInputs = [ pkgs.libsmi ];
         ;
        # nix run + nix shell
        apps.default =   
          type = "app";
          program = "$ self.packages."$ system ".default /bin/snimpy";
         ;
        # nix develop
        devShells.default = pkgs.mkShell  
          name = "snimpy-dev";
          buildInputs = [
            self.packages."$ system ".default.inputDerivation
            pkgs.python3Packages.ipython
          ];
         ;
       );
 
If you have Nix installed on your system:
  • nix run github:vincentbernat/snimpy runs Snimpy,
  • nix shell github:vincentbernat/snimpy provides a shell with Snimpy ready-to-use,
  • nix build github:vincentbernat/snimpy builds the Python package, tests included, and
  • nix develop . provides a shell to hack around Snimpy when run from a fresh checkout.9
For more information about Nix flakes, have a look at the tutorial from Tweag.

Nix and CDKTF At the root of the repository I use for CDKTF, there is a flake.nix file to set up a shell with Terraform and CDKTF installed and with the appropriate environment variables to automate my infrastructure. Terraform is already packaged in nixpkgs. However, I need to apply a patch on top of the Gandi provider. Not a problem with Nix!
terraform = pkgs.terraform.withPlugins (p: [
  p.aws
  p.hcloud
  p.vultr
  (p.gandi.overrideAttrs
    (old:  
      src = pkgs.fetchFromGitHub  
        owner = "vincentbernat";
        repo = "terraform-provider-gandi";
        rev = "feature/livedns-key";
        hash = "sha256-V16BIjo5/rloQ1xTQrdd0snoq1OPuDh3fQNW7kiv/kQ=";
       ;
     ))
]);
CDKTF is written in TypeScript. I have a package.json file with all the dependencies needed, including the ones to use TypeScript as the language to define infrastructure:
 
  "name": "cdktf-take1",
  "version": "1.0.0",
  "main": "main.js",
  "types": "main.ts",
  "private": true,
  "dependencies":  
    "@types/node": "^14.18.30",
    "cdktf": "^0.13.3",
    "cdktf-cli": "^0.13.3",
    "constructs": "^10.1.151",
    "eslint": "^8.27.0",
    "prettier": "^2.7.1",
    "ts-node": "^10.9.1",
    "typescript": "^3.9.10",
    "typescript-language-server": "^2.1.0"
   
 
I use Yarn to get a yarn.lock file that can be used directly to declare a derivation containing all the dependencies:
nodeEnv = pkgs.mkYarnModules  
  pname = "cdktf-take1-js-modules";
  version = "1.0.0";
  packageJSON = ./package.json;
  yarnLock = ./yarn.lock;
 ;
The next step is to generate the CDKTF providers from the Terraform providers and turn them into a derivation:
cdktfProviders = pkgs.stdenvNoCC.mkDerivation  
  name = "cdktf-providers";
  nativeBuildInputs = [
    pkgs.nodejs
    terraform
  ];
  src = nix-filter  
    root = ./.;
    include = [ ./cdktf.json ./tsconfig.json ];
   ;
  buildPhase = ''
    export HOME=$(mktemp -d)
    export CHECKPOINT_DISABLE=1
    export DISABLE_VERSION_CHECK=1
    export PATH=$ nodeEnv /node_modules/.bin:$PATH
    ln -nsf $ nodeEnv /node_modules node_modules
    # Build all providers we have in terraform
    for provider in $(cd $ terraform /libexec/terraform-providers; echo */*/*/*); do
      version=''$ provider##*/ 
      provider=''$ provider%/* 
      echo "Build $provider@$version"
      cdktf provider add --force-local $provider@$version   cat
    done
    echo "Compile TS   JS"
    tsc
  '';
  installPhase = ''
    mv .gen $out
    ln -nsf $ nodeEnv /node_modules $out/node_modules
  '';
 ;
Finally, we can define the development environment:
devShells.default = pkgs.mkShell  
  name = "cdktf-take1";
  buildInputs = [
    pkgs.nodejs
    pkgs.yarn
    terraform
  ];
  shellHook = ''
    # No telemetry
    export CHECKPOINT_DISABLE=1
    # No autoinstall of plugins
    export CDKTF_DISABLE_PLUGIN_CACHE_ENV=1
    # Do not check version
    export DISABLE_VERSION_CHECK=1
    # Access to node modules
    export PATH=$PWD/node_modules/.bin:$PATH
    ln -nsf $ nodeEnv /node_modules node_modules
    ln -nsf $ cdktfProviders  .gen
    # Credentials
    for p in \
      njf.nznmba.pbz/Nqzvavfgengbe \
      urgmare.pbz/ivaprag@oreang.pu \
      ihyge.pbz/ihyge@ivaprag.oreang.pu; do
        eval $(pass show $(echo $p   tr 'A-Za-z' 'N-ZA-Mn-za-m')   grep '^export')
    done
    eval $(pass show personal/cdktf/secrets   grep '^export')
    export TF_VAR_hcloudToken="$HCLOUD_TOKEN"
    export TF_VAR_vultrApiKey="$VULTR_API_KEY"
    unset VULTR_API_KEY HCLOUD_TOKEN
  '';
 ;
The derivations listed in buildInputs are available in the provided shell. The content of shellHook is sourced when starting the shell. It sets up some symbolic links to make the JavaScript environment built at an earlier step available, as well as the generated CDKTF providers. It also exports all the credentials.10 I am also using direnv with an .envrc to automatically load the development environment. This also enables the environment to be available from inside Emacs, notably when using lsp-mode to get TypeScript completions. Without direnv, nix develop . can activate the environment. I use the following commands to deploy the infrastructure:11
$ cdktf synth
$ cd cdktf.out/stacks/cdktf-take1
$ terraform plan --out plan
$ terraform apply plan
$ terraform output -json > ~-automation/nixops-take1/cdktf.json
The last command generates a JSON file containing various data to complete the deployment with NixOps.

NixOps The JSON file exported by Terraform contains the list of servers with various attributes:
 
  "hardware": "hetzner",
  "ipv4Address": "5.161.44.145",
  "ipv6Address": "2a01:4ff:f0:b91::1",
  "name": "web05.luffy.cx",
  "tags": [
    "web",
    "continent:NA",
    "continent:SA"
  ]
 
In network.nix, this list is imported and transformed into an attribute set describing the servers. A simplified version looks like this:
let
  lib = inputs.nixpkgs.lib;
  shortName = name: builtins.elemAt (lib.splitString "." name) 0;
  domainName = name: lib.concatStringsSep "." (builtins.tail (lib.splitString "." name));
  server = hardware: name: imports:  
    networking =  
      hostName = shortName name;
      domain = domainName name;
     ;
    deployment.targetHost = name;
    imports = [ (./hardware/. + "/$ hardware .nix") ] ++ imports;
   ;
  cdktf-servers-json = (lib.importJSON ./cdktf.json).servers.value;
  cdktf-servers = map
    (s:
      let
        tags-maybe-import = map (t: ./. + "/$ t .nix") s.tags;
        tags-import = builtins.filter (t: builtins.pathExists t) tags-maybe-import;
      in
       
        name = shortName s.name;
        value = server s.hardware s.name tags-import;
       )
    cdktf-servers-json;
in
 
  // [ ]
  // builtins.listToAttrs cdktf-servers
For web05, this expands to:
web05 =  
  networking =  
    hostName = "web05";
    domainName = "luffy.cx";
   ;
  deployment.targetHost = "web05.luffy.cx";
  imports = [ ./hardware/hetzner.nix ./web.nix ];
 ;
As for CDKTF, at the root of the repository I use for NixOps, there is a flake.nix file to set up a shell with NixOps configured. Because NixOps do not support rollouts, I usually use the following commands to deploy on a single server:12
$ nix flake update
$ nixops deploy --include=web04
$ ./tests web04.luffy.cx
If the tests are OK, I deploy the remaining nodes gradually with the following command:
$ (set -e; for h in web 03..06 ; do nixops deploy --include=$h; done)
nixops deploy rolls out all servers in parallel and therefore could cause a short outage where all Nginx are down at the same time.
This post has been a work-in-progress for the past three years, with the content being updated and refined as I experimented with different solutions. There is still much to explore13 but I feel there is enough content to publish now.

  1. It was an AMD Athlon 64 X2 5600+ with 2 GB of RAM and 2 400 GB disks with software RAID. I was paying something around 59 per month for it. While it was a good deal in 2008, by 2018 it was no longer cost-effective. It was running on Debian Wheezy with Linux-VServer for isolation, both of which were outdated in 2018.
  2. I also did not use Python because Poetry support in Nix was a bit broken around the time I started hacking around CDKTF.
  3. Pulumi can apply arbitrary functions with the apply() method on an output. It makes it easy to transform data that are not known during the planning stage. Terraform has functions to serve a similar purpose, but they are more limited.
  4. The two mentioned pull requests are not merged yet. The second one is superseded by PR #61, submitted two months later, which enforces the use of /bin/bash. I also submitted PR #56, which was merged 4 months later and quickly reverted without an explanation.
  5. You may consider packages and derivations to be synonyms in the Nix ecosystem.
  6. OpenSSL 3 has outstanding performance regressions.
  7. NixOS can be a bit slow to integrate patches since they need to rebuild parts of the binary cache before releasing the fixes. In this specific case, they were fast: the vulnerability and patches were released on August 13th 2019 and available in NixOS on August 15th. As a comparison, Debian only released the fixed version on August 22nd, which is unusually late.
  8. Because flakes are experimental, many documentations do not use them and it is an additional aspect to learn.
  9. It is possible to replace . with github:vincentbernat/snimpy, like in the other commands, but having Snimpy dependencies without Snimpy source code is less interesting.
  10. I am using pass as a password manager. The password names are only obfuscated to avoid spam.
  11. The cdktf command can wrap the terraform commands, but I prefer to use them directly as they are more flexible.
  12. If the change is risky, I disable the server with CDKTF. This removes it from the web service DNS records.
  13. I would like to replace NixOps with an alternative handling progressive rollouts and checks. I am also considering switching to Nomad or Kubernetes to deploy workloads.

12 December 2022

Jonathan McDowell: Setting up FreshRSS in a subdirectory

Ever since the demise of Google Reader I have been looking for a suitable replacement RSS reader. In the past I used to use Liferea but that was when I used a single desktop machine; these days I want to be able to read on my phone and multiple machines. I moved to Feedly and it s been mostly ok, but I m hitting the limit of feeds available in the free tier, and $72/year is a bit more than I can justify to myself. Especially when I have machines already available to me where I could self host something. The problem, of course, is what to host. It seems the best options are all written in PHP, so I had to get over my adverse knee-jerk reaction to that. I ended up on FreshRSS but if it hadn t worked out I d have tried TinyTinyRSS. Of course I m hosting on Debian, and the machine I chose to use was already running nginx and PostgreSQL. So I needed to install PHP:
$ sudo apt install php7.4-fpm php-curl php-gmp php-intl php-mbstring \
	php-pgsql php-xml php-zip
I put my FreshRSS install in /srv/freshrss so I grabbed the 1.20.2 release from GitHub (actually 1.20.1 at the time, but I ve upgraded to the latest since) and untared it in there. I gave www-data access to the data directory (sudo chown -R www-data /srv/freshrss/data) (yes, yes, I could have created a new user specifically for FreshRSS, but I ve chosen not to for now). There s no actual need to configure things up on the filesystem, you can do the initial setup from the web interface. Which is where the trouble came. I ve been an Apache user since 1998 and as a result it s what I know and what I go to. nginx is new to me. And I wanted my FreshRSS instance to live in a subdirectory of an existing TLS-enabled host, rather than have it s own hostname. Now, at least FreshRSS copes with this (unlike far too many other projects), you just have to configure your webserver correctly. Which took me more experimentation than I d like, but I ve ended up with the following snippet:
    # PHP files handling
    location ~ ^/freshrss/.+?\.php(/.*)?$  
        root /srv/freshrss/p;
        fastcgi_pass unix:/run/php/php-fpm.sock;
        fastcgi_split_path_info ^/freshrss(/.+\.php)(/.*)?$;
        set $path_info $fastcgi_path_info;
        fastcgi_param PATH_INFO $path_info;
        include fastcgi_params;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
     
    location ~ ^/freshrss(/.*)?$  
        root /srv/freshrss/p;
        try_files $1 /freshrss$1/index.php$is_args$args;
     
Other than the addition of the freshrss prefix this ends up differing slightly from the FreshRSS webserver configuration example. I ended up having to make the path info on the fastcgi_split_path_info optional, and my try_files in the bare directory location directive needed $is_args$args added or I just ended up in a redirect loop because the session parameters didn t get passed through. I m sure there s a better way to do it, but I did a bunch of searching and this is how I ended up making it work. Before firing up the web configuration I created a suitable database:
$ sudo -Hu postgres psql
psql (13.8 (Debian 13.8-0+deb11u1))
Type "help" for help.
postgres=# create database freshrss;
CREATE DATABASE
postgres=# create user freshrss with encrypted password 'hunter2';
CREATE ROLE
postgres=# grant all privileges on database freshrss to freshrss;
GRANT
postgres=# \q
I ran through the local configuration, creating myself a user and adding some feeds, then created a cronjob to fetch updates hourly and keep a log:
# mkdir /var/log/freshrss
# chown :www-data /var/log/freshrss
# chmod 775 /var/log/freshrss
# cat > /etc/cron.d/freshrss-refresh <EOF
33 * * * * www-data /srv/freshrss/app/actualize_script.php > /var/log/freshrss/update-$(date --iso-8601=minutes).log 2>&1
EOF
Experiences so far? Reasonably happy. The interface seems snappy enough, and works well both on mobile and desktop. I m only running a single user instance at present, but am considering opening it up to some other folk and will see how that scales. And it clearly indicated a number of my feeds that were broken, so I ve cleaned some up that are still around and deleted the missing ones. Now I just need to figure out what else I should be subscribed to that I ve been putting off due to the Feedly limit!

Vasudev Kamath: Installing Debian from GRML Live CD

I had bought a Thinkpad E470 laptop back in 2018 which was lying unused for quite some time. Recently when I wanted to use it, I found that the keyboard is not working, especially some keys and after some time the laptop will hang in Lenovo boot screen. I came back to Bangalore almost after 2 years from my hometown (WFH due to Covid) and thought it was the right time to get my laptop back to normal working state. After getting the keyboard replaced I noticed that 1TB HDD is no longer fast enough for my taste!. I've to admit I never thought I would start disliking HDD so quickly thanks to modern SSD based work laptops. So as a second upgrade I got the HDD removed from my laptop and got a 240G SSD. Yeah I know its reduction from my original size but I intend to continue using my old HDD via USB SATA enclosure as an external HDD which can house the extra data which I need to save. So now that I've a SSD I need to install Debian Unstable again on it and this is where I tried something new. My colleague (name redacted on request) suggested to me use GRML live CD and install Debian via debootstrap. And after giving a thought I decided to try this out. Some reason for going ahead with this are listed below
  1. Debian Installer does not support a proper BTRFS based root file system. It just allows btrfs as root but no subvolume support. Also I'm not sure about the luks support with btrfs as root.
  2. I also wanted to give a try to systemd-boot as my laptop is UEFI capable and I've slowly started disliking Grub.
  3. I really hate installing task-kde-desktop (Yeah you read it right, I've switched to be a KDE user for quite some time) which will pull tons of unwanted stuff and bloat. Well it's not just task-kde-desktop but any other task-desktop package does similar and I don't want to have too much of unused stuff and services running.
Disk Preparation As a first step I went to GRML website and downloaded current pre-release. Frankly, I'm using GRML for first time and I was not sure what to expect. When I booted it up I was bit taken a back to see its console based and I did not have a wired lan just a plain wireless dongle (Jiofi device) and was wondering what it will take to connect. But surprisingly curses based UI was pretty much straight forward to allow me to connect to Wifi AP. Another thing was the rescue CD had non-free firmware as the laptop was using ath10k device and needed non-free blobs to operate. Once I got shell prompt in rescue CD first thing I did was to reconfigure console-setup to increase font size which was very very small on default boot. Once that is done I did the following to create a 1G (FAT32) partition for EFI.
parted -a optimal -s /dev/sda mklabel gpt
parted -a optimal -s /dev/sda mkpart primary vfat 0% 1G
parted -a optimal -s /dev/sda set 1 esp on
mkfs.vfat -n boot_disk -F 32 /dev/sda1
So here is what I did: created a 1G vfat type partition and set the esp flag on it. This will be mounted to /boot/efi for systemd-boot. Next I created a single partition on the rest of the available free disk which will be used as the root file system. Next I encrypted the root parition using LUKS and then created the BTRFS file system on top of it.
cryptsetup luksFormat /dev/sda2
cryptsetup luksOpen /dev/sda2 ENC
mkfs.btrfs -L root_disk /dev/mapper/ENC
Next is to create subvolumes in BTRFS. I followed suggestion by colleague and created a top-level @ as subvolume below which created @/home @/var/log @/opt . Also enabled compression with zstd and level of 1 to avoid battery drain. Finally marked the @ as default subvolume to avoid adding it to fstab entry.
mount -o compress=zstd:1 /dev/mapper/ENC /mnt
btrfs subvol create /mnt/@
cd /mnt/@
btrfs subvol create ./home
btrfs subvol create ./opt
mkdir -p var
btrfs subvol create ./var/log
btrfs suvol set-default /mnt/@
Bootstrapping Debian Now that root disk is prepared next step was to bootstrap the root file system. I used debootstrap for this job. One thing I missed here from installer was ability to preseed. I tried looking around to figure out if we can preseed debootstrap but did not find much. If you know the procedure do point it to me.
cd /mnt/
debootstrap --include=dbus,locales,tzdata unstable @/ http://deb.debian.org/debian
Well this just gets a bare minimal installation of Debian I need to install rest of the things post this step manually by chroot into target folder @/. I like the grml-chroot command for chroot purpose, it does most of the job of mounting all required directory like /dev/ /proc /sys etc. But before entering chroot I need to mount the ESP partition we created to /boot/efi so that I can finalize the installation of kernel and systemd-boot.
umount /mnt
mount -o compress=zstd:1 /dev/mapper/ENC /mnt
mkdir -p /mnt/boot/efi
mount /dev/sda1 /mnt/boot/efi
grml-chroot /mnt /bin/bash
I remounted the root subvolume @ directly to /mnt now, remember I made @ as default subvolume before. I also mounted ESP partition with FAT32 file system to /boot/efi. Finally I used grml-chroot to get into chroot of newly bootstrapped file system. Now I will install the kernel and minimal KDE desktop installation and configure locales and time zone data for the new system. I wanted to use dracut instead of default initramfs-tools for initrd. I also need to install cryptsetup and btrfs-progs so I can decrypt and really boot into my new system.
apt-get update
apt-get install linux-image-amd64 dracut openssh-client \
                        kde-plasma-desktop plasma-workspace-wayland \
                        plasma-nm cryptsetup btrfs-progs sudo
Next is setting up crypttab and fstab entries for new system. Following entry is added to fstab
LABEL="root_disk" / btrfs defaults,compress=zstd:1 0 0
And the crypttab entry
ENCRYPTED_ROOT UUID=xxxx none discard,x-initrd.attach
I've not written actual UUID above this is just for the purpose of showing the content of /etc/crypttab. Once these entries are added we need to recreate initrd. I just reconfigured the installed kernel package for retriggerring the recreation of initrd using dracut. .. Reconfiguration was locales is done by editing /etc/locales.gen to uncomment en_US.UTF-8 and writing /etc/timezone with Asia/Kolkata. I used DEBIAN_FRONTEND=noninteractive to avoid another prompt asking for locale and timezone information.
export DEBIAN_FRONTEND=noninteractive
dpkg-reconfigure locales
dpkg-reconfigure tzdata
Added my user using adduser command and also set the root password as well. Added my user to sudo group so I can use sudo to elevate privileges.
Setting up systemd-boot So now basic usable system is ready last part is enabling the systemd-boot configuration as I'm not gonna use grub. I did following to install systemd-boot. Frankly I'm not expert of this it was colleague's suggestion. Before installing the systemd-boot I had to setup kernel command line. This can be done by writing command line to /etc/kernel/cmdline with following contents.
systemd.gpt_auto=no quiet root=LABEL=root_disk
I'm disabling systemd-gpt-generator to avoid race condition between crypttab entry and auto generated entry by systemd. I faced this mainly because of my stupidity of not adding entry root=LABEL=root_disk
apt-get install -y systemd-boot
bootctl install --make-entry-directory=yes --entry-token=machine-id
dpkg-reconfigure linux-image-6.0.0-5-amd64
Finally exit from the chroot and reboot into the freshly installed system. systemd-boot already ships a hook file zz-systemd-boot under /etc/kernel so its pretty much usable without any manual intervention. Previously after kernel installation we had to manually update kernel image in efi partitions using bootctl
Conclussion Though installing from live image is not new and debian-installer also does the same only difference is more control over installation and doing things which is installer is not letting you do (or should I say is not part of default installation?). If properly automated using scripts we can leverage this to do custom installation in large scale environments. I know there is FAI but I've not explored it and felt there is too much to setup for a simple installations with specific requirements. So finally I've a system with Debian which differs from default Debian installation :-). I should thank my colleague for rekindling nerd inside me who had stopped experimenting quite a long time back.

16 November 2022

Antoine Beaupr : A ZFS migration

In my tubman setup, I started using ZFS on an old server I had lying around. The machine is really old though (2011!) and it "feels" pretty slow. I want to see how much of that is ZFS and how much is the machine. Synthetic benchmarks show that ZFS may be slower than mdadm in RAID-10 or RAID-6 configuration, so I want to confirm that on a live workload: my workstation. Plus, I want easy, regular, high performance backups (with send/receive snapshots) and there's no way I'm going to use BTRFS because I find it too confusing and unreliable. So off we go.

Installation Since this is a conversion (and not a new install), our procedure is slightly different than the official documentation but otherwise it's pretty much in the same spirit: we're going to use ZFS for everything, including the root filesystem. So, install the required packages, on the current system:
apt install --yes gdisk zfs-dkms zfs zfs-initramfs zfsutils-linux
We also tell DKMS that we need to rebuild the initrd when upgrading:
echo REMAKE_INITRD=yes > /etc/dkms/zfs.conf

Partitioning This is going to partition /dev/sdc with:
  • 1MB MBR / BIOS legacy boot
  • 512MB EFI boot
  • 1GB bpool, unencrypted pool for /boot
  • rest of the disk for zpool, the rest of the data
     sgdisk --zap-all /dev/sdc
     sgdisk -a1 -n1:24K:+1000K -t1:EF02 /dev/sdc
     sgdisk     -n2:1M:+512M   -t2:EF00 /dev/sdc
     sgdisk     -n3:0:+1G      -t3:BF01 /dev/sdc
     sgdisk     -n4:0:0        -t4:BF00 /dev/sdc
    
That will look something like this:
    root@curie:/home/anarcat# sgdisk -p /dev/sdc
    Disk /dev/sdc: 1953525168 sectors, 931.5 GiB
    Model: ESD-S1C         
    Sector size (logical/physical): 512/512 bytes
    Disk identifier (GUID): [REDACTED]
    Partition table holds up to 128 entries
    Main partition table begins at sector 2 and ends at sector 33
    First usable sector is 34, last usable sector is 1953525134
    Partitions will be aligned on 16-sector boundaries
    Total free space is 14 sectors (7.0 KiB)
    Number  Start (sector)    End (sector)  Size       Code  Name
       1              48            2047   1000.0 KiB  EF02  
       2            2048         1050623   512.0 MiB   EF00  
       3         1050624         3147775   1024.0 MiB  BF01  
       4         3147776      1953525134   930.0 GiB   BF00
Unfortunately, we can't be sure of the sector size here, because the USB controller is probably lying to us about it. Normally, this smartctl command should tell us the sector size as well:
root@curie:~# smartctl -i /dev/sdb -qnoserial
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-14-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Black Mobile
Device Model:     WDC WD10JPLX-00MBPT0
Firmware Version: 01.01H01
User Capacity:    1 000 204 886 016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue May 17 13:33:04 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Above is the example of the builtin HDD drive. But the SSD device enclosed in that USB controller doesn't support SMART commands, so we can't trust that it really has 512 bytes sectors. This matters because we need to tweak the ashift value correctly. We're going to go ahead the SSD drive has the common 4KB settings, which means ashift=12. Note here that we are not creating a separate partition for swap. Swap on ZFS volumes (AKA "swap on ZVOL") can trigger lockups and that issue is still not fixed upstream. Ubuntu recommends using a separate partition for swap instead. But since this is "just" a workstation, we're betting that we will not suffer from this problem, after hearing a report from another Debian developer running this setup on their workstation successfully. We do not recommend this setup though. In fact, if I were to redo this partition scheme, I would probably use LUKS encryption and setup a dedicated swap partition, as I had problems with ZFS encryption as well.

Creating pools ZFS pools are somewhat like "volume groups" if you are familiar with LVM, except they obviously also do things like RAID-10. (Even though LVM can technically also do RAID, people typically use mdadm instead.) In any case, the guide suggests creating two different pools here: one, in cleartext, for boot, and a separate, encrypted one, for the rest. Technically, the boot partition is required because the Grub bootloader only supports readonly ZFS pools, from what I understand. But I'm a little out of my depth here and just following the guide.

Boot pool creation This creates the boot pool in readonly mode with features that grub supports:
    zpool create \
        -o cachefile=/etc/zfs/zpool.cache \
        -o ashift=12 -d \
        -o feature@async_destroy=enabled \
        -o feature@bookmarks=enabled \
        -o feature@embedded_data=enabled \
        -o feature@empty_bpobj=enabled \
        -o feature@enabled_txg=enabled \
        -o feature@extensible_dataset=enabled \
        -o feature@filesystem_limits=enabled \
        -o feature@hole_birth=enabled \
        -o feature@large_blocks=enabled \
        -o feature@lz4_compress=enabled \
        -o feature@spacemap_histogram=enabled \
        -o feature@zpool_checkpoint=enabled \
        -O acltype=posixacl -O canmount=off \
        -O compression=lz4 \
        -O devices=off -O normalization=formD -O relatime=on -O xattr=sa \
        -O mountpoint=/boot -R /mnt \
        bpool /dev/sdc3
I haven't investigated all those settings and just trust the upstream guide on the above.

Main pool creation This is a more typical pool creation.
    zpool create \
        -o ashift=12 \
        -O encryption=on -O keylocation=prompt -O keyformat=passphrase \
        -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
        -O compression=zstd \
        -O relatime=on \
        -O canmount=off \
        -O mountpoint=/ -R /mnt \
        rpool /dev/sdc4
Breaking this down:
  • -o ashift=12: mentioned above, 4k sector size
  • -O encryption=on -O keylocation=prompt -O keyformat=passphrase: encryption, prompt for a password, default algorithm is aes-256-gcm, explicit in the guide, made implicit here
  • -O acltype=posixacl -O xattr=sa: enable ACLs, with better performance (not enabled by default)
  • -O dnodesize=auto: related to extended attributes, less compatibility with other implementations
  • -O compression=zstd: enable zstd compression, can be disabled/enabled by dataset to with zfs set compression=off rpool/example
  • -O relatime=on: classic atime optimisation, another that could be used on a busy server is atime=off
  • -O canmount=off: do not make the pool mount automatically with mount -a?
  • -O mountpoint=/ -R /mnt: mount pool on / in the future, but /mnt for now
Those settings are all available in zfsprops(8). Other flags are defined in zpool-create(8). The reasoning behind them is also explained in the upstream guide and some also in [the Debian wiki][]. Those flags were actually not used:
  • -O normalization=formD: normalize file names on comparisons (not storage), implies utf8only=on, which is a bad idea (and effectively meant my first sync failed to copy some files, including this folder from a supysonic checkout). and this cannot be changed after the filesystem is created. bad, bad, bad.
[the Debian wiki]: https://wiki.debian.org/ZFS#Advanced_Topics

Side note about single-disk pools Also note that we're living dangerously here: single-disk ZFS pools are rumoured to be more dangerous than not running ZFS at all. The choice quote from this article is:
[...] any error can be detected, but cannot be corrected. This sounds like an acceptable compromise, but its actually not. The reason its not is that ZFS' metadata cannot be allowed to be corrupted. If it is it is likely the zpool will be impossible to mount (and will probably crash the system once the corruption is found). So a couple of bad sectors in the right place will mean that all data on the zpool will be lost. Not some, all. Also there's no ZFS recovery tools, so you cannot recover any data on the drives.
Compared with (say) ext4, where a single disk error can recovered, this is pretty bad. But we are ready to live with this with the idea that we'll have hourly offline snapshots that we can easily recover from. It's trade-off. Also, we're running this on a NVMe/M.2 drive which typically just blinks out of existence completely, and doesn't "bit rot" the way a HDD would. Also, the FreeBSD handbook quick start doesn't have any warnings about their first example, which is with a single disk. So I am reassured at least.

Creating mount points Next we create the actual filesystems, known as "datasets" which are the things that get mounted on mountpoint and hold the actual files.
  • this creates two containers, for ROOT and BOOT
     zfs create -o canmount=off -o mountpoint=none rpool/ROOT &&
     zfs create -o canmount=off -o mountpoint=none bpool/BOOT
    
    Note that it's unclear to me why those datasets are necessary, but they seem common practice, also used in this FreeBSD example. The OpenZFS guide mentions the Solaris upgrades and Ubuntu's zsys that use that container for upgrades and rollbacks. This blog post seems to explain a bit the layout behind the installer.
  • this creates the actual boot and root filesystems:
     zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/debian &&
     zfs mount rpool/ROOT/debian &&
     zfs create -o mountpoint=/boot bpool/BOOT/debian
    
    I guess the debian name here is because we could technically have multiple operating systems with the same underlying datasets.
  • then the main datasets:
     zfs create                                 rpool/home &&
     zfs create -o mountpoint=/root             rpool/home/root &&
     chmod 700 /mnt/root &&
     zfs create                                 rpool/var
    
  • exclude temporary files from snapshots:
     zfs create -o com.sun:auto-snapshot=false  rpool/var/cache &&
     zfs create -o com.sun:auto-snapshot=false  rpool/var/tmp &&
     chmod 1777 /mnt/var/tmp
    
  • and skip automatic snapshots in Docker:
     zfs create -o canmount=off                 rpool/var/lib &&
     zfs create -o com.sun:auto-snapshot=false  rpool/var/lib/docker
    
    Notice here a peculiarity: we must create rpool/var/lib to create rpool/var/lib/docker otherwise we get this error:
     cannot create 'rpool/var/lib/docker': parent does not exist
    
    ... and no, just creating /mnt/var/lib doesn't fix that problem. In fact, it makes things even more confusing because an existing directory shadows a mountpoint, which is the opposite of how things normally work. Also note that you will probably need to change storage driver in Docker, see the zfs-driver documentation for details but, basically, I did:
    echo '  "storage-driver": "zfs"  ' > /etc/docker/daemon.json
    
    Note that podman has the same problem (and similar solution):
    printf '[storage]\ndriver = "zfs"\n' > /etc/containers/storage.conf
    
  • make a tmpfs for /run:
     mkdir /mnt/run &&
     mount -t tmpfs tmpfs /mnt/run &&
     mkdir /mnt/run/lock
    
We don't create a /srv, as that's the HDD stuff. Also mount the EFI partition:
mkfs.fat -F 32 /dev/sdc2 &&
mount /dev/sdc2 /mnt/boot/efi/
At this point, everything should be mounted in /mnt. It should look like this:
root@curie:~# LANG=C df -h -t zfs -t vfat
Filesystem            Size  Used Avail Use% Mounted on
rpool/ROOT/debian     899G  384K  899G   1% /mnt
bpool/BOOT/debian     832M  123M  709M  15% /mnt/boot
rpool/home            899G  256K  899G   1% /mnt/home
rpool/home/root       899G  256K  899G   1% /mnt/root
rpool/var             899G  384K  899G   1% /mnt/var
rpool/var/cache       899G  256K  899G   1% /mnt/var/cache
rpool/var/tmp         899G  256K  899G   1% /mnt/var/tmp
rpool/var/lib/docker  899G  256K  899G   1% /mnt/var/lib/docker
/dev/sdc2             511M  4.0K  511M   1% /mnt/boot/efi
Now that we have everything setup and mounted, let's copy all files over.

Copying files This is a list of all the mounted filesystems
for fs in /boot/ /boot/efi/ / /home/; do
    echo "syncing $fs to /mnt$fs..." && 
    rsync -aSHAXx --info=progress2 --delete $fs /mnt$fs
done
You can check that the list is correct with:
mount -l -t ext4,btrfs,vfat   awk ' print $3 '
Note that we skip /srv as it's on a different disk. On the first run, we had:
root@curie:~# for fs in /boot/ /boot/efi/ / /home/; do
        echo "syncing $fs to /mnt$fs..." && 
        rsync -aSHAXx --info=progress2 $fs /mnt$fs
    done
syncing /boot/ to /mnt/boot/...
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=0/299)  
syncing /boot/efi/ to /mnt/boot/efi/...
     16,831,437 100%  184.14MB/s    0:00:00 (xfr#101, to-chk=0/110)
syncing / to /mnt/...
 28,019,293,280  94%   47.63MB/s    0:09:21 (xfr#703710, ir-chk=6748/839220)rsync: [generator] delete_file: rmdir(var/lib/docker) failed: Device or resource busy (16)
could not make way for new symlink: var/lib/docker
 34,081,267,990  98%   50.71MB/s    0:10:40 (xfr#736577, to-chk=0/867732)    
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]
syncing /home/ to /mnt/home/...
rsync: [sender] readlink_stat("/home/anarcat/.fuse") failed: Permission denied (13)
 24,456,268,098  98%   68.03MB/s    0:05:42 (xfr#159867, ir-chk=6875/172377) 
file has vanished: "/home/anarcat/.cache/mozilla/firefox/s2hwvqbu.quantum/cache2/entries/B3AB0CDA9C4454B3C1197E5A22669DF8EE849D90"
199,762,528,125  93%   74.82MB/s    0:42:26 (xfr#1437846, ir-chk=1018/1983979)rsync: [generator] recv_generator: mkdir "/mnt/home/anarcat/dist/supysonic/tests/assets/\#346" failed: Invalid or incomplete multibyte or wide character (84)
*** Skipping any contents from this failed directory ***
315,384,723,978  96%   76.82MB/s    1:05:15 (xfr#2256473, to-chk=0/2993950)    
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]
Note the failure to transfer that supysonic file? It turns out they had a weird filename in their source tree, since then removed, but still it showed how the utf8only feature might not be such a bad idea. At this point, the procedure was restarted all the way back to "Creating pools", after unmounting all ZFS filesystems (umount /mnt/run /mnt/boot/efi && umount -t zfs -a) and destroying the pool, which, surprisingly, doesn't require any confirmation (zpool destroy rpool). The second run was cleaner:
root@curie:~# for fs in /boot/ /boot/efi/ / /home/; do
        echo "syncing $fs to /mnt$fs..." && 
        rsync -aSHAXx --info=progress2 --delete $fs /mnt$fs
    done
syncing /boot/ to /mnt/boot/...
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=0/299)  
syncing /boot/efi/ to /mnt/boot/efi/...
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=0/110)  
syncing / to /mnt/...
 28,019,033,070  97%   42.03MB/s    0:10:35 (xfr#703671, ir-chk=1093/833515)rsync: [generator] delete_file: rmdir(var/lib/docker) failed: Device or resource busy (16)
could not make way for new symlink: var/lib/docker
 34,081,807,102  98%   44.84MB/s    0:12:04 (xfr#736580, to-chk=0/867723)    
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]
syncing /home/ to /mnt/home/...
rsync: [sender] readlink_stat("/home/anarcat/.fuse") failed: Permission denied (13)
IO error encountered -- skipping file deletion
 24,043,086,450  96%   62.03MB/s    0:06:09 (xfr#151819, ir-chk=15117/172571)
file has vanished: "/home/anarcat/.cache/mozilla/firefox/s2hwvqbu.quantum/cache2/entries/4C1FDBFEA976FF924D062FB990B24B897A77B84B"
315,423,626,507  96%   67.09MB/s    1:14:43 (xfr#2256845, to-chk=0/2994364)    
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1333) [sender=3.2.3]
Also note the transfer speed: we seem capped at 76MB/s, or 608Mbit/s. This is not as fast as I was expecting: the USB connection seems to be at around 5Gbps:
anarcat@curie:~$ lsusb -tv   head -4
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/6p, 5000M
    ID 1d6b:0003 Linux Foundation 3.0 root hub
     __ Port 1: Dev 4, If 0, Class=Mass Storage, Driver=uas, 5000M
        ID 0b05:1932 ASUSTek Computer, Inc.
So it shouldn't cap at that speed. It's possible the USB adapter is failing to give me the full speed though. It's not the M.2 SSD drive either, as that has a ~500MB/s bandwidth, acccording to its spec. At this point, we're about ready to do the final configuration. We drop to single user mode and do the rest of the procedure. That used to be shutdown now, but it seems like the systemd switch broke that, so now you can reboot into grub and pick the "recovery" option. Alternatively, you might try systemctl rescue, as I found out. I also wanted to copy the drive over to another new NVMe drive, but that failed: it looks like the USB controller I have doesn't work with older, non-NVME drives.

Boot configuration Now we need to enter the new system to rebuild the boot loader and initrd and so on. First, we bind mounts and chroot into the ZFS disk:
mount --rbind /dev  /mnt/dev &&
mount --rbind /proc /mnt/proc &&
mount --rbind /sys  /mnt/sys &&
chroot /mnt /bin/bash
Next we add an extra service that imports the bpool on boot, to make sure it survives a zpool.cache destruction:
cat > /etc/systemd/system/zfs-import-bpool.service <<EOF
[Unit]
DefaultDependencies=no
Before=zfs-import-scan.service
Before=zfs-import-cache.service
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/sbin/zpool import -N -o cachefile=none bpool
# Work-around to preserve zpool cache:
ExecStartPre=-/bin/mv /etc/zfs/zpool.cache /etc/zfs/preboot_zpool.cache
ExecStartPost=-/bin/mv /etc/zfs/preboot_zpool.cache /etc/zfs/zpool.cache
[Install]
WantedBy=zfs-import.target
EOF
Enable the service:
systemctl enable zfs-import-bpool.service
I had to trim down /etc/fstab and /etc/crypttab to only contain references to the legacy filesystems (/srv is still BTRFS!). If we don't already have a tmpfs defined in /etc/fstab:
ln -s /usr/share/systemd/tmp.mount /etc/systemd/system/ &&
systemctl enable tmp.mount
Rebuild boot loader with support for ZFS, but also to workaround GRUB's missing zpool-features support:
grub-probe /boot   grep -q zfs &&
update-initramfs -c -k all &&
sed -i 's,GRUB_CMDLINE_LINUX.*,GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/debian",' /etc/default/grub &&
update-grub
For good measure, make sure the right disk is configured here, for example you might want to tag both drives in a RAID array:
dpkg-reconfigure grub-pc
Install grub to EFI while you're there:
grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=debian --recheck --no-floppy
Filesystem mount ordering. The rationale here in the OpenZFS guide is a little strange, but I don't dare ignore that.
mkdir /etc/zfs/zfs-list.cache
touch /etc/zfs/zfs-list.cache/bpool
touch /etc/zfs/zfs-list.cache/rpool
zed -F &
Verify that zed updated the cache by making sure these are not empty:
cat /etc/zfs/zfs-list.cache/bpool
cat /etc/zfs/zfs-list.cache/rpool
Once the files have data, stop zed:
fg
Press Ctrl-C.
Fix the paths to eliminate /mnt:
sed -Ei "s /mnt/? / " /etc/zfs/zfs-list.cache/*
Snapshot initial install:
zfs snapshot bpool/BOOT/debian@install
zfs snapshot rpool/ROOT/debian@install
Exit chroot:
exit

Finalizing One last sync was done in rescue mode:
for fs in /boot/ /boot/efi/ / /home/; do
    echo "syncing $fs to /mnt$fs..." && 
    rsync -aSHAXx --info=progress2 --delete $fs /mnt$fs
done
Then we unmount all filesystems:
mount   grep -v zfs   tac   awk '/\/mnt/  print $3 '   xargs -i  umount -lf  
zpool export -a
Reboot, swap the drives, and boot in ZFS. Hurray!

Benchmarks This is a test that was ran in single-user mode using fio and the Ars Technica recommended tests, which are:
  • Single 4KiB random write process:
     fio --name=randwrite4k1x --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1
    
  • 16 parallel 64KiB random write processes:
     fio --name=randwrite64k16x --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1
    
  • Single 1MiB random write process:
     fio --name=randwrite1m1x --ioengine=posixaio --rw=randwrite --bs=1m --size=16g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1
    
Strangely, that's not exactly what the author, Jim Salter, did in his actual test bench used in the ZFS benchmarking article. The first thing is there's no read test at all, which is already pretty strange. But also it doesn't include stuff like dropping caches or repeating results. So here's my variation, which i called fio-ars-bench.sh for now. It just batches a bunch of fio tests, one by one, 60 seconds each. It should take about 12 minutes to run, as there are 3 pair of tests, read/write, with and without async. My bias, before building, running and analysing those results is that ZFS should outperform the traditional stack on writes, but possibly not on reads. It's also possible it outperforms it on both, because it's a newer drive. A new test might be possible with a new external USB drive as well, although I doubt I will find the time to do this.

Results All tests were done on WD blue SN550 drives, which claims to be able to push 2400MB/s read and 1750MB/s write. An extra drive was bought to move the LVM setup from a WDC WDS500G1B0B-00AS40 SSD, a WD blue M.2 2280 SSD that was at least 5 years old, spec'd at 560MB/s read, 530MB/s write. Benchmarks were done on the M.2 SSD drive but discarded so that the drive difference is not a factor in the test. In practice, I'm going to assume we'll never reach those numbers because we're not actually NVMe (this is an old workstation!) so the bottleneck isn't the disk itself. For our purposes, it might still give us useful results.

Rescue test, LUKS/LVM/ext4 Those tests were performed with everything shutdown, after either entering the system in rescue mode, or by reaching that target with:
systemctl rescue
The network might have been started before or after the test as well:
systemctl start systemd-networkd
So it should be fairly reliable as basically nothing else is running. Raw numbers, from the ?job-curie-lvm.log, converted to MiB/s and manually merged:
test read I/O read IOPS write I/O write IOPS
rand4k4g1x 39.27 10052 212.15 54310
rand4k4g1x--fsync=1 39.29 10057 2.73 699
rand64k256m16x 1297.00 20751 1068.57 17097
rand64k256m16x--fsync=1 1290.90 20654 353.82 5661
rand1m16g1x 315.15 315 563.77 563
rand1m16g1x--fsync=1 345.88 345 157.01 157
Peaks are at about 20k IOPS and ~1.3GiB/s read, 1GiB/s write in the 64KB blocks with 16 jobs. Slowest is the random 4k block sync write at an abysmal 3MB/s and 700 IOPS The 1MB read/write tests have lower IOPS, but that is expected.

Rescue test, ZFS This test was also performed in rescue mode. Raw numbers, from the ?job-curie-zfs.log, converted to MiB/s and manually merged:
test read I/O read IOPS write I/O write IOPS
rand4k4g1x 77.20 19763 27.13 6944
rand4k4g1x--fsync=1 76.16 19495 6.53 1673
rand64k256m16x 1882.40 30118 70.58 1129
rand64k256m16x--fsync=1 1865.13 29842 71.98 1151
rand1m16g1x 921.62 921 102.21 102
rand1m16g1x--fsync=1 908.37 908 64.30 64
Peaks are at 1.8GiB/s read, also in the 64k job like above, but much faster. The write is, as expected, much slower at 70MiB/s (compared to 1GiB/s!), but it should be noted the sync write doesn't degrade performance compared to async writes (although it's still below the LVM 300MB/s).

Conclusions Really, ZFS has trouble performing in all write conditions. The random 4k sync write test is the only place where ZFS outperforms LVM in writes, and barely (7MiB/s vs 3MiB/s). Everywhere else, writes are much slower, sometimes by an order of magnitude. And before some ZFS zealot jumps in talking about the SLOG or some other cache that could be added to improved performance, I'll remind you that those numbers are on a bare bones NVMe drive, pretty much as fast storage as you can find on this machine. Adding another NVMe drive as a cache probably will not improve write performance here. Still, those are very different results than the tests performed by Salter which shows ZFS beating traditional configurations in all categories but uncached 4k reads (not writes!). That said, those tests are very different from the tests I performed here, where I test writes on a single disk, not a RAID array, which might explain the discrepancy. Also, note that neither LVM or ZFS manage to reach the 2400MB/s read and 1750MB/s write performance specification. ZFS does manage to reach 82% of the read performance (1973MB/s) and LVM 64% of the write performance (1120MB/s). LVM hits 57% of the read performance and ZFS hits barely 6% of the write performance. Overall, I'm a bit disappointed in the ZFS write performance here, I must say. Maybe I need to tweak the record size or some other ZFS voodoo, but I'll note that I didn't have to do any such configuration on the other side to kick ZFS in the pants...

Real world experience This section document not synthetic backups, but actual real world workloads, comparing before and after I switched my workstation to ZFS.

Docker performance I had the feeling that running some git hook (which was firing a Docker container) was "slower" somehow. It seems that, at runtime, ZFS backends are significant slower than their overlayfs/ext4 equivalent:
May 16 14:42:52 curie systemd[1]: home-docker-overlay2-17e4d24228decc2d2d493efc401dbfb7ac29739da0e46775e122078d9daf3e87\x2dinit-merged.mount: Succeeded.
May 16 14:42:52 curie systemd[5161]: home-docker-overlay2-17e4d24228decc2d2d493efc401dbfb7ac29739da0e46775e122078d9daf3e87\x2dinit-merged.mount: Succeeded.
May 16 14:42:52 curie systemd[1]: home-docker-overlay2-17e4d24228decc2d2d493efc401dbfb7ac29739da0e46775e122078d9daf3e87-merged.mount: Succeeded.
May 16 14:42:53 curie dockerd[1723]: time="2022-05-16T14:42:53.087219426-04:00" level=info msg="starting signal loop" namespace=moby path=/run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/af22586fba07014a4d10ab19da10cf280db7a43cad804d6c1e9f2682f12b5f10 pid=151170
May 16 14:42:53 curie systemd[1]: Started libcontainer container af22586fba07014a4d10ab19da10cf280db7a43cad804d6c1e9f2682f12b5f10.
May 16 14:42:54 curie systemd[1]: docker-af22586fba07014a4d10ab19da10cf280db7a43cad804d6c1e9f2682f12b5f10.scope: Succeeded.
May 16 14:42:54 curie dockerd[1723]: time="2022-05-16T14:42:54.047297800-04:00" level=info msg="shim disconnected" id=af22586fba07014a4d10ab19da10cf280db7a43cad804d6c1e9f2682f12b5f10
May 16 14:42:54 curie dockerd[998]: time="2022-05-16T14:42:54.051365015-04:00" level=info msg="ignoring event" container=af22586fba07014a4d10ab19da10cf280db7a43cad804d6c1e9f2682f12b5f10 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
May 16 14:42:54 curie systemd[2444]: run-docker-netns-f5453c87c879.mount: Succeeded.
May 16 14:42:54 curie systemd[5161]: run-docker-netns-f5453c87c879.mount: Succeeded.
May 16 14:42:54 curie systemd[2444]: home-docker-overlay2-17e4d24228decc2d2d493efc401dbfb7ac29739da0e46775e122078d9daf3e87-merged.mount: Succeeded.
May 16 14:42:54 curie systemd[5161]: home-docker-overlay2-17e4d24228decc2d2d493efc401dbfb7ac29739da0e46775e122078d9daf3e87-merged.mount: Succeeded.
May 16 14:42:54 curie systemd[1]: run-docker-netns-f5453c87c879.mount: Succeeded.
May 16 14:42:54 curie systemd[1]: home-docker-overlay2-17e4d24228decc2d2d493efc401dbfb7ac29739da0e46775e122078d9daf3e87-merged.mount: Succeeded.
Translating this:
  • container setup: ~1 second
  • container runtime: ~1 second
  • container teardown: ~1 second
  • total runtime: 2-3 seconds
Obviously, those timestamps are not quite accurate enough to make precise measurements... After I switched to ZFS:
mai 30 15:31:39 curie systemd[1]: var-lib-docker-zfs-graph-41ce08fb7a1d3a9c101694b82722f5621c0b4819bd1d9f070933fd1e00543cdf\x2dinit.mount: Succeeded. 
mai 30 15:31:39 curie systemd[5287]: var-lib-docker-zfs-graph-41ce08fb7a1d3a9c101694b82722f5621c0b4819bd1d9f070933fd1e00543cdf\x2dinit.mount: Succeeded. 
mai 30 15:31:40 curie systemd[1]: var-lib-docker-zfs-graph-41ce08fb7a1d3a9c101694b82722f5621c0b4819bd1d9f070933fd1e00543cdf.mount: Succeeded. 
mai 30 15:31:40 curie systemd[5287]: var-lib-docker-zfs-graph-41ce08fb7a1d3a9c101694b82722f5621c0b4819bd1d9f070933fd1e00543cdf.mount: Succeeded. 
mai 30 15:31:41 curie dockerd[3199]: time="2022-05-30T15:31:41.551403693-04:00" level=info msg="starting signal loop" namespace=moby path=/run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/42a1a1ed5912a7227148e997f442e7ab2e5cc3558aa3471548223c5888c9b142 pid=141080 
mai 30 15:31:41 curie systemd[1]: run-docker-runtime\x2drunc-moby-42a1a1ed5912a7227148e997f442e7ab2e5cc3558aa3471548223c5888c9b142-runc.ZVcjvl.mount: Succeeded. 
mai 30 15:31:41 curie systemd[5287]: run-docker-runtime\x2drunc-moby-42a1a1ed5912a7227148e997f442e7ab2e5cc3558aa3471548223c5888c9b142-runc.ZVcjvl.mount: Succeeded. 
mai 30 15:31:41 curie systemd[1]: Started libcontainer container 42a1a1ed5912a7227148e997f442e7ab2e5cc3558aa3471548223c5888c9b142. 
mai 30 15:31:45 curie systemd[1]: docker-42a1a1ed5912a7227148e997f442e7ab2e5cc3558aa3471548223c5888c9b142.scope: Succeeded. 
mai 30 15:31:45 curie dockerd[3199]: time="2022-05-30T15:31:45.883019128-04:00" level=info msg="shim disconnected" id=42a1a1ed5912a7227148e997f442e7ab2e5cc3558aa3471548223c5888c9b142 
mai 30 15:31:45 curie dockerd[1726]: time="2022-05-30T15:31:45.883064491-04:00" level=info msg="ignoring event" container=42a1a1ed5912a7227148e997f442e7ab2e5cc3558aa3471548223c5888c9b142 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" 
mai 30 15:31:45 curie systemd[1]: run-docker-netns-e45f5cf5f465.mount: Succeeded. 
mai 30 15:31:45 curie systemd[5287]: run-docker-netns-e45f5cf5f465.mount: Succeeded. 
mai 30 15:31:45 curie systemd[1]: var-lib-docker-zfs-graph-41ce08fb7a1d3a9c101694b82722f5621c0b4819bd1d9f070933fd1e00543cdf.mount: Succeeded. 
mai 30 15:31:45 curie systemd[5287]: var-lib-docker-zfs-graph-41ce08fb7a1d3a9c101694b82722f5621c0b4819bd1d9f070933fd1e00543cdf.mount: Succeeded.
That's double or triple the run time, from 2 seconds to 6 seconds. Most of the time is spent in run time, inside the container. Here's the breakdown:
  • container setup: ~2 seconds
  • container run: ~4 seconds
  • container teardown: ~1 second
  • total run time: about ~6-7 seconds
That's a two- to three-fold increase! Clearly something is going on here that I should tweak. It's possible that code path is less optimized in Docker. I also worry about podman, but apparently it also supports ZFS backends. Possibly it would perform better, but at this stage I wouldn't have a good comparison: maybe it would have performed better on non-ZFS as well...

Interactivity While doing the offsite backups (below), the system became somewhat "sluggish". I felt everything was slow, and I estimate it introduced ~50ms latency in any input device. Arguably, those are all USB and the external drive was connected through USB, but I suspect the ZFS drivers are not as well tuned with the scheduler as the regular filesystem drivers...

Recovery procedures For test purposes, I unmounted all systems during the procedure:
umount /mnt/boot/efi /mnt/boot/run
umount -a -t zfs
zpool export -a
And disconnected the drive, to see how I would recover this system from another Linux system in case of a total motherboard failure. To import an existing pool, plug the device, then import the pool with an alternate root, so it doesn't mount over your existing filesystems, then you mount the root filesystem and all the others:
zpool import -l -a -R /mnt &&
zfs mount rpool/ROOT/debian &&
zfs mount -a &&
mount /dev/sdc2 /mnt/boot/efi &&
mount -t tmpfs tmpfs /mnt/run &&
mkdir /mnt/run/lock

Offsite backup Part of the goal of using ZFS is to simplify and harden backups. I wanted to experiment with shorter recovery times specifically both point in time recovery objective and recovery time objective and faster incremental backups. This is, therefore, part of my backup services. This section documents how an external NVMe enclosure was setup in a pool to mirror the datasets from my workstation. The final setup should include syncoid copying datasets to the backup server regularly, but I haven't finished that configuration yet.

Partitioning The above partitioning procedure used sgdisk, but I couldn't figure out how to do this with sgdisk, so this uses sfdisk to dump the partition from the first disk to an external, identical drive:
sfdisk -d /dev/nvme0n1   sfdisk --no-reread /dev/sda --force

Pool creation This is similar to the main pool creation, except we tweaked a few bits after changing the upstream procedure:
zpool create \
        -o cachefile=/etc/zfs/zpool.cache \
        -o ashift=12 -d \
        -o feature@async_destroy=enabled \
        -o feature@bookmarks=enabled \
        -o feature@embedded_data=enabled \
        -o feature@empty_bpobj=enabled \
        -o feature@enabled_txg=enabled \
        -o feature@extensible_dataset=enabled \
        -o feature@filesystem_limits=enabled \
        -o feature@hole_birth=enabled \
        -o feature@large_blocks=enabled \
        -o feature@lz4_compress=enabled \
        -o feature@spacemap_histogram=enabled \
        -o feature@zpool_checkpoint=enabled \
        -O acltype=posixacl -O xattr=sa \
        -O compression=lz4 \
        -O devices=off \
        -O relatime=on \
        -O canmount=off \
        -O mountpoint=/boot -R /mnt \
        bpool-tubman /dev/sdb3
The change from the main boot pool are: Main pool creation is:
zpool create \
        -o ashift=12 \
        -O encryption=on -O keylocation=prompt -O keyformat=passphrase \
        -O acltype=posixacl -O xattr=sa -O dnodesize=auto \
        -O compression=zstd \
        -O relatime=on \
        -O canmount=off \
        -O mountpoint=/ -R /mnt \
        rpool-tubman /dev/sdb4

First sync I used syncoid to copy all pools over to the external device. syncoid is a thing that's part of the sanoid project which is specifically designed to sync snapshots between pool, typically over SSH links but it can also operate locally. The sanoid command had a --readonly argument to simulate changes, but syncoid didn't so I tried to fix that with an upstream PR. It seems it would be better to do this by hand, but this was much easier. The full first sync was:
root@curie:/home/anarcat# ./bin/syncoid -r  bpool bpool-tubman
CRITICAL ERROR: Target bpool-tubman exists but has no snapshots matching with bpool!
                Replication to target would require destroying existing
                target. Cowardly refusing to destroy your existing target.
          NOTE: Target bpool-tubman dataset is < 64MB used - did you mistakenly run
                 zfs create bpool-tubman  on the target? ZFS initial
                replication must be to a NON EXISTENT DATASET, which will
                then be CREATED BY the initial replication process.
INFO: Sending oldest full snapshot bpool/BOOT@test (~ 42 KB) to new target filesystem:
44.2KiB 0:00:00 [4.19MiB/s] [========================================================================================================================] 103%            
INFO: Updating new target filesystem with incremental bpool/BOOT@test ... syncoid_curie_2022-05-30:12:50:39 (~ 4 KB):
2.13KiB 0:00:00 [ 114KiB/s] [===============================================================>                                                         ] 53%            
INFO: Sending oldest full snapshot bpool/BOOT/debian@install (~ 126.0 MB) to new target filesystem:
 126MiB 0:00:00 [ 308MiB/s] [=======================================================================================================================>] 100%            
INFO: Updating new target filesystem with incremental bpool/BOOT/debian@install ... syncoid_curie_2022-05-30:12:50:39 (~ 113.4 MB):
 113MiB 0:00:00 [ 315MiB/s] [=======================================================================================================================>] 100%
root@curie:/home/anarcat# ./bin/syncoid -r  rpool rpool-tubman
CRITICAL ERROR: Target rpool-tubman exists but has no snapshots matching with rpool!
                Replication to target would require destroying existing
                target. Cowardly refusing to destroy your existing target.
          NOTE: Target rpool-tubman dataset is < 64MB used - did you mistakenly run
                 zfs create rpool-tubman  on the target? ZFS initial
                replication must be to a NON EXISTENT DATASET, which will
                then be CREATED BY the initial replication process.
INFO: Sending oldest full snapshot rpool/ROOT@syncoid_curie_2022-05-30:12:50:51 (~ 69 KB) to new target filesystem:
44.2KiB 0:00:00 [2.44MiB/s] [===========================================================================>                                             ] 63%            
INFO: Sending oldest full snapshot rpool/ROOT/debian@install (~ 25.9 GB) to new target filesystem:
25.9GiB 0:03:33 [ 124MiB/s] [=======================================================================================================================>] 100%            
INFO: Updating new target filesystem with incremental rpool/ROOT/debian@install ... syncoid_curie_2022-05-30:12:50:52 (~ 3.9 GB):
3.92GiB 0:00:33 [ 119MiB/s] [======================================================================================================================>  ] 99%            
INFO: Sending oldest full snapshot rpool/home@syncoid_curie_2022-05-30:12:55:04 (~ 276.8 GB) to new target filesystem:
 277GiB 0:27:13 [ 174MiB/s] [=======================================================================================================================>] 100%            
INFO: Sending oldest full snapshot rpool/home/root@syncoid_curie_2022-05-30:13:22:19 (~ 2.2 GB) to new target filesystem:
2.22GiB 0:00:25 [90.2MiB/s] [=======================================================================================================================>] 100%            
INFO: Sending oldest full snapshot rpool/var@syncoid_curie_2022-05-30:13:22:47 (~ 5.6 GB) to new target filesystem:
5.56GiB 0:00:32 [ 176MiB/s] [=======================================================================================================================>] 100%            
INFO: Sending oldest full snapshot rpool/var/cache@syncoid_curie_2022-05-30:13:23:22 (~ 627.3 MB) to new target filesystem:
 627MiB 0:00:03 [ 169MiB/s] [=======================================================================================================================>] 100%            
INFO: Sending oldest full snapshot rpool/var/lib@syncoid_curie_2022-05-30:13:23:28 (~ 69 KB) to new target filesystem:
44.2KiB 0:00:00 [1.40MiB/s] [===========================================================================>                                             ] 63%            
INFO: Sending oldest full snapshot rpool/var/lib/docker@syncoid_curie_2022-05-30:13:23:28 (~ 442.6 MB) to new target filesystem:
 443MiB 0:00:04 [ 103MiB/s] [=======================================================================================================================>] 100%            
INFO: Sending oldest full snapshot rpool/var/lib/docker/05c0de7fabbea60500eaa495d0d82038249f6faa63b12914737c4d71520e62c5@266253254 (~ 6.3 MB) to new target filesystem:
6.49MiB 0:00:00 [12.9MiB/s] [========================================================================================================================] 102%            
INFO: Updating new target filesystem with incremental rpool/var/lib/docker/05c0de7fabbea60500eaa495d0d82038249f6faa63b12914737c4d71520e62c5@266253254 ... syncoid_curie_2022-05-30:13:23:34 (~ 4 KB):
1.52KiB 0:00:00 [27.6KiB/s] [============================================>                                                                            ] 38%            
INFO: Sending oldest full snapshot rpool/var/lib/flatpak@syncoid_curie_2022-05-30:13:23:36 (~ 2.0 GB) to new target filesystem:
2.00GiB 0:00:17 [ 115MiB/s] [=======================================================================================================================>] 100%            
INFO: Sending oldest full snapshot rpool/var/tmp@syncoid_curie_2022-05-30:13:23:55 (~ 57.0 MB) to new target filesystem:
61.8MiB 0:00:01 [45.0MiB/s] [========================================================================================================================] 108%            
INFO: Clone is recreated on target rpool-tubman/var/lib/docker/ed71ddd563a779ba6fb37b3b1d0cc2c11eca9b594e77b4b234867ebcb162b205 based on rpool/var/lib/docker/05c0de7fabbea60500eaa495d0d82038249f6faa63b12914737c4d71520e62c5@266253254
INFO: Sending oldest full snapshot rpool/var/lib/docker/ed71ddd563a779ba6fb37b3b1d0cc2c11eca9b594e77b4b234867ebcb162b205@syncoid_curie_2022-05-30:13:23:58 (~ 218.6 MB) to new target filesystem:
 219MiB 0:00:01 [ 151MiB/s] [=======================================================================================================================>] 100%
Funny how the CRITICAL ERROR doesn't actually stop syncoid and it just carries on merrily doing when it's telling you it's "cowardly refusing to destroy your existing target"... Maybe that's because my pull request broke something though... During the transfer, the computer was very sluggish: everything feels like it has ~30-50ms latency extra:
anarcat@curie:sanoid$ LANG=C top -b  -n 1   head -20
top - 13:07:05 up 6 days,  4:01,  1 user,  load average: 16.13, 16.55, 11.83
Tasks: 606 total,   6 running, 598 sleeping,   0 stopped,   2 zombie
%Cpu(s): 18.8 us, 72.5 sy,  1.2 ni,  5.0 id,  1.2 wa,  0.0 hi,  1.2 si,  0.0 st
MiB Mem :  15898.4 total,   1387.6 free,  13170.0 used,   1340.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   1319.8 avail Mem 
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     70 root      20   0       0      0      0 S  83.3   0.0   6:12.67 kswapd0
4024878 root      20   0  282644  96432  10288 S  44.4   0.6   0:11.43 puppet
3896136 root      20   0   35328  16528     48 S  22.2   0.1   2:08.04 mbuffer
3896135 root      20   0   10328    776    168 R  16.7   0.0   1:22.93 zfs
3896138 root      20   0   10588    788    156 R  16.7   0.0   1:49.30 zfs
    350 root       0 -20       0      0      0 R  11.1   0.0   1:03.53 z_rd_int
    351 root       0 -20       0      0      0 S  11.1   0.0   1:04.15 z_rd_int
3896137 root      20   0    4384    352    244 R  11.1   0.0   0:44.73 pv
4034094 anarcat   30  10   20028  13960   2428 S  11.1   0.1   0:00.70 mbsync
4036539 anarcat   20   0    9604   3464   2408 R  11.1   0.0   0:00.04 top
    352 root       0 -20       0      0      0 S   5.6   0.0   1:03.64 z_rd_int
    353 root       0 -20       0      0      0 S   5.6   0.0   1:03.64 z_rd_int
    354 root       0 -20       0      0      0 S   5.6   0.0   1:04.01 z_rd_int
I wonder how much of that is due to syncoid, particularly because I often saw mbuffer and pv in there which are not strictly necessary to do those kind of operations, as far as I understand. Once that's done, export the pools to disconnect the drive:
zpool export bpool-tubman
zpool export rpool-tubman

Raw disk benchmark Copied the 512GB SSD/M.2 device to another 1024GB NVMe/M.2 device:
anarcat@curie:~$ sudo dd if=/dev/sdb of=/dev/sdc bs=4M status=progress conv=fdatasync
499944259584 octets (500 GB, 466 GiB) copi s, 1713 s, 292 MB/s
119235+1 enregistrements lus
119235+1 enregistrements  crits
500107862016 octets (500 GB, 466 GiB) copi s, 1719,93 s, 291 MB/s
... while both over USB, whoohoo 300MB/s!

Monitoring ZFS should be monitoring your pools regularly. Normally, the [[!debman zed]] daemon monitors all ZFS events. It is the thing that will report when a scrub failed, for example. See this configuration guide. Scrubs should be regularly scheduled to ensure consistency of the pool. This can be done in newer zfsutils-linux versions (bullseye-backports or bookworm) with one of those, depending on the desired frequency:
systemctl enable zfs-scrub-weekly@rpool.timer --now
systemctl enable zfs-scrub-monthly@rpool.timer --now
When the scrub runs, if it finds anything it will send an event which will get picked up by the zed daemon which will then send a notification, see below for an example. TODO: deploy on curie, if possible (probably not because no RAID) TODO: this should be in Puppet

Scrub warning example So what happens when problems are found? Here's an example of how I dealt with an error I received. After setting up another server (tubman) with ZFS, I eventually ended up getting a warning from the ZFS toolchain.
Date: Sun, 09 Oct 2022 00:58:08 -0400
From: root <root@anarc.at>
To: root@anarc.at
Subject: ZFS scrub_finish event for rpool on tubman
ZFS has finished a scrub:
   eid: 39536
 class: scrub_finish
  host: tubman
  time: 2022-10-09 00:58:07-0400
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:33:57 with 0 errors on Sun Oct  9 00:58:07 2022
config:
        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdb4    ONLINE       0     1     0
            sdc4    ONLINE       0     0     0
        cache
          sda3      ONLINE       0     0     0
errors: No known data errors
This, in itself, is a little worrisome. But it helpfully links to this more detailed documentation (and props up there: the link still works) which explains this is a "minor" problem (something that could be included in the report). In this case, this happened on a server setup on 2021-04-28, but the disks and server hardware are much older. The server itself (marcos v1) was built around 2011, over 10 years ago now. The hard drive in question is:
root@tubman:~# smartctl -i -qnoserial /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-15-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5
Device Model:     ST4000DM004-2CV104
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5425 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Oct 11 11:02:32 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Some more SMART stats:
root@tubman:~# smartctl -a -qnoserial /dev/sdb   grep -e  Head_Flying_Hours -e Power_On_Hours -e Total_LBA -e 'Sector Sizes'
Sector Sizes:     512 bytes logical, 4096 bytes physical
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       12464 (206 202 0)
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       10966h+55m+23.757s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       21107792664
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       3201579750
That's over a year of power on, which shouldn't be so bad. It has written about 10TB of data (21107792664 LBAs * 512 byte/LBA), which is about two full writes. According to its specification, this device is supposed to support 55 TB/year of writes, so we're far below spec. Note that are still far from the "non-recoverable read error per bits" spec (1 per 10E15), as we've basically read 13E12 bits (3201579750 LBAs * 512 byte/LBA = 13E12 bits). It's likely this disk was made in 2018, so it is in its fourth year. Interestingly, /dev/sdc is also a Seagate drive, but of a different series:
root@tubman:~# smartctl -qnoserial  -i /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-15-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5
Device Model:     ST4000DM004-2CV104
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5425 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Oct 11 11:21:35 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
It has seen much more reads than the other disk which is also interesting:
root@tubman:~# smartctl -a -qnoserial /dev/sdc   grep -e  Head_Flying_Hours -e Power_On_Hours -e Total_LBA -e 'Sector Sizes'
Sector Sizes:     512 bytes logical, 4096 bytes physical
  9 Power_On_Hours          0x0032   059   059   000    Old_age   Always       -       36240
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       33994h+10m+52.118s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       30730174438
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       51894566538
That's 4 years of Head_Flying_Hours, and over 4 years (4 years and 48 days) of Power_On_Hours. The copyright date on that drive's specs goes back to 2016, so it's a much older drive. SMART self-test succeeded.

Remaining issues
  • TODO: move send/receive backups to offsite host, see also zfs for alternatives to syncoid/sanoid there
  • TODO: setup backup cron job (or timer?)
  • TODO: swap still not setup on curie, see zfs
  • TODO: document this somewhere: bpool and rpool are both pools and datasets. that's pretty confusing, but also very useful because it allows for pool-wide recursive snapshots, which are used for the backup system

fio improvements I really want to improve my experience with fio. Right now, I'm just cargo-culting stuff from other folks and I don't really like it. stressant is a good example of my struggles, in the sense that it doesn't really work that well for disk tests. I would love to have just a single .fio job file that lists multiple jobs to run serially. For example, this file describes the above workload pretty well:
[global]
# cargo-culting Salter
fallocate=none
ioengine=posixaio
runtime=60
time_based=1
end_fsync=1
stonewall=1
group_reporting=1
# no need to drop caches, done by default
# invalidate=1
# Single 4KiB random read/write process
[randread-4k-4g-1x]
rw=randread
bs=4k
size=4g
numjobs=1
iodepth=1
[randwrite-4k-4g-1x]
rw=randwrite
bs=4k
size=4g
numjobs=1
iodepth=1
# 16 parallel 64KiB random read/write processes:
[randread-64k-256m-16x]
rw=randread
bs=64k
size=256m
numjobs=16
iodepth=16
[randwrite-64k-256m-16x]
rw=randwrite
bs=64k
size=256m
numjobs=16
iodepth=16
# Single 1MiB random read/write process
[randread-1m-16g-1x]
rw=randread
bs=1m
size=16g
numjobs=1
iodepth=1
[randwrite-1m-16g-1x]
rw=randwrite
bs=1m
size=16g
numjobs=1
iodepth=1
... except the jobs are actually started in parallel, even though they are stonewall'd, as far as I can tell by the reports. I sent a mail to the fio mailing list for clarification. It looks like the jobs are started in parallel, but actual (correctly) run serially. It seems like this might just be a matter of reporting the right timestamps in the end, although it does feel like starting all the processes (even if not doing any work yet) could skew the results.

Hangs during procedure During the procedure, it happened a few times where any ZFS command would completely hang. It seems that using an external USB drive to sync stuff didn't work so well: sometimes it would reconnect under a different device (from sdc to sdd, for example), and this would greatly confuse ZFS. Here, for example, is sdd reappearing out of the blue:
May 19 11:22:53 curie kernel: [  699.820301] scsi host4: uas
May 19 11:22:53 curie kernel: [  699.820544] usb 2-1: authorized to connect
May 19 11:22:53 curie kernel: [  699.922433] scsi 4:0:0:0: Direct-Access     ROG      ESD-S1C          0    PQ: 0 ANSI: 6
May 19 11:22:53 curie kernel: [  699.923235] sd 4:0:0:0: Attached scsi generic sg2 type 0
May 19 11:22:53 curie kernel: [  699.923676] sd 4:0:0:0: [sdd] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
May 19 11:22:53 curie kernel: [  699.923788] sd 4:0:0:0: [sdd] Write Protect is off
May 19 11:22:53 curie kernel: [  699.923949] sd 4:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
May 19 11:22:53 curie kernel: [  699.924149] sd 4:0:0:0: [sdd] Optimal transfer size 33553920 bytes
May 19 11:22:53 curie kernel: [  699.961602]  sdd: sdd1 sdd2 sdd3 sdd4
May 19 11:22:53 curie kernel: [  699.996083] sd 4:0:0:0: [sdd] Attached SCSI disk
Next time I run a ZFS command (say zpool list), the command completely hangs (D state) and this comes up in the logs:
May 19 11:34:21 curie kernel: [ 1387.914843] zio pool=bpool vdev=/dev/sdc3 error=5 type=2 offset=71344128 size=4096 flags=184880
May 19 11:34:21 curie kernel: [ 1387.914859] zio pool=bpool vdev=/dev/sdc3 error=5 type=2 offset=205565952 size=4096 flags=184880
May 19 11:34:21 curie kernel: [ 1387.914874] zio pool=bpool vdev=/dev/sdc3 error=5 type=2 offset=272789504 size=4096 flags=184880
May 19 11:34:21 curie kernel: [ 1387.914906] zio pool=bpool vdev=/dev/sdc3 error=5 type=1 offset=270336 size=8192 flags=b08c1
May 19 11:34:21 curie kernel: [ 1387.914932] zio pool=bpool vdev=/dev/sdc3 error=5 type=1 offset=1073225728 size=8192 flags=b08c1
May 19 11:34:21 curie kernel: [ 1387.914948] zio pool=bpool vdev=/dev/sdc3 error=5 type=1 offset=1073487872 size=8192 flags=b08c1
May 19 11:34:21 curie kernel: [ 1387.915165] zio pool=bpool vdev=/dev/sdc3 error=5 type=2 offset=272793600 size=4096 flags=184880
May 19 11:34:21 curie kernel: [ 1387.915183] zio pool=bpool vdev=/dev/sdc3 error=5 type=2 offset=339853312 size=4096 flags=184880
May 19 11:34:21 curie kernel: [ 1387.915648] WARNING: Pool 'bpool' has encountered an uncorrectable I/O failure and has been suspended.
May 19 11:34:21 curie kernel: [ 1387.915648] 
May 19 11:37:25 curie kernel: [ 1571.558614] task:txg_sync        state:D stack:    0 pid:  997 ppid:     2 flags:0x00004000
May 19 11:37:25 curie kernel: [ 1571.558623] Call Trace:
May 19 11:37:25 curie kernel: [ 1571.558640]  __schedule+0x282/0x870
May 19 11:37:25 curie kernel: [ 1571.558650]  schedule+0x46/0xb0
May 19 11:37:25 curie kernel: [ 1571.558670]  schedule_timeout+0x8b/0x140
May 19 11:37:25 curie kernel: [ 1571.558675]  ? __next_timer_interrupt+0x110/0x110
May 19 11:37:25 curie kernel: [ 1571.558678]  io_schedule_timeout+0x4c/0x80
May 19 11:37:25 curie kernel: [ 1571.558689]  __cv_timedwait_common+0x12b/0x160 [spl]
May 19 11:37:25 curie kernel: [ 1571.558694]  ? add_wait_queue_exclusive+0x70/0x70
May 19 11:37:25 curie kernel: [ 1571.558702]  __cv_timedwait_io+0x15/0x20 [spl]
May 19 11:37:25 curie kernel: [ 1571.558816]  zio_wait+0x129/0x2b0 [zfs]
May 19 11:37:25 curie kernel: [ 1571.558929]  dsl_pool_sync+0x461/0x4f0 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559032]  spa_sync+0x575/0xfa0 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559138]  ? spa_txg_history_init_io+0x101/0x110 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559245]  txg_sync_thread+0x2e0/0x4a0 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559354]  ? txg_fini+0x240/0x240 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559366]  thread_generic_wrapper+0x6f/0x80 [spl]
May 19 11:37:25 curie kernel: [ 1571.559376]  ? __thread_exit+0x20/0x20 [spl]
May 19 11:37:25 curie kernel: [ 1571.559379]  kthread+0x11b/0x140
May 19 11:37:25 curie kernel: [ 1571.559382]  ? __kthread_bind_mask+0x60/0x60
May 19 11:37:25 curie kernel: [ 1571.559386]  ret_from_fork+0x22/0x30
May 19 11:37:25 curie kernel: [ 1571.559401] task:zed             state:D stack:    0 pid: 1564 ppid:     1 flags:0x00000000
May 19 11:37:25 curie kernel: [ 1571.559404] Call Trace:
May 19 11:37:25 curie kernel: [ 1571.559409]  __schedule+0x282/0x870
May 19 11:37:25 curie kernel: [ 1571.559412]  ? __kmalloc_node+0x141/0x2b0
May 19 11:37:25 curie kernel: [ 1571.559417]  schedule+0x46/0xb0
May 19 11:37:25 curie kernel: [ 1571.559420]  schedule_preempt_disabled+0xa/0x10
May 19 11:37:25 curie kernel: [ 1571.559424]  __mutex_lock.constprop.0+0x133/0x460
May 19 11:37:25 curie kernel: [ 1571.559435]  ? nvlist_xalloc.part.0+0x68/0xc0 [znvpair]
May 19 11:37:25 curie kernel: [ 1571.559537]  spa_all_configs+0x41/0x120 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559644]  zfs_ioc_pool_configs+0x17/0x70 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559752]  zfsdev_ioctl_common+0x697/0x870 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559758]  ? _copy_from_user+0x28/0x60
May 19 11:37:25 curie kernel: [ 1571.559860]  zfsdev_ioctl+0x53/0xe0 [zfs]
May 19 11:37:25 curie kernel: [ 1571.559866]  __x64_sys_ioctl+0x83/0xb0
May 19 11:37:25 curie kernel: [ 1571.559869]  do_syscall_64+0x33/0x80
May 19 11:37:25 curie kernel: [ 1571.559873]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
May 19 11:37:25 curie kernel: [ 1571.559876] RIP: 0033:0x7fcf0ef32cc7
May 19 11:37:25 curie kernel: [ 1571.559878] RSP: 002b:00007fcf0e181618 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
May 19 11:37:25 curie kernel: [ 1571.559881] RAX: ffffffffffffffda RBX: 000055b212f972a0 RCX: 00007fcf0ef32cc7
May 19 11:37:25 curie kernel: [ 1571.559883] RDX: 00007fcf0e181640 RSI: 0000000000005a04 RDI: 000000000000000b
May 19 11:37:25 curie kernel: [ 1571.559885] RBP: 00007fcf0e184c30 R08: 00007fcf08016810 R09: 00007fcf08000080
May 19 11:37:25 curie kernel: [ 1571.559886] R10: 0000000000080000 R11: 0000000000000246 R12: 000055b212f972a0
May 19 11:37:25 curie kernel: [ 1571.559888] R13: 0000000000000000 R14: 00007fcf0e181640 R15: 0000000000000000
May 19 11:37:25 curie kernel: [ 1571.559980] task:zpool           state:D stack:    0 pid:11815 ppid:  3816 flags:0x00004000
May 19 11:37:25 curie kernel: [ 1571.559983] Call Trace:
May 19 11:37:25 curie kernel: [ 1571.559988]  __schedule+0x282/0x870
May 19 11:37:25 curie kernel: [ 1571.559992]  schedule+0x46/0xb0
May 19 11:37:25 curie kernel: [ 1571.559995]  io_schedule+0x42/0x70
May 19 11:37:25 curie kernel: [ 1571.560004]  cv_wait_common+0xac/0x130 [spl]
May 19 11:37:25 curie kernel: [ 1571.560008]  ? add_wait_queue_exclusive+0x70/0x70
May 19 11:37:25 curie kernel: [ 1571.560118]  txg_wait_synced_impl+0xc9/0x110 [zfs]
May 19 11:37:25 curie kernel: [ 1571.560223]  txg_wait_synced+0xc/0x40 [zfs]
May 19 11:37:25 curie kernel: [ 1571.560325]  spa_export_common+0x4cd/0x590 [zfs]
May 19 11:37:25 curie kernel: [ 1571.560430]  ? zfs_log_history+0x9c/0xf0 [zfs]
May 19 11:37:25 curie kernel: [ 1571.560537]  zfsdev_ioctl_common+0x697/0x870 [zfs]
May 19 11:37:25 curie kernel: [ 1571.560543]  ? _copy_from_user+0x28/0x60
May 19 11:37:25 curie kernel: [ 1571.560644]  zfsdev_ioctl+0x53/0xe0 [zfs]
May 19 11:37:25 curie kernel: [ 1571.560649]  __x64_sys_ioctl+0x83/0xb0
May 19 11:37:25 curie kernel: [ 1571.560653]  do_syscall_64+0x33/0x80
May 19 11:37:25 curie kernel: [ 1571.560656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
May 19 11:37:25 curie kernel: [ 1571.560659] RIP: 0033:0x7fdc23be2cc7
May 19 11:37:25 curie kernel: [ 1571.560661] RSP: 002b:00007ffc8c792478 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
May 19 11:37:25 curie kernel: [ 1571.560664] RAX: ffffffffffffffda RBX: 000055942ca49e20 RCX: 00007fdc23be2cc7
May 19 11:37:25 curie kernel: [ 1571.560666] RDX: 00007ffc8c792490 RSI: 0000000000005a03 RDI: 0000000000000003
May 19 11:37:25 curie kernel: [ 1571.560667] RBP: 00007ffc8c795e80 R08: 00000000ffffffff R09: 00007ffc8c792310
May 19 11:37:25 curie kernel: [ 1571.560669] R10: 000055942ca49e30 R11: 0000000000000246 R12: 00007ffc8c792490
May 19 11:37:25 curie kernel: [ 1571.560671] R13: 000055942ca49e30 R14: 000055942aed2c20 R15: 00007ffc8c795a40
Here's another example, where you see the USB controller bleeping out and back into existence:
mai 19 11:38:39 curie kernel: usb 2-1: USB disconnect, device number 2
mai 19 11:38:39 curie kernel: sd 4:0:0:0: [sdd] Synchronizing SCSI cache
mai 19 11:38:39 curie kernel: sd 4:0:0:0: [sdd] Synchronize Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
mai 19 11:39:25 curie kernel: INFO: task zed:1564 blocked for more than 241 seconds.
mai 19 11:39:25 curie kernel:       Tainted: P          IOE     5.10.0-14-amd64 #1 Debian 5.10.113-1
mai 19 11:39:25 curie kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mai 19 11:39:25 curie kernel: task:zed             state:D stack:    0 pid: 1564 ppid:     1 flags:0x00000000
mai 19 11:39:25 curie kernel: Call Trace:
mai 19 11:39:25 curie kernel:  __schedule+0x282/0x870
mai 19 11:39:25 curie kernel:  ? __kmalloc_node+0x141/0x2b0
mai 19 11:39:25 curie kernel:  schedule+0x46/0xb0
mai 19 11:39:25 curie kernel:  schedule_preempt_disabled+0xa/0x10
mai 19 11:39:25 curie kernel:  __mutex_lock.constprop.0+0x133/0x460
mai 19 11:39:25 curie kernel:  ? nvlist_xalloc.part.0+0x68/0xc0 [znvpair]
mai 19 11:39:25 curie kernel:  spa_all_configs+0x41/0x120 [zfs]
mai 19 11:39:25 curie kernel:  zfs_ioc_pool_configs+0x17/0x70 [zfs]
mai 19 11:39:25 curie kernel:  zfsdev_ioctl_common+0x697/0x870 [zfs]
mai 19 11:39:25 curie kernel:  ? _copy_from_user+0x28/0x60
mai 19 11:39:25 curie kernel:  zfsdev_ioctl+0x53/0xe0 [zfs]
mai 19 11:39:25 curie kernel:  __x64_sys_ioctl+0x83/0xb0
mai 19 11:39:25 curie kernel:  do_syscall_64+0x33/0x80
mai 19 11:39:25 curie kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
mai 19 11:39:25 curie kernel: RIP: 0033:0x7fcf0ef32cc7
mai 19 11:39:25 curie kernel: RSP: 002b:00007fcf0e181618 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
mai 19 11:39:25 curie kernel: RAX: ffffffffffffffda RBX: 000055b212f972a0 RCX: 00007fcf0ef32cc7
mai 19 11:39:25 curie kernel: RDX: 00007fcf0e181640 RSI: 0000000000005a04 RDI: 000000000000000b
mai 19 11:39:25 curie kernel: RBP: 00007fcf0e184c30 R08: 00007fcf08016810 R09: 00007fcf08000080
mai 19 11:39:25 curie kernel: R10: 0000000000080000 R11: 0000000000000246 R12: 000055b212f972a0
mai 19 11:39:25 curie kernel: R13: 0000000000000000 R14: 00007fcf0e181640 R15: 0000000000000000
mai 19 11:39:25 curie kernel: INFO: task zpool:11815 blocked for more than 241 seconds.
mai 19 11:39:25 curie kernel:       Tainted: P          IOE     5.10.0-14-amd64 #1 Debian 5.10.113-1
mai 19 11:39:25 curie kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mai 19 11:39:25 curie kernel: task:zpool           state:D stack:    0 pid:11815 ppid:  2621 flags:0x00004004
mai 19 11:39:25 curie kernel: Call Trace:
mai 19 11:39:25 curie kernel:  __schedule+0x282/0x870
mai 19 11:39:25 curie kernel:  schedule+0x46/0xb0
mai 19 11:39:25 curie kernel:  io_schedule+0x42/0x70
mai 19 11:39:25 curie kernel:  cv_wait_common+0xac/0x130 [spl]
mai 19 11:39:25 curie kernel:  ? add_wait_queue_exclusive+0x70/0x70
mai 19 11:39:25 curie kernel:  txg_wait_synced_impl+0xc9/0x110 [zfs]
mai 19 11:39:25 curie kernel:  txg_wait_synced+0xc/0x40 [zfs]
mai 19 11:39:25 curie kernel:  spa_export_common+0x4cd/0x590 [zfs]
mai 19 11:39:25 curie kernel:  ? zfs_log_history+0x9c/0xf0 [zfs]
mai 19 11:39:25 curie kernel:  zfsdev_ioctl_common+0x697/0x870 [zfs]
mai 19 11:39:25 curie kernel:  ? _copy_from_user+0x28/0x60
mai 19 11:39:25 curie kernel:  zfsdev_ioctl+0x53/0xe0 [zfs]
mai 19 11:39:25 curie kernel:  __x64_sys_ioctl+0x83/0xb0
mai 19 11:39:25 curie kernel:  do_syscall_64+0x33/0x80
mai 19 11:39:25 curie kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
mai 19 11:39:25 curie kernel: RIP: 0033:0x7fdc23be2cc7
mai 19 11:39:25 curie kernel: RSP: 002b:00007ffc8c792478 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
mai 19 11:39:25 curie kernel: RAX: ffffffffffffffda RBX: 000055942ca49e20 RCX: 00007fdc23be2cc7
mai 19 11:39:25 curie kernel: RDX: 00007ffc8c792490 RSI: 0000000000005a03 RDI: 0000000000000003
mai 19 11:39:25 curie kernel: RBP: 00007ffc8c795e80 R08: 00000000ffffffff R09: 00007ffc8c792310
mai 19 11:39:25 curie kernel: R10: 000055942ca49e30 R11: 0000000000000246 R12: 00007ffc8c792490
mai 19 11:39:25 curie kernel: R13: 000055942ca49e30 R14: 000055942aed2c20 R15: 00007ffc8c795a40
I understand those are rather extreme conditions: I would fully expect the pool to stop working if the underlying drives disappear. What doesn't seem acceptable is that a command would completely hang like this.

References See the zfs documentation for more information about ZFS, and tubman for another installation and migration procedure.

10 November 2022

Melissa Wen: V3D enablement in mainline kernel

Hey, If you enjoy using upstream Linux kernel in your Raspberry Pi system or just want to give a try in the freshest kernel graphics drivers there, the good news is that now you can compile and boot the V3D driver from the mainline in your Raspberry Pi 4. Thanks to the work of Stefan, Peter and Nicolas [1] [2], the V3D enablement reached the Linux kernel mainline. That means hacking and using new features available in the upstream V3D driver directly from the source. However, even for those used to compiling and installing a custom kernel in the Raspberry Pi, there are some quirks to getting the mainline v3d module available in 32-bit and 64-bit systems. I ve quickly summarized how to compile and install upstream kernel versions (>=6.0) in this short blog post.
Note: V3D driver is not present in Raspberry Pi models 0-3.
First, I m taking into account that you already know how to cross-compile a custom kernel to your system. If it is not your case, a good tutorial is already available in the Raspberry Pi documentation, but it targets the kernel in the rpi-linux repository (downstream kernel). From this documentation, the main differences in the upstream kernel are presented below:

Raspberry Pi 4 64-bit (arm64)

Diff short summary:
  1. instead of getting the .config file from bcm2711_defconfig, get it by running make ARCH=arm64 defconfig
  2. compile and install the kernel image and modules as usual, but just copy the dtb file arch/arm64/boot/dts/broadcom/bcm2711-rpi-4-b.dtb to the /boot of your target machine (no /overlays directory to copy)
  3. change /boot/config.txt:
    • comment/remove the dt_overlay=vc4-kms-v3d entry
    • add a device_tree=bcm2711-rpi-4-b.dtb entry

Raspberry Pi 4 32-bits (arm)

Diff short summary:
  1. get the .config file by running make ARCH=arm multi_v7_defconfig
  2. using make ARCH=arm menuconfig or a similar tool, enable CONFIG_ARM_LPAE=y
  3. compile and install the kernel image and modules as usual, but just copy the dtb file arch/arm/boot/dts/bcm2711-rpi-4-b.dtb to the /boot of your target machine (no /overlays directory to copy)
  4. change /boot/config.txt:
    • comment/remove the dt_overlay=vc4-kms-v3d entry
    • add a device_tree=bcm2711-rpi-4-b.dtb entry

Step-by-step for remote deployment:

Set variables Raspberry Pi 4 64-bit (arm64)
cd <path-to-upstream-linux-directory>
KERNEL= make kernelrelease 
ARCH="arm64"
CROSS_COMPILE="aarch64-linux-gnu-"
DEFCONFIG=defconfig
IMAGE=Image
DTB_PATH="broadcom/bcm2711-rpi-4-b.dtb"
RPI4=<ip>
TMP= mktemp -d 
Raspberry Pi 4 32-bits (arm)
cd <path-to-upstream-linux-directory>
KERNEL= make kernelrelease 
ARCH="arm"
CROSS_COMPILE="arm-linux-gnueabihf-"
DEFCONFIG=multi_v7_defconfig
IMAGE=zImage
DTB_PATH="bcm2711-rpi-4-b.dtb"
RPI4=<ip>
TMP= mktemp -d 

Get default .config file
make ARCH=$ARCH CROSS_COMPILE=$CROSS_COMPILE $DEFCONFIG
Raspberry Pi 4 32-bit (arm) Additional step for 32-bit system. Enable CONFIG_ARM_LPAE=y using make ARCH=arm menuconfig

Cross-compile the mainline kernel
make ARCH=$ARCH CROSS_COMPILE=$CROSS_COMPILE $IMAGE modules dtbs

Install modules to send
make ARCH=$ARCH CROSS_COMPILE=$CROSS_COMPILE INSTALL_MOD_PATH=$TMP modules_install

Copy kernel image, modules and the dtb to your remote system
ssh $RPI4 mkdir -p /tmp/new_modules /tmp/new_kernel
rsync -av $TMP/ $RPI4:/tmp/new_modules/
scp arch/$ARCH/boot/$IMAGE $RPI4:/tmp/new_kernel/$KERNEL
scp arch/$ARCH/boot/dts/$DTB_PATH $RPI4:/tmp/new_kernel
ssh $RPI4 sudo rsync -av /tmp/new_modules/lib/modules/ /lib/modules/
ssh $RPI4 sudo rsync -av /tmp/new_kernel/ /boot/
rm -rf $TMP

Set config.txt of you RPi 4 In your Raspberry Pi 4, open the config file /boot/config.txt
  • comment/remove the dt_overlay=vc4-kms-v3d entry
  • add a device_tree=bcm2711-rpi-4-b.dtb entry
  • add a kernel=<image-name> entry

Why not Kworkflow? You can safely use the steps above, but if you are hacking the kernel and need to repeat this compiling and installing steps repeatedly, why don t try the Kworkflow? Kworkflow is a set of scripts to synthesize all steps to have a custom kernel compiled and installed in local and remote machines and it supports kernel building and deployment to Raspberry Pi machines for Raspbian 32 bit and 64 bit. After learning the kernel compilation and installation step by step, you can simply use kw bd command and have a custom kernel installed in your Raspberry Pi 4.
Note: the userspace 3D acceleration (glx/mesa) is working as expected on arm64, but the driver is not loaded yet for arm. Besides that, a bunch of pte invalid errors may appear when using 3D acceleration, it s a known issue that are still under investigation.

25 October 2022

Dirk Eddelbuettel: nanotime 0.3.7 on CRAN: Enhancements

A new version of our nanotime package arrived at CRAN today as version 0.3.7. nanotime relies on the RcppCCTZ package (as well as the RcppDate package for additional C++ operations) and offers efficient high(er) resolution time parsing and formatting up to nanosecond resolution, and the bit64 package for the actual integer64 arithmetic. Initially implemented using the S3 system, it has benefitted greatly from a rigorous refactoring by Leonardo who not only rejigged nanotime internals in S4 but also added new S4 types for periods, intervals and durations. This release adds a few more operators, plus some other fixes, that were contributed in several PRs by Trevor Davis. The NEWS snippet has the full details.

Changes in version 0.3.7 (2022-10-23)
  • Update mkdocs for material docs generator (Dirk in #102)
  • Use inherits() instead comparing to class() (Trevor Davis in #104)
  • Set default arguments in nanoduration() (Trevor Davis in #105)
  • Add as.nanoduration.difftime() support (Trevor Davis in #106)
  • Add +/- methods for nanotime and difftime objects (Trevor Davis in #110 closing #108, #107)

Thanks to my CRANberries there is also a diff to the previous version. More details and examples are at the nanotime page; code, issue tickets etc at the GitHub repository and all documentation is provided at the nanotime documentation site. If you like this or other open-source work I do, you can now sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

16 October 2022

Colin Watson: Reproducible man-db databases

I ve released man-db 2.11.0 (announcement, NEWS), and uploaded it to Debian unstable. The biggest chunk of work here was fixing some extremely long-standing issues with how the database is built. Despite being in the package name, man-db s database is much less important than it used to be: most uses of man(1) haven t required it in a long time, and both hardware and software improvements mean that even some searches can be done by brute force without needing prior indexing. However, the database is still needed for the whatis(1) and apropos(1) commands. The database has a simple format - no relational structure here, it s just a simple key-value database using old-fashioned DBM-like interfaces and composing a few fields to form values - but there are a number of subtleties involved. The issues tend to amount to this: what does a manual page name mean? At first glance it might seem simple, because you have file names that look something like /usr/share/man/man1/ls.1.gz and that s obviously ls(1). Some pages are symlinks to other pages (which we track separately because it makes it easier to figure out which entries to update when the contents of the file system change), and sometimes multiple pages are even hard links to the same file. The real complications come with whatis references . Pages can list a bunch of names in their NAME section, and the historical expectation is that it should be possible to use those names as arguments to man(1) even if they don t also appear in the file system (although Debian policy has deprecated relying on this for some time). Not only does that mean that man(1) sometimes needs to consult the database, but it also means that the database is inherently more complicated, since a page might list something in its NAME section that conflicts with an actual file name in the file system, and now you need a priority system to resolve ambiguities. There are some other possible causes of ambiguity as well. The people working on reproducible builds in Debian branched out to the related challenge of reproducible installations some time ago: can you take a collection of packages, bootstrap a file system image from them, and reproduce that exact same image somewhere else? This is useful for the same sorts of reasons that reproducible builds are useful: it lets you verify that an image is built from the components it s supposed to be built from, and doesn t contain any other skulduggery by accident or design. One of the people working on this noticed that man-db s database files were an obstacle to that: in particular, the exact contents of the database seemed to depend on the order in which files were scanned when building it. The reporter proposed solving this by processing files in sorted order, but I wasn t keen on that approach: firstly because it would mean we could no longer process files in an order that makes it more efficient to read them all from disk (still valuable on rotational disks), but mostly because the differences seemed to point to other bugs. Having understood this, there then followed several late nights of very fiddly work on the details of how the database is maintained. None of this was conceptually difficult: it mainly amounted to ensuring that we maintain a consistent well-order for different entries that we might want to insert for a given database key, and that we consider the same names for insertion regardless of the order in which we encounter files. As usual, the tricky bit is making sure that we have the right data structures to support this. man-db is written in C which is not very well-supplied with built-in data structures, and originally much of the code was written in a style that tried to minimize memory allocations; this came at the cost of ownership and lifetime often being rather unclear, and it was often difficult to make changes without causing leaks or double-frees. Over the years I ve been gradually introducing better encapsulation to make things easier to follow, and I had to do another round of that here. There were also some problems with caching being done at slightly the wrong layer: we need to make use of a trace of the chain of links followed to resolve a page to its ultimate source file, but we were incorrectly caching that trace and reusing it for any link to the same file, with incorrect results in many cases. Oh, and after doing all that I found that the on-disk representation of a GDBM database is insertion-order-dependent, so I ended up having to manually reorganize the database at the end by reading it all in and writing it all back out in sorted order, which feels really weird to me coming from spending most of my time with PostgreSQL these days. Fortunately the database is small so this takes negligible time. None of this is particularly glamorous work, but it paid off:
# export SOURCE_DATE_EPOCH="$(date +%s)"
# mkdir emptydir disorder
# disorderfs --multi-user=yes --shuffle-dirents=yes --reverse-dirents=no emptydir disorder
# export TMPDIR="$(pwd)/disorder"
# mmdebstrap --variant=standard --hook-dir=/usr/share/mmdebstrap/hooks/merged-usr \
      unstable out1.tar
# mmdebstrap --variant=standard --hook-dir=/usr/share/mmdebstrap/hooks/merged-usr \
      unstable out2.tar
# cmp out1.tar out2.tar
# echo $?
0

25 September 2022

Sergio Talens-Oliag: Kubernetes Static Content Server

This post describes how I ve put together a simple static content server for kubernetes clusters using a Pod with a persistent volume and multiple containers: an sftp server to manage contents, a web server to publish them with optional access control and another one to run scripts which need access to the volume filesystem. The sftp server runs using MySecureShell, the web server is nginx and the script runner uses the webhook tool to publish endpoints to call them (the calls will come from other Pods that run backend servers or are executed from Jobs or CronJobs).

HistoryThe system was developed because we had a NodeJS API with endpoints to upload files and store them on S3 compatible services that were later accessed via HTTPS, but the requirements changed and we needed to be able to publish folders instead of individual files using their original names and apply access restrictions using our API. Thinking about our requirements the use of a regular filesystem to keep the files and folders was a good option, as uploading and serving files is simple. For the upload I decided to use the sftp protocol, mainly because I already had an sftp container image based on mysecureshell prepared; once we settled on that we added sftp support to the API server and configured it to upload the files to our server instead of using S3 buckets. To publish the files we added a nginx container configured to work as a reverse proxy that uses the ngx_http_auth_request_module to validate access to the files (the sub request is configurable, in our deployment we have configured it to call our API to check if the user can access a given URL). Finally we added a third container when we needed to execute some tasks directly on the filesystem (using kubectl exec with the existing containers did not seem a good idea, as that is not supported by CronJobs objects, for example). The solution we found avoiding the NIH Syndrome (i.e. write our own tool) was to use the webhook tool to provide the endpoints to call the scripts; for now we have three:
  • one to get the disc usage of a PATH,
  • one to hardlink all the files that are identical on the filesystem,
  • one to copy files and folders from S3 buckets to our filesystem.

Container definitions

mysecureshellThe mysecureshell container can be used to provide an sftp service with multiple users (although the files are owned by the same UID and GID) using standalone containers (launched with docker or podman) or in an orchestration system like kubernetes, as we are going to do here. The image is generated using the following Dockerfile:
ARG ALPINE_VERSION=3.16.2
FROM alpine:$ALPINE_VERSION as builder
LABEL maintainer="Sergio Talens-Oliag <sto@mixinet.net>"
RUN apk update &&\
 apk add --no-cache alpine-sdk git musl-dev &&\
 git clone https://github.com/sto/mysecureshell.git &&\
 cd mysecureshell &&\
 ./configure --prefix=/usr --sysconfdir=/etc --mandir=/usr/share/man\
 --localstatedir=/var --with-shutfile=/var/lib/misc/sftp.shut --with-debug=2 &&\
 make all && make install &&\
 rm -rf /var/cache/apk/*
FROM alpine:$ALPINE_VERSION
LABEL maintainer="Sergio Talens-Oliag <sto@mixinet.net>"
COPY --from=builder /usr/bin/mysecureshell /usr/bin/mysecureshell
COPY --from=builder /usr/bin/sftp-* /usr/bin/
RUN apk update &&\
 apk add --no-cache openssh shadow pwgen &&\
 sed -i -e "s ^.*\(AuthorizedKeysFile\).*$ \1 /etc/ssh/auth_keys/%u "\
 /etc/ssh/sshd_config &&\
 mkdir /etc/ssh/auth_keys &&\
 cat /dev/null > /etc/motd &&\
 add-shell '/usr/bin/mysecureshell' &&\
 rm -rf /var/cache/apk/*
COPY bin/* /usr/local/bin/
COPY etc/sftp_config /etc/ssh/
COPY entrypoint.sh /
EXPOSE 22
VOLUME /sftp
ENTRYPOINT ["/entrypoint.sh"]
CMD ["server"]
The /etc/sftp_config file is used to configure the mysecureshell server to have all the user homes under /sftp/data, only allow them to see the files under their home directories as if it were at the root of the server and close idle connections after 5m of inactivity:
etc/sftp_config
# Default mysecureshell configuration
<Default>
   # All users will have access their home directory under /sftp/data
   Home /sftp/data/$USER
   # Log to a file inside /sftp/logs/ (only works when the directory exists)
   LogFile /sftp/logs/mysecureshell.log
   # Force users to stay in their home directory
   StayAtHome true
   # Hide Home PATH, it will be shown as /
   VirtualChroot true
   # Hide real file/directory owner (just change displayed permissions)
   DirFakeUser true
   # Hide real file/directory group (just change displayed permissions)
   DirFakeGroup true
   # We do not want users to keep forever their idle connection
   IdleTimeOut 5m
</Default>
# vim: ts=2:sw=2:et
The entrypoint.sh script is the one responsible to prepare the container for the users included on the /secrets/user_pass.txt file (creates the users with their HOME directories under /sftp/data and a /bin/false shell and creates the key files from /secrets/user_keys.txt if available). The script expects a couple of environment variables:
  • SFTP_UID: UID used to run the daemon and for all the files, it has to be different than 0 (all the files managed by this daemon are going to be owned by the same user and group, even if the remote users are different).
  • SFTP_GID: GID used to run the daemon and for all the files, it has to be different than 0.
And can use the SSH_PORT and SSH_PARAMS values if present. It also requires the following files (they can be mounted as secrets in kubernetes):
  • /secrets/host_keys.txt: Text file containing the ssh server keys in mime format; the file is processed using the reformime utility (the one included on busybox) and can be generated using the gen-host-keys script included on the container (it uses ssh-keygen and makemime).
  • /secrets/user_pass.txt: Text file containing lines of the form username:password_in_clear_text (only the users included on this file are available on the sftp server, in fact in our deployment we use only the scs user for everything).
And optionally can use another one:
  • /secrets/user_keys.txt: Text file that contains lines of the form username:public_ssh_ed25519_or_rsa_key; the public keys are installed on the server and can be used to log into the sftp server if the username exists on the user_pass.txt file.
The contents of the entrypoint.sh script are:
entrypoint.sh
#!/bin/sh
set -e
# ---------
# VARIABLES
# ---------
# Expects SSH_UID & SSH_GID on the environment and uses the value of the
# SSH_PORT & SSH_PARAMS variables if present
# SSH_PARAMS
SSH_PARAMS="-D -e -p $ SSH_PORT:=22  $ SSH_PARAMS "
# Fixed values
# DIRECTORIES
HOME_DIR="/sftp/data"
CONF_FILES_DIR="/secrets"
AUTH_KEYS_PATH="/etc/ssh/auth_keys"
# FILES
HOST_KEYS="$CONF_FILES_DIR/host_keys.txt"
USER_KEYS="$CONF_FILES_DIR/user_keys.txt"
USER_PASS="$CONF_FILES_DIR/user_pass.txt"
USER_SHELL_CMD="/usr/bin/mysecureshell"
# TYPES
HOST_KEY_TYPES="dsa ecdsa ed25519 rsa"
# ---------
# FUNCTIONS
# ---------
# Validate HOST_KEYS, USER_PASS, SFTP_UID and SFTP_GID
_check_environment()  
  # Check the ssh server keys ... we don't boot if we don't have them
  if [ ! -f "$HOST_KEYS" ]; then
    cat <<EOF
We need the host keys on the '$HOST_KEYS' file to proceed.
Call the 'gen-host-keys' script to create and export them on a mime file.
EOF
    exit 1
  fi
  # Check that we have users ... if we don't we can't continue
  if [ ! -f "$USER_PASS" ]; then
    cat <<EOF
We need at least the '$USER_PASS' file to provision users.
Call the 'gen-users-tar' script to create a tar file to create an archive that
contains public and private keys for users, a 'user_keys.txt' with the public
keys of the users and a 'user_pass.txt' file with random passwords for them 
(pass the list of usernames to it).
EOF
    exit 1
  fi
  # Check SFTP_UID
  if [ -z "$SFTP_UID" ]; then
    echo "The 'SFTP_UID' can't be empty, pass a 'GID'."
    exit 1
  fi
  if [ "$SFTP_UID" -eq "0" ]; then
    echo "The 'SFTP_UID' can't be 0, use a different 'UID'"
    exit 1
  fi
  # Check SFTP_GID
  if [ -z "$SFTP_GID" ]; then
    echo "The 'SFTP_GID' can't be empty, pass a 'GID'."
    exit 1
  fi
  if [ "$SFTP_GID" -eq "0" ]; then
    echo "The 'SFTP_GID' can't be 0, use a different 'GID'"
    exit 1
  fi
 
# Adjust ssh host keys
_setup_host_keys()  
  opwd="$(pwd)"
  tmpdir="$(mktemp -d)"
  cd "$tmpdir"
  ret="0"
  reformime <"$HOST_KEYS"   ret="1"
  for kt in $HOST_KEY_TYPES; do
    key="ssh_host_$ kt _key"
    pub="ssh_host_$ kt _key.pub"
    if [ ! -f "$key" ]; then
      echo "Missing '$key' file"
      ret="1"
    fi
    if [ ! -f "$pub" ]; then
      echo "Missing '$pub' file"
      ret="1"
    fi
    if [ "$ret" -ne "0" ]; then
      continue
    fi
    cat "$key" >"/etc/ssh/$key"
    chmod 0600 "/etc/ssh/$key"
    chown root:root "/etc/ssh/$key"
    cat "$pub" >"/etc/ssh/$pub"
    chmod 0600 "/etc/ssh/$pub"
    chown root:root "/etc/ssh/$pub"
  done
  cd "$opwd"
  rm -rf "$tmpdir"
  return "$ret"
 
# Create users
_setup_user_pass()  
  opwd="$(pwd)"
  tmpdir="$(mktemp -d)"
  cd "$tmpdir"
  ret="0"
  [ -d "$HOME_DIR" ]   mkdir "$HOME_DIR"
  # Make sure the data dir can be managed by the sftp user
  chown "$SFTP_UID:$SFTP_GID" "$HOME_DIR"
  # Allow the user (and root) to create directories inside the $HOME_DIR, if
  # we don't allow it the directory creation fails on EFS (AWS)
  chmod 0755 "$HOME_DIR"
  # Create users
  echo "sftp:sftp:$SFTP_UID:$SFTP_GID:::/bin/false" >"newusers.txt"
  sed -n "/^[^#]/   s/:/ /p  " "$USER_PASS"   while read -r _u _p; do
    echo "$_u:$_p:$SFTP_UID:$SFTP_GID::$HOME_DIR/$_u:$USER_SHELL_CMD"
  done >>"newusers.txt"
  newusers --badnames newusers.txt
  # Disable write permission on the directory to forbid remote sftp users to
  # remove their own root dir (they have already done it); we adjust that
  # here to avoid issues with EFS (see before)
  chmod 0555 "$HOME_DIR"
  # Clean up the tmpdir
  cd "$opwd"
  rm -rf "$tmpdir"
  return "$ret"
 
# Adjust user keys
_setup_user_keys()  
  if [ -f "$USER_KEYS" ]; then
    sed -n "/^[^#]/   s/:/ /p  " "$USER_KEYS"   while read -r _u _k; do
      echo "$_k" >>"$AUTH_KEYS_PATH/$_u"
    done
  fi
 
# Main function
exec_sshd()  
  _check_environment
  _setup_host_keys
  _setup_user_pass
  _setup_user_keys
  echo "Running: /usr/sbin/sshd $SSH_PARAMS"
  # shellcheck disable=SC2086
  exec /usr/sbin/sshd -D $SSH_PARAMS
 
# ----
# MAIN
# ----
case "$1" in
"server") exec_sshd ;;
*) exec "$@" ;;
esac
# vim: ts=2:sw=2:et
The container also includes a couple of auxiliary scripts, the first one can be used to generate the host_keys.txt file as follows:
$ docker run --rm stodh/mysecureshell gen-host-keys > host_keys.txt
Where the script is as simple as:
bin/gen-host-keys
#!/bin/sh
set -e
# Generate new host keys
ssh-keygen -A >/dev/null
# Replace hostname
sed -i -e 's/@.*$/@mysecureshell/' /etc/ssh/ssh_host_*_key.pub
# Print in mime format (stdout)
makemime /etc/ssh/ssh_host_*
# vim: ts=2:sw=2:et
And there is another script to generate a .tar file that contains auth data for the list of usernames passed to it (the file contains a user_pass.txt file with random passwords for the users, public and private ssh keys for them and the user_keys.txt file that matches the generated keys). To generate a tar file for the user scs we can execute the following:
$ docker run --rm stodh/mysecureshell gen-users-tar scs > /tmp/scs-users.tar
To see the contents and the text inside the user_pass.txt file we can do:
$ tar tvf /tmp/scs-users.tar
-rw-r--r-- root/root        21 2022-09-11 15:55 user_pass.txt
-rw-r--r-- root/root       822 2022-09-11 15:55 user_keys.txt
-rw------- root/root       387 2022-09-11 15:55 id_ed25519-scs
-rw-r--r-- root/root        85 2022-09-11 15:55 id_ed25519-scs.pub
-rw------- root/root      3357 2022-09-11 15:55 id_rsa-scs
-rw------- root/root      3243 2022-09-11 15:55 id_rsa-scs.pem
-rw-r--r-- root/root       729 2022-09-11 15:55 id_rsa-scs.pub
$ tar xfO /tmp/scs-users.tar user_pass.txt
scs:20JertRSX2Eaar4x
The source of the script is:
bin/gen-users-tar
#!/bin/sh
set -e
# ---------
# VARIABLES
# ---------
USER_KEYS_FILE="user_keys.txt"
USER_PASS_FILE="user_pass.txt"
# ---------
# MAIN CODE
# ---------
# Generate user passwords and keys, return 1 if no username is received
if [ "$#" -eq "0" ]; then
  return 1
fi
opwd="$(pwd)"
tmpdir="$(mktemp -d)"
cd "$tmpdir"
for u in "$@"; do
  ssh-keygen -q -a 100 -t ed25519 -f "id_ed25519-$u" -C "$u" -N ""
  ssh-keygen -q -a 100 -b 4096 -t rsa -f "id_rsa-$u" -C "$u" -N ""
  # Legacy RSA private key format
  cp -a "id_rsa-$u" "id_rsa-$u.pem"
  ssh-keygen -q -p -m pem -f "id_rsa-$u.pem" -N "" -P "" >/dev/null
  chmod 0600 "id_rsa-$u.pem"
  echo "$u:$(pwgen -s 16 1)" >>"$USER_PASS_FILE"
  echo "$u:$(cat "id_ed25519-$u.pub")" >>"$USER_KEYS_FILE"
  echo "$u:$(cat "id_rsa-$u.pub")" >>"$USER_KEYS_FILE"
done
tar cf - "$USER_PASS_FILE" "$USER_KEYS_FILE" id_* 2>/dev/null
cd "$opwd"
rm -rf "$tmpdir"
# vim: ts=2:sw=2:et

nginx-scsThe nginx-scs container is generated using the following Dockerfile:
ARG NGINX_VERSION=1.23.1
FROM nginx:$NGINX_VERSION
LABEL maintainer="Sergio Talens-Oliag <sto@mixinet.net>"
RUN rm -f /docker-entrypoint.d/*
COPY docker-entrypoint.d/* /docker-entrypoint.d/
Basically we are removing the existing docker-entrypoint.d scripts from the standard image and adding a new one that configures the web server as we want using a couple of environment variables:
  • AUTH_REQUEST_URI: URL to use for the auth_request, if the variable is not found on the environment auth_request is not used.
  • HTML_ROOT: Base directory of the web server, if not passed the default /usr/share/nginx/html is used.
Note that if we don t pass the variables everything works as if we were using the original nginx image. The contents of the configuration script are:
docker-entrypoint.d/10-update-default-conf.sh
#!/bin/sh
# Replace the default.conf nginx file by our own version.
set -e
if [ -z "$HTML_ROOT" ]; then
  HTML_ROOT="/usr/share/nginx/html"
fi
if [ "$AUTH_REQUEST_URI" ]; then
  cat >/etc/nginx/conf.d/default.conf <<EOF
server  
  listen       80;
  server_name  localhost;
  location /  
    auth_request /.auth;
    root  $HTML_ROOT;
    index index.html index.htm;
   
  location /.auth  
    internal;
    proxy_pass $AUTH_REQUEST_URI;
    proxy_pass_request_body off;
    proxy_set_header Content-Length "";
    proxy_set_header X-Original-URI \$request_uri;
   
  error_page   500 502 503 504  /50x.html;
  location = /50x.html  
    root /usr/share/nginx/html;
   
 
EOF
else
  cat >/etc/nginx/conf.d/default.conf <<EOF
server  
  listen       80;
  server_name  localhost;
  location /  
    root  $HTML_ROOT;
    index index.html index.htm;
   
  error_page   500 502 503 504  /50x.html;
  location = /50x.html  
    root /usr/share/nginx/html;
   
 
EOF
fi
# vim: ts=2:sw=2:et
As we will see later the idea is to use the /sftp/data or /sftp/data/scs folder as the root of the web published by this container and create an Ingress object to provide access to it outside of our kubernetes cluster.

webhook-scsThe webhook-scs container is generated using the following Dockerfile:
ARG ALPINE_VERSION=3.16.2
ARG GOLANG_VERSION=alpine3.16
FROM golang:$GOLANG_VERSION AS builder
LABEL maintainer="Sergio Talens-Oliag <sto@mixinet.net>"
ENV WEBHOOK_VERSION 2.8.0
ENV WEBHOOK_PR 549
ENV S3FS_VERSION v1.91
WORKDIR /go/src/github.com/adnanh/webhook
RUN apk update &&\
 apk add --no-cache -t build-deps curl libc-dev gcc libgcc patch
RUN curl -L --silent -o webhook.tar.gz\
 https://github.com/adnanh/webhook/archive/$ WEBHOOK_VERSION .tar.gz &&\
 tar xzf webhook.tar.gz --strip 1 &&\
 curl -L --silent -o $ WEBHOOK_PR .patch\
 https://patch-diff.githubusercontent.com/raw/adnanh/webhook/pull/$ WEBHOOK_PR .patch &&\
 patch -p1 < $ WEBHOOK_PR .patch &&\
 go get -d && \
 go build -o /usr/local/bin/webhook
WORKDIR /src/s3fs-fuse
RUN apk update &&\
 apk add ca-certificates build-base alpine-sdk libcurl automake autoconf\
 libxml2-dev libressl-dev mailcap fuse-dev curl-dev
RUN curl -L --silent -o s3fs.tar.gz\
 https://github.com/s3fs-fuse/s3fs-fuse/archive/refs/tags/$S3FS_VERSION.tar.gz &&\
 tar xzf s3fs.tar.gz --strip 1 &&\
 ./autogen.sh &&\
 ./configure --prefix=/usr/local &&\
 make -j && \
 make install
FROM alpine:$ALPINE_VERSION
LABEL maintainer="Sergio Talens-Oliag <sto@mixinet.net>"
WORKDIR /webhook
RUN apk update &&\
 apk add --no-cache ca-certificates mailcap fuse libxml2 libcurl libgcc\
 libstdc++ rsync util-linux-misc &&\
 rm -rf /var/cache/apk/*
COPY --from=builder /usr/local/bin/webhook /usr/local/bin/webhook
COPY --from=builder /usr/local/bin/s3fs /usr/local/bin/s3fs
COPY entrypoint.sh /
COPY hooks/* ./hooks/
EXPOSE 9000
ENTRYPOINT ["/entrypoint.sh"]
CMD ["server"]
Again, we use a multi-stage build because in production we wanted to support a functionality that is not already on the official versions (streaming the command output as a response instead of waiting until the execution ends); this time we build the image applying the PATCH included on this pull request against a released version of the source instead of creating a fork. The entrypoint.sh script is used to generate the webhook configuration file for the existing hooks using environment variables (basically the WEBHOOK_WORKDIR and the *_TOKEN variables) and launch the webhook service:
entrypoint.sh
#!/bin/sh
set -e
# ---------
# VARIABLES
# ---------
WEBHOOK_BIN="$ WEBHOOK_BIN:-/webhook/hooks "
WEBHOOK_YML="$ WEBHOOK_YML:-/webhook/scs.yml "
WEBHOOK_OPTS="$ WEBHOOK_OPTS:--verbose "
# ---------
# FUNCTIONS
# ---------
print_du_yml()  
  cat <<EOF
- id: du
  execute-command: '$WEBHOOK_BIN/du.sh'
  command-working-directory: '$WORKDIR'
  response-headers:
  - name: 'Content-Type'
    value: 'application/json'
  http-methods: ['GET']
  include-command-output-in-response: true
  include-command-output-in-response-on-error: true
  pass-arguments-to-command:
  - source: 'url'
    name: 'path'
  pass-environment-to-command:
  - source: 'string'
    envname: 'OUTPUT_FORMAT'
    name: 'json'
EOF
 
print_hardlink_yml()  
  cat <<EOF
- id: hardlink
  execute-command: '$WEBHOOK_BIN/hardlink.sh'
  command-working-directory: '$WORKDIR'
  http-methods: ['GET']
  include-command-output-in-response: true
  include-command-output-in-response-on-error: true
EOF
 
print_s3sync_yml()  
  cat <<EOF
- id: s3sync
  execute-command: '$WEBHOOK_BIN/s3sync.sh'
  command-working-directory: '$WORKDIR'
  http-methods: ['POST']
  include-command-output-in-response: true
  include-command-output-in-response-on-error: true
  pass-environment-to-command:
  - source: 'payload'
    envname: 'AWS_KEY'
    name: 'aws.key'
  - source: 'payload'
    envname: 'AWS_SECRET_KEY'
    name: 'aws.secret_key'
  - source: 'payload'
    envname: 'S3_BUCKET'
    name: 's3.bucket'
  - source: 'payload'
    envname: 'S3_REGION'
    name: 's3.region'
  - source: 'payload'
    envname: 'S3_PATH'
    name: 's3.path'
  - source: 'payload'
    envname: 'SCS_PATH'
    name: 'scs.path'
  stream-command-output: true
EOF
 
print_token_yml()  
  if [ "$1" ]; then
    cat << EOF
  trigger-rule:
    match:
      type: 'value'
      value: '$1'
      parameter:
        source: 'header'
        name: 'X-Webhook-Token'
EOF
  fi
 
exec_webhook()  
  # Validate WORKDIR
  if [ -z "$WEBHOOK_WORKDIR" ]; then
    echo "Must define the WEBHOOK_WORKDIR variable!" >&2
    exit 1
  fi
  WORKDIR="$(realpath "$WEBHOOK_WORKDIR" 2>/dev/null)"   true
  if [ ! -d "$WORKDIR" ]; then
    echo "The WEBHOOK_WORKDIR '$WEBHOOK_WORKDIR' is not a directory!" >&2
    exit 1
  fi
  # Get TOKENS, if the DU_TOKEN or HARDLINK_TOKEN is defined that is used, if
  # not if the COMMON_TOKEN that is used and in other case no token is checked
  # (that is the default)
  DU_TOKEN="$ DU_TOKEN:-$COMMON_TOKEN "
  HARDLINK_TOKEN="$ HARDLINK_TOKEN:-$COMMON_TOKEN "
  S3_TOKEN="$ S3_TOKEN:-$COMMON_TOKEN "
  # Create webhook configuration
    
    print_du_yml
    print_token_yml "$DU_TOKEN"
    echo ""
    print_hardlink_yml
    print_token_yml "$HARDLINK_TOKEN"
    echo ""
    print_s3sync_yml
    print_token_yml "$S3_TOKEN"
   >"$WEBHOOK_YML"
  # Run the webhook command
  # shellcheck disable=SC2086
  exec webhook -hooks "$WEBHOOK_YML" $WEBHOOK_OPTS
 
# ----
# MAIN
# ----
case "$1" in
"server") exec_webhook ;;
*) exec "$@" ;;
esac
The entrypoint.sh script generates the configuration file for the webhook server calling functions that print a yaml section for each hook and optionally adds rules to validate access to them comparing the value of a X-Webhook-Token header against predefined values. The expected token values are taken from environment variables, we can define a token variable for each hook (DU_TOKEN, HARDLINK_TOKEN or S3_TOKEN) and a fallback value (COMMON_TOKEN); if no token variable is defined for a hook no check is done and everybody can call it. The Hook Definition documentation explains the options you can use for each hook, the ones we have right now do the following:
  • du: runs on the $WORKDIR directory, passes as first argument to the script the value of the path query parameter and sets the variable OUTPUT_FORMAT to the fixed value json (we use that to print the output of the script in JSON format instead of text).
  • hardlink: runs on the $WORKDIR directory and takes no parameters.
  • s3sync: runs on the $WORKDIR directory and sets a lot of environment variables from values read from the JSON encoded payload sent by the caller (all the values must be sent by the caller even if they are assigned an empty value, if they are missing the hook fails without calling the script); we also set the stream-command-output value to true to make the script show its output as it is working (we patched the webhook source to be able to use this option).

The du hook scriptThe du hook script code checks if the argument passed is a directory, computes its size using the du command and prints the results in text format or as a JSON dictionary:
hooks/du.sh
#!/bin/sh
set -e
# Script to print disk usage for a PATH inside the scs folder
# ---------
# FUNCTIONS
# ---------
print_error()  
  if [ "$OUTPUT_FORMAT" = "json" ]; then
    echo " \"error\":\"$*\" "
  else
    echo "$*" >&2
  fi
  exit 1
 
usage()  
  if [ "$OUTPUT_FORMAT" = "json" ]; then
    echo " \"error\":\"Pass arguments as '?path=XXX\" "
  else
    echo "Usage: $(basename "$0") PATH" >&2
  fi
  exit 1
 
# ----
# MAIN
# ----
if [ "$#" -eq "0" ]   [ -z "$1" ]; then
  usage
fi
if [ "$1" = "." ]; then
  DU_PATH="./"
else
  DU_PATH="$(find . -name "$1" -mindepth 1 -maxdepth 1)"   true
fi
if [ -z "$DU_PATH" ]   [ ! -d "$DU_PATH/." ]; then
  print_error "The provided PATH ('$1') is not a directory"
fi
# Print disk usage in bytes for the given PATH
OUTPUT="$(du -b -s "$DU_PATH")"
if [ "$OUTPUT_FORMAT" = "json" ]; then
  # Format output as  "path":"PATH","bytes":"BYTES" 
  echo "$OUTPUT"  
    sed -e "s%^\(.*\)\t.*/\(.*\)$% \"path\":\"\2\",\"bytes\":\"\1\" %"  
    tr -d '\n'
else
  # Print du output as is
  echo "$OUTPUT"
fi
# vim: ts=2:sw=2:et:ai:sts=2

The s3sync hook scriptThe s3sync hook script uses the s3fs tool to mount a bucket and synchronise data between a folder inside the bucket and a directory on the filesystem using rsync; all values needed to execute the task are taken from environment variables:
hooks/s3sync.sh
#!/bin/ash
set -euo pipefail
set -o errexit
set -o errtrace
# Functions
finish()  
  ret="$1"
  echo ""
  echo "Script exit code: $ret"
  exit "$ret"
 
# Check variables
if [ -z "$AWS_KEY" ]   [ -z "$AWS_SECRET_KEY" ]   [ -z "$S3_BUCKET" ]  
  [ -z "$S3_PATH" ]   [ -z "$SCS_PATH" ]; then
  [ "$AWS_KEY" ]   echo "Set the AWS_KEY environment variable"
  [ "$AWS_SECRET_KEY" ]   echo "Set the AWS_SECRET_KEY environment variable"
  [ "$S3_BUCKET" ]   echo "Set the S3_BUCKET environment variable"
  [ "$S3_PATH" ]   echo "Set the S3_PATH environment variable"
  [ "$SCS_PATH" ]   echo "Set the SCS_PATH environment variable"
  finish 1
fi
if [ "$S3_REGION" ] && [ "$S3_REGION" != "us-east-1" ]; then
  EP_URL="endpoint=$S3_REGION,url=https://s3.$S3_REGION.amazonaws.com"
else
  EP_URL="endpoint=us-east-1"
fi
# Prepare working directory
WORK_DIR="$(mktemp -p "$HOME" -d)"
MNT_POINT="$WORK_DIR/s3data"
PASSWD_S3FS="$WORK_DIR/.passwd-s3fs"
# Check the moutpoint
if [ ! -d "$MNT_POINT" ]; then
  mkdir -p "$MNT_POINT"
elif mountpoint "$MNT_POINT"; then
  echo "There is already something mounted on '$MNT_POINT', aborting!"
  finish 1
fi
# Create password file
touch "$PASSWD_S3FS"
chmod 0400 "$PASSWD_S3FS"
echo "$AWS_KEY:$AWS_SECRET_KEY" >"$PASSWD_S3FS"
# Mount s3 bucket as a filesystem
s3fs -o dbglevel=info,retries=5 -o "$EP_URL" -o "passwd_file=$PASSWD_S3FS" \
  "$S3_BUCKET" "$MNT_POINT"
echo "Mounted bucket '$S3_BUCKET' on '$MNT_POINT'"
# Remove the password file, just in case
rm -f "$PASSWD_S3FS"
# Check source PATH
ret="0"
SRC_PATH="$MNT_POINT/$S3_PATH"
if [ ! -d "$SRC_PATH" ]; then
  echo "The S3_PATH '$S3_PATH' can't be found!"
  ret=1
fi
# Compute SCS_UID & SCS_GID (by default based on the working directory owner)
SCS_UID="$ SCS_UID:=$(stat -c "%u" "." 2>/dev/null) "   true
SCS_GID="$ SCS_GID:=$(stat -c "%g" "." 2>/dev/null) "   true
# Check destination PATH
DST_PATH="./$SCS_PATH"
if [ "$ret" -eq "0" ] && [ -d "$DST_PATH" ]; then
  mkdir -p "$DST_PATH"   ret="$?"
fi
# Copy using rsync
if [ "$ret" -eq "0" ]; then
  rsync -rlptv --chown="$SCS_UID:$SCS_GID" --delete --stats \
    "$SRC_PATH/" "$DST_PATH/"   ret="$?"
fi
# Unmount the S3 bucket
umount -f "$MNT_POINT"
echo "Called umount for '$MNT_POINT'"
# Remove mount point dir
rmdir "$MNT_POINT"
# Remove WORK_DIR
rmdir "$WORK_DIR"
# We are done
finish "$ret"
# vim: ts=2:sw=2:et:ai:sts=2

Deployment objectsThe system is deployed as a StatefulSet with one replica. Our production deployment is done on AWS and to be able to scale we use EFS for our PersistenVolume; the idea is that the volume has no size limit, its AccessMode can be set to ReadWriteMany and we can mount it from multiple instances of the Pod without issues, even if they are in different availability zones. For development we use k3d and we are also able to scale the StatefulSet for testing because we use a ReadWriteOnce PVC, but it points to a hostPath that is backed up by a folder that is mounted on all the compute nodes, so in reality Pods in different k3d nodes use the same folder on the host.

secrets.yamlThe secrets file contains the files used by the mysecureshell container that can be generated using kubernetes pods as follows (we are only creating the scs user):
$ kubectl run "mysecureshell" --restart='Never' --quiet --rm --stdin \
  --image "stodh/mysecureshell:latest" -- gen-host-keys >"./host_keys.txt"
$ kubectl run "mysecureshell" --restart='Never' --quiet --rm --stdin \
  --image "stodh/mysecureshell:latest" -- gen-users-tar scs >"./users.tar"
Once we have the files we can generate the secrets.yaml file as follows:
$ tar xf ./users.tar user_keys.txt user_pass.txt
$ kubectl --dry-run=client -o yaml create secret generic "scs-secret" \
  --from-file="host_keys.txt=host_keys.txt" \
  --from-file="user_keys.txt=user_keys.txt" \
  --from-file="user_pass.txt=user_pass.txt" > ./secrets.yaml
The resulting secrets.yaml will look like the following file (the base64 would match the content of the files, of course):
secrets.yaml
apiVersion: v1
data:
  host_keys.txt: TWlt...
  user_keys.txt: c2Nz...
  user_pass.txt: c2Nz...
kind: Secret
metadata:
  creationTimestamp: null
  name: scs-secret

pvc.yamlThe persistent volume claim for a simple deployment (one with only one instance of the statefulSet) can be as simple as this:
pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: scs-pvc
  labels:
    app.kubernetes.io/name: scs
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 8Gi
On this definition we don t set the storageClassName to use the default one.

Volumes in our development environment (k3d)In our development deployment we create the following PersistentVolume as required by the Local Persistence Volume Static Provisioner (note that the /volumes/scs-pv has to be created by hand, in our k3d system we mount the same host directory on the /volumes path of all the nodes and create the scs-pv directory by hand before deploying the persistent volume):
k3d-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: scs-pv
  labels:
    app.kubernetes.io/name: scs
spec:
  capacity:
    storage: 8Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Delete
  claimRef:
    name: scs-pvc
  storageClassName: local-storage
  local:
    path: /volumes/scs-pv
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
          - k3s
And to make sure that everything works as expected we update the PVC definition to add the right storageClassName:
k3d-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: scs-pvc
  labels:
    app.kubernetes.io/name: scs
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 8Gi
  storageClassName: local-storage

Volumes in our production environment (aws)In the production deployment we don t create the PersistentVolume (we are using the aws-efs-csi-driver which supports Dynamic Provisioning) but we add the storageClassName (we set it to the one mapped to the EFS driver, i.e. efs-sc) and set ReadWriteMany as the accessMode:
efs-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: scs-pvc
  labels:
    app.kubernetes.io/name: scs
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 8Gi
  storageClassName: efs-sc

statefulset.yamlThe definition of the statefulSet is as follows:
statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: scs
  labels:
    app.kubernetes.io/name: scs
spec:
  serviceName: scs
  replicas: 1
  selector:
    matchLabels:
      app: scs
  template:
    metadata:
      labels:
        app: scs
    spec:
      containers:
      - name: nginx
        image: stodh/nginx-scs:latest
        ports:
        - containerPort: 80
          name: http
        env:
        - name: AUTH_REQUEST_URI
          value: ""
        - name: HTML_ROOT
          value: /sftp/data
        volumeMounts:
        - mountPath: /sftp
          name: scs-datadir
      - name: mysecureshell
        image: stodh/mysecureshell:latest
        ports:
        - containerPort: 22
          name: ssh
        securityContext:
          capabilities:
            add:
            - IPC_OWNER
        env:
        - name: SFTP_UID
          value: '2020'
        - name: SFTP_GID
          value: '2020'
        volumeMounts:
        - mountPath: /secrets
          name: scs-file-secrets
          readOnly: true
        - mountPath: /sftp
          name: scs-datadir
      - name: webhook
        image: stodh/webhook-scs:latest
        securityContext:
          privileged: true
        ports:
        - containerPort: 9000
          name: webhook-http
        env:
        - name: WEBHOOK_WORKDIR
          value: /sftp/data/scs
        volumeMounts:
        - name: devfuse
          mountPath: /dev/fuse
        - mountPath: /sftp
          name: scs-datadir
      volumes:
      - name: devfuse
        hostPath:
          path: /dev/fuse
      - name: scs-file-secrets
        secret:
          secretName: scs-secrets
      - name: scs-datadir
        persistentVolumeClaim:
          claimName: scs-pvc
Notes about the containers:
  • nginx: As this is an example the web server is not using an AUTH_REQUEST_URI and uses the /sftp/data directory as the root of the web (to get to the files uploaded for the scs user we will need to use /scs/ as a prefix on the URLs).
  • mysecureshell: We are adding the IPC_OWNER capability to the container to be able to use some of the sftp-* commands inside it, but they are not really needed, so adding the capability is optional.
  • webhook: We are launching this container in privileged mode to be able to use the s3fs-fuse, as it will not work otherwise for now (see this kubernetes issue); if the functionality is not needed the container can be executed with regular privileges; besides, as we are not enabling public access to this service we don t define *_TOKEN variables (if required the values should be read from a Secret object).
Notes about the volumes:
  • the devfuse volume is only needed if we plan to use the s3fs command on the webhook container, if not we can remove the volume definition and its mounts.

service.yamlTo be able to access the different services on the statefulset we publish the relevant ports using the following Service object:
service.yaml
apiVersion: v1
kind: Service
metadata:
  name: scs-svc
  labels:
    app.kubernetes.io/name: scs
spec:
  ports:
  - name: ssh
    port: 22
    protocol: TCP
    targetPort: 22
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  - name: webhook-http
    port: 9000
    protocol: TCP
    targetPort: 9000
  selector:
    app: scs

ingress.yamlTo download the scs files from the outside we can add an ingress object like the following (the definition is for testing using the localhost name):
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: scs-ingress
  labels:
    app.kubernetes.io/name: scs
spec:
  ingressClassName: nginx
  rules:
  - host: 'localhost'
    http:
      paths:
      - path: /scs
        pathType: Prefix
        backend:
          service:
            name: scs-svc
            port:
              number: 80

DeploymentTo deploy the statefulSet we create a namespace and apply the object definitions shown before:
$ kubectl create namespace scs-demo
namespace/scs-demo created
$ kubectl -n scs-demo apply -f secrets.yaml
secret/scs-secrets created
$ kubectl -n scs-demo apply -f pvc.yaml
persistentvolumeclaim/scs-pvc created
$ kubectl -n scs-demo apply -f statefulset.yaml
statefulset.apps/scs created
$ kubectl -n scs-demo apply -f service.yaml
service/scs-svc created
$ kubectl -n scs-demo apply -f ingress.yaml
ingress.networking.k8s.io/scs-ingress created
Once the objects are deployed we can check that all is working using kubectl:
$ kubectl  -n scs-demo get all,secrets,ingress
NAME        READY   STATUS    RESTARTS   AGE
pod/scs-0   3/3     Running   0          24s
NAME            TYPE       CLUSTER-IP  EXTERNAL-IP  PORT(S)                  AGE
service/scs-svc ClusterIP  10.43.0.47  <none>       22/TCP,80/TCP,9000/TCP   21s

NAME                   READY   AGE
statefulset.apps/scs   1/1     24s
NAME                         TYPE                                  DATA   AGE
secret/default-token-mwcd7   kubernetes.io/service-account-token   3      53s
secret/scs-secrets           Opaque                                3      39s
NAME                                   CLASS  HOSTS      ADDRESS     PORTS   AGE
ingress.networking.k8s.io/scs-ingress  nginx  localhost  172.21.0.5  80      17s
At this point we are ready to use the system.

Usage examples

File uploadsAs previously mentioned in our system the idea is to use the sftp server from other Pods, but to test the system we are going to do a kubectl port-forward and connect to the server using our host client and the password we have generated (it is on the user_pass.txt file, inside the users.tar archive):
$ kubectl -n scs-demo port-forward service/scs-svc 2020:22 &
Forwarding from 127.0.0.1:2020 -> 22
Forwarding from [::1]:2020 -> 22
$ PF_PID=$!
$ sftp -P 2020 scs@127.0.0.1                                                 1
Handling connection for 2020
The authenticity of host '[127.0.0.1]:2020 ([127.0.0.1]:2020)' can't be \
  established.
ED25519 key fingerprint is SHA256:eHNwCnyLcSSuVXXiLKeGraw0FT/4Bb/yjfqTstt+088.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '[127.0.0.1]:2020' (ED25519) to the list of known \
  hosts.
scs@127.0.0.1's password: **********
Connected to 127.0.0.1.
sftp> ls -la
drwxr-xr-x    2 sftp     sftp         4096 Sep 25 14:47 .
dr-xr-xr-x    3 sftp     sftp         4096 Sep 25 14:36 ..
sftp> !date -R > /tmp/date.txt                                               2
sftp> put /tmp/date.txt .
Uploading /tmp/date.txt to /date.txt
date.txt                                      100%   32    27.8KB/s   00:00
sftp> ls -l
-rw-r--r--    1 sftp     sftp           32 Sep 25 15:21 date.txt
sftp> ln date.txt date.txt.1                                                 3
sftp> ls -l
-rw-r--r--    2 sftp     sftp           32 Sep 25 15:21 date.txt
-rw-r--r--    2 sftp     sftp           32 Sep 25 15:21 date.txt.1
sftp> put /tmp/date.txt date.txt.2                                           4
Uploading /tmp/date.txt to /date.txt.2
date.txt                                      100%   32    27.8KB/s   00:00
sftp> ls -l                                                                  5
-rw-r--r--    2 sftp     sftp           32 Sep 25 15:21 date.txt
-rw-r--r--    2 sftp     sftp           32 Sep 25 15:21 date.txt.1
-rw-r--r--    1 sftp     sftp           32 Sep 25 15:21 date.txt.2
sftp> exit
$ kill "$PF_PID"
[1]  + terminated  kubectl -n scs-demo port-forward service/scs-svc 2020:22
  1. We connect to the sftp service on the forwarded port with the scs user.
  2. We put a file we have created on the host on the directory.
  3. We do a hard link of the uploaded file.
  4. We put a second copy of the file we created locally.
  5. On the file list we can see that the two first files have two hardlinks

File retrievalsIf our ingress is configured right we can download the date.txt file from the URL http://localhost/scs/date.txt:
$ curl -s http://localhost/scs/date.txt
Sun, 25 Sep 2022 17:21:51 +0200

Use of the webhook containerTo finish this post we are going to show how we can call the hooks directly, from a CronJob and from a Job.

Direct script call (du)In our deployment the direct calls are done from other Pods, to simulate it we are going to do a port-forward and call the script with an existing PATH (the root directory) and a bad one:
$ kubectl -n scs-demo port-forward service/scs-svc 9000:9000 >/dev/null &
$ PF_PID=$!
$ JSON="$(curl -s "http://localhost:9000/hooks/du?path=.")"
$ echo $JSON
 "path":"","bytes":"4160" 
$ JSON="$(curl -s "http://localhost:9000/hooks/du?path=foo")"
$ echo $JSON
 "error":"The provided PATH ('foo') is not a directory" 
$ kill $PF_PID
As we only have files on the base directory we print the disk usage of the . PATH and the output is in json format because we export OUTPUT_FORMAT with the value json on the webhook configuration.

Jobs (s3sync)The following job can be used to synchronise the contents of a directory in a S3 bucket with the SCS Filesystem:
job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: s3sync
  labels:
    cronjob: 's3sync'
spec:
  template:
    metadata:
      labels:
        cronjob: 's3sync'
    spec:
      containers:
      - name: s3sync-job
        image: alpine:latest
        command: 
        - "wget"
        - "-q"
        - "--header"
        - "Content-Type: application/json"
        - "--post-file"
        - "/secrets/s3sync.json"
        - "-O-"
        - "http://scs-svc:9000/hooks/s3sync"
        volumeMounts:
        - mountPath: /secrets
          name: job-secrets
          readOnly: true
      restartPolicy: Never
      volumes:
      - name: job-secrets
        secret:
          secretName: webhook-job-secrets
The file with parameters for the script must be something like this:
s3sync.json
 
  "aws":  
    "key": "********************",
    "secret_key": "****************************************"
   ,
  "s3":  
    "region": "eu-north-1",
    "bucket": "blogops-test",
    "path": "test"
   ,
  "scs":  
    "path": "test"
   
 
Once we have both files we can run the Job as follows:
$ kubectl -n scs-demo create secret generic webhook-job-secrets \            1
  --from-file="s3sync.json=s3sync.json"
secret/webhook-job-secrets created
$ kubectl -n scs-demo apply -f webhook-job.yaml                              2
job.batch/s3sync created
$ kubectl -n scs-demo get pods -l "cronjob=s3sync"                           3
NAME           READY   STATUS      RESTARTS   AGE
s3sync-zx2cj   0/1     Completed   0          12s
$ kubectl -n scs-demo logs s3sync-zx2cj                                      4
Mounted bucket 's3fs-test' on '/root/tmp.jiOjaF/s3data'
sending incremental file list
created directory ./test
./
kyso.png
Number of files: 2 (reg: 1, dir: 1)
Number of created files: 2 (reg: 1, dir: 1)
Number of deleted files: 0
Number of regular files transferred: 1
Total file size: 15,075 bytes
Total transferred file size: 15,075 bytes
Literal data: 15,075 bytes
Matched data: 0 bytes
File list size: 0
File list generation time: 0.147 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 15,183
Total bytes received: 74
sent 15,183 bytes  received 74 bytes  30,514.00 bytes/sec
total size is 15,075  speedup is 0.99
Called umount for '/root/tmp.jiOjaF/s3data'
Script exit code: 0
$ kubectl -n scs-demo delete -f webhook-job.yaml                             5
job.batch "s3sync" deleted
$ kubectl -n scs-demo delete secrets webhook-job-secrets                     6
secret "webhook-job-secrets" deleted
  1. Here we create the webhook-job-secrets secret that contains the s3sync.json file.
  2. This command runs the job.
  3. Checking the label cronjob=s3sync we get the Pods executed by the job.
  4. Here we print the logs of the completed job.
  5. Once we are finished we remove the Job.
  6. And also the secret.

Final remarksThis post has been longer than I expected, but I believe it can be useful for someone; in any case, next time I ll try to explain something shorter or will split it into multiple entries.

20 September 2022

Simon Josefsson: Privilege separation of GSS-API credentials for Apache

To protect web resources with Kerberos you may use Apache HTTPD with mod_auth_gssapi however, all web scripts (e.g., PHP) run under Apache will have access to the Kerberos long-term symmetric secret credential (keytab). If someone can get it, they can impersonate your server, which is bad. The gssproxy project makes it possible to introduce privilege separation to reduce the attack surface. There is a tutorial for RPM-based distributions (Fedora, RHEL, AlmaLinux, etc), but I wanted to get this to work on a DPKG-based distribution (Debian, Ubuntu, Trisquel, PureOS, etc) and found it worthwhile to document the process. I m using Ubuntu 22.04 below, but have tested it on Debian 11 as well. I have adopted the gssproxy package in Debian, and testing this setup is part of the scripted autopkgtest/debci regression testing. First install the required packages:
root@foo:~# apt-get update
root@foo:~# apt-get install -y apache2 libapache2-mod-auth-gssapi gssproxy curl
This should give you a working and running web server. Verify it is operational under the proper hostname, I ll use foo.sjd.se in this writeup.
root@foo:~# curl --head http://foo.sjd.se/
HTTP/1.1 200 OK
The next step is to create a keytab containing the Kerberos V5 secrets for your host, the exact steps depends on your environment (usually kadmin ktadd or ipa-getkeytab), but use the string HTTP/foo.sjd.se and then confirm using something like the following.
root@foo:~# ls -la /etc/gssproxy/httpd.keytab
-rw------- 1 root root 176 Sep 18 06:44 /etc/gssproxy/httpd.keytab
root@foo:~# klist -k /etc/gssproxy/httpd.keytab -e
Keytab name: FILE:/etc/gssproxy/httpd.keytab
KVNO Principal
---- --------------------------------------------------------------------------
   2 HTTP/foo.sjd.se@GSSPROXY.EXAMPLE.ORG (aes256-cts-hmac-sha1-96) 
   2 HTTP/foo.sjd.se@GSSPROXY.EXAMPLE.ORG (aes128-cts-hmac-sha1-96) 
root@foo:~# 
The file should be owned by root and not be in the default /etc/krb5.keytab location, so Apache s libapache2-mod-auth-gssapi will have to use gssproxy to use it.

Then configure gssproxy to find the credential and use it with Apache.
root@foo:~# cat<<EOF > /etc/gssproxy/80-httpd.conf
[service/HTTP]
mechs = krb5
cred_store = keytab:/etc/gssproxy/httpd.keytab
cred_store = ccache:/var/lib/gssproxy/clients/krb5cc_%U
euid = www-data
process = /usr/sbin/apache2
EOF
For debugging, it may be useful to enable more gssproxy logging:
root@foo:~# cat<<EOF > /etc/gssproxy/gssproxy.conf
[gssproxy]
debug_level = 1
EOF
root@foo:~#
Restart gssproxy so it finds the new configuration, and monitor syslog as follows:
root@foo:~# tail -F /var/log/syslog &
root@foo:~# systemctl restart gssproxy
You should see something like this in the log file:
Sep 18 07:03:15 foo gssproxy[4076]: [2022/09/18 05:03:15]: Exiting after receiving a signal
Sep 18 07:03:15 foo systemd[1]: Stopping GSSAPI Proxy Daemon
Sep 18 07:03:15 foo systemd[1]: gssproxy.service: Deactivated successfully.
Sep 18 07:03:15 foo systemd[1]: Stopped GSSAPI Proxy Daemon.
Sep 18 07:03:15 foo gssproxy[4092]: [2022/09/18 05:03:15]: Debug Enabled (level: 1)
Sep 18 07:03:15 foo systemd[1]: Starting GSSAPI Proxy Daemon
Sep 18 07:03:15 foo gssproxy[4093]: [2022/09/18 05:03:15]: Kernel doesn't support GSS-Proxy (can't open /proc/net/rpc/use-gss-proxy: 2 (No such file or directory))
Sep 18 07:03:15 foo gssproxy[4093]: [2022/09/18 05:03:15]: Problem with kernel communication! NFS server will not work
Sep 18 07:03:15 foo systemd[1]: Started GSSAPI Proxy Daemon.
Sep 18 07:03:15 foo gssproxy[4093]: [2022/09/18 05:03:15]: Initialization complete.
The NFS-related errors is due to a default gssproxy configuration file, it is harmless and if you don t use NFS with GSS-API you can silence it like this:
root@foo:~# rm /etc/gssproxy/24-nfs-server.conf
root@foo:~# systemctl try-reload-or-restart gssproxy
The log should now indicate that it loaded the keytab:
Sep 18 07:18:59 foo systemd[1]: Reloading GSSAPI Proxy Daemon 
Sep 18 07:18:59 foo gssproxy[4182]: [2022/09/18 05:18:59]: Received SIGHUP; re-reading config.
Sep 18 07:18:59 foo gssproxy[4182]: [2022/09/18 05:18:59]: Service: HTTP, Keytab: /etc/gssproxy/httpd.keytab, Enctype: 18
Sep 18 07:18:59 foo gssproxy[4182]: [2022/09/18 05:18:59]: New config loaded successfully.
Sep 18 07:18:59 foo systemd[1]: Reloaded GSSAPI Proxy Daemon.
To instruct Apache or actually, the MIT Kerberos V5 GSS-API library used by mod_auth_gssap loaded by Apache to use gssproxy instead of using /etc/krb5.keytab as usual, Apache needs to be started in an environment that has GSS_USE_PROXY=1 set. The background is covered by the gssproxy-mech(8) man page and explained by the gssproxy README.

When systemd is used the following can be used to set the environment variable, note the final command to reload systemd.
root@foo:~# mkdir -p /etc/systemd/system/apache2.service.d
root@foo:~# cat<<EOF > /etc/systemd/system/apache2.service.d/gssproxy.conf
[Service]
Environment=GSS_USE_PROXY=1
EOF
root@foo:~# systemctl daemon-reload
The next step is to configure a GSS-API protected Apache resource:
root@foo:~# cat<<EOF > /etc/apache2/conf-available/private.conf
<Location /private>
  AuthType GSSAPI
  AuthName "GSSAPI Login"
  Require valid-user
</Location>
Enable the configuration and restart Apache the suggested use of reload is not sufficient, because then it won t be restarted with the newly introduced GSS_USE_PROXY variable. This just applies to the first time, after the first restart you may use reload again.
root@foo:~# a2enconf private
Enabling conf private.
To activate the new configuration, you need to run:
systemctl reload apache2
root@foo:~# systemctl restart apache2
When you have debug messages enabled, the log may look like this:
Sep 18 07:32:23 foo systemd[1]: Stopping The Apache HTTP Server 
Sep 18 07:32:23 foo gssproxy[4182]: [2022/09/18 05:32:23]: Client [2022/09/18 05:32:23]: (/usr/sbin/apache2) [2022/09/18 05:32:23]: connected (fd = 10)[2022/09/18 05:32:23]: (pid = 4651) (uid = 0) (gid = 0)[2022/09/18 05:32:23]:
Sep 18 07:32:23 foo gssproxy[4182]: message repeated 4 times: [ [2022/09/18 05:32:23]: Client [2022/09/18 05:32:23]: (/usr/sbin/apache2) [2022/09/18 05:32:23]: connected (fd = 10)[2022/09/18 05:32:23]: (pid = 4651) (uid = 0) (gid = 0)[2022/09/18 05:32:23]:]
Sep 18 07:32:23 foo systemd[1]: apache2.service: Deactivated successfully.
Sep 18 07:32:23 foo systemd[1]: Stopped The Apache HTTP Server.
Sep 18 07:32:23 foo systemd[1]: Starting The Apache HTTP Server
Sep 18 07:32:23 foo gssproxy[4182]: [2022/09/18 05:32:23]: Client [2022/09/18 05:32:23]: (/usr/sbin/apache2) [2022/09/18 05:32:23]: connected (fd = 10)[2022/09/18 05:32:23]: (pid = 4657) (uid = 0) (gid = 0)[2022/09/18 05:32:23]:
root@foo:~# Sep 18 07:32:23 foo gssproxy[4182]: message repeated 8 times: [ [2022/09/18 05:32:23]: Client [2022/09/18 05:32:23]: (/usr/sbin/apache2) [2022/09/18 05:32:23]: connected (fd = 10)[2022/09/18 05:32:23]: (pid = 4657) (uid = 0) (gid = 0)[2022/09/18 05:32:23]:]
Sep 18 07:32:23 foo systemd[1]: Started The Apache HTTP Server.
Finally, set up a dummy test page on the server:
root@foo:~# echo OK > /var/www/html/private
To verify that the server is working properly you may acquire tickets locally and then use curl to retrieve the GSS-API protected resource. The "--negotiate" enables SPNEGO and "--user :" asks curl to use username from the environment.
root@foo:~# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: jas@GSSPROXY.EXAMPLE.ORG
Valid starting Expires Service principal
09/18/22 07:40:37 09/19/22 07:40:37 krbtgt/GSSPROXY.EXAMPLE.ORG@GSSPROXY.EXAMPLE.ORG
root@foo:~# curl --negotiate --user : http://foo.sjd.se/private
OK
root@foo:~#
The log should contain something like this:
Sep 18 07:56:00 foo gssproxy[4872]: [2022/09/18 05:56:00]: Client [2022/09/18 05:56:00]: (/usr/sbin/apache2) [2022/09/18 05:56:00]: connected (fd = 10)[2022/09/18 05:56:00]: (pid = 5042) (uid = 33) (gid = 33)[2022/09/18 05:56:00]:
Sep 18 07:56:00 foo gssproxy[4872]: [CID 10][2022/09/18 05:56:00]: gp_rpc_execute: executing 6 (GSSX_ACQUIRE_CRED) for service "HTTP", euid: 33,socket: (null)
Sep 18 07:56:00 foo gssproxy[4872]: [CID 10][2022/09/18 05:56:00]: gp_rpc_execute: executing 6 (GSSX_ACQUIRE_CRED) for service "HTTP", euid: 33,socket: (null)
Sep 18 07:56:00 foo gssproxy[4872]: [CID 10][2022/09/18 05:56:00]: gp_rpc_execute: executing 1 (GSSX_INDICATE_MECHS) for service "HTTP", euid: 33,socket: (null)
Sep 18 07:56:00 foo gssproxy[4872]: [CID 10][2022/09/18 05:56:00]: gp_rpc_execute: executing 6 (GSSX_ACQUIRE_CRED) for service "HTTP", euid: 33,socket: (null)
Sep 18 07:56:00 foo gssproxy[4872]: [CID 10][2022/09/18 05:56:00]: gp_rpc_execute: executing 9 (GSSX_ACCEPT_SEC_CONTEXT) for service "HTTP", euid: 33,socket: (null)
The Apache log will look like this, notice the authenticated username shown.
127.0.0.1 - jas@GSSPROXY.EXAMPLE.ORG [18/Sep/2022:07:56:00 +0200] "GET /private HTTP/1.1" 200 481 "-" "curl/7.81.0"
Congratulations, and happy hacking!

13 September 2022

Alberto Garc a: Adding software to the Steam Deck with systemd-sysext

Yakuake on SteamOS Introduction: an immutable OS The Steam Deck runs SteamOS, a single-user operating system based on Arch Linux. Although derived from a standard package-based distro, the OS in the Steam Deck is immutable and system updates replace the contents of the root filesystem atomically instead of using the package manager. An immutable OS makes the system more stable and its updates less error-prone, but users cannot install additional packages to add more software. This is not a problem for most users since they are only going to run Steam and its games (which are stored in the home partition). Nevertheless, the OS also has a desktop mode which provides a standard Linux desktop experience, and here it makes sense to be able to install more software. How to do that though? It is possible for the user to become root, make the root filesytem read-write and install additional software there, but any changes will be gone after the next OS update. Modifying the rootfs can also be dangerous if the user is not careful. Ways to add additional software The simplest and safest way to install additional software is with Flatpak, and that s the method recommended in the Steam Deck Desktop FAQ. Flatpak is already installed and integrated in the system via the Discover app so I won t go into more details here. However, while Flatpak works great for desktop applications not every piece of software is currently available, and Flatpak is also not designed for other types of programs like system services or command-line tools. Fortunately there are several ways to add software to the Steam Deck without touching the root filesystem, each one with different pros and cons. I will probably talk about some of them in the future, but in this post I m going to focus on one that is already available in the system: systemd-sysext. About systemd-sysext This is a tool included in recent versions of systemd and it is designed to add additional files (in the form of system extensions) to an otherwise immutable root filesystem. Each one of these extensions contains a set of files. When extensions are enabled (aka merged ) those files will appear on the root filesystem using overlayfs. From then on the user can open and run them normally as if they had been installed with a package manager. Merged extensions are seamlessly integrated with the rest of the OS. Since extensions are just collections of files they can be used to add new applications but also other things like system services, development tools, language packs, etc. Creating an extension: yakuake I m using yakuake as an example for this tutorial since the extension is very easy to create, it is an application that some users are demanding and is not easy to distribute with Flatpak. So let s create a yakuake extension. Here are the steps: 1) Create a directory and unpack the files there:
$ mkdir yakuake
$ wget https://steamdeck-packages.steamos.cloud/archlinux-mirror/extra/os/x86_64/yakuake-21.12.1-1-x86_64.pkg.tar.zst
$ tar -C yakuake -xaf yakuake-*.tar.zst usr
2) Create a file called extension-release.NAME under usr/lib/extension-release.d with the fields ID and VERSION_ID taken from the Steam Deck s /etc/os-release file.
$ mkdir -p yakuake/usr/lib/extension-release.d/
$ echo ID=steamos > yakuake/usr/lib/extension-release.d/extension-release.yakuake
$ echo VERSION_ID=3.3.1 >> yakuake/usr/lib/extension-release.d/extension-release.yakuake
3) Create an image file with the contents of the extension:
$ mksquashfs yakuake yakuake.raw
That s it! The extension is ready. A couple of important things: image files must have the .raw suffix and, despite the name, they can contain any filesystem that the OS can mount. In this example I used SquashFS but other alternatives like EroFS or ext4 are equally valid. NOTE: systemd-sysext can also use extensions from plain directories (i.e skipping the mksquashfs part). Unfortunately we cannot use them in our case because overlayfs does not work with the casefold feature that is enabled on the Steam Deck. Using the extension Once the extension is created you simply need to copy it to a place where systemd-systext can find it. There are several places where they can be installed (see the manual for a list) but due to the Deck s partition layout and the potentially large size of some extensions it probably makes more sense to store them in the home partition and create a link from one of the supported locations (/var/lib/extensions in this example):
(deck@steamdeck ~)$ mkdir extensions
(deck@steamdeck ~)$ scp user@host:/path/to/yakuake.raw extensions/
(deck@steamdeck ~)$ sudo ln -s $PWD/extensions /var/lib/extensions
Once the extension is installed in that directory you only need to enable and start systemd-sysext:
(deck@steamdeck ~)$ sudo systemctl enable systemd-sysext
(deck@steamdeck ~)$ sudo systemctl start systemd-sysext
After this, if everything went fine you should be able to see (and run) /usr/bin/yakuake. The files should remain there from now on, also if you reboot the device. You can see what extensions are enabled with this command:
$ systemd-sysext status
HIERARCHY EXTENSIONS SINCE
/opt      none       -
/usr      yakuake    Tue 2022-09-13 18:21:53 CEST
If you add or remove extensions from the directory then a simple systemd-sysext refresh is enough to apply the changes. Unfortunately, and unlike distro packages, extensions don t have any kind of post-installation hooks or triggers, so in the case of Yakuake you probably won t see an entry in the KDE application menu immediately after enabling the extension. You can solve that by running kbuildsycoca5 once from the command line. Limitations and caveats Using systemd extensions is generally very easy but there are some things that you need to take into account:
  1. Using extensions is easy (you put them in the directory and voil !). However, creating extensions is not necessarily always easy. To begin with, any libraries, files, etc., that your extensions may need should be either present in the root filesystem or provided by the extension itself. You may need to combine files from different sources or packages into a single extension, or compile them yourself.
  2. In particular, if the extension contains binaries they should probably come from the Steam Deck repository or they should be built to work with those packages. If you need to build your own binaries then having a SteamOS virtual machine can be handy. There you can install all development files and also test that everything works as expected. One could also create a Steam Deck SDK extension with all the necessary files to develop directly on the Deck
  3. Extensions are not distribution packages, they don t have dependency information and therefore they should be self-contained. They also lack triggers and other features available in packages. For desktop applications I still recommend using a system like Flatpak when possible.
  4. Extensions are tied to a particular version of the OS and, as explained above, the ID and VERSION_ID of each extension must match the values from /etc/os-release. If the fields don t match then the extension will be ignored. This is to be expected because there s no guarantee that a particular extension is going to work with a different version of the OS. This can happen after a system update. In the best case one simply needs to update the extension s VERSION_ID, but in some cases it might be necessary to create the extension again with different/updated files.
  5. Extensions only install files in /usr and /opt. Any other file in the image will be ignored. This can be a problem if a particular piece of software needs files in other directories.
  6. When extensions are enabled the /usr and /opt directories become read-only because they are now part of an overlayfs. They will remain read-only even if you run steamos-readonly disable !!. If you really want to make the rootfs read-write you need to disable the extensions (systemd-sysext unmerge) first.
  7. Unlike Flatpak or Podman (including toolbox / distrobox), this is (by design) not meant to isolate the contents of the extension from the rest of the system, so you should be careful with what you re installing. On the other hand, this lack of isolation makes systemd-sysext better suited to some use cases than those container-based systems.
Conclusion systemd extensions are an easy way to add software (or data files) to the immutable OS of the Steam Deck in a way that is seamlessly integrated with the rest of the system. Creating them can be more or less easy depending on the case, but using them is extremely simple. Extensions are not packages, and systemd-sysext is not a package manager or a general-purpose tool to solve all problems, but if you are aware of its limitations it can be a practical tool. It is also possible to share extensions with other users, but here the usual warning against installing binaries from untrusted sources applies. Use with caution, and enjoy!

2 September 2022

John Goerzen: Dead USB Drives Are Fine: Building a Reliable Sneakernet

OK, you re probably thinking. John, you talk a lot about things like Gopher and personal radios, and now you want to talk about building a reliable network out of USB drives? Well, yes. In fact, I ve already done it.

What is sneakernet? Normally, sneakernet is a sort of tongue-in-cheek reference to using disconnected storage to transport data or messages. By disconnect storage I mean anything like CD-ROMs, hard drives, SD cards, USB drives, and so forth. There are times when loading up 12TB on a device and driving it across town is just faster and easier than using the Internet for the same. And, sometimes you need to get data to places that have no Internet at all. Another reason for sneakernet is security. For instance, if your backup system is online, and your systems being backed up are online, then it could become possible for an attacker to destroy both your primary copy of data and your backups. Or, you might use a dedicated computer with no network connection to do GnuPG (GPG) signing.

What about reliable sneakernet, then? TCP is often considered a reliable protocol. That means that the sending side is generally able to tell if its message was properly received. As with most reliable protocols, we have these components:
  1. After transmitting a piece of data, the sender retains it.
  2. After receiving a piece of data, the receiver sends an acknowledgment (ACK) back to the sender.
  3. Upon receiving the acknowledgment, the sender removes its buffered copy of the data.
  4. If no acknowledgment is received at the sender, it retransmits the data, in case it gets lost in transit.
  5. It reorders any packets that arrive out of order, so that the recipient s data stream is ordered correctly.
Now, a lot of the things I just mentioned for sneakernet are legendarily unreliable. USB drives fail, CD-ROMs get scratched, hard drives get banged up. Think about putting these things in a bicycle bag or airline luggage. Some of them are going to fail. You might think, well, I ll just copy files to a USB drive instead of move them, and once I get them onto the destination machine, I ll delete them from the source. Congratulations! You are a human retransmit algorithm! We should be able to automate this! And we can.

Enter NNCP NNCP is one of those things that almost defies explanation. It is a toolkit for building asynchronous networks. It can use as a carrier: a pipe, TCP network connection, a mounted filesystem (specifically intended for cases like this), and much more. It also supports multi-hop asynchronous routing and asynchronous meshing, but these are beyond the scope of this particular article. NNCP s transports that involve live communication between two hops already had all the hallmarks of being reliable; there was a positive ACK and retransmit. As of version 8.7.0, NNCP s ACKs themselves can also be asynchronous meaning that every NNCP transport can now be reliable. Yes, that s right. Your ACKs can flow over tapes and USB drives if you want them to. I use this for archiving and backups. If you aren t already familiar with NNCP, you might take a look at my NNCP page. I also have a lot of blog posts about NNCP. Those pages describe the basics of NNCP: the packet (the unit of transmission in NNCP, which can be tiny or many TB), the end-to-end encryption, and so forth. The new command we will now be interested in is nncp-ack.

The Basic Idea Here are the basic steps to processing this stuff with NNCP:
  1. First, we use nncp-xfer -rx to process incoming packets from the USB (or other media) device. This moves them into the NNCP inbound queue, deleting them from the media device, and verifies the packet integrity.
  2. We use nncp-ack -node $NODE to create ACK packets responding to the packets we just loaded into the rx queue. It writes a list of generated ACKs onto fd 4, which we save off for later use.
  3. We run nncp-toss -seen to process the incoming queue. The use of -seen causes NNCP to remember the hashes of packets seen before, so a duplicate of an already-seen packet will not be processed twice. This command also processes incoming ACKs for packets we ve sent out previously; if they pass verification, the relevant packets are removed from the local machine s tx queue.
  4. Now, we use nncp-xfer -keep -tx -mkdir -node $NODE to send outgoing packets to a given node by writing them to a given directory on the media device. -keep causes them to remain in the outgoing queue.
  5. Finally, we use the list of generated ACK packets saved off in step 2 above. That list is passed to nncp-rm -node $NODE -pkt < $FILE to remove those specific packets from the outbound queue. The reason is that there will never be an ACK of ACK packet (that would create an infinite loop), so if we don t delete them in this manner, they would hang around forever.
You can see these steps follow the same basic outline on upstream s nncp-ack page. One thing to keep in mind: if anything else is running nncp-toss, there is a chance of a race condition between steps 1 and 2 (if nncp-toss gets to it first, it might not get an ack generated). This would sort itself out eventually, presumably, as the sender would retransmit and it would be ACKed later.

Further ideas NNCP guarantees the integrity of packets, but not ordering between packets; if you need that, you might look into my Filespooler program. It is designed to work with NNCP and can provide ordered processing.

An example script Here is a script you might try for this sort of thing. It may have more logic than you need really, you just need the steps above but hopefully it is clear.
#!/bin/bash
set -eo pipefail
MEDIABASE="/media/$USER"
# The local node name
NODENAME=" hostname "
# All nodes.  NODENAME should be in this list.
ALLNODES="node1 node2 node3"
RUNNNCP=""
# If you need to sudo, use something like RUNNNCP="sudo -Hu nncp"
NNCPPATH="/usr/local/nncp/bin"
ACKPATH=" mktemp -d "
# Process incoming packets.
#
# Parameters: $1 - the path to scan.  Must contain a directory
# named "nncp".
procrxpath ()  
    while [ -n "$1" ]; do
        BASEPATH="$1/nncp"
        shift
        if ! [ -d "$BASEPATH" ]; then
            echo "$BASEPATH doesn't exist; skipping"
            continue
        fi
        echo " *** Incoming: processing $BASEPATH"
        TMPDIR=" mktemp -d "
        # This rsync and the one below can help with
        # certain permission issues from weird foreign
        # media.  You could just eliminate it and
        # always use $BASEPATH instead of $TMPDIR below.
        rsync -rt "$BASEPATH/" "$TMPDIR/"
        # You may need these next two lines if using sudo as above.
        # chgrp -R nncp "$TMPDIR"
        # chmod -R g+rwX "$TMPDIR"
        echo "     Running nncp-xfer -rx"
        $RUNNNCP $NNCPPATH/nncp-xfer -progress -rx "$TMPDIR"
        for NODE in $ALLNODES; do
                if [ "$NODE" != "$NODENAME" ]; then
                        echo "     Running nncp-ack for $NODE"
                        # Now, we generate ACK packets for each node we will
                        # process.  nncp-ack writes a list of the created
                        # ACK packets to fd 4.  We'll use them later.
                        # If using sudo, add -C 5 after $RUNNNCP.
                        $RUNNNCP $NNCPPATH/nncp-ack -progress -node "$NODE" \
                           4>> "$ACKPATH/$NODE"
                fi
        done
        rsync --delete -rt "$TMPDIR/" "$BASEPATH/"
        rm -fr "$TMPDIR"
    done
 
proctxpath ()  
    while [ -n "$1" ]; do
        BASEPATH="$1/nncp"
        shift
        if ! [ -d "$BASEPATH" ]; then
            echo "$BASEPATH doesn't exist; skipping"
            continue
        fi
        echo " *** Outgoing: processing $BASEPATH"
        TMPDIR=" mktemp -d "
        rsync -rt "$BASEPATH/" "$TMPDIR/"
        # You may need these two lines if using sudo:
        # chgrp -R nncp "$TMPDIR"
        # chmod -R g+rwX "$TMPDIR"
        for DESTHOST in $ALLNODES; do
            if [ "$DESTHOST" = "$NODENAME" ]; then
                continue
            fi
            # Copy outgoing packets to this node, but keep them in the outgoing
            # queue with -keep.
            $RUNNNCP $NNCPPATH/nncp-xfer -keep -tx -mkdir -node "$DESTHOST" -progress "$TMPDIR"
            # Here is the key: that list of ACK packets we made above - now we delete them.
            # There will never be an ACK for an ACK, so they'd keep sending forever
            # if we didn't do this.
            if [ -f "$ACKPATH/$DESTHOST" ]; then
                echo "nncp-rm for node $DESTHOST"
                $RUNNNCP $NNCPPATH/nncp-rm -debug -node "$DESTHOST" -pkt < "$ACKPATH/$DESTHOST"
            fi
        done
        rsync --delete -rt "$TMPDIR/" "$BASEPATH/"
        rm -rf "$TMPDIR"
        # We only want to write stuff once.
        return 0
    done
 
procrxpath "$MEDIABASE"/*
echo " *** Initial tossing..."
# We make sure to use -seen to rule out duplicates.
$RUNNNCP $NNCPPATH/nncp-toss -progress -seen
proctxpath "$MEDIABASE"/*
echo "You can unmount devices now."
echo "Done."

This post is also available on my webiste, where it may be periodically updated.

1 August 2022

Sergio Talens-Oliag: Using Git Server Hooks on GitLab CE to Validate Tags

Since a long time ago I ve been a gitlab-ce user, in fact I ve set it up on three of the last four companies I ve worked for (initially I installed it using the omnibus packages on a debian server but on the last two places I moved to the docker based installation, as it is easy to maintain and we don t need a big installation as the teams using it are small). On the company I work for now (kyso) we are using it to host all our internal repositories and to do all the CI/CD work (the automatic deployments are triggered by web hooks in some cases, but the rest is all done using gitlab-ci). The majority of projects are using nodejs as programming language and we have automated the publication of npm packages on our gitlab instance npm registry and even the publication into the npmjs registry. To publish the packages we have added rules to the gitlab-ci configuration of the relevant repositories and we publish them when a tag is created. As the we are lazy by definition, I configured the system to use the tag as the package version; I tested if the contents of the package.json where in sync with the expected version and if it was not I updated it and did a force push of the tag with the updated file using the following code on the script that publishes the package:
# Update package version & add it to the .build-args
INITIAL_PACKAGE_VERSION="$(npm pkg get version tr -d '"')"
npm version --allow-same --no-commit-hooks --no-git-tag-version \
  "$CI_COMMIT_TAG"
UPDATED_PACKAGE_VERSION="$(npm pkg get version tr -d '"')"
echo "UPDATED_PACKAGE_VERSION=$UPDATED_PACKAGE_VERSION" >> .build-args
# Update tag if the version was updated or abort
if [ "$INITIAL_PACKAGE_VERSION" != "$UPDATED_PACKAGE_VERSION" ]; then
  if [ -n "$CI_GIT_USER" ] && [ -n "$CI_GIT_TOKEN" ]; then
    git commit -m "Updated version from tag $CI_COMMIT_TAG" package.json
    git tag -f "$CI_COMMIT_TAG" -m "Updated version from tag"
    git push -f -o ci.skip origin "$CI_COMMIT_TAG"
  else
    echo "!!! ERROR !!!"
    echo "The updated tag could not be uploaded."
    echo "Set CI_GIT_USER and CI_GIT_TOKEN or fix the 'package.json' file"
    echo "!!! ERROR !!!"
    exit 1
  fi
fi
This feels a little dirty (we are leaving commits on the tag but not updating the original branch); I thought about trying to find the branch using the tag and update it, but I drop the idea pretty soon as there were multiple issues to consider (i.e. we can have tags pointing to commits present in multiple branches and even if it only points to one the tag does not have to be the HEAD of the branch making the inclusion difficult). In any case this system was working, so we left it until we started to publish to the NPM Registry; as we are using a token to push the packages that we don t want all developers to have access to (right now it would not matter, but when the team grows it will) I started to use gitlab protected branches on the projects that need it and adjusting the .npmrc file using protected variables. The problem then was that we can no longer do a standard force push for a branch (that is the main point of the protected branches feature) unless we use the gitlab api, so the tags with the wrong version started to fail. As the way things were being done seemed dirty anyway I thought that the best way of fixing things was to forbid users to push a tag that includes a version that does not match the package.json version. After thinking about it we decided to use githooks on the gitlab server for the repositories that need it, as we are only interested in tags we are going to use the update hook; it is executed once for each ref to be updated, and takes three parameters:
  • the name of the ref being updated,
  • the old object name stored in the ref,
  • and the new object name to be stored in the ref.
To install our hook we have found the gitaly relative path of each repo and located it on the server filesystem (as I said we are using docker and the gitlab s data directory is on /srv/gitlab/data, so the path to the repo has the form /srv/gitlab/data/git-data/repositories/@hashed/xx/yy/hash.git). Once we have the directory we need to:
  • create a custom_hooks sub directory inside it,
  • add the update script (as we only need one script we used that instead of creating an update.d directory, the good thing is that this will also work with a standard git server renaming the base directory to hooks instead of custom_hooks),
  • make it executable, and
  • change the directory and file ownership to make sure it can be read and executed from the gitlab container
On a console session:
$ cd /srv/gitlab/data/git-data/repositories/@hashed/xx/yy/hash.git
$ mkdir custom_hooks
$ edit_or_copy custom_hooks/update
$ chmod 0755 custom_hooks/update
$ chown --reference=. -R custom_hooks
The update script we are using is as follows:
#!/bin/sh
set -e
# kyso update hook
#
# Right now it checks version.txt or package.json versions against the tag name
# (it supports a 'v' prefix on the tag)
# Arguments
ref_name="$1"
old_rev="$2"
new_rev="$3"
# Initial test
if [ -z "$ref_name" ]    [ -z "$old_rev" ]   [ -z "$new_rev" ]; then
  echo "usage: $0 <ref> <oldrev> <newrev>" >&2
  exit 1
fi
# Get the tag short name
tag_name="$ ref_name##refs/tags/ "
# Exit if the update is not for a tag
if [ "$tag_name" = "$ref_name" ]; then
  exit 0
fi
# Get the null rev value (string of zeros)
zero=$(git hash-object --stdin </dev/null   tr '0-9a-f' '0')
# Get if the tag is new or not
if [ "$old_rev" = "$zero" ]; then
  new_tag="true"
else
  new_tag="false"
fi
# Get the type of revision:
# - delete: if the new_rev is zero
# - commit: annotated tag
# - tag: un-annotated tag
if [ "$new_rev" = "$zero" ]; then
  new_rev_type="delete"
else
  new_rev_type="$(git cat-file -t "$new_rev")"
fi
# Exit if we are deleting a tag (nothing to check here)
if [ "$new_rev_type" = "delete" ]; then
  exit 0
fi
# Check the version against the tag (supports version.txt & package.json)
if git cat-file -e "$new_rev:version.txt" >/dev/null 2>&1; then
  version="$(git cat-file -p "$new_rev:version.txt")"
  if [ "$version" = "$tag_name" ]   [ "$version" = "$ tag_name#v " ]; then
    exit 0
  else
    EMSG="tag '$tag_name' and 'version.txt' contents '$version' don't match"
    echo "GL-HOOK-ERR: $EMSG"
    exit 1
  fi
elif git cat-file -e "$new_rev:package.json" >/dev/null 2>&1; then
  version="$(
    git cat-file -p "$new_rev:package.json"   jsonpath version   tr -d '\[\]"'
  )"
  if [ "$version" = "$tag_name" ]   [ "$version" = "$ tag_name#v " ]; then
    exit 0
  else
    EMSG="tag '$tag_name' and 'package.json' version '$version' don't match"
    echo "GL-HOOK-ERR: $EMSG"
    exit 1
  fi
else
  # No version.txt or package.json file found
  exit 0
fi
Some comments about it:
  • we are only looking for tags, if the ref_name does not have the prefix refs/tags/ the script does an exit 0,
  • although we are checking if the tag is new or not we are not using the value (in gitlab that is handled by the protected tag feature),
  • if we are deleting a tag the script does an exit 0, we don t need to check anything in that case,
  • we are ignoring if the tag is annotated or not (we set the new_rev_type to tag or commit, but we don t use the value),
  • we test first the version.txt file and if it does not exist we check the package.json file, if it does not exist either we do an exit 0, as there is no version to check against and we allow that on a tag,
  • we add the GL-HOOK-ERR: prefix to the messages to show them on the gitlab web interface (can be tested creating a tag from it),
  • to get the version on the package.json file we use the jsonpath binary (it is installed by the jsonpath ruby gem) because it is available on the gitlab container (initially I used sed to get the value, but a real JSON parser is always a better option).
Once the hook is installed when a user tries to push a tag to a repository that has a version.txt file or package.json file and the tag does not match the version (if version.txt is present it takes precedence) the push fails. If the tag matches or the files are not present the tag is added if the user has permission to add it in gitlab (our hook is only executed if the user is allowed to create or update the tag).

24 June 2022

Kees Cook: finding binary differences

As part of the continuing work to replace 1-element arrays in the Linux kernel, it s very handy to show that a source change has had no executable code difference. For example, if you started with this:
struct foo  
    unsigned long flags;
    u32 length;
    u32 data[1];
 ;
void foo_init(int count)
 
    struct foo *instance;
    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
    ...
    instance = kmalloc(bytes, GFP_KERNEL);
    ...
 ;
And you changed only the struct definition:
-    u32 data[1];
+    u32 data[];
The bytes calculation is going to be incorrect, since it is still subtracting 1 element s worth of space from the desired count. (And let s ignore for the moment the open-coded calculation that may end up with an arithmetic over/underflow here; that can be solved separately by using the struct_size() helper or the size_mul(), size_add(), etc family of helpers.) The missed adjustment to the size calculation is relatively easy to find in this example, but sometimes it s much less obvious how structure sizes might be woven into the code. I ve been checking for issues by using the fantastic diffoscope tool. It can produce a LOT of noise if you try to compare builds without keeping in mind the issues solved by reproducible builds, with some additional notes. I prepare my build with the known to disrupt code layout options disabled, but with debug info enabled:
$ KBF="KBUILD_BUILD_TIMESTAMP=1970-01-01 KBUILD_BUILD_USER=user KBUILD_BUILD_HOST=host KBUILD_BUILD_VERSION=1"
$ OUT=gcc
$ make $KBF O=$OUT allmodconfig
$ ./scripts/config --file $OUT/.config \
        -d GCOV_KERNEL -d KCOV -d GCC_PLUGINS -d IKHEADERS -d KASAN -d UBSAN \
        -d DEBUG_INFO_NONE -e DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT
$ make $KBF O=$OUT olddefconfig
Then I build a stock target, saving the output in before . In this case, I m examining drivers/scsi/megaraid/:
$ make -jN $KBF O=$OUT drivers/scsi/megaraid/
$ mkdir -p $OUT/before
$ cp $OUT/drivers/scsi/megaraid/*.o $OUT/before/
Then I patch and build a modified target, saving the output in after :
$ vi the/source/code.c
$ make -jN $KBF O=$OUT drivers/scsi/megaraid/
$ mkdir -p $OUT/after
$ cp $OUT/drivers/scsi/megaraid/*.o $OUT/after/
And then run diffoscope:
$ diffoscope $OUT/before/ $OUT/after/
If diffoscope output reports nothing, then we re done. Usually, though, when source lines move around other stuff will shift too (e.g. WARN macros rely on line numbers, so the bug table may change contents a bit, etc), and diffoscope output will look noisy. To examine just the executable code, the command that diffoscope used is reported in the output, and we can run it directly, but with possibly shifted line numbers not reported. i.e. running objdump without --line-numbers:
$ ARGS="--disassemble --demangle --reloc --no-show-raw-insn --section=.text"
$ for i in $(cd $OUT/before && echo *.o); do
        echo $i
        diff -u <(objdump $ARGS $OUT/before/$i   sed "0,/^Disassembly/d") \
                <(objdump $ARGS $OUT/after/$i    sed "0,/^Disassembly/d")
done
If I see an unexpected difference, for example:
-    c120:      movq   $0x0,0x800(%rbx)
+    c120:      movq   $0x0,0x7f8(%rbx)
Then I'll search for the pattern with line numbers added to the objdump output:
$ vi <(objdump --line-numbers $ARGS $OUT/after/megaraid_sas_fp.o)
I'd search for "0x0,0x7f8", find the source file and line number above it, open that source file at that position, and look to see where something was being miscalculated:
$ vi drivers/scsi/megaraid/megaraid_sas_fp.c +329
Once tracked down, I'd start over at the "patch and build a modified target" step above, repeating until there were no differences. For example, in the starting example, I'd also need to make this change:
-    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
+    size_t bytes = sizeof(*instance) + sizeof(u32) * count;
Though, as hinted earlier, better yet would be:
-    size_t bytes = sizeof(*instance) + sizeof(u32) * (count - 1);
+    size_t bytes = struct_size(instance, data, count);
But sometimes adding the helper usage will add binary output differences since they're performing overflow checking that might saturate at SIZE_MAX. To help with patch clarity, those changes can be done separately from fixing the array declaration.

2022, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.
CC BY-SA 4.0

18 June 2022

Bastian Venthur: blag is now available in Debian

Last year, I wrote my own blog-aware static site generator in Python. I called it blag named after the blag of the webcomic xkcd. Now I finally got around packaging- and uploading blag to Debian. It passed the NEW queue and is now part of the distribution. That means if you re using Debian, you can install it via:
sudo aptitude install blag
Ubuntu will probably follow soon. For every other system, blag is also available on PyPI:
pip install blag
To get started, you can
mkdir blog && cd blog
blag quickstart                        # fill out some info
nvim content/hello-world.md            # write some content
blag build                             # build the website
Blag is aware of articles and pages: the difference is that articles are part of the blog and will be added to the atom feed, the archive and aggregated in the tag pages. Pages are just rendered out to HTML. Articles and pages can be freely mixed in the content directory, what differentiates an article from a page is the existence of the dade metadata element:
title: My first article
description: Short description of the article
date: 2022-06-18 23:00
tags: blogging, markdown
## Hello World!
Lorem ipsum.
[...]
blag also comes with a dev-server that rebuilds the website automatically on every change detected, you can start it using:
blag serve
The default theme looks quite ugly, and you probably want to create your own styling to make it more beautiful. The process is not very difficult if you re familiar with jinja templating. Help on that can be found in the Templating section of the online documentation, the offline version in the blag-doc package, or the man page, respectively. Speaking of the blag-doc package: packaging it was surprisingly tricky, and it also took me a lot of trial and error to realize that dh_sphinxdocs alone does not automatically put the generated html output into the appropriate package, you rather have to list them in the package.docs-file (i.e. blag-doc.docs) so dh_installdocs can install them properly.

11 June 2022

Louis-Philippe V ronneau: Updating a rooted Pixel 3a

A short while after getting a Pixel 3a, I decided to root it, mostly to have more control over the charging procedure. In order to preserve battery life, I like my phone to stop charging at around 75% of full battery capacity and to shut down automatically at around 12%. Some Android ROMs have extra settings to manage this, but LineageOS unfortunately does not. Android already comes with a fairly complex mechanism to handle the charge cycle, but it is mostly controlled by the kernel and cannot be easily configured by end-users. acc is a higher-level "systemless" interface for the Android kernel battery management, but one needs root to do anything interesting with it. Once rooted, you can use the AccA app instead of playing on the command line to fine tune your battery settings. Sadly, having a rooted phone also means I need to re-root it each time there is an OS update (typically each week). Somehow, I keep forgetting the exact procedure to do this! Hopefully, I will be able to use this post as a reference in the future :) Note that these instructions might not apply to your exact phone model, proceed with caution! Extract the boot.img file This procedure mostly comes from the LineageOS documentation on extracting proprietary blobs from the payload.
  1. Download the latest LineageOS image for your phone.
  2. unzip the image to get the payload.bin file inside it.
  3. Clone the LineageOS scripts git repository: $ git clone https://github.com/LineageOS/scripts
  4. extract the boot image (requires python3-protobuf): $ mkdir extracted-payload $ python3 scripts/update-payload-extractor/extract.py payload.bin --output_dir extracted-payload
You should now have a boot.img file. Patch the boot image file using Magisk
  1. Upload the boot.img file you previously extracted to your device.
  2. Open Magisk and patch the boot.img file.
  3. Download the patched file back on your computer.
Flash the patched boot image
  1. Enable ADB debug mode on your phone.
  2. Reboot into fastboot mode. $ adb reboot fastboot
  3. Flash the patched boot image file: $ fastboot flash boot magisk_patched-foo.img
  4. Disable ADB debug mode on your phone.
Troubleshooting In an ideal world, you would do this entire process each time you upgrade to a new LineageOS version. Sadly, this creates friction and makes updating much more troublesome. To simplify things, you can try to flash an old patched boot.img file after upgrading, instead of generating it each time. In my experience, it usually works. When it does not, the device behaves weirdly after a reboot and things that require proprietary blobs (like WiFi) will stop working. If that happens:
  1. Download the latest LineageOS version for your phone.
  2. Reboot into recovery (Power + Volume Down).
  3. Click on "Apply Updates"
  4. Sideload the ROM: $ adb sideload lineageos-foo.zip

26 May 2022

Sergio Talens-Oliag: New Blog Config

As promised, on this post I m going to explain how I ve configured this blog using hugo, asciidoctor and the papermod theme, how I publish it using nginx, how I ve integrated the remark42 comment system and how I ve automated its publication using gitea and json2file-go. It is a long post, but I hope that at least parts of it can be interesting for some, feel free to ignore it if that is not your case

Hugo Configuration

Theme settingsThe site is using the PaperMod theme and as I m using asciidoctor to publish my content I ve adjusted the settings to improve how things are shown with it. The current config.yml file is the one shown below (probably some of the settings are not required nor being used right now, but I m including the current file, so this post will have always the latest version of it):
config.yml
baseURL: https://blogops.mixinet.net/
title: Mixinet BlogOps
paginate: 5
theme: PaperMod
destination: public/
enableInlineShortcodes: true
enableRobotsTXT: true
buildDrafts: false
buildFuture: false
buildExpired: false
enableEmoji: true
pygmentsUseClasses: true
minify:
  disableXML: true
  minifyOutput: true
languages:
  en:
    languageName: "English"
    description: "Mixinet BlogOps - https://blogops.mixinet.net/"
    author: "Sergio Talens-Oliag"
    weight: 1
    title: Mixinet BlogOps
    homeInfoParams:
      Title: "Sergio Talens-Oliag Technical Blog"
      Content: >
        ![Mixinet BlogOps](/images/mixinet-blogops.png)
    taxonomies:
      category: categories
      tag: tags
      series: series
    menu:
      main:
        - name: Archive
          url: archives
          weight: 5
        - name: Categories
          url: categories/
          weight: 10
        - name: Tags
          url: tags/
          weight: 10
        - name: Search
          url: search/
          weight: 15
outputs:
  home:
    - HTML
    - RSS
    - JSON
params:
  env: production
  defaultTheme: light
  disableThemeToggle: false
  ShowShareButtons: true
  ShowReadingTime: true
  disableSpecial1stPost: true
  disableHLJS: true
  displayFullLangName: true
  ShowPostNavLinks: true
  ShowBreadCrumbs: true
  ShowCodeCopyButtons: true
  ShowRssButtonInSectionTermList: true
  ShowFullTextinRSS: true
  ShowToc: true
  TocOpen: false
  comments: true
  remark42SiteID: "blogops"
  remark42Url: "/remark42"
  profileMode:
    enabled: false
    title: Sergio Talens-Oliag Technical Blog
    imageUrl: "/images/mixinet-blogops.png"
    imageTitle: Mixinet BlogOps
    buttons:
      - name: Archives
        url: archives
      - name: Categories
        url: categories
      - name: Tags
        url: tags
  socialIcons:
    - name: CV
      url: "https://www.uv.es/~sto/cv/"
    - name: Debian
      url: "https://people.debian.org/~sto/"
    - name: GitHub
      url: "https://github.com/sto/"
    - name: GitLab
      url: "https://gitlab.com/stalens/"
    - name: Linkedin
      url: "https://www.linkedin.com/in/sergio-talens-oliag/"
    - name: RSS
      url: "index.xml"
  assets:
    disableHLJS: true
    favicon: "/favicon.ico"
    favicon16x16:  "/favicon-16x16.png"
    favicon32x32:  "/favicon-32x32.png"
    apple_touch_icon:  "/apple-touch-icon.png"
    safari_pinned_tab:  "/safari-pinned-tab.svg"
  fuseOpts:
    isCaseSensitive: false
    shouldSort: true
    location: 0
    distance: 1000
    threshold: 0.4
    minMatchCharLength: 0
    keys: ["title", "permalink", "summary", "content"]
markup:
  asciidocExt:
    attributes:  
    backend: html5s
    extensions: ['asciidoctor-html5s','asciidoctor-diagram']
    failureLevel: fatal
    noHeaderOrFooter: true
    preserveTOC: false
    safeMode: unsafe
    sectionNumbers: false
    trace: false
    verbose: false
    workingFolderCurrent: true
privacy:
  vimeo:
    disabled: false
    simple: true
  twitter:
    disabled: false
    enableDNT: true
    simple: true
  instagram:
    disabled: false
    simple: true
  youtube:
    disabled: false
    privacyEnhanced: true
services:
  instagram:
    disableInlineCSS: true
  twitter:
    disableInlineCSS: true
security:
  exec:
    allow:
      - '^asciidoctor$'
      - '^dart-sass-embedded$'
      - '^go$'
      - '^npx$'
      - '^postcss$'
Some notes about the settings:
  • disableHLJS and assets.disableHLJS are set to true; we plan to use rouge on adoc and the inclusion of the hljs assets adds styles that collide with the ones used by rouge.
  • ShowToc is set to true and the TocOpen setting is set to false to make the ToC appear collapsed initially. My plan was to use the asciidoctor ToC, but after trying I believe that the theme one looks nice and I don t need to adjust styles, although it has some issues with the html5s processor (the admonition titles use <h6> and they are shown on the ToC, which is weird), to fix it I ve copied the layouts/partial/toc.html to my site repository and replaced the range of headings to end at 5 instead of 6 (in fact 5 still seems a lot, but as I don t think I ll use that heading level on the posts it doesn t really matter).
  • params.profileMode values are adjusted, but for now I ve left it disabled setting params.profileMode.enabled to false and I ve set the homeInfoParams to show more or less the same content with the latest posts under it (I ve added some styles to my custom.css style sheet to center the text and image of the first post to match the look and feel of the profile).
  • On the asciidocExt section I ve adjusted the backend to use html5s, I ve added the asciidoctor-html5s and asciidoctor-diagram extensions to asciidoctor and adjusted the workingFolderCurrent to true to make asciidoctor-diagram work right (haven t tested it yet).

Theme customisationsTo write in asciidoctor using the html5s processor I ve added some files to the assets/css/extended directory:
  1. As said before, I ve added the file assets/css/extended/custom.css to make the homeInfoParams look like the profile page and I ve also changed a little bit some theme styles to make things look better with the html5s output:
    custom.css
    /* Fix first entry alignment to make it look like the profile */
    .first-entry   text-align: center;  
    .first-entry img   display: inline;  
    /**
     * Remove margin for .post-content code and reduce padding to make it look
     * better with the asciidoctor html5s output.
     **/
    .post-content code   margin: auto 0; padding: 4px;  
  2. I ve also added the file assets/css/extended/adoc.css with some styles taken from the asciidoctor-default.css, see this blog post about the original file; mine is the same after formatting it with css-beautify and editing it to use variables for the colors to support light and dark themes:
    adoc.css
    /* AsciiDoctor*/
    table  
        border-collapse: collapse;
        border-spacing: 0
     
    .admonitionblock>table  
        border-collapse: separate;
        border: 0;
        background: none;
        width: 100%
     
    .admonitionblock>table td.icon  
        text-align: center;
        width: 80px
     
    .admonitionblock>table td.icon img  
        max-width: none
     
    .admonitionblock>table td.icon .title  
        font-weight: bold;
        font-family: "Open Sans", "DejaVu Sans", sans-serif;
        text-transform: uppercase
     
    .admonitionblock>table td.content  
        padding-left: 1.125em;
        padding-right: 1.25em;
        border-left: 1px solid #ddddd8;
        color: var(--primary)
     
    .admonitionblock>table td.content>:last-child>:last-child  
        margin-bottom: 0
     
    .admonitionblock td.icon [class^="fa icon-"]  
        font-size: 2.5em;
        text-shadow: 1px 1px 2px var(--secondary);
        cursor: default
     
    .admonitionblock td.icon .icon-note::before  
        content: "\f05a";
        color: var(--icon-note-color)
     
    .admonitionblock td.icon .icon-tip::before  
        content: "\f0eb";
        color: var(--icon-tip-color)
     
    .admonitionblock td.icon .icon-warning::before  
        content: "\f071";
        color: var(--icon-warning-color)
     
    .admonitionblock td.icon .icon-caution::before  
        content: "\f06d";
        color: var(--icon-caution-color)
     
    .admonitionblock td.icon .icon-important::before  
        content: "\f06a";
        color: var(--icon-important-color)
     
    .conum[data-value]  
        display: inline-block;
        color: #fff !important;
        background-color: rgba(100, 100, 0, .8);
        -webkit-border-radius: 100px;
        border-radius: 100px;
        text-align: center;
        font-size: .75em;
        width: 1.67em;
        height: 1.67em;
        line-height: 1.67em;
        font-family: "Open Sans", "DejaVu Sans", sans-serif;
        font-style: normal;
        font-weight: bold
     
    .conum[data-value] *  
        color: #fff !important
     
    .conum[data-value]+b  
        display: none
     
    .conum[data-value]::after  
        content: attr(data-value)
     
    pre .conum[data-value]  
        position: relative;
        top: -.125em
     
    b.conum *  
        color: inherit !important
     
    .conum:not([data-value]):empty  
        display: none
     
  3. The previous file uses variables from a partial copy of the theme-vars.css file that changes the highlighted code background color and adds the color definitions used by the admonitions:
    theme-vars.css
    :root  
        /* Solarized base2 */
        /* --hljs-bg: rgb(238, 232, 213); */
        /* Solarized base3 */
        /* --hljs-bg: rgb(253, 246, 227); */
        /* Solarized base02 */
        --hljs-bg: rgb(7, 54, 66);
        /* Solarized base03 */
        /* --hljs-bg: rgb(0, 43, 54); */
        /* Default asciidoctor theme colors */
        --icon-note-color: #19407c;
        --icon-tip-color: var(--primary);
        --icon-warning-color: #bf6900;
        --icon-caution-color: #bf3400;
        --icon-important-color: #bf0000
     
    .dark  
        --hljs-bg: rgb(7, 54, 66);
        /* Asciidoctor theme colors with tint for dark background */
        --icon-note-color: #3e7bd7;
        --icon-tip-color: var(--primary);
        --icon-warning-color: #ff8d03;
        --icon-caution-color: #ff7847;
        --icon-important-color: #ff3030
     
  4. The previous styles use font-awesome, so I ve downloaded its resources for version 4.7.0 (the one used by asciidoctor) storing the font-awesome.css into on the assets/css/extended dir (that way it is merged with the rest of .css files) and copying the fonts to the static/assets/fonts/ dir (will be served directly):
    FA_BASE_URL="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0"
    curl "$FA_BASE_URL/css/font-awesome.css" \
      > assets/css/extended/font-awesome.css
    for f in FontAwesome.otf fontawesome-webfont.eot \
      fontawesome-webfont.svg fontawesome-webfont.ttf \
      fontawesome-webfont.woff fontawesome-webfont.woff2; do
        curl "$FA_BASE_URL/fonts/$f" > "static/assets/fonts/$f"
    done
  5. As already said the default highlighter is disabled (it provided a css compatible with rouge) so we need a css to do the highlight styling; as rouge provides a way to export them, I ve created the assets/css/extended/rouge.css file with the thankful_eyes theme:
    rougify style thankful_eyes > assets/css/extended/rouge.css
  6. To support the use of the html5s backend with admonitions I ve added a variation of the example found on this blog post to assets/js/adoc-admonitions.js:
    adoc-admonitions.js
    // replace the default admonitions block with a table that uses a format
    // similar to the standard asciidoctor ... as we are using fa-icons here there
    // is no need to add the icons: font entry on the document.
    window.addEventListener('load', function ()  
      const admonitions = document.getElementsByClassName('admonition-block')
      for (let i = admonitions.length - 1; i >= 0; i--)  
        const elm = admonitions[i]
        const type = elm.classList[1]
        const title = elm.getElementsByClassName('block-title')[0];
    	const label = title.getElementsByClassName('title-label')[0]
    		.innerHTML.slice(0, -1);
        elm.removeChild(elm.getElementsByClassName('block-title')[0]);
        const text = elm.innerHTML
        const parent = elm.parentNode
        const tempDiv = document.createElement('div')
        tempDiv.innerHTML =  <div class="admonitionblock $ type ">
        <table>
          <tbody>
            <tr>
              <td class="icon">
                <i class="fa icon-$ type " title="$ label "></i>
              </td>
              <td class="content">
                $ text 
              </td>
            </tr>
          </tbody>
        </table>
      </div> 
        const input = tempDiv.childNodes[0]
        parent.replaceChild(input, elm)
       
     )
    and enabled its minified use on the layouts/partials/extend_footer.html file adding the following lines to it:
     - $admonitions := slice (resources.Get "js/adoc-admonitions.js")
        resources.Concat "assets/js/adoc-admonitions.js"   minify   fingerprint  
    <script defer crossorigin="anonymous" src="  $admonitions.RelPermalink  "
      integrity="  $admonitions.Data.Integrity  "></script>

Remark42 configurationTo integrate Remark42 with the PaperMod theme I ve created the file layouts/partials/comments.html with the following content based on the remark42 documentation, including extra code to sync the dark/light setting with the one set on the site:
comments.html
<div id="remark42"></div>
<script>
  var remark_config =  
    host:   .Site.Params.remark42Url  ,
    site_id:   .Site.Params.remark42SiteID  ,
    url:   .Permalink  ,
    locale:   .Site.Language.Lang  
   ;
  (function(c)  
    /* Adjust the theme using the local-storage pref-theme if set */
    if (localStorage.getItem("pref-theme") === "dark")  
      remark_config.theme = "dark";
      else if (localStorage.getItem("pref-theme") === "light")  
      remark_config.theme = "light";
     
    /* Add remark42 widget */
    for(var i = 0; i < c.length; i++) 
      var d = document, s = d.createElement('script');
      s.src = remark_config.host + '/web/' + c[i] +'.js';
      s.defer = true;
      (d.head   d.body).appendChild(s);
     
   )(remark_config.components   ['embed']);
</script>
In development I use it with anonymous comments enabled, but to avoid SPAM the production site uses social logins (for now I ve only enabled Github & Google, if someone requests additional services I ll check them, but those were the easy ones for me initially). To support theme switching with remark42 I ve also added the following inside the layouts/partials/extend_footer.html file:
 - if (not site.Params.disableThemeToggle)  
<script>
/* Function to change theme when the toggle button is pressed */
document.getElementById("theme-toggle").addEventListener("click", () =>  
  if (typeof window.REMARK42 != "undefined")  
    if (document.body.className.includes('dark'))  
      window.REMARK42.changeTheme('light');
      else  
      window.REMARK42.changeTheme('dark');
     
   
 );
</script>
 - end  
With this code if the theme-toggle button is pressed we change the remark42 theme before the PaperMod one (that s needed here only, on page loads the remark42 theme is synced with the main one using the code from the layouts/partials/comments.html shown earlier).

Development setupTo preview the site on my laptop I m using docker-compose with the following configuration:
docker-compose.yaml
version: "2"
services:
  hugo:
    build:
      context: ./docker/hugo-adoc
      dockerfile: ./Dockerfile
    image: sto/hugo-adoc
    container_name: hugo-adoc-blogops
    restart: always
    volumes:
      - .:/documents
    command: server --bind 0.0.0.0 -D -F
    user: $ APP_UID :$ APP_GID 
  nginx:
    image: nginx:latest
    container_name: nginx-blogops
    restart: always
    volumes:
      - ./nginx/default.conf:/etc/nginx/conf.d/default.conf
    ports:
      -  1313:1313
  remark42:
    build:
      context: ./docker/remark42
      dockerfile: ./Dockerfile
    image: sto/remark42
    container_name: remark42-blogops
    restart: always
    env_file:
      - ./.env
      - ./remark42/env.dev
    volumes:
      - ./remark42/var.dev:/srv/var
To run it properly we have to create the .env file with the current user ID and GID on the variables APP_UID and APP_GID (if we don t do it the files can end up being owned by a user that is not the same as the one running the services):
$ echo "APP_UID=$(id -u)\nAPP_GID=$(id -g)" > .env
The Dockerfile used to generate the sto/hugo-adoc is:
Dockerfile
FROM asciidoctor/docker-asciidoctor:latest
RUN gem install --no-document asciidoctor-html5s &&\
 apk update && apk add --no-cache curl libc6-compat &&\
 repo_path="gohugoio/hugo" &&\
 api_url="https://api.github.com/repos/$repo_path/releases/latest" &&\
 download_url="$(\
  curl -sL "$api_url"  \
  sed -n "s/^.*download_url\": \"\\(.*.extended.*Linux-64bit.tar.gz\)\"/\1/p"\
 )" &&\
 curl -sL "$download_url" -o /tmp/hugo.tgz &&\
 tar xf /tmp/hugo.tgz hugo &&\
 install hugo /usr/bin/ &&\
 rm -f hugo /tmp/hugo.tgz &&\
 /usr/bin/hugo version &&\
 apk del curl && rm -rf /var/cache/apk/*
# Expose port for live server
EXPOSE 1313
ENTRYPOINT ["/usr/bin/hugo"]
CMD [""]
If you review it you will see that I m using the docker-asciidoctor image as the base; the idea is that this image has all I need to work with asciidoctor and to use hugo I only need to download the binary from their latest release at github (as we are using an image based on alpine we also need to install the libc6-compat package, but once that is done things are working fine for me so far). The image does not launch the server by default because I don t want it to; in fact I use the same docker-compose.yml file to publish the site in production simply calling the container without the arguments passed on the docker-compose.yml file (see later). When running the containers with docker-compose up (or docker compose up if you have the docker-compose-plugin package installed) we also launch a nginx container and the remark42 service so we can test everything together. The Dockerfile for the remark42 image is the original one with an updated version of the init.sh script:
Dockerfile
FROM umputun/remark42:latest
COPY init.sh /init.sh
The updated init.sh is similar to the original, but allows us to use an APP_GID variable and updates the /etc/group file of the container so the files get the right user and group (with the original script the group is always 1001):
init.sh
#!/sbin/dinit /bin/sh
uid="$(id -u)"
if [ "$ uid " -eq "0" ]; then
  echo "init container"
  # set container's time zone
  cp "/usr/share/zoneinfo/$ TIME_ZONE " /etc/localtime
  echo "$ TIME_ZONE " >/etc/timezone
  echo "set timezone $ TIME_ZONE  ($(date))"
  # set UID & GID for the app
  if [ "$ APP_UID " ]   [ "$ APP_GID " ]; then
    [ "$ APP_UID " ]   APP_UID="1001"
    [ "$ APP_GID " ]   APP_GID="$ APP_UID "
    echo "set custom APP_UID=$ APP_UID  & APP_GID=$ APP_GID "
    sed -i "s/^app:x:1001:1001:/app:x:$ APP_UID :$ APP_GID :/" /etc/passwd
    sed -i "s/^app:x:1001:/app:x:$ APP_GID :/" /etc/group
  else
    echo "custom APP_UID and/or APP_GID not defined, using 1001:1001"
  fi
  chown -R app:app /srv /home/app
fi
echo "prepare environment"
# replace  % REMARK_URL %  by content of REMARK_URL variable
find /srv -regex '.*\.\(html\ js\ mjs\)$' -print \
  -exec sed -i "s % REMARK_URL % $ REMARK_URL  g"   \;
if [ -n "$ SITE_ID " ]; then
  #replace "site_id: 'remark'" by SITE_ID
  sed -i "s 'remark' '$ SITE_ID ' g" /srv/web/*.html
fi
echo "execute \"$*\""
if [ "$ uid " -eq "0" ]; then
  exec su-exec app "$@"
else
  exec "$@"
fi
The environment file used with remark42 for development is quite minimal:
env.dev
TIME_ZONE=Europe/Madrid
REMARK_URL=http://localhost:1313/remark42
SITE=blogops
SECRET=123456
ADMIN_SHARED_ID=sto
AUTH_ANON=true
EMOJI=true
And the nginx/default.conf file used to publish the service locally is simple too:
default.conf
server   
 listen 1313;
 server_name localhost;
 location /  
    proxy_pass http://hugo:1313;
    proxy_set_header Host $http_host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
  
 location /remark42/  
    rewrite /remark42/(.*) /$1 break;
    proxy_pass http://remark42:8080/;
    proxy_set_header Host $http_host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
   
 

Production setupThe VM where I m publishing the blog runs Debian GNU/Linux and uses binaries from local packages and applications packaged inside containers. To run the containers I m using docker-ce (I could have used podman instead, but I already had it installed on the machine, so I stayed with it). The binaries used on this project are included on the following packages from the main Debian repository:
  • git to clone & pull the repository,
  • jq to parse json files from shell scripts,
  • json2file-go to save the webhook messages to files,
  • inotify-tools to detect when new files are stored by json2file-go and launch scripts to process them,
  • nginx to publish the site using HTTPS and work as proxy for json2file-go and remark42 (I run it using a container),
  • task-spool to queue the scripts that update the deployment.
And I m using docker and docker compose from the debian packages on the docker repository:
  • docker-ce to run the containers,
  • docker-compose-plugin to run docker compose (it is a plugin, so no - in the name).

Repository checkoutTo manage the git repository I ve created a deploy key, added it to gitea and cloned the project on the /srv/blogops PATH (that route is owned by a regular user that has permissions to run docker, as I said before).

Compiling the site with hugoTo compile the site we are using the docker-compose.yml file seen before, to be able to run it first we build the container images and once we have them we launch hugo using docker compose run:
$ cd /srv/blogops
$ git pull
$ docker compose build
$ if [ -d "./public" ]; then rm -rf ./public; fi
$ docker compose run hugo --
The compilation leaves the static HTML on /srv/blogops/public (we remove the directory first because hugo does not clean the destination folder as jekyll does). The deploy script re-generates the site as described and moves the public directory to its final place for publishing.

Running remark42 with dockerOn the /srv/blogops/remark42 folder I have the following docker-compose.yml:
docker-compose.yml
version: "2"
services:
  remark42:
    build:
      context: ../docker/remark42
      dockerfile: ./Dockerfile
    image: sto/remark42
    env_file:
      - ../.env
      - ./env.prod
    container_name: remark42
    restart: always
    volumes:
      - ./var.prod:/srv/var
    ports:
      - 127.0.0.1:8042:8080
The ../.env file is loaded to get the APP_UID and APP_GID variables that are used by my version of the init.sh script to adjust file permissions and the env.prod file contains the rest of the settings for remark42, including the social network tokens (see the remark42 documentation for the available parameters, I don t include my configuration here because some of them are secrets).

Nginx configurationThe nginx configuration for the blogops.mixinet.net site is as simple as:
server  
  listen 443 ssl http2;
  server_name blogops.mixinet.net;
  ssl_certificate /etc/letsencrypt/live/blogops.mixinet.net/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/blogops.mixinet.net/privkey.pem;
  include /etc/letsencrypt/options-ssl-nginx.conf;
  ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem;
  access_log /var/log/nginx/blogops.mixinet.net-443.access.log;
  error_log  /var/log/nginx/blogops.mixinet.net-443.error.log;
  root /srv/blogops/nginx/public_html;
  location /  
    try_files $uri $uri/ =404;
   
  include /srv/blogops/nginx/remark42.conf;
 
server  
  listen 80 ;
  listen [::]:80 ;
  server_name blogops.mixinet.net;
  access_log /var/log/nginx/blogops.mixinet.net-80.access.log;
  error_log  /var/log/nginx/blogops.mixinet.net-80.error.log;
  if ($host = blogops.mixinet.net)  
    return 301 https://$host$request_uri;
   
  return 404;
 
On this configuration the certificates are managed by certbot and the server root directory is on /srv/blogops/nginx/public_html and not on /srv/blogops/public; the reason for that is that I want to be able to compile without affecting the running site, the deployment script generates the site on /srv/blogops/public and if all works well we rename folders to do the switch, making the change feel almost atomic.

json2file-go configurationAs I have a working WireGuard VPN between the machine running gitea at my home and the VM where the blog is served, I m going to configure the json2file-go to listen for connections on a high port using a self signed certificate and listening on IP addresses only reachable through the VPN. To do it we create a systemd socket to run json2file-go and adjust its configuration to listen on a private IP (we use the FreeBind option on its definition to be able to launch the service even when the IP is not available, that is, when the VPN is down). The following script can be used to set up the json2file-go configuration:
setup-json2file.sh
#!/bin/sh
set -e
# ---------
# VARIABLES
# ---------
BASE_DIR="/srv/blogops/webhook"
J2F_DIR="$BASE_DIR/json2file"
TLS_DIR="$BASE_DIR/tls"
J2F_SERVICE_NAME="json2file-go"
J2F_SERVICE_DIR="/etc/systemd/system/json2file-go.service.d"
J2F_SERVICE_OVERRIDE="$J2F_SERVICE_DIR/override.conf"
J2F_SOCKET_DIR="/etc/systemd/system/json2file-go.socket.d"
J2F_SOCKET_OVERRIDE="$J2F_SOCKET_DIR/override.conf"
J2F_BASEDIR_FILE="/etc/json2file-go/basedir"
J2F_DIRLIST_FILE="/etc/json2file-go/dirlist"
J2F_CRT_FILE="/etc/json2file-go/certfile"
J2F_KEY_FILE="/etc/json2file-go/keyfile"
J2F_CRT_PATH="$TLS_DIR/crt.pem"
J2F_KEY_PATH="$TLS_DIR/key.pem"
# ----
# MAIN
# ----
# Install packages used with json2file for the blogops site
sudo apt update
sudo apt install -y json2file-go uuid
if [ -z "$(type mkcert)" ]; then
  sudo apt install -y mkcert
fi
sudo apt clean
# Configuration file values
J2F_USER="$(id -u)"
J2F_GROUP="$(id -g)"
J2F_DIRLIST="blogops:$(uuid)"
J2F_LISTEN_STREAM="172.31.31.1:4443"
# Configure json2file
[ -d "$J2F_DIR" ]   mkdir "$J2F_DIR"
sudo sh -c "echo '$J2F_DIR' >'$J2F_BASEDIR_FILE'"
[ -d "$TLS_DIR" ]   mkdir "$TLS_DIR"
if [ ! -f "$J2F_CRT_PATH" ]   [ ! -f "$J2F_KEY_PATH" ]; then
  mkcert -cert-file "$J2F_CRT_PATH" -key-file "$J2F_KEY_PATH" "$(hostname -f)"
fi
sudo sh -c "echo '$J2F_CRT_PATH' >'$J2F_CRT_FILE'"
sudo sh -c "echo '$J2F_KEY_PATH' >'$J2F_KEY_FILE'"
sudo sh -c "cat >'$J2F_DIRLIST_FILE'" <<EOF
$(echo "$J2F_DIRLIST"   tr ';' '\n')
EOF
# Service override
[ -d "$J2F_SERVICE_DIR" ]   sudo mkdir "$J2F_SERVICE_DIR"
sudo sh -c "cat >'$J2F_SERVICE_OVERRIDE'" <<EOF
[Service]
User=$J2F_USER
Group=$J2F_GROUP
EOF
# Socket override
[ -d "$J2F_SOCKET_DIR" ]   sudo mkdir "$J2F_SOCKET_DIR"
sudo sh -c "cat >'$J2F_SOCKET_OVERRIDE'" <<EOF
[Socket]
# Set FreeBind to listen on missing addresses (the VPN can be down sometimes)
FreeBind=true
# Set ListenStream to nothing to clear its value and add the new value later
ListenStream=
ListenStream=$J2F_LISTEN_STREAM
EOF
# Restart and enable service
sudo systemctl daemon-reload
sudo systemctl stop "$J2F_SERVICE_NAME"
sudo systemctl start "$J2F_SERVICE_NAME"
sudo systemctl enable "$J2F_SERVICE_NAME"
# ----
# vim: ts=2:sw=2:et:ai:sts=2
Warning: The script uses mkcert to create the temporary certificates, to install the package on bullseye the backports repository must be available.

Gitea configurationTo make gitea use our json2file-go server we go to the project and enter into the hooks/gitea/new page, once there we create a new webhook of type gitea and set the target URL to https://172.31.31.1:4443/blogops and on the secret field we put the token generated with uuid by the setup script:
sed -n -e 's/blogops://p' /etc/json2file-go/dirlist
The rest of the settings can be left as they are:
  • Trigger on: Push events
  • Branch filter: *
Warning: We are using an internal IP and a self signed certificate, that means that we have to review that the webhook section of the app.ini of our gitea server allows us to call the IP and skips the TLS verification (you can see the available options on the gitea documentation). The [webhook] section of my server looks like this:
[webhook]
ALLOWED_HOST_LIST=private
SKIP_TLS_VERIFY=true
Once we have the webhook configured we can try it and if it works our json2file server will store the file on the /srv/blogops/webhook/json2file/blogops/ folder.

The json2file spooler scriptWith the previous configuration our system is ready to receive webhook calls from gitea and store the messages on files, but we have to do something to process those files once they are saved in our machine. An option could be to use a cronjob to look for new files, but we can do better on Linux using inotify we will use the inotifywait command from inotify-tools to watch the json2file output directory and execute a script each time a new file is moved inside it or closed after writing (IN_CLOSE_WRITE and IN_MOVED_TO events). To avoid concurrency problems we are going to use task-spooler to launch the scripts that process the webhooks using a queue of length 1, so they are executed one by one in a FIFO queue. The spooler script is this:
blogops-spooler.sh
#!/bin/sh
set -e
# ---------
# VARIABLES
# ---------
BASE_DIR="/srv/blogops/webhook"
BIN_DIR="$BASE_DIR/bin"
TSP_DIR="$BASE_DIR/tsp"
WEBHOOK_COMMAND="$BIN_DIR/blogops-webhook.sh"
# ---------
# FUNCTIONS
# ---------
queue_job()  
  echo "Queuing job to process file '$1'"
  TMPDIR="$TSP_DIR" TS_SLOTS="1" TS_MAXFINISHED="10" \
    tsp -n "$WEBHOOK_COMMAND" "$1"
 
# ----
# MAIN
# ----
INPUT_DIR="$1"
if [ ! -d "$INPUT_DIR" ]; then
  echo "Input directory '$INPUT_DIR' does not exist, aborting!"
  exit 1
fi
[ -d "$TSP_DIR" ]   mkdir "$TSP_DIR"
echo "Processing existing files under '$INPUT_DIR'"
find "$INPUT_DIR" -type f   sort   while read -r _filename; do
  queue_job "$_filename"
done
# Use inotifywatch to process new files
echo "Watching for new files under '$INPUT_DIR'"
inotifywait -q -m -e close_write,moved_to --format "%w%f" -r "$INPUT_DIR"  
  while read -r _filename; do
    queue_job "$_filename"
  done
# ----
# vim: ts=2:sw=2:et:ai:sts=2
To run it as a daemon we install it as a systemd service using the following script:
setup-spooler.sh
#!/bin/sh
set -e
# ---------
# VARIABLES
# ---------
BASE_DIR="/srv/blogops/webhook"
BIN_DIR="$BASE_DIR/bin"
J2F_DIR="$BASE_DIR/json2file"
SPOOLER_COMMAND="$BIN_DIR/blogops-spooler.sh '$J2F_DIR'"
SPOOLER_SERVICE_NAME="blogops-j2f-spooler"
SPOOLER_SERVICE_FILE="/etc/systemd/system/$SPOOLER_SERVICE_NAME.service"
# Configuration file values
J2F_USER="$(id -u)"
J2F_GROUP="$(id -g)"
# ----
# MAIN
# ----
# Install packages used with the webhook processor
sudo apt update
sudo apt install -y inotify-tools jq task-spooler
sudo apt clean
# Configure process service
sudo sh -c "cat > $SPOOLER_SERVICE_FILE" <<EOF
[Install]
WantedBy=multi-user.target
[Unit]
Description=json2file processor for $J2F_USER
After=docker.service
[Service]
Type=simple
User=$J2F_USER
Group=$J2F_GROUP
ExecStart=$SPOOLER_COMMAND
EOF
# Restart and enable service
sudo systemctl daemon-reload
sudo systemctl stop "$SPOOLER_SERVICE_NAME"   true
sudo systemctl start "$SPOOLER_SERVICE_NAME"
sudo systemctl enable "$SPOOLER_SERVICE_NAME"
# ----
# vim: ts=2:sw=2:et:ai:sts=2

The gitea webhook processorFinally, the script that processes the JSON files does the following:
  1. First, it checks if the repository and branch are right,
  2. Then, it fetches and checks out the commit referenced on the JSON file,
  3. Once the files are updated, compiles the site using hugo with docker compose,
  4. If the compilation succeeds the script renames directories to swap the old version of the site by the new one.
If there is a failure the script aborts but before doing it or if the swap succeeded the system sends an email to the configured address and/or the user that pushed updates to the repository with a log of what happened. The current script is this one:
blogops-webhook.sh
#!/bin/sh
set -e
# ---------
# VARIABLES
# ---------
# Values
REPO_REF="refs/heads/main"
REPO_CLONE_URL="https://gitea.mixinet.net/mixinet/blogops.git"
MAIL_PREFIX="[BLOGOPS-WEBHOOK] "
# Address that gets all messages, leave it empty if not wanted
MAIL_TO_ADDR="blogops@mixinet.net"
# If the following variable is set to 'true' the pusher gets mail on failures
MAIL_ERRFILE="false"
# If the following variable is set to 'true' the pusher gets mail on success
MAIL_LOGFILE="false"
# gitea's conf/app.ini value of NO_REPLY_ADDRESS, it is used for email domains
# when the KeepEmailPrivate option is enabled for a user
NO_REPLY_ADDRESS="noreply.example.org"
# Directories
BASE_DIR="/srv/blogops"
PUBLIC_DIR="$BASE_DIR/public"
NGINX_BASE_DIR="$BASE_DIR/nginx"
PUBLIC_HTML_DIR="$NGINX_BASE_DIR/public_html"
WEBHOOK_BASE_DIR="$BASE_DIR/webhook"
WEBHOOK_SPOOL_DIR="$WEBHOOK_BASE_DIR/spool"
WEBHOOK_ACCEPTED="$WEBHOOK_SPOOL_DIR/accepted"
WEBHOOK_DEPLOYED="$WEBHOOK_SPOOL_DIR/deployed"
WEBHOOK_REJECTED="$WEBHOOK_SPOOL_DIR/rejected"
WEBHOOK_TROUBLED="$WEBHOOK_SPOOL_DIR/troubled"
WEBHOOK_LOG_DIR="$WEBHOOK_SPOOL_DIR/log"
# Files
TODAY="$(date +%Y%m%d)"
OUTPUT_BASENAME="$(date +%Y%m%d-%H%M%S.%N)"
WEBHOOK_LOGFILE_PATH="$WEBHOOK_LOG_DIR/$OUTPUT_BASENAME.log"
WEBHOOK_ACCEPTED_JSON="$WEBHOOK_ACCEPTED/$OUTPUT_BASENAME.json"
WEBHOOK_ACCEPTED_LOGF="$WEBHOOK_ACCEPTED/$OUTPUT_BASENAME.log"
WEBHOOK_REJECTED_TODAY="$WEBHOOK_REJECTED/$TODAY"
WEBHOOK_REJECTED_JSON="$WEBHOOK_REJECTED_TODAY/$OUTPUT_BASENAME.json"
WEBHOOK_REJECTED_LOGF="$WEBHOOK_REJECTED_TODAY/$OUTPUT_BASENAME.log"
WEBHOOK_DEPLOYED_TODAY="$WEBHOOK_DEPLOYED/$TODAY"
WEBHOOK_DEPLOYED_JSON="$WEBHOOK_DEPLOYED_TODAY/$OUTPUT_BASENAME.json"
WEBHOOK_DEPLOYED_LOGF="$WEBHOOK_DEPLOYED_TODAY/$OUTPUT_BASENAME.log"
WEBHOOK_TROUBLED_TODAY="$WEBHOOK_TROUBLED/$TODAY"
WEBHOOK_TROUBLED_JSON="$WEBHOOK_TROUBLED_TODAY/$OUTPUT_BASENAME.json"
WEBHOOK_TROUBLED_LOGF="$WEBHOOK_TROUBLED_TODAY/$OUTPUT_BASENAME.log"
# Query to get variables from a gitea webhook json
ENV_VARS_QUERY="$(
  printf "%s" \
    '(.             @sh "gt_ref=\(.ref);"),' \
    '(.             @sh "gt_after=\(.after);"),' \
    '(.repository   @sh "gt_repo_clone_url=\(.clone_url);"),' \
    '(.repository   @sh "gt_repo_name=\(.name);"),' \
    '(.pusher       @sh "gt_pusher_full_name=\(.full_name);"),' \
    '(.pusher       @sh "gt_pusher_email=\(.email);")'
)"
# ---------
# Functions
# ---------
webhook_log()  
  echo "$(date -R) $*" >>"$WEBHOOK_LOGFILE_PATH"
 
webhook_check_directories()  
  for _d in "$WEBHOOK_SPOOL_DIR" "$WEBHOOK_ACCEPTED" "$WEBHOOK_DEPLOYED" \
    "$WEBHOOK_REJECTED" "$WEBHOOK_TROUBLED" "$WEBHOOK_LOG_DIR"; do
    [ -d "$_d" ]   mkdir "$_d"
  done
 
webhook_clean_directories()  
  # Try to remove empty dirs
  for _d in "$WEBHOOK_ACCEPTED" "$WEBHOOK_DEPLOYED" "$WEBHOOK_REJECTED" \
    "$WEBHOOK_TROUBLED" "$WEBHOOK_LOG_DIR" "$WEBHOOK_SPOOL_DIR"; do
    if [ -d "$_d" ]; then
      rmdir "$_d" 2>/dev/null   true
    fi
  done
 
webhook_accept()  
  webhook_log "Accepted: $*"
  mv "$WEBHOOK_JSON_INPUT_FILE" "$WEBHOOK_ACCEPTED_JSON"
  mv "$WEBHOOK_LOGFILE_PATH" "$WEBHOOK_ACCEPTED_LOGF"
  WEBHOOK_LOGFILE_PATH="$WEBHOOK_ACCEPTED_LOGF"
 
webhook_reject()  
  [ -d "$WEBHOOK_REJECTED_TODAY" ]   mkdir "$WEBHOOK_REJECTED_TODAY"
  webhook_log "Rejected: $*"
  if [ -f "$WEBHOOK_JSON_INPUT_FILE" ]; then
    mv "$WEBHOOK_JSON_INPUT_FILE" "$WEBHOOK_REJECTED_JSON"
  fi
  mv "$WEBHOOK_LOGFILE_PATH" "$WEBHOOK_REJECTED_LOGF"
  exit 0
 
webhook_deployed()  
  [ -d "$WEBHOOK_DEPLOYED_TODAY" ]   mkdir "$WEBHOOK_DEPLOYED_TODAY"
  webhook_log "Deployed: $*"
  mv "$WEBHOOK_ACCEPTED_JSON" "$WEBHOOK_DEPLOYED_JSON"
  mv "$WEBHOOK_ACCEPTED_LOGF" "$WEBHOOK_DEPLOYED_LOGF"
  WEBHOOK_LOGFILE_PATH="$WEBHOOK_DEPLOYED_LOGF"
 
webhook_troubled()  
  [ -d "$WEBHOOK_TROUBLED_TODAY" ]   mkdir "$WEBHOOK_TROUBLED_TODAY"
  webhook_log "Troubled: $*"
  mv "$WEBHOOK_ACCEPTED_JSON" "$WEBHOOK_TROUBLED_JSON"
  mv "$WEBHOOK_ACCEPTED_LOGF" "$WEBHOOK_TROUBLED_LOGF"
  WEBHOOK_LOGFILE_PATH="$WEBHOOK_TROUBLED_LOGF"
 
print_mailto()  
  _addr="$1"
  _user_email=""
  # Add the pusher email address unless it is from the domain NO_REPLY_ADDRESS,
  # which should match the value of that variable on the gitea 'app.ini' (it
  # is the domain used for emails when the user hides it).
  # shellcheck disable=SC2154
  if [ -n "$ gt_pusher_email##*@"$ NO_REPLY_ADDRESS " " ] &&
    [ -z "$ gt_pusher_email##*@* " ]; then
    _user_email="\"$gt_pusher_full_name <$gt_pusher_email>\""
  fi
  if [ "$_addr" ] && [ "$_user_email" ]; then
    echo "$_addr,$_user_email"
  elif [ "$_user_email" ]; then
    echo "$_user_email"
  elif [ "$_addr" ]; then
    echo "$_addr"
  fi
 
mail_success()  
  to_addr="$MAIL_TO_ADDR"
  if [ "$MAIL_LOGFILE" = "true" ]; then
    to_addr="$(print_mailto "$to_addr")"
  fi
  if [ "$to_addr" ]; then
    # shellcheck disable=SC2154
    subject="OK - $gt_repo_name updated to commit '$gt_after'"
    mail -s "$ MAIL_PREFIX $ subject " "$to_addr" \
      <"$WEBHOOK_LOGFILE_PATH"
  fi
 
mail_failure()  
  to_addr="$MAIL_TO_ADDR"
  if [ "$MAIL_ERRFILE" = true ]; then
    to_addr="$(print_mailto "$to_addr")"
  fi
  if [ "$to_addr" ]; then
    # shellcheck disable=SC2154
    subject="KO - $gt_repo_name update FAILED for commit '$gt_after'"
    mail -s "$ MAIL_PREFIX $ subject " "$to_addr" \
      <"$WEBHOOK_LOGFILE_PATH"
  fi
 
# ----
# MAIN
# ----
# Check directories
webhook_check_directories
# Go to the base directory
cd "$BASE_DIR"
# Check if the file exists
WEBHOOK_JSON_INPUT_FILE="$1"
if [ ! -f "$WEBHOOK_JSON_INPUT_FILE" ]; then
  webhook_reject "Input arg '$1' is not a file, aborting"
fi
# Parse the file
webhook_log "Processing file '$WEBHOOK_JSON_INPUT_FILE'"
eval "$(jq -r "$ENV_VARS_QUERY" "$WEBHOOK_JSON_INPUT_FILE")"
# Check that the repository clone url is right
# shellcheck disable=SC2154
if [ "$gt_repo_clone_url" != "$REPO_CLONE_URL" ]; then
  webhook_reject "Wrong repository: '$gt_clone_url'"
fi
# Check that the branch is the right one
# shellcheck disable=SC2154
if [ "$gt_ref" != "$REPO_REF" ]; then
  webhook_reject "Wrong repository ref: '$gt_ref'"
fi
# Accept the file
# shellcheck disable=SC2154
webhook_accept "Processing '$gt_repo_name'"
# Update the checkout
ret="0"
git fetch >>"$WEBHOOK_LOGFILE_PATH" 2>&1   ret="$?"
if [ "$ret" -ne "0" ]; then
  webhook_troubled "Repository fetch failed"
  mail_failure
fi
# shellcheck disable=SC2154
git checkout "$gt_after" >>"$WEBHOOK_LOGFILE_PATH" 2>&1   ret="$?"
if [ "$ret" -ne "0" ]; then
  webhook_troubled "Repository checkout failed"
  mail_failure
fi
# Remove the build dir if present
if [ -d "$PUBLIC_DIR" ]; then
  rm -rf "$PUBLIC_DIR"
fi
# Build site
docker compose run hugo -- >>"$WEBHOOK_LOGFILE_PATH" 2>&1   ret="$?"
# go back to the main branch
git switch main && git pull
# Fail if public dir was missing
if [ "$ret" -ne "0" ]   [ ! -d "$PUBLIC_DIR" ]; then
  webhook_troubled "Site build failed"
  mail_failure
fi
# Remove old public_html copies
webhook_log 'Removing old site versions, if present'
find $NGINX_BASE_DIR -mindepth 1 -maxdepth 1 -name 'public_html-*' -type d \
  -exec rm -rf   \; >>"$WEBHOOK_LOGFILE_PATH" 2>&1   ret="$?"
if [ "$ret" -ne "0" ]; then
  webhook_troubled "Removal of old site versions failed"
  mail_failure
fi
# Switch site directory
TS="$(date +%Y%m%d-%H%M%S)"
if [ -d "$PUBLIC_HTML_DIR" ]; then
  webhook_log "Moving '$PUBLIC_HTML_DIR' to '$PUBLIC_HTML_DIR-$TS'"
  mv "$PUBLIC_HTML_DIR" "$PUBLIC_HTML_DIR-$TS" >>"$WEBHOOK_LOGFILE_PATH" 2>&1  
    ret="$?"
fi
if [ "$ret" -eq "0" ]; then
  webhook_log "Moving '$PUBLIC_DIR' to '$PUBLIC_HTML_DIR'"
  mv "$PUBLIC_DIR" "$PUBLIC_HTML_DIR" >>"$WEBHOOK_LOGFILE_PATH" 2>&1  
    ret="$?"
fi
if [ "$ret" -ne "0" ]; then
  webhook_troubled "Site switch failed"
  mail_failure
else
  webhook_deployed "Site deployed successfully"
  mail_success
fi
# ----
# vim: ts=2:sw=2:et:ai:sts=2

Next.

Previous.