Search Results: "Michael Prokop"

12 January 2022

Michael Prokop: Revisiting 2021

* Uhm yeah, so this shirt didn t age well. :) Mainly to recall what happened, I m once again revisiting my previous year (previous edition: 2020). 2021 was quite challenging overall. It started with four weeks of distance learning at school. Luckily at least at school things got back to "some kind of normal" afterwards. The lockdowns turned out to be an excellent opportunity for practising Geocaching though, and that s what I started to do with my family. It s a great way to grab some fresh air, get to know new areas, and spend time with family and friends I plan to continue doing this. :) We bought a family season ticket for Freib der (open-air baths) in Graz; this turned out to be a great investment I enjoyed the open air swimming with family, as well as going for swimming laps on my own very much, and plan to do the same in 2022. Due to the lockdowns and the pandemics, the weekly Badminton sessions sadly didn t really take place, so I pushed towards the above-mentioned outdoor swimming and also some running; with my family we managed to do some cycling, inline skating and even practiced some boulder climbing. For obvious reasons plenty of concerts I was looking forward didn t take place. With my parents we at least managed to attend a concert performance of Puccinis Tosca with Jonas Kaufmann at Schlo bergb hne Kasematten/Graz, and with the kids we saw "Robin Hood" in Oper Graz and "Pippi Langstrumpf" at Studiob hne of Oper Graz. The lack of concerts and rehearsals once again and still severely impacts my playing the drums, including at HTU BigBand Graz. :-/ Grml-wise we managed to publish release 2021.07, codename JauKerl. Debian-wise we got version 11 AKA bullseye released as new stable release in August. For 2021 I planned to and also managed to minimize buying (new) physical stuff, except for books and other reading stuff. Speaking of reading, 2021 was nice I managed to finish more than 100 books (see Mein Lesejahr 2021 ), and I d like to keep the reading pace. Now let s hope for better times in 2022!

5 July 2021

Michael Prokop: Debian bullseye: changes in util-linux #newinbullseye

Continuing with #newinbullseye. One package that isn t new but its tools are used by many of us is util-linux, providing many essential system utilities. There is util-linux v2.33.1 in Debian/buster and util-linux v2.36.1 in Debian/bullseye, and as usual there are many new features and options available. I don t want to replicate the release notes provided by upstream, instead make sure to check out the Release highlights sections in the following release notes: Tools that have been taken over from / moved to other packages Debian s util-linux source package provides new binary packages: eject (and eject-udeb) and bsdextrautils. The util-linux implementation of /usr/bin/eject is used now, replacing the one previously provided by the eject source package. Overall, from a util-linux perspective the following shifts took place: Deprecated / removed tools Tools that are no longer shipped as of Debian/bullseye: New tools Debian s bsdutils package (which is provided by the util-linux source package) provides a new tool from util-linux: The new tools lsirq + irqtop (to monitor kernel interrupts) sadly didn t make it into util-linux s packaging of Debian/bullseye (as without per-CPU data they do not seem mature at this time). The new hardlink tool (to consolidate duplicate files via hardlinks) won t be shipped, as there s an existing hardlink package already. New features/options agetty + getty:
--show-issue    display issue file and exit
blkdiscard:
--force         disable all checking
blkid:
-D, --no-part-details      don't print info from partition table
blkzone:
Commands:
open         Open a range of zones.
close        Close a range of zones.
finish       Set a range of zones to Full.
Options:
-f, --force            enforce on block devices used by the system
cfdisk:
--lock[=<mode>]      use exclusive device lock (yes, no or nonblock)
dmesg:
--noescape             don't escape unprintable character
-W, --follow-new       wait and print only new messages
fdisk:
-x, --list-details          like --list but with more details
-n, --noauto-pt             don't create default partition table on empty devices
--lock[=<mode>]             use exclusive device lock (yes, no or nonblock)
fstrim:
-I, --listed-in <list>   trim filesystems listed in specified files
--quiet-unsupported      suppress error messages if trim unsupported
lsblk:
Options:
-E, --dedup <column> de-duplicate output by <column> 
                     (for example 'lsblk --dedup WWN' to de-duplicate devices by WWN number, e.g. multi-path devices)
-M, --merge          group parents of sub-trees (usable for RAIDs, Multi-path)
                     see http://karelzak.blogspot.com/2018/11/lsblk-merge.html
New output columns:
FSVER         filesystem version
PARTTYPENAME  partition type name
DAX           dax-capable device
lscpu:
Options:
-B, --bytes             print sizes in bytes rather than in human readable format
-C, --caches[=<list>]   info about caches in extended readable format
    --output-all        print all available columns for -e, -p or -C
Available output columns for -C:
        ALL-SIZE  size of all system caches
           LEVEL  cache level
            NAME  cache name
        ONE-SIZE  size of one cache
            TYPE  cache type
            WAYS  ways of associativity
    ALLOC-POLICY  allocation policy
    WRITE-POLICY  write policy
        PHY-LINE  number of physical cache line per cache t
            SETS  number of sets in the cache; set lines has the same cache index
   COHERENCY-SIZE  minimum amount of data in bytes transferred from memory to cache         
lslogins:
--lastlog <path>     set an alternate path for lastlog
lsns:
-t, --type time      namespace type time is also supported now (next to mnt, net, ipc, user, pid, uts, cgroup)
mkswap:
--lock[=<mode>]      use exclusive device lock (yes, no or nonblock)
more:
Options:
-n, --lines <number>  the number of lines per screenful
New long options (in addition to the listed equivalent short options):
  --silent       - equivalent to -d
  --logical      - equivalent to -f
  --no-pause     - equivalent to -l
  --print-over   - equivalent to -c
  --clean-print  - equivalent to -p
  --squeeze      - equivalent to -s
  --plain        - equivalent to -u
mount:
Options:
--target-prefix <path>  specifies path use for all mountpoints
Source:
ID=<id>                 specifies device by udev hardware ID
mountpoint:
--nofollow     do not follow symlink
nsenter:
-T, --time[=<file>]    enter time namespace
script:
-I, --log-in <file>           log stdin to file
-O, --log-out <file>          log stdout to file (default)
-B, --log-io <file>           log stdin and stdout to file
-T, --log-timing <file>       log timing information to file
-m, --logging-format <name>   force to 'classic' or 'advanced' format
-E, --echo <when>             echo input (auto, always or never)
sfdisk:
--disk-id <dev> [<str>]           print or change disk label ID (UUID)
--relocate <oper> <dev>           move partition header
--move-use-fsync                  use fsync after each write when move data
--lock[=<mode>]                   use exclusive device lock (yes, no or nonblock)
unshare:
-T, --time[=<file>]       unshare time namespace
--map-user=<uid> <name>   map current user to uid (implies --user)
--map-group=<gid> <name>  map current group to gid (implies --user)
-c, --map-current-user    map current user to itself (implies --user)
--keep-caps               retain capabilities granted in user namespaces
-R, --root=<dir>          run the command with root directory set to <dir>
-w, --wd=<dir>            change working directory to <dir>
-S, --setuid <uid>        set uid in entered namespace
-G, --setgid <gid>        set gid in entered namespace
--monotonic <offset>      set clock monotonic offset (seconds) in time namespaces
--boottime <offset>       set clock boottime offset (seconds) in time namespaces
wipefs:
--lock[=<mode>] use exclusive device lock (yes, no or nonblock)

9 June 2021

Michael Prokop: efivars is gone with Debian/bullseye #newinbullseye

Continuing with #newinbullseye, it s worth being aware of, that efivars is gone with the kernel version shipped as of Debian/bullseye. Quoting from wiki.debian.org/UEFI:
The Linux kernel gives access to the UEFI configuration variables via a set of files under /sys, using two different interfaces. The older interface was showing files under /sys/firmware/efi/vars, and this is what was used by default in both Wheezy and Jessie. The new interface is efivarfs, which will expose things in a slightly different format under /sys/firmware/efi/efivars.
This is the new preferred way of using UEFI configuration variables, and Debian switched to it by default from Stretch onwards.
Now, CONFIG_EFI_VARS is no longer enabled in Debian due to commit 20146398c4 (shipped as such with Debian kernel package versions >=5.10.1-1~exp1). As a result, the kernel module efivars is no longer available on systems running Debian kernels >=5.10 (which includes Debian/bullseye). Now, when running such a system in EFI mode, chroot-ing into a system and executing e.g. efibootmgr, it might fail with:
# efibootmgr
EFI variables are not supported on this system.
This is caused by /sys/firmware/efi/vars no longer being available, because of the disabled CONFIG_EFI_VARS. To get this working again, you need to make efivarfs available via:
# mount -t efivarfs efivarfs /sys/firmware/efi/efivars
Then efibootmgr and further tools relying on efivars should work again. FYI: if you re a user of Grml s grml-chroot tool, this is going to be handled out of the box for you.

27 May 2021

Michael Prokop: What to expect from Debian/bullseye #newinbullseye

Bullseye Banner, Copyright 2020 Juliette Taka Debian v11 with codename bullseye is supposed to be released as new stable release soon-ish (let s hope for June, 2021! :)). Similar to what we had with #newinbuster and previous releases, now it s time for #newinbullseye! I was the driving force at several of my customers to be well prepared for bullseye before its freeze, and since then we re on good track there overall. In my opinion, Debian s release team did (and still does) a great job I m very happy about how unblock requests (not only mine but also ones I kept an eye on) were handled so far. As usual with major upgrades, there are some things to be aware of, and hereby I m starting my public notes on bullseye that might be worth also for other folks. My focus is primarily on server systems and looking at things from a sysadmin perspective. Further readings Of course start with taking a look at the official Debian release notes, make sure to especially go through What s new in Debian 11 + Issues to be aware of for bullseye. Chris published notes on upgrading to Debian bullseye, and also anarcat published upgrade notes for bullseye. Package versions As a starting point, let s look at some selected packages and their versions in buster vs. bullseye as of 2021-05-27 (mainly having amd64 in mind):
Package buster/v10 bullseye/v11
ansible 2.7.7 2.10.8
apache 2.4.38 2.4.46
apt 1.8.2.2 2.2.3
bash 5.0 5.1
ceph 12.2.11 14.2.20
docker 18.09.1 20.10.5
dovecot 2.3.4 2.3.13
dpkg 1.19.7 1.20.9
emacs 26.1 27.1
gcc 8.3.0 10.2.1
git 2.20.1 2.30.2
golang 1.11 1.15
libc 2.28 2.31
linux kernel 4.19 5.10
llvm 7.0 11.0
lxc 3.0.3 4.0.6
mariadb 10.3.27 10.5.10
nginx 1.14.2 1.18.0
nodejs 10.24.0 12.21.0
openjdk 11.0.9.1 11.0.11+9 + 17~19
openssh 7.9p1 8.4p1
openssl 1.1.1d 1.1.1k
perl 5.28.1 5.32.1
php 7.3 7.4+76
postfix 3.4.14 3.5.6
postgres 11 13
puppet 5.5.10 5.5.22
python2 2.7.16 2.7.18
python3 3.7.3 3.9.2
qemu/kvm 3.1 5.2
ruby 2.5.1 2.7+2
rust 1.41.1 1.48.0
samba 4.9.5 4.13.5
systemd 241 247.3
unattended-upgrades 1.11.2 2.8
util-linux 2.33.1 2.36.1
vagrant 2.2.3 2.2.14
vim 8.1.0875 8.2.2434
zsh 5.7.1 5.8
Linux Kernel The bullseye release will ship a Linux kernel based on v5.10 (v5.10.28 as of 2021-05-27, with v5.10.38 pending in unstable/sid), whereas buster shipped kernel 4.19. As usual there are plenty of changes in the kernel area and this might warrant a separate blog entry, but to highlight some issues: One surprising change might be that the scrollback buffer (Shift + PageUp) is gone from the Linux console. Make sure to always use screen/tmux or handle output through a pager of your choice if you need all of it and you re in the console. The kernel provides BTF support (via CONFIG_DEBUG_INFO_BTF, see #973870), which means it s no longer necessary to install LLVM, Clang, etc (requiring >100MB of disk space), see Gregg s excellent blog post regarding the underlying rational. Sadly the libbpf-tools packaging didn t make it into bullseye (#978727), but if you want to use your own self-made Debian packages, my notes might be useful. With kernel version 5.4, SUBDIRS support was removed from kbuild, so if an out-of-tree kernel module (like a *-dkms package) fails to compile on bullseye, make sure to use a recent version of it which uses M= or KBUILD_EXTMOD= instead. Unprivileged user namespaces are enabled by default (see #898446 + #987777), so programs can create more restricted sandboxes without the need to run as root or via a setuid-root helper. If you prefer to keep this feature restricted (or tools like web browsers, WebKitGTK, Flatpak, don t work), use sysctl -w kernel.unprivileged_userns_clone=0 . The /boot/System.map file(s) no longer provide the actual data, you need to switch to the dbg package if you rely on that information:
% cat /boot/System.map-5.10.0-6-amd64 
ffffffffffffffff B The real System.map is in the linux-image-<version>-dbg package
Be aware though, that the *-dbg package requires ~5GB of additional disk space. Systemd systemd v247 made it into bullseye (updated from v241). Same as for the kernel this might warrant a separate blog entry, but to mention some highlights: Systemd in bullseye activates its persistent journal functionality by default (storing its files in /var/log/journal/, see #717388). systemd-timesyncd is no longer part of the systemd binary package itself, but available as standalone package. This allows usage of ntp, chrony, openntpd, without having systemd-timesyncd installed (which prevents race conditions like #889290, which was biting me more than once). journalctl gained new options:
--cursor-file=FILE      Show entries after cursor in FILE and update FILE
--facility=FACILITY...  Show entries with the specified facilities
--image=IMAGE           Operate on files in filesystem image
--namespace=NAMESPACE   Show journal data from specified namespace
--relinquish-var        Stop logging to disk, log to temporary file system
--smart-relinquish-var  Similar, but NOP if log directory is on root mount
systemctl gained new options:
clean UNIT...                       Clean runtime, cache, state, logs or configuration of unit
freeze PATTERN...                   Freeze execution of unit processes
thaw PATTERN...                     Resume execution of a frozen unit
log-level [LEVEL]                   Get/set logging threshold for manager
log-target [TARGET]                 Get/set logging target for manager
service-watchdogs [BOOL]            Get/set service watchdog state
--with-dependencies                 Show unit dependencies with 'status', 'cat', 'list-units', and 'list-unit-files'
 -T --show-transaction              When enqueuing a unit job, show full transaction
 --what=RESOURCES                   Which types of resources to remove
--boot-loader-menu=TIME             Boot into boot loader menu on next boot
--boot-loader-entry=NAME            Boot into a specific boot loader entry on next boot
--timestamp=FORMAT                  Change format of printed timestamps
If you use systemctl edit to adjust overrides, then you ll now also get the existing configuration file listed as comment, which I consider very helpful. The MACAddressPolicy behavior with systemd naming schema v241 changed for virtual devices (I plan to write about this in a separate blog post). There are plenty of new manual pages: systemd also gained new unit configurations related to security hardening: Another new unit configuration is SystemCallLog= , which supports listing the system calls to be logged. This is very useful for for auditing or temporarily when constructing system call filters. The cgroupv2 change is also documented in the release notes, but to explicitly mention it also here, quoting from /usr/share/doc/systemd/NEWS.Debian.gz:
systemd now defaults to the unified cgroup hierarchy (i.e. cgroupv2).
This change reflects the fact that cgroups2 support has matured
substantially in both systemd and in the kernel.
All major container tools nowadays should support cgroupv2.
If you run into problems with cgroupv2, you can switch back to the previous,
hybrid setup by adding systemd.unified_cgroup_hierarchy=false to the
kernel command line.
You can read more about the benefits of cgroupv2 at
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
Note that cgroup-tools (lssubsys + lscgroup etc) don t work in cgroup2/unified hierarchy yet (see #959022 for the details). Configuration management puppet s upstream doesn t provide packages for bullseye yet (see PA-3624 + MODULES-11060), and sadly neither v6 nor v7 made it into bullseye, so when using the packages from Debian you re still stuck with v5.5 (also see #950182). ansible is also available, and while it looked like that only version 2.9.16 would make it into bullseye (see #984557 + #986213), actually version 2.10.8 made it into bullseye. chef was removed from Debian and is not available with bullseye (due to trademark issues). Prometheus stack Prometheus server was updated from v2.7.1 to v2.24.1, and the prometheus service by default applies some systemd hardening now. Also all the usual exporters are still there, but bullseye also gained some new ones: Virtualization docker (v20.10.5), ganeti (v3.0.1), libvirt (v7.0.0), lxc (v4.0.6), openstack, qemu/kvm (v5.2), xen (v4.14.1), are all still around, though what s new and noteworthy is that podman version 3.0.1 (tool for managing OCI containers and pods) made it into bullseye. If you re using the docker packages from upstream, be aware that they still don t seem to understand Debian package version handling. The docker* packages will not be automatically considered for upgrade, as 5:20.10.6~3-0~debian-buster is considered newer than 5:20.10.6~3-0~debian-bullseye:
% apt-cache policy docker-ce
  docker-ce:
    Installed: 5:20.10.6~3-0~debian-buster
    Candidate: 5:20.10.6~3-0~debian-buster
    Version table:
   *** 5:20.10.6~3-0~debian-buster 100
          100 /var/lib/dpkg/status
       5:20.10.6~3-0~debian-bullseye 500
          500 https://download.docker.com/linux/debian bullseye/stable amd64 Packages
Vagrant is available in version 2.2.14, the package from upstream works perfectly fine on bullseye as well. If you re relying on VirtualBox, be aware that upstream doesn t provide packages for bullseye yet, but the package from Debian/unstable (v6.1.22 as of 2021-05-27) works fine on bullseye (VirtualBox isn t shipped with stable releases since quite some time due to lack of cooperation from upstream on security support for older releases, see #794466). If you rely on the virtualbox-guest-additions-iso and its shared folders support, you might be glad to hear that v6.1.22 made it into bullseye (see #988783), properly supporting more recent kernel versions like present in bullseye. debuginfod There s a new service debuginfod.debian.net (see debian-devel-announce and Debian Wiki), which makes the debugging experience way smoother. You no longer need to download the debugging Debian packages (*-dbgsym/*-dbg), but instead can fetch them on demand, by exporting the following variables (before invoking gdb or alike):
% export DEBUGINFOD_PROGRESS=1    # for optional download progress reporting
% export DEBUGINFOD_URLS="https://debuginfod.debian.net"
BTW: if you can t rely on debuginfod (for whatever reason), I d like to point your attention towards find-dbgsym-packages from the debian-goodies package. Vim Sadly Vim 8.2 once again makes another change for bad defaults (hello mouse behavior!). When incsearch is set, it also applies to :substitute. This makes it veeeeeeeeeery annoying when running something like :%s/\s\+$// to get rid of trailing whitespace characters, because if there are no matches it jumps to the beginning of the file and then back, sigh. To get the old behavior back, you can use this:
au CmdLineEnter : let s:incs = &incsearch   set noincsearch
au CmdLineLeave : let &incsearch = s:incs
rsync rsync was updated from v3.1.3 to v3.2.3. It provides various checksum enhancements (see option --checksum-choice). We got new capabilities (hardlink-specials, atimes, optional protect-args, stop-at, no crtimes) and the addition of zstd and lz4 compression algorithms. And we got new options: OpenSSH OpenSSH was updated from v7.9p1 to 8.4p1, so if you re interested in all the changes, check out the release notes between those version (8.0, 8.1, 8.2, 8.3 + 8.4). Let s highlight some notable new features: Misc unsorted

9 April 2021

Michael Prokop: A Ceph war story

It all started with the big bang! We nearly lost 33 of 36 disks on a Proxmox/Ceph Cluster; this is the story of how we recovered them. At the end of 2020, we eventually had a long outstanding maintenance window for taking care of system upgrades at a customer. During this maintenance window, which involved reboots of server systems, the involved Ceph cluster unexpectedly went into a critical state. What was planned to be a few hours of checklist work in the early evening turned out to be an emergency case; let s call it a nightmare (not only because it included a big part of the night). Since we have learned a few things from our post mortem and RCA, it s worth sharing those with others. But first things first, let s step back and clarify what we had to deal with. The system and its upgrade One part of the upgrade included 3 Debian servers (we re calling them server1, server2 and server3 here), running on Proxmox v5 + Debian/stretch with 12 Ceph OSDs each (65.45TB in total), a so-called Proxmox Hyper-Converged Ceph Cluster. First, we went for upgrading the Proxmox v5/stretch system to Proxmox v6/buster, before updating Ceph Luminous v12.2.13 to the latest v14.2 release, supported by Proxmox v6/buster. The Proxmox upgrade included updating corosync from v2 to v3. As part of this upgrade, we had to apply some configuration changes, like adjust ring0 + ring1 address settings and add a mon_host configuration to the Ceph configuration. During the first two servers reboots, we noticed configuration glitches. After fixing those, we went for a reboot of the third server as well. Then we noticed that several Ceph OSDs were unexpectedly down. The NTP service wasn t working as expected after the upgrade. The underlying issue is a race condition of ntp with systemd-timesyncd (see #889290). As a result, we had clock skew problems with Ceph, indicating that the Ceph monitors clocks aren t running in sync (which is essential for proper Ceph operation). We initially assumed that our Ceph OSD failure derived from this clock skew problem, so we took care of it. After yet another round of reboots, to ensure the systems are running all with identical and sane configurations and services, we noticed lots of failing OSDs. This time all but three OSDs (19, 21 and 22) were down:
% sudo ceph osd tree
ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       65.44138 root default
-2       21.81310     host server1
 0   hdd  1.08989         osd.0    down  1.00000 1.00000
 1   hdd  1.08989         osd.1    down  1.00000 1.00000
 2   hdd  1.63539         osd.2    down  1.00000 1.00000
 3   hdd  1.63539         osd.3    down  1.00000 1.00000
 4   hdd  1.63539         osd.4    down  1.00000 1.00000
 5   hdd  1.63539         osd.5    down  1.00000 1.00000
18   hdd  2.18279         osd.18   down  1.00000 1.00000
20   hdd  2.18179         osd.20   down  1.00000 1.00000
28   hdd  2.18179         osd.28   down  1.00000 1.00000
29   hdd  2.18179         osd.29   down  1.00000 1.00000
30   hdd  2.18179         osd.30   down  1.00000 1.00000
31   hdd  2.18179         osd.31   down  1.00000 1.00000
-4       21.81409     host server2
 6   hdd  1.08989         osd.6    down  1.00000 1.00000
 7   hdd  1.08989         osd.7    down  1.00000 1.00000
 8   hdd  1.63539         osd.8    down  1.00000 1.00000
 9   hdd  1.63539         osd.9    down  1.00000 1.00000
10   hdd  1.63539         osd.10   down  1.00000 1.00000
11   hdd  1.63539         osd.11   down  1.00000 1.00000
19   hdd  2.18179         osd.19     up  1.00000 1.00000
21   hdd  2.18279         osd.21     up  1.00000 1.00000
22   hdd  2.18279         osd.22     up  1.00000 1.00000
32   hdd  2.18179         osd.32   down  1.00000 1.00000
33   hdd  2.18179         osd.33   down  1.00000 1.00000
34   hdd  2.18179         osd.34   down  1.00000 1.00000
-3       21.81419     host server3
12   hdd  1.08989         osd.12   down  1.00000 1.00000
13   hdd  1.08989         osd.13   down  1.00000 1.00000
14   hdd  1.63539         osd.14   down  1.00000 1.00000
15   hdd  1.63539         osd.15   down  1.00000 1.00000
16   hdd  1.63539         osd.16   down  1.00000 1.00000
17   hdd  1.63539         osd.17   down  1.00000 1.00000
23   hdd  2.18190         osd.23   down  1.00000 1.00000
24   hdd  2.18279         osd.24   down  1.00000 1.00000
25   hdd  2.18279         osd.25   down  1.00000 1.00000
35   hdd  2.18179         osd.35   down  1.00000 1.00000
36   hdd  2.18179         osd.36   down  1.00000 1.00000
37   hdd  2.18179         osd.37   down  1.00000 1.00000
Our blood pressure increased slightly! Did we just lose all of our cluster? What happened, and how can we get all the other OSDs back? We stumbled upon this beauty in our logs:
kernel: [   73.697957] XFS (sdl1): SB stripe unit sanity check failed
kernel: [   73.698002] XFS (sdl1): Metadata corruption detected at xfs_sb_read_verify+0x10e/0x180 [xfs], xfs_sb block 0xffffffffffffffff
kernel: [   73.698799] XFS (sdl1): Unmount and run xfs_repair
kernel: [   73.699199] XFS (sdl1): First 128 bytes of corrupted metadata buffer:
kernel: [   73.699677] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 62 00  XFSB..........b.
kernel: [   73.700205] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
kernel: [   73.700836] 00000020: 62 44 2b c0 e6 22 40 d7 84 3d e1 cc 65 88 e9 d8  bD+.."@..=..e...
kernel: [   73.701347] 00000030: 00 00 00 00 00 00 40 08 00 00 00 00 00 00 01 00  ......@.........
kernel: [   73.701770] 00000040: 00 00 00 00 00 00 01 01 00 00 00 00 00 00 01 02  ................
ceph-disk[4240]: mount: /var/lib/ceph/tmp/mnt.jw367Y: mount(2) system call failed: Structure needs cleaning.
ceph-disk[4240]: ceph-disk: Mounting filesystem failed: Command '['/bin/mount', '-t', u'xfs', '-o', 'noatime,inode64', '--', '/dev/disk/by-parttypeuuid/4fbd7e29-9d25-41b8-afd0-062c0ceff05d.cdda39ed-5
ceph/tmp/mnt.jw367Y']' returned non-zero exit status 32
kernel: [   73.702162] 00000050: 00 00 00 01 00 00 18 80 00 00 00 04 00 00 00 00  ................
kernel: [   73.702550] 00000060: 00 00 06 48 bd a5 10 00 08 00 00 02 00 00 00 00  ...H............
kernel: [   73.702975] 00000070: 00 00 00 00 00 00 00 00 0c 0c 0b 01 0d 00 00 19  ................
kernel: [   73.703373] XFS (sdl1): SB validate failed with error -117.
The same issue was present for the other failing OSDs. We hoped, that the data itself was still there, and only the mounting of the XFS partitions failed. The Ceph cluster was initially installed in 2017 with Ceph jewel/10.2 with the OSDs on filestore (nowadays being a legacy approach to storing objects in Ceph). However, we migrated the disks to bluestore since then (with ceph-disk and not yet via ceph-volume what s being used nowadays). Using ceph-disk introduces these 100MB XFS partitions containing basic metadata for the OSD. Given that we had three working OSDs left, we decided to investigate how to rebuild the failing ones. Some folks on #ceph (thanks T1, ormandj + peetaur!) were kind enough to share how working XFS partitions looked like for them. After creating a backup (via dd), we tried to re-create such an XFS partition on server1. We noticed that even mounting a freshly created XFS partition failed:
synpromika@server1 ~ % sudo mkfs.xfs -f -i size=2048 -m uuid="4568c300-ad83-4288-963e-badcd99bf54f" /dev/sdc1
meta-data=/dev/sdc1              isize=2048   agcount=4, agsize=6272 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=25088, imaxpct=25
         =                       sunit=128    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=1608, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
synpromika@server1 ~ % sudo mount /dev/sdc1 /mnt/ceph-recovery
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0x0/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x1000
cache_node_purge: refcount was 1, not zero (node=0x1d3c400)
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0x18800/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x18800/0x1000
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0x0/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x1000
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0x24c00/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x24c00/0x1000
SB stripe unit sanity check failed
Metadata corruption detected at 0x433840, xfs_sb block 0xc400/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0xc400/0x1000
releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!releasing dirty buffer (bulk) to free list!found dirty buffer (bulk) on free list!bad magic number
bad magic number
Metadata corruption detected at 0x433840, xfs_sb block 0x0/0x1000
libxfs_writebufr: write verifer failed on xfs_sb bno 0x0/0x1000
releasing dirty buffer (bulk) to free list!mount: /mnt/ceph-recovery: wrong fs type, bad option, bad superblock on /dev/sdc1, missing codepage or helper program, or other error.
Ouch. This very much looked related to the actual issue we re seeing. So we tried to execute mkfs.xfs with a bunch of different sunit/swidth settings. Using -d sunit=512 -d swidth=512 at least worked then, so we decided to force its usage in the creation of our OSD XFS partition. This brought us a working XFS partition. Please note, sunit must not be larger than swidth (more on that later!). Then we reconstructed how to restore all the metadata for the OSD (activate.monmap, active, block_uuid, bluefs, ceph_fsid, fsid, keyring, kv_backend, magic, mkfs_done, ready, require_osd_release, systemd, type, whoami). To identify the UUID, we can read the data from ceph --format json osd dump , like this for all our OSDs (Zsh syntax ftw!):
synpromika@server1 ~ % for f in  0..37  ; printf "osd-$f: %s\n" "$(sudo ceph --format json osd dump   jq -r ".osds[]   select(.osd==$f)   .uuid")"
osd-0: 4568c300-ad83-4288-963e-badcd99bf54f
osd-1: e573a17a-ccde-4719-bdf8-eef66903ca4f
osd-2: 0e1b2626-f248-4e7d-9950-f1a46644754e
osd-3: 1ac6a0a2-20ee-4ed8-9f76-d24e900c800c
[...]
Identifying the corresponding raw device for each OSD UUID is possible via:
synpromika@server1 ~ % UUID="4568c300-ad83-4288-963e-badcd99bf54f"
synpromika@server1 ~ % readlink -f /dev/disk/by-partuuid/"$ UUID "
/dev/sdc1
The OSD s key ID can be retrieved via:
synpromika@server1 ~ % OSD_ID=0
synpromika@server1 ~ % sudo ceph auth get osd."$ OSD_ID " -f json 2>/dev/null   jq -r '.[]   .key'
AQCKFpZdm0We[...]
Now we also need to identify the underlying block device:
synpromika@server1 ~ % OSD_ID=0
synpromika@server1 ~ % sudo ceph osd metadata osd."$ OSD_ID " -f json   jq -r '.bluestore_bdev_partition_path'    
/dev/sdc2
With all of this, we reconstructed the keyring, fsid, whoami, block + block_uuid files. All the other files inside the XFS metadata partition are identical on each OSD. So after placing and adjusting the corresponding metadata on the XFS partition for Ceph usage, we got a working OSD hurray! Since we had to fix yet another 32 OSDs, we decided to automate this XFS partitioning and metadata recovery procedure. We had a network share available on /srv/backup for storing backups of existing partition data. On each server, we tested the procedure with one single OSD before iterating over the list of remaining failing OSDs. We started with a shell script on server1, then adjusted the script for server2 and server3. This is the script, as we executed it on the 3rd server. Thanks to this, we managed to get the Ceph cluster up and running again. We didn t want to continue with the Ceph upgrade itself during the night though, as we wanted to know exactly what was going on and why the system behaved like that. Time for RCA! Root Cause Analysis So all but three OSDs on server2 failed, and the problem seems to be related to XFS. Therefore, our starting point for the RCA was, to identify what was different on server2, as compared to server1 + server3. My initial assumption was that this was related to some firmware issues with the involved controller (and as it turned out later, I was right!). The disks were attached as JBOD devices to a ServeRAID M5210 controller (with a stripe size of 512). Firmware state:
synpromika@server1 ~ % sudo storcli64 /c0 show all   grep '^Firmware'
Firmware Package Build = 24.16.0-0092
Firmware Version = 4.660.00-8156
synpromika@server2 ~ % sudo storcli64 /c0 show all   grep '^Firmware'
Firmware Package Build = 24.21.0-0112
Firmware Version = 4.680.00-8489
synpromika@server3 ~ % sudo storcli64 /c0 show all   grep '^Firmware'
Firmware Package Build = 24.16.0-0092
Firmware Version = 4.660.00-8156
This looked very promising, as server2 indeed runs with a different firmware version on the controller. But how so? Well, the motherboard of server2 got replaced by a Lenovo/IBM technician in January 2020, as we had a failing memory slot during a memory upgrade. As part of this procedure, the Lenovo/IBM technician installed the latest firmware versions. According to our documentation, some OSDs were rebuilt (due to the filestore->bluestore migration) in March and April 2020. It turned out that precisely those OSDs were the ones that survived the upgrade. So the surviving drives were created with a different firmware version running on the involved controller. All the other OSDs were created with an older controller firmware. But what difference does this make? Now let s check firmware changelogs. For the 24.21.0-0097 release we found this:
- Cannot create or mount xfs filesystem using xfsprogs 4.19.x kernel 4.20(SCGCQ02027889)
- xfs_info command run on an XFS file system created on a VD of strip size 1M shows sunit and swidth as 0(SCGCQ02056038)
Our XFS problem certainly was related to the controller s firmware. We also recalled that our monitoring system reported different sunit settings for the OSDs that were rebuilt in March and April. For example, OSD 21 was recreated and got different sunit settings:
WARN  server2.example.org  Mount options of /var/lib/ceph/osd/ceph-21      WARN - Missing: sunit=1024, Exceeding: sunit=512
We compared the new OSD 21 with an existing one (OSD 25 on server3):
synpromika@server2 ~ % systemctl show var-lib-ceph-osd-ceph\\x2d21.mount   grep sunit
Options=rw,noatime,attr2,inode64,sunit=512,swidth=512,noquota
synpromika@server3 ~ % systemctl show var-lib-ceph-osd-ceph\\x2d25.mount   grep sunit
Options=rw,noatime,attr2,inode64,sunit=1024,swidth=512,noquota
Thanks to our documentation, we could compare execution logs of their creation:
% diff -u ceph-disk-osd-25.log ceph-disk-osd-21.log
-synpromika@server2 ~ % sudo ceph-disk -v prepare --bluestore /dev/sdj --osd-id 25
+synpromika@server3 ~ % sudo ceph-disk -v prepare --bluestore /dev/sdi --osd-id 21
[...]
-command_check_call: Running command: /sbin/mkfs -t xfs -f -i size=2048 -- /dev/sdj1
-meta-data=/dev/sdj1              isize=2048   agcount=4, agsize=6272 blks
[...]
+command_check_call: Running command: /sbin/mkfs -t xfs -f -i size=2048 -- /dev/sdi1
+meta-data=/dev/sdi1              isize=2048   agcount=4, agsize=6336 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=0
-data     =                       bsize=4096   blocks=25088, imaxpct=25
-         =                       sunit=128    swidth=64 blks
+data     =                       bsize=4096   blocks=25344, imaxpct=25
+         =                       sunit=64     swidth=64 blks
 naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
 log      =internal log           bsize=4096   blocks=1608, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
 realtime =none                   extsz=4096   blocks=0, rtextents=0
[...]
So back then, we even tried to track this down but couldn t make sense of it yet. But now this sounds very much like it is related to the problem we saw with this Ceph/XFS failure. We follow Occam s razor, assuming the simplest explanation is usually the right one, so let s check the disk properties and see what differs:
synpromika@server1 ~ % sudo blockdev --getsz --getsize64 --getss --getpbsz --getiomin --getioopt /dev/sdk
4685545472
2398999281664
512
4096
524288
262144
synpromika@server2 ~ % sudo blockdev --getsz --getsize64 --getss --getpbsz --getiomin --getioopt /dev/sdk
4685545472
2398999281664
512
4096
262144
262144
See the difference between server1 and server2 for identical disks? The getiomin option now reports something different for them:
synpromika@server1 ~ % sudo blockdev --getiomin /dev/sdk            
524288
synpromika@server1 ~ % cat /sys/block/sdk/queue/minimum_io_size
524288
synpromika@server2 ~ % sudo blockdev --getiomin /dev/sdk 
262144
synpromika@server2 ~ % cat /sys/block/sdk/queue/minimum_io_size
262144
It doesn t make sense that the minimum I/O size (iomin, AKA BLKIOMIN) is bigger than the optimal I/O size (ioopt, AKA BLKIOOPT). This leads us to Bug 202127 cannot mount or create xfs on a 597T device, which matches our findings here. But why did this XFS partition work in the past and fails now with the newer kernel version? The XFS behaviour change Now given that we have backups of all the XFS partition, we wanted to track down, a) when this XFS behaviour was introduced, and b) whether, and if so how it would be possible to reuse the XFS partition without having to rebuild it from scratch (e.g. if you would have no working Ceph OSD or backups left). Let s look at such a failing XFS partition with the Grml live system:
root@grml ~ # grml-version
grml64-full 2020.06 Release Codename Ausgehfuahangl [2020-06-24]
root@grml ~ # uname -a
Linux grml 5.6.0-2-amd64 #1 SMP Debian 5.6.14-2 (2020-06-09) x86_64 GNU/Linux
root@grml ~ # grml-hostname grml-2020-06
Setting hostname to grml-2020-06: done
root@grml ~ # exec zsh
root@grml-2020-06 ~ # dpkg -l xfsprogs util-linux
Desired=Unknown/Install/Remove/Purge/Hold
  Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
 / Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
 / Name           Version      Architecture Description
+++-==============-============-============-=========================================
ii  util-linux     2.35.2-4     amd64        miscellaneous system utilities
ii  xfsprogs       5.6.0-1+b2   amd64        Utilities for managing the XFS filesystem
There it s failing, no matter which mount option we try:
root@grml-2020-06 ~ # mount ./sdd1.dd /mnt
mount: /mnt: mount(2) system call failed: Structure needs cleaning.
root@grml-2020-06 ~ # dmesg   tail -30
[...]
[   64.788640] XFS (loop1): SB stripe unit sanity check failed
[   64.788671] XFS (loop1): Metadata corruption detected at xfs_sb_read_verify+0x102/0x170 [xfs], xfs_sb block 0xffffffffffffffff
[   64.788671] XFS (loop1): Unmount and run xfs_repair
[   64.788672] XFS (loop1): First 128 bytes of corrupted metadata buffer:
[   64.788673] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 62 00  XFSB..........b.
[   64.788674] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   64.788675] 00000020: 32 b6 dc 35 53 b7 44 96 9d 63 30 ab b3 2b 68 36  2..5S.D..c0..+h6
[   64.788675] 00000030: 00 00 00 00 00 00 40 08 00 00 00 00 00 00 01 00  ......@.........
[   64.788675] 00000040: 00 00 00 00 00 00 01 01 00 00 00 00 00 00 01 02  ................
[   64.788676] 00000050: 00 00 00 01 00 00 18 80 00 00 00 04 00 00 00 00  ................
[   64.788677] 00000060: 00 00 06 48 bd a5 10 00 08 00 00 02 00 00 00 00  ...H............
[   64.788677] 00000070: 00 00 00 00 00 00 00 00 0c 0c 0b 01 0d 00 00 19  ................
[   64.788679] XFS (loop1): SB validate failed with error -117.
root@grml-2020-06 ~ # mount -t xfs -o rw,relatime,attr2,inode64,sunit=1024,swidth=512,noquota ./sdd1.dd /mnt/
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/loop1, missing codepage or helper program, or other error.
32 root@grml-2020-06 ~ # dmesg   tail -1
[   66.342976] XFS (loop1): stripe width (512) must be a multiple of the stripe unit (1024)
root@grml-2020-06 ~ # mount -t xfs -o rw,relatime,attr2,inode64,sunit=512,swidth=512,noquota ./sdd1.dd /mnt/
mount: /mnt: mount(2) system call failed: Structure needs cleaning.
32 root@grml-2020-06 ~ # dmesg   tail -14
[   66.342976] XFS (loop1): stripe width (512) must be a multiple of the stripe unit (1024)
[   80.751277] XFS (loop1): SB stripe unit sanity check failed
[   80.751323] XFS (loop1): Metadata corruption detected at xfs_sb_read_verify+0x102/0x170 [xfs], xfs_sb block 0xffffffffffffffff 
[   80.751324] XFS (loop1): Unmount and run xfs_repair
[   80.751325] XFS (loop1): First 128 bytes of corrupted metadata buffer:
[   80.751327] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 62 00  XFSB..........b.
[   80.751328] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   80.751330] 00000020: 32 b6 dc 35 53 b7 44 96 9d 63 30 ab b3 2b 68 36  2..5S.D..c0..+h6
[   80.751331] 00000030: 00 00 00 00 00 00 40 08 00 00 00 00 00 00 01 00  ......@.........
[   80.751331] 00000040: 00 00 00 00 00 00 01 01 00 00 00 00 00 00 01 02  ................
[   80.751332] 00000050: 00 00 00 01 00 00 18 80 00 00 00 04 00 00 00 00  ................
[   80.751333] 00000060: 00 00 06 48 bd a5 10 00 08 00 00 02 00 00 00 00  ...H............
[   80.751334] 00000070: 00 00 00 00 00 00 00 00 0c 0c 0b 01 0d 00 00 19  ................
[   80.751338] XFS (loop1): SB validate failed with error -117.
Also xfs_repair doesn t help either:
root@grml-2020-06 ~ # xfs_info ./sdd1.dd
meta-data=./sdd1.dd              isize=2048   agcount=4, agsize=6272 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=25088, imaxpct=25
         =                       sunit=128    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=1608, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
root@grml-2020-06 ~ # xfs_repair ./sdd1.dd
Phase 1 - find and verify superblock...
bad primary superblock - bad stripe width in superblock !!!
attempting to find secondary superblock...
..............................................................................................Sorry, could not find valid secondary superblock
Exiting now.
With the SB stripe unit sanity check failed message, we could easily track this down to the following commit fa4ca9c:
% git show fa4ca9c5574605d1e48b7e617705230a0640b6da   cat
commit fa4ca9c5574605d1e48b7e617705230a0640b6da
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Jun 5 10:06:16 2018 -0700
    
    xfs: catch bad stripe alignment configurations
    
    When stripe alignments are invalid, data alignment algorithms in the
    allocator may not work correctly. Ensure we catch superblocks with
    invalid stripe alignment setups at mount time. These data alignment
    mismatches are now detected at mount time like this:
    
    XFS (loop0): SB stripe unit sanity check failed
    XFS (loop0): Metadata corruption detected at xfs_sb_read_verify+0xab/0x110, xfs_sb block 0xffffffffffffffff
    XFS (loop0): Unmount and run xfs_repair
    XFS (loop0): First 128 bytes of corrupted metadata buffer:
    0000000091c2de02: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 10 00  XFSB............
    0000000023bff869: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00000000cdd8c893: 17 32 37 15 ff ca 46 3d 9a 17 d3 33 04 b5 f1 a2  .27...F=...3....
    000000009fd2844f: 00 00 00 00 00 00 00 04 00 00 00 00 00 00 06 d0  ................
    0000000088e9b0bb: 00 00 00 00 00 00 06 d1 00 00 00 00 00 00 06 d2  ................
    00000000ff233a20: 00 00 00 01 00 00 10 00 00 00 00 01 00 00 00 00  ................
    000000009db0ac8b: 00 00 03 60 e1 34 02 00 08 00 00 02 00 00 00 00  ... .4..........
    00000000f7022460: 00 00 00 00 00 00 00 00 0c 09 0b 01 0c 00 00 19  ................
    XFS (loop0): SB validate failed with error -117.
    
    And the mount fails.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
diff --git fs/xfs/libxfs/xfs_sb.c fs/xfs/libxfs/xfs_sb.c
index b5dca3c8c84d..c06b6fc92966 100644
--- fs/xfs/libxfs/xfs_sb.c
+++ fs/xfs/libxfs/xfs_sb.c
@@ -278,6 +278,22 @@ xfs_mount_validate_sb(
                return -EFSCORRUPTED;
         
        
+       if (sbp->sb_unit)  
+               if (!xfs_sb_version_hasdalign(sbp)  
+                   sbp->sb_unit > sbp->sb_width  
+                   (sbp->sb_width % sbp->sb_unit) != 0)  
+                       xfs_notice(mp, "SB stripe unit sanity check failed");
+                       return -EFSCORRUPTED;
+                 
+         else if (xfs_sb_version_hasdalign(sbp))   
+               xfs_notice(mp, "SB stripe alignment sanity check failed");
+               return -EFSCORRUPTED;
+         else if (sbp->sb_width)  
+               xfs_notice(mp, "SB stripe width sanity check failed");
+               return -EFSCORRUPTED;
+        
+
+       
        if (xfs_sb_version_hascrc(&mp->m_sb) &&
            sbp->sb_blocksize < XFS_MIN_CRC_BLOCKSIZE)  
                xfs_notice(mp, "v5 SB sanity check failed");
This change is included in kernel versions 4.18-rc1 and newer:
% git describe --contains fa4ca9c5574605d1e48
v4.18-rc1~37^2~14
Now let s try with an older kernel version (4.9.0), using old Grml 2017.05 release:
root@grml ~ # grml-version
grml64-small 2017.05 Release Codename Freedatensuppe [2017-05-31]
root@grml ~ # uname -a
Linux grml 4.9.0-1-grml-amd64 #1 SMP Debian 4.9.29-1+grml.1 (2017-05-24) x86_64 GNU/Linux
root@grml ~ # lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 9.0 (stretch)
Release:        9.0
Codename:       stretch
root@grml ~ # grml-hostname grml-2017-05
Setting hostname to grml-2017-05: done
root@grml ~ # exec zsh
root@grml-2017-05 ~ #
root@grml-2017-05 ~ # xfs_info ./sdd1.dd
xfs_info: ./sdd1.dd is not a mounted XFS filesystem
1 root@grml-2017-05 ~ # xfs_repair ./sdd1.dd
Phase 1 - find and verify superblock...
bad primary superblock - bad stripe width in superblock !!!
attempting to find secondary superblock...
..............................................................................................Sorry, could not find valid secondary superblock
Exiting now.
1 root@grml-2017-05 ~ # mount ./sdd1.dd /mnt
root@grml-2017-05 ~ # mount -t xfs
/root/sdd1.dd on /mnt type xfs (rw,relatime,attr2,inode64,sunit=1024,swidth=512,noquota)
root@grml-2017-05 ~ # ls /mnt
activate.monmap  active  block  block_uuid  bluefs  ceph_fsid  fsid  keyring  kv_backend  magic  mkfs_done  ready  require_osd_release  systemd  type  whoami
root@grml-2017-05 ~ # xfs_info /mnt
meta-data=/dev/loop1             isize=2048   agcount=4, agsize=6272 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=25088, imaxpct=25
         =                       sunit=128    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=1608, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
Mounting there indeed works! Now, if we mount the filesystem with new and proper sunit/swidth settings using the older kernel, it should rewrite them on disk:
root@grml-2017-05 ~ # mount -t xfs -o sunit=512,swidth=512 ./sdd1.dd /mnt/
root@grml-2017-05 ~ # umount /mnt/
And indeed, mounting this rewritten filesystem then also works with newer kernels:
root@grml-2020-06 ~ # mount ./sdd1.rewritten /mnt/
root@grml-2020-06 ~ # xfs_info /root/sdd1.rewritten
meta-data=/dev/loop1             isize=2048   agcount=4, agsize=6272 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=25088, imaxpct=25
         =                       sunit=64    swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=1608, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
root@grml-2020-06 ~ # mount -t xfs                
/root/sdd1.rewritten on /mnt type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,sunit=512,swidth=512,noquota)
FTR: The sunit=512,swidth=512 from the xfs mount option is identical to xfs_info s output sunit=64,swidth=64 (because mount.xfs s sunit value is given in 512-byte block units, see man 5 xfs, and the xfs_info output reported here is in blocks with a block size (bsize) of 4096, so sunit = 512*512 := 64*4096 ). mkfs uses minimum and optimal sizes for stripe unit and stripe width; you can check this e.g. via (note that server2 with fixed firmware version reports proper values, whereas server3 with broken controller firmware reports non-sense):
synpromika@server2 ~ % for i in /sys/block/sd*/queue/ ; do printf "%s: %s %s\n" "$i" "$(cat "$i"/minimum_io_size)" "$(cat "$i"/optimal_io_size)" ; done
[...]
/sys/block/sdc/queue/: 262144 262144
/sys/block/sdd/queue/: 262144 262144
/sys/block/sde/queue/: 262144 262144
/sys/block/sdf/queue/: 262144 262144
/sys/block/sdg/queue/: 262144 262144
/sys/block/sdh/queue/: 262144 262144
/sys/block/sdi/queue/: 262144 262144
/sys/block/sdj/queue/: 262144 262144
/sys/block/sdk/queue/: 262144 262144
/sys/block/sdl/queue/: 262144 262144
/sys/block/sdm/queue/: 262144 262144
/sys/block/sdn/queue/: 262144 262144
[...]
synpromika@server3 ~ % for i in /sys/block/sd*/queue/ ; do printf "%s: %s %s\n" "$i" "$(cat "$i"/minimum_io_size)" "$(cat "$i"/optimal_io_size)" ; done
[...]
/sys/block/sdc/queue/: 524288 262144
/sys/block/sdd/queue/: 524288 262144
/sys/block/sde/queue/: 524288 262144
/sys/block/sdf/queue/: 524288 262144
/sys/block/sdg/queue/: 524288 262144
/sys/block/sdh/queue/: 524288 262144
/sys/block/sdi/queue/: 524288 262144
/sys/block/sdj/queue/: 524288 262144
/sys/block/sdk/queue/: 524288 262144
/sys/block/sdl/queue/: 524288 262144
/sys/block/sdm/queue/: 524288 262144
/sys/block/sdn/queue/: 524288 262144
[...]
This is the underlying reason why the initially created XFS partitions were created with incorrect sunit/swidth settings. The broken firmware of server1 and server3 was the cause of the incorrect settings they were ignored by old(er) xfs/kernel versions, but treated as an error by new ones. Make sure to also read the XFS FAQ regarding How to calculate the correct sunit,swidth values for optimal performance . We also stumbled upon two interesting reads in RedHat s knowledge base: 5075561 + 2150101 (requires an active subscription, though) and #1835947. Am I affected? How to work around it? To check whether your XFS mount points are affected by this issue, the following command line should be useful:
awk '$3 == "xfs" print $2 ' /proc/self/mounts   while read mount ; do echo -n "$mount " ; xfs_info $mount   awk '$0 ~ "swidth" gsub(/.*=/,"",$2); gsub(/.*=/,"",$3); print $2,$3 '   awk '  if ($1 > $2) print "impacted"; else print "OK" ' ; done
If you run into the above situation, the only known solution to get your original XFS partition working again, is to boot into an older kernel version again (4.17 or older), mount the XFS partition with correct sunit/swidth settings and then boot back into your new system (kernel version wise). Lessons learned Thanks: Darshaka Pathirana, Chris Hofstaedtler and Michael Hanscho. Looking for help with your IT infrastructure? Let us know!

16 February 2021

Michael Prokop: How to properly use 3rd party Debian repository signing keys with apt

(Blogging this, since this is a recurring anti-pattern I noticed at several customers and often comes up during deployments of 3rd party repositories.) Update on 2021-02-19: clarified, that Signed-By requires apt >= 1.1, thanks Vincent Bernat Many upstream projects provide Debian repository instructions like this:
curl -fsSL https://example.com/stable/debian.gpg   sudo apt-key add -
Do not follow this, for different reasons, including:
  1. You do not see what you get before adding the GPG key to your global apt trust store
  2. You can t easily script this via your preferred configuration management (the apt-key manpage clearly discourages programmatic usage)
  3. The signing key is considered valid for all your enabled Debian repositories (instead of only a specific one)
  4. You need GnuPG (either gnupg2 or gnupg1) on your system for usage with apt-key
There s a much better approach to this: download the GPG key, make sure it s in the appropriate format, then use it via deb [signed-by=/usr/share/keyrings/ ] in your apt s sources list configuration. Note and FTR: the Signed-By feature is available starting with apt 1.1 (so apt in Debian jessie/8 and older does not support it). TL;DR: As an example, let s demonstrate this with the Tailscale Debian repository for buster.
Downloading the GPG file will give you an ascii-armored GPG file:
% curl -fsSL -o buster.gpg https://pkgs.tailscale.com/stable/debian/buster.gpg
% gpg --keyid-format long buster.gpg 
gpg: WARNING: no command supplied.  Trying to guess what you mean ...
pub   rsa4096/458CA832957F5868 2020-02-25 [SC]
      2596A99EAAB33821893C0A79458CA832957F5868
uid                           Tailscale Inc. (Package repository signing key) <info@tailscale.com>
sub   rsa4096/B1547A3DDAAF03C6 2020-02-25 [E]
% file buster.gpg
buster.gpg: PGP public key block Public-Key (old)
If you have apt version >= 1.4 available (Debian >=stretch/9 and Ubuntu >=bionic/18.04), you can use this file directly as follows:
% sudo mv buster.gpg /usr/share/keyrings/tailscale.asc
% cat /etc/apt/sources.list.d/tailscale.list
deb [signed-by=/usr/share/keyrings/tailscale.asc] https://pkgs.tailscale.com/stable/debian buster main
% sudo apt update
[...]
And you re done! Iff your apt version really is older than 1.4, you need to convert the ascii-armored GPG file into a GPG key public ring file (AKA binary OpenPGP format), either by just dearmor-ing it (if you don t care about checking ID + fingerprint):
% gpg --dearmor < buster.gpg > tailscale.gpg
or if you prefer to go via GPG, you can also use a temporary GPG home directory (if you don t care about going through your personal GPG setup):
% mkdir --mode=700 /tmp/gpg-tmpdir
% gpg --homedir /tmp/gpg-tmpdir --import ./buster.gpg
gpg: keybox '/tmp/gpg-tmpdir/pubring.kbx' created
gpg: /tmp/gpg-tmpdir/trustdb.gpg: trustdb created
gpg: key 458CA832957F5868: public key "Tailscale Inc. (Package repository signing key) <info@tailscale.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1
% gpg --homedir /tmp/gpg-tmpdir --output tailscale.gpg  --export-options=export-minimal --export 0x458CA832957F5868
% rm -rf /tmp/gpg-tmpdir
The resulting GPG key public ring file should look like that:
% file tailscale.gpg 
tailscale.gpg: PGP/GPG key public ring (v4) created Tue Feb 25 04:51:20 2020 RSA (Encrypt or Sign) 4096 bits MPI=0xc00399b10bc12858...
% gpg tailscale.gpg 
gpg: WARNING: no command supplied.  Trying to guess what you mean ...
pub   rsa4096/458CA832957F5868 2020-02-25 [SC]
      2596A99EAAB33821893C0A79458CA832957F5868
uid                           Tailscale Inc. (Package repository signing key) <info@tailscale.com>
sub   rsa4096/B1547A3DDAAF03C6 2020-02-25 [E]
Then you can use this GPG file on your system as follows:
% sudo mv tailscale.gpg /usr/share/keyrings/tailscale.gpg
% cat /etc/apt/sources.list.d/tailscale.list
deb [signed-by=/usr/share/keyrings/tailscale.gpg] https://pkgs.tailscale.com/stable/debian buster main
% sudo apt update
[...]
Such a setup ensures:
  1. You can verify the GPG key file (ID + fingerprint)
  2. You can easily ship files via /usr/share/keyrings/ and refer to it in your deployment scripts, configuration management, (and can also easily update or get rid of them again!)
  3. The GPG key is valid only for the repositories with the corresponding [signed-by=/usr/share/keyrings/ ] entry
  4. You don t need to install GnuPG (neither gnupg2 nor gnupg1) on the system which is using the 3rd party Debian repository
Thanks: Guillem Jover for reviewing an early draft of this blog article.

15 January 2021

Michael Prokop: Revisiting 2020

* Mainly to recall what happened last year and to give thoughts and plan for the upcoming year(s) I m once again revisiting my previous year (previous editions: 2019, 2018, 2017, 2016, 2015, 2014, 2013 + 2012). Due to the Coronavirus disease (COVID-19) pandemic, 2020 was special for several reasons, but overall I consider myself and my family privileged and am very grateful for that. In terms of IT events, I planned to attend Grazer Linuxdays and DebConf in Haifa/Israel. Sadly Grazer Linuxdays didn t take place at all, and DebConf took place online instead (which I didn t really participate in for several reasons). I took part in the well organized DENOG12 + ATNOG 2020/1 online meetings. I still organize our monthly Security Treff Graz (STG) meetups, and for half of the year, those meetings took place online (which worked OK-ish overall IMO). Only at the beginning of 2020, I managed to play Badminton (still playing in the highest available training class (in german: Kader ) at the University of Graz / Universit ts-Sportinstitut, USI). For the rest of the year except for ~2 weeks in October or so the sessions couldn t occur. Plenty of concerts I planned to attend were cancelled for obvious reasons, including the ones I would have played myself. But I managed to attend Jazz Redoute 2020 Dom im Berg, Martin Grubinger in Musikverein Graz and Emiliano Sampaio s Mega Mereneu Project at WIST Moserhofgasse (all before the corona situation kicked in). The concert from Ton Feinig & RTV Slovenia Big Band occurred under strict regulations in Summer, as well as Elektra Opera by Richard Strau in a very special setting (only one piano player instead of the orchestra because of a Corona case in the orchestra) in Autumn. At the beginning of 2020, I also visited Literaturshow Roboter mit Senf at Literaturhaus Graz. The lack of concerts and rehearsals also severely impacted my playing the drums (including at HTU BigBand Graz), which pretty much didn t take place. :( Grml-wise we managed to publish release 2020.06, codename Ausgehfuahangl. Regarding jenkins-debian-glue I tried to clarify its state and received some really lovely feedback. I consider 2020 as the year where I dropped regular usage of Jabber (so far my accounts still exist, but I m no longer regularly online and am not sure for how much longer I ll keep my accounts alive as such). Business-wise it was our seventh year of business with SynPro Solutions GmbH. No big news but steady and ongoing work with my other business duties Grml Solutions and Grml-Forensic. As usual, I shared childcare with my wife. Due to the corona situation, my wife got a new working schedule, which shuffled around our schedule a bit on Mondays + Tuesdays. Still, we managed to handle the homeschooling/distance learning quite well. Currently we re sitting in the third lockdown, and yet another round of homeschooling/distance learning is going on those days (let s see how long ). I counted 112 actual school days in all of 2020 for our older daughter with only 68 school days since our first lockdown on 16th of March, whereas we had 213(!) press conferences by our Austrian government in 2020. (Further rants about the situation in Austria snipped.) Book reading-wise I managed to complete 60 books (see Mein Lesejahr 2020 ). Once again, I noticed that what felt like good days for me always included reading books, so I ll try to keep my reading pace for 2021. I ll also continue with my hobbies Buying Books and Reading Books , to get worse at Tsundoku. Hoping for vaccination and a more normal 2021, Schwuppdiwupp!

3 July 2020

Michael Prokop: Grml 2020.06 Codename Ausgehfuahangl

We did it again , at the end of June we released Grml 2020.06, codename Ausgehfuahangl. This Grml release (a Linux live system for system administrators) is based on Debian/testing (AKA bullseye) and provides current software packages as of June, incorporates up to date hardware support and fixes known issues from previous Grml releases. I am especially fond of our cloud-init and qemu-guest-agent integration, which makes usage and automation in virtual environments like Proxmox VE much more comfortable. Once as the Qemu Guest Agent setting is enabled in the VM options (also see Proxmox wiki), you ll see IP address information in the VM summary: Screenshot of qemu guest agent integration Using a cloud-init drive allows using an SSH key for login as user "grml", and you can control network settings as well: Screenshot of cloud-init integration It was fun to focus and work on this new Grml release together with Darsha, and we hope you enjoy the new Grml release as much as we do!

14 June 2017

Michael Prokop: Grml 2017.05 Codename Freedatensuppe

The Debian stretch release is going to happen soon (on 2017-06-17) and since our latest Grml release is based on a very recent version of Debian stretch I m taking this as opportunity to announce it also here. So by the end of May we released a new stable release of Grml (the Debian based live system focusing on system administrator s needs), known as version 2017.05 with codename Freedatensuppe. Details about the changes of the new release are available in the official release notes and as usual the ISOs are available via grml.org/download. With this new Grml release we finally made the switch from file-rc to systemd. From a user s point of view this doesn t change that much, though to prevent having to answer even more mails regarding the switch I wrote down some thoughts in Grml s FAQ. There are some things that we still need to improve and sort out, but overall the switch to systemd so far went better than anticipated (thanks a lot to the pkg-systemd folks, especially Felipe Sateler and Michael Biebl!). And last but not least, Darshaka Pathirana helped me a lot with the systemd integration and polishing the release, many thanks! Happy Grml-ing!

26 May 2017

Michael Prokop: The #newinstretch game: dbgsym packages in Debian/stretch

Debug packages include debug symbols and so far were usually named <package>-dbg in Debian. Those packages are essential if you ve to debug failing (especially: crashing) programs. Since December 2015 Debian has automatic dbgsym packages, being built by default. Those packages are available as <package>-dbgsym, so starting with Debian/stretch you should no longer look for -dbg packages but for -dbgsym instead. Currently there are 13.369 dbgsym packages available for the amd64 architecture of Debian/stretch, comparing this to the 2.250 packages which I counted being available for Debian/jessie this is really a huge improvement. (If you re interested in the details of dbgsym packages as a package maintainer take a look at the Automatic Debug Packages page in the Debian wiki.) The dbgsym packages are NOT provided by the usual Debian archive though (which is good thing, since those packages are quite disk space consuming, e.g. just the amd64 stretch mirror of debian-debug consumes 47GB). Instead there s a new archive called debian-debug. To get access to the dbgsym packages via the debian-debug suite on your Debian/stretch system include the following entry in your apt s sources.list configuration (replace deb.debian.org with whatever mirror you prefer):
deb http://deb.debian.org/debian-debug/ stretch-debug main
If you re not yet familiar with usage of such debug packages let me give you a short demo. Let s start with sending SIGILL (Illegal Instruction) to a running sha256sum process, causing it to generate a so called core dump file:
% sha256sum /dev/urandom &
[1] 1126
% kill -4 1126
% 
[1]+  Illegal instruction     (core dumped) sha256sum /dev/urandom
% ls
core
$ file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'sha256sum /dev/urandom', real uid: 1000, effective uid: 1000, real gid: 1000, effective gid: 1000, execfn: '/usr/bin/sha256sum', platform: 'x86_64'
Now we can run the GNU Debugger (gdb) on this core file, executing:
% gdb sha256sum core
[...]
Type "apropos word" to search for commands related to "word"...
Reading symbols from sha256sum...(no debugging symbols found)...done.
[New LWP 1126]
Core was generated by  sha256sum /dev/urandom'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x000055fe9aab63db in ?? ()
(gdb) bt
#0  0x000055fe9aab63db in ?? ()
#1  0x000055fe9aab8606 in ?? ()
#2  0x000055fe9aab4e5b in ?? ()
#3  0x000055fe9aab42ea in ?? ()
#4  0x00007faec30872b1 in __libc_start_main (main=0x55fe9aab3ae0, argc=2, argv=0x7ffc512951f8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc512951e8) at ../csu/libc-start.c:291
#5  0x000055fe9aab4b5a in ?? ()
(gdb) 
As you can see by the several ?? question marks, the bt command (short for backtrace) doesn t provide useful information.
So let s install the according debug package, which is coreutils-dbgsym in this case (since the sha256sum binary which generated the core file is part of the coreutils package). Then let s rerun the same gdb steps:
% gdb sha256sum core
[...]
Type "apropos word" to search for commands related to "word"...
Reading symbols from sha256sum...Reading symbols from /usr/lib/debug/.build-id/a4/b946ef7c161f2d215518ca38d3f0300bcbdbb7.debug...done.
done.
[New LWP 1126]
Core was generated by  sha256sum /dev/urandom'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x000055fe9aab63db in sha256_process_block (buffer=buffer@entry=0x55fe9be95290, len=len@entry=32768, ctx=ctx@entry=0x7ffc51294eb0) at lib/sha256.c:526
526     lib/sha256.c: No such file or directory.
(gdb) bt
#0  0x000055fe9aab63db in sha256_process_block (buffer=buffer@entry=0x55fe9be95290, len=len@entry=32768, ctx=ctx@entry=0x7ffc51294eb0) at lib/sha256.c:526
#1  0x000055fe9aab8606 in sha256_stream (stream=0x55fe9be95060, resblock=0x7ffc51295080) at lib/sha256.c:230
#2  0x000055fe9aab4e5b in digest_file (filename=0x7ffc51295f3a "/dev/urandom", bin_result=0x7ffc51295080 "\001", missing=0x7ffc51295078, binary=<optimized out>) at src/md5sum.c:624
#3  0x000055fe9aab42ea in main (argc=<optimized out>, argv=<optimized out>) at src/md5sum.c:1036
As you can see it s reading the debug symbols from /usr/lib/debug/.build-id/a4/b946ef7c161f2d215518ca38d3f0300bcbdbb7.debug and this is what we were looking for.
gdb now also tells us that we don t have lib/sha256.c available. For even better debugging it s useful to have the according source code available. This is also just an apt-get source coreutils ; cd coreutils-8.26/ away:
~/coreutils-8.26 % gdb sha256sum ~/core
[...]
Type "apropos word" to search for commands related to "word"...
Reading symbols from sha256sum...Reading symbols from /usr/lib/debug/.build-id/a4/b946ef7c161f2d215518ca38d3f0300bcbdbb7.debug...done.
done.
[New LWP 1126]
Core was generated by  sha256sum /dev/urandom'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x000055fe9aab63db in sha256_process_block (buffer=buffer@entry=0x55fe9be95290, len=len@entry=32768, ctx=ctx@entry=0x7ffc51294eb0) at lib/sha256.c:526
526           R( h, a, b, c, d, e, f, g, K(25), M(25) );
(gdb) bt
#0  0x000055fe9aab63db in sha256_process_block (buffer=buffer@entry=0x55fe9be95290, len=len@entry=32768, ctx=ctx@entry=0x7ffc51294eb0) at lib/sha256.c:526
#1  0x000055fe9aab8606 in sha256_stream (stream=0x55fe9be95060, resblock=0x7ffc51295080) at lib/sha256.c:230
#2  0x000055fe9aab4e5b in digest_file (filename=0x7ffc51295f3a "/dev/urandom", bin_result=0x7ffc51295080 "\001", missing=0x7ffc51295078, binary=<optimized out>) at src/md5sum.c:624
#3  0x000055fe9aab42ea in main (argc=<optimized out>, argv=<optimized out>) at src/md5sum.c:1036
(gdb) 
Now we re ready for all the debugging magic. :) Thanks to everyone who was involved in getting us the automatic dbgsym package builds in Debian!

25 May 2017

Michael Prokop: The #newinstretch game: new forensic packages in Debian/stretch

Repeating what I did for the last Debian releases with the #newinwheezy and #newinjessie games it s time for the #newinstretch game: Debian/stretch AKA Debian 9.0 will include a bunch of packages for people interested in digital forensics. The packages maintained within the Debian Forensics team which are new in the Debian/stretch release as compared to Debian/jessie (and ignoring jessie-backports): Join the #newinstretch game and present packages and features which are new in Debian/stretch.

19 May 2017

Michael Prokop: Debian stretch: changes in util-linux #newinstretch

We re coming closer to the Debian/stretch stable release and similar to what we had with #newinwheezy and #newinjessie it s time for #newinstretch! Hideki Yamane already started the game by blogging about GitHub s Icon font, fonts-octicons and Arturo Borrero Gonzalez wrote a nice article about nftables in Debian/stretch. One package that isn t new but its tools are used by many of us is util-linux, providing many essential system utilities. We have util-linux v2.25.2 in Debian/jessie and in Debian/stretch there will be util-linux >=v2.29.2. There are many new options available and we also have a few new tools available. Tools that have been taken over from other packages New tools New features/options addpart (show or change the real-time scheduling attributes of a process):
--reload reload prompts on running agetty instances
blkdiscard (discard the content of sectors on a device):
-p, --step <num>    size of the discard iterations within the offset
-z, --zeroout       zero-fill rather than discard
chrt (show or change the real-time scheduling attributes of a process):
-d, --deadline            set policy to SCHED_DEADLINE
-T, --sched-runtime <ns>  runtime parameter for DEADLINE
-P, --sched-period <ns>   period parameter for DEADLINE
-D, --sched-deadline <ns> deadline parameter for DEADLINE
fdformat (do a low-level formatting of a floppy disk):
-f, --from <N>    start at the track N (default 0)
-t, --to <N>      stop at the track N
-r, --repair <N>  try to repair tracks failed during the verification (max N retries)
fdisk (display or manipulate a disk partition table):
-B, --protect-boot            don't erase bootbits when creating a new label
-o, --output <list>           output columns
    --bytes                   print SIZE in bytes rather than in human readable format
-w, --wipe <mode>             wipe signatures (auto, always or never)
-W, --wipe-partitions <mode>  wipe signatures from new partitions (auto, always or never)
New available columns (for -o):
 gpt: Device Start End Sectors Size Type Type-UUID Attrs Name UUID
 dos: Device Start End Sectors Cylinders Size Type Id Attrs Boot End-C/H/S Start-C/H/S
 bsd: Slice Start End Sectors Cylinders Size Type Bsize Cpg Fsize
 sgi: Device Start End Sectors Cylinders Size Type Id Attrs
 sun: Device Start End Sectors Cylinders Size Type Id Flags
findmnt (find a (mounted) filesystem):
-J, --json             use JSON output format
-M, --mountpoint <dir> the mountpoint directory
-x, --verify           verify mount table content (default is fstab)
    --verbose          print more details
flock (manage file locks from shell scripts):
-F, --no-fork            execute command without forking
    --verbose            increase verbosity
getty (open a terminal and set its mode):
--reload               reload prompts on running agetty instances
hwclock (query or set the hardware clock):
--get            read hardware clock and print drift corrected result
--update-drift   update drift factor in /etc/adjtime (requires --set or --systohc)
ldattach (attach a line discipline to a serial line):
-c, --intro-command <string>  intro sent before ldattach
-p, --pause <seconds>         pause between intro and ldattach
logger (enter messages into the system log):
-e, --skip-empty         do not log empty lines when processing files
    --no-act             do everything except the write the log
    --octet-count        use rfc6587 octet counting
-S, --size <size>        maximum size for a single message
    --rfc3164            use the obsolete BSD syslog protocol
    --rfc5424[=<snip>]   use the syslog protocol (the default for remote);
                           <snip> can be notime, or notq, and/or nohost
    --sd-id <id>         rfc5424 structured data ID
    --sd-param <data>    rfc5424 structured data name=value
    --msgid <msgid>      set rfc5424 message id field
    --socket-errors[=<on off auto>] print connection errors when using Unix sockets
losetup (set up and control loop devices):
-L, --nooverlap               avoid possible conflict between devices
    --direct-io[=<on off>]    open backing file with O_DIRECT 
-J, --json                    use JSON --list output format
New available --list column:
DIO  access backing file with direct-io
lsblk (list information about block devices):
-J, --json           use JSON output format
New available columns (for --output):
HOTPLUG  removable or hotplug device (usb, pcmcia, ...)
SUBSYSTEMS  de-duplicated chain of subsystems
lscpu (display information about the CPU architecture):
-y, --physical          print physical instead of logical IDs
New available column:
DRAWER  logical drawer number
lslocks (list local system locks):
-J, --json             use JSON output format
-i, --noinaccessible   ignore locks without read permissions
nsenter (run a program with namespaces of other processes):
-C, --cgroup[=<file>]      enter cgroup namespace
    --preserve-credentials do not touch uids or gids
-Z, --follow-context       set SELinux context according to --target PID
rtcwake (enter a system sleep state until a specified wakeup time):
--date <timestamp>   date time of timestamp to wake
--list-modes         list available modes
-r, --reorder <dev>  fix partitions order (by start offset)
sfdisk (display or manipulate a disk partition table):
New Commands:
-J, --json <dev>                  dump partition table in JSON format
-F, --list-free [<dev> ...]       list unpartitioned free areas of each device
-r, --reorder <dev>               fix partitions order (by start offset)
    --delete <dev> [<part> ...]   delete all or specified partitions
--part-label <dev> <part> [<str>] print or change partition label
--part-type <dev> <part> [<type>] print or change partition type
--part-uuid <dev> <part> [<uuid>] print or change partition uuid
--part-attrs <dev> <part> [<str>] print or change partition attributes
New Options:
-a, --append                   append partitions to existing partition table
-b, --backup                   backup partition table sectors (see -O)
    --bytes                    print SIZE in bytes rather than in human readable format
    --move-data[=<typescript>] move partition data after relocation (requires -N)
    --color[=<when>]           colorize output (auto, always or never)
                               colors are enabled by default
-N, --partno <num>             specify partition number
-n, --no-act                   do everything except write to device
    --no-tell-kernel           do not tell kernel about changes
-O, --backup-file <path>       override default backup file name
-o, --output <list>            output columns
-w, --wipe <mode>              wipe signatures (auto, always or never)
-W, --wipe-partitions <mode>   wipe signatures from new partitions (auto, always or never)
-X, --label <name>             specify label type (dos, gpt, ...)
-Y, --label-nested <name>      specify nested label type (dos, bsd)
Available columns (for -o):
 gpt: Device Start End Sectors Size Type Type-UUID Attrs Name UUID
 dos: Device Start End Sectors Cylinders Size Type Id Attrs Boot End-C/H/S Start-C/H/S
 bsd: Slice Start  End Sectors Cylinders Size Type Bsize Cpg Fsize
 sgi: Device Start End Sectors Cylinders Size Type Id Attrs
 sun: Device Start End Sectors Cylinders Size Type Id Flags
swapon (enable devices and files for paging and swapping):
-o, --options <list>     comma-separated list of swap options
New available columns (for --show):
UUID   swap uuid
LABEL  swap label
unshare (run a program with some namespaces unshared from the parent):
-C, --cgroup[=<file>]                              unshare cgroup namespace
    --propagation slave shared private unchanged   modify mount propagation in mount namespace
-s, --setgroups allow deny                         control the setgroups syscall in user namespaces
Deprecated / removed options sfdisk (display or manipulate a disk partition table):
-c, --id                  change or print partition Id
    --change-id           change Id
    --print-id            print Id
-C, --cylinders <number>  set the number of cylinders to use
-H, --heads <number>      set the number of heads to use
-S, --sectors <number>    set the number of sectors to use
-G, --show-pt-geometry    deprecated, alias to --show-geometry
-L, --Linux               deprecated, only for backward compatibility
-u, --unit S              deprecated, only sector unit is supported

18 May 2017

Michael Prokop: Debugging a mystery: ssh causing strange exit codes?

XKCD comic 1722 Recently we had a WTF moment at a customer of mine which is worth sharing. In an automated deployment procedure we re installing Debian systems and setting up MySQL HA/Scalability. Installation of the first node works fine, but during installation of the second node something weird is going on. Even though the deployment procedure reported that everything went fine: it wasn t fine at all. After bisecting to the relevant command lines where it s going wrong we identified that the failure is happening between two ssh/scp commands, which are invoked inside a chroot through a shell wrapper. The ssh command caused a wrong exit code showing up: instead of bailing out with an error (we re running under set -e ) it returned with exit code 0 and the deployment procedure continued, even though there was a fatal error. Initially we triggered the bug when two ssh/scp command lines close to each other were executed, but I managed to find a minimal example for demonstration purposes:
# cat ssh_wrapper 
chroot << "EOF" / /bin/bash
ssh root@localhost hostname >/dev/null
exit 1
EOF
echo "return code = $?"
What we d expect is the following behavior, receive exit code 1 from the last command line in the chroot wrapper:
# ./ssh_wrapper 
return code = 1
But what we actually get is exit code 0:
# ./ssh_wrapper 
return code = 0
Uhm?! So what s going wrong and what s the fix? Let s find out what s causing the problem:
# cat ssh_wrapper 
chroot << "EOF" / /bin/bash
ssh root@localhost command_does_not_exist >/dev/null 2>&1
exit "$?"
EOF
echo "return code = $?"
# ./ssh_wrapper 
return code = 127
Ok, so if we invoke it with a binary that does not exist we properly get exit code 127, as expected.
What about switching /bin/bash to /bin/sh (which corresponds to dash here) to make sure it s not a bash bug:
# cat ssh_wrapper 
chroot << "EOF" / /bin/sh
ssh root@localhost hostname >/dev/null
exit 1
EOF
echo "return code = $?"
# ./ssh_wrapper 
return code = 1
Oh, but that works as expected!? When looking at this behavior I had the feeling that something is going wrong with file descriptors. So what about wrapping the ssh command line within different tools? No luck with stdbuf -i0 -o0 -e0 ssh root@localhost hostname , nor with script -c ssh root@localhost hostname /dev/null and also not with socat EXEC: ssh root@localhost hostname STDIO . But it works under unbuffer(1) from the expect package:
# cat ssh_wrapper 
chroot << "EOF" / /bin/bash
unbuffer ssh root@localhost hostname >/dev/null
exit 1
EOF
echo "return code = $?"
# ./ssh_wrapper 
return code = 1
So my bet on something with the file descriptor handling was right. Going through the ssh manpage, what about using ssh s -n option to prevent reading from standard input (stdin)?
# cat ssh_wrapper
chroot << "EOF" / /bin/bash
ssh -n root@localhost hostname >/dev/null
exit 1
EOF
echo "return code = $?"
# ./ssh_wrapper 
return code = 1
Bingo! Quoting ssh(1):
     -n      Redirects stdin from /dev/null (actually, prevents reading from stdin).
             This must be used when ssh is run in the background.  A common trick is
             to use this to run X11 programs on a remote machine.  For example,
             ssh -n shadows.cs.hut.fi emacs & will start an emacs on shadows.cs.hut.fi,
             and the X11 connection will be automatically forwarded over an encrypted
             channel.  The ssh program will be put in the background.  (This does not work
             if ssh needs to ask for a password or passphrase; see also the -f option.)
Let s execute the scripts through strace -ff -s500 ./ssh_wrapper to see what s going in more detail.
In the strace run without ssh s -n option we see that it s cloning stdin (file descriptor 0), getting assigned to file descriptor 4:
dup(0)            = 4
[...]
read(4, "exit 1\n", 16384) = 7
while in the strace run with ssh s -n option being present there s no file descriptor duplication but only:
open("/dev/null", O_RDONLY) = 4
This matches ssh.c s ssh_session2_open function (where stdin_null_flag corresponds to ssh s -n option):
        if (stdin_null_flag)                                              
                in = open(_PATH_DEVNULL, O_RDONLY);
          else  
                in = dup(STDIN_FILENO);
         
This behavior can also be simulated if we explicitly read from /dev/null, and this indeed works as well:
# cat ssh_wrapper
chroot << "EOF" / /bin/bash
ssh root@localhost hostname >/dev/null </dev/null
exit 1
EOF
echo "return code = $?"
# ./ssh_wrapper 
return code = 1
The underlying problem is that both bash and ssh are consuming from stdin. This can be verified via:
# cat ssh_wrapper
chroot << "EOF" / /bin/bash
echo "Inner: pre"
while read line; do echo "Eat: $line" ; done
echo "Inner: post"
exit 3
EOF
echo "Outer: exit code = $?"
# ./ssh_wrapper
Inner: pre
Eat: echo "Inner: post"
Eat: exit 3
Outer: exit code = 0
This behavior applies to bash, ksh, mksh, posh and zsh. Only dash doesn t show this behavior.
To understand the difference between bash and dash executions we can use the following test scripts:
# cat stdin-test-cmp
#!/bin/sh
TEST_SH=bash strace -v -s500 -ff ./stdin-test 2>&1   tee stdin-test-bash.out
TEST_SH=dash strace -v -s500 -ff ./stdin-test 2>&1   tee stdin-test-dash.out
# cat stdin-test
#!/bin/sh
: $ TEST_SH:=dash 
$TEST_SH <<"EOF"
echo "Inner: pre"
while read line; do echo "Eat: $line"; done
echo "Inner: post"
exit 3
EOF
echo "Outer: exit code = $?"
When executing ./stdin-test-cmp and comparing the generated files stdin-test-bash.out and stdin-test-dash.out you ll notice that dash consumes all stdin in one single go (a single read(0, ) ), instead of character-by-character as specified by POSIX and implemented by bash, ksh, mksh, posh and zsh. See stdin-test-bash.out on the left side and stdin-test-dash.out on the right side in this screenshot: screenshot of vimdiff on *.out files So when ssh tries to read from stdin there s nothing there anymore. Quoting POSIX s sh section:
When the shell is using standard input and it invokes a command that also uses standard input, the shell shall ensure that the standard input file pointer points directly after the command it has read when the command begins execution. It shall not read ahead in such a manner that any characters intended to be read by the invoked command are consumed by the shell (whether interpreted by the shell or not) or that characters that are not read by the invoked command are not seen by the shell. When the command expecting to read standard input is started asynchronously by an interactive shell, it is unspecified whether characters are read by the command or interpreted by the shell. If the standard input to sh is a FIFO or terminal device and is set to non-blocking reads, then sh shall enable blocking reads on standard input. This shall remain in effect when the command completes.
So while we learned that both bash and ssh are consuming from stdin and this needs to prevented by either using ssh s -n or explicitly specifying stdin, we also noticed that dash s behavior is different from all the other main shells and could be considered a bug (which we reported as #862907). Lessons learned: Thanks to Guillem Jover for review and feedback regarding this blog post.

28 July 2016

Michael Prokop: systemd backport of v230 available for Debian/jessie

At DebConf 16 I was working on a systemd backport for Debian/jessie. Results are officially available via the Debian archive now. In Debian jessie we have systemd v215 (which originally dates back to 2014-07-03 upstream-wise, plus changes + fixes from pkg-systemd folks of course). Now via Debian backports you have the option to update systemd to a very recent version: v230. If you have jessie-backports enabled it s just an apt install systemd -t jessie-backports away. For the upstream changes between v215 and v230 see upstream s NEWS file for list of changes. (Actually the systemd backport is available since 2016-07-19 for amd64, arm64 + armhf, though for mips, mipsel, powerpc, ppc64el + s390x we had to fight against GCC ICEs when compiling on/for Debian/jessie and for i386 architecture the systemd test-suite identified broken O_TMPFILE permission handling.) Thanks to the Alexander Wirt from the backports team for accepting my backport, thanks to intrigeri for the related apparmor backport, Guus Sliepen for the related ifupdown backport and Didier Raboud for the related usb-modeswitch/usb-modeswitch-data backports. Thanks to everyone testing my systemd backport and reporting feedback. Thanks a lot to Felipe Sateler and Martin Pitt for reviews, feedback and cooperation. And special thanks to Michael Biebl for all his feedback, reviews and help with the systemd backport from its very beginnings until the latest upload. PS: I cannot stress this enough how fantastic Debian s pkg-systemd team is. Responsive, friendly, helpful, dedicated and skilled folks, thanks folks!

19 July 2016

Michael Prokop: DebConf16 in Capetown/South Africa: Lessons learnt

DebConf 16 in Capetown/South Africa was fantastic for many reasons. My Capetown/South Africa/Culture/Flight related lessons: My technical lessons from DebConf16: BTW, thanks to the video team the recordings from the sessions are available online.

26 May 2016

Michael Prokop: My talk at OSDC 2016: Continuous Integration in Data Centers Further 3 Years Later

Open Source Data Center Conference (OSDC) was a pleasure and great event, Netways clearly knows how to run a conference. This year at OSDC 2016 I gave a talk titled Continuous Integration in Data Centers Further 3 Years Later . The slides from this talk are available online (PDF, 6.2MB). Thanks to Netways folks also a recording is available: This embedded video doesn t work for you? Try heading over to YouTube. Note: my talk was kind of an update and extension for the (german) talk I gave at OSDC 2013. If you re interested, the slides (PDF, 4.3MB) and the recording (YouTube) from my talk in 2013 are available online as well.

18 April 2016

Michael Prokop: Event: DebConf 16

* Yes, I m going to DebConf 16! This year DebConf the Debian Developer Conference will take place in Cape Town, South Africa. Outbound: 2016-06-26 15:40 VIE -> 17:10 LHR BA0703
2016-06-26 21:30 LHR -> 09:55 CPT BA0059 Inbound: 2016-07-09 19:30 CPT > 06:15 LHR BA0058
2016-07-10 07:55 LHR > 11:05 VIE BA0696

31 March 2016

Michael Prokop: Event: OSDC 2016

* Open Source Data Center Conference (OSDC) is a conference on open source software in data centers and huge IT environments and will take place in Berlin/Germany in April 2016. I will give a talk titled Continuous Integration in Data Centers Further 3 Years Later there. I gave a talk titled Continuous Integration in data centers at OSDC in 2013, presenting ways how to realize continuous integration/delivery with Jenkins and related tools. Three years later we gained new tools in our continuous delivery pipeline, including Docker, Gerrit and Goss. Over the years we also had to deal with different problems caused by faster release cycles, a growing team and gaining new projects. We therefore established code review in our pipeline, improved our test infrastructure and invested in our infrastructure automation. In this talk I will discuss the lessons we learned over the last years, demonstrate how a proper continuous delivery pipeline can improve your life and how open source tools like Jenkins, Docker and Gerrit can be leveraged for setting up such an environment. Hope to see you there!

24 August 2015

Michael Prokop: DebConf15: Continuous Delivery of Debian packages talk

At the Debian Conference 2015 I gave a talk about Continuous Delivery of Debian packages. My slides are available online (PDF, 753KB). Thanks to the fantastic video team there s also a recording of the talk available: WebM (471MB) and on YouTube.

18 August 2015

Arturo Borrero Gonz lez: 2015 FLOSS summer report

debian logo
Good news. Many things happened since my last report (8 months ago), some of them very interesting :-)

debian maintainer

Back in April 2015 I applied to become Debian Maintainer (DM). I was supported by several Debian Developers (DD), including Ana Guerrero, Anibal Monsalve, Michael Prokop and Vicent Cheng. They are people I have been somehow involved with in the last times (developing, in-person meetings, other talks...).

After a month or two, my PGP key was added to the debian keyring.

And what means this? If a DD gives me the corresponding authorization, I can now upload packages directly to the archive without the need for a sponsor.

I have been maintaining packages as a standard contributor since early 2014. From 2014 to 2015 I've learned many many things about Debian. That knowledge was key to become DM.

Google Summer of Code 2015

This is my 3 year in GSoC. In 2013 and 2014 I was involved with the Netfilter Project, but this time I'm contributing to the Debian project.
In concrete, my project is "Improve the Debian port mipsel".

Most of the software is developed to run in common CPU architectures like amd64 and i386. However, Debian can run in a large variety of arches (not so many operating systems have this power). Developers tend to consider these arches 'exotic' and don't pay much attention to them.
The mips/mipsel architecture is somewhat similar to arm: its mainly intended for small devices.

My tasks consist mainly into fixing bugs and FTBFS errors in the mipsel architecture.

Roughly speaking, this can be done in two ways: emulating the mipsel arch using qemu, or using a physical mipsel machine. The qemu way is very very slow. Fortunately, as part of my GSoC involvement, I was given a ci20 mipsel board by Imagination Technologies. I have been using this board for all my GSoC work.

Detailing my work during this GSoC deserves his own blog post. However, the Debian workflow for GSoC'15 requires a weekly report, and here are mine:

  1. week 1
  2. week 2
  3. week 3
  4. week 4
  5. week 5
  6. week 6
  7. week 7
  8. week 8
  9. week 9
  10. week 10
  11. week 11
  12. week 12

no longer involved with the Netfilter Project

Such is life. Days only have 24 hours. I had to 'refactor' my priorities and my involvement with the Netfilter Project is now almost none. This happened back in May'15. I was in so many business that I had stress and even had anxiety. Among other things, this hard decision meant that I missed the Netfilter Workshop 2015 in Budapest :-(

My plan for 2016 is to focus in the University and pay bills with my full-time job as a system administrator.

other debian sutff

Regarding packaging, it worth mention my latest new package: liquidprompt. For people who get their hands dirty with the CLI, I recommend it :-)
I made lot of updates to the other packages as well.

The nftables package is now in jessie-backports. Debian includes now Linux v4 in jessie-backports as well, which mean you can start playing with a full-featured nftables right now :-)
I'm looking forward to package the following version of upstream nftables, which is to include new exciting changes.

best regards!

Next.