Search Results: "tora"

13 May 2022

Antoine Beaupr : BTRFS notes

I'm not a fan of BTRFS. This page serves as a reminder of why, but also a cheat sheet to figure out basic tasks in a BTRFS environment because those are not obvious to me, even after repeatedly having to deal with them. Content warning: there might be mentions of ZFS.

Stability concerns I'm worried about BTRFS stability, which has been historically ... changing. RAID-5 and RAID-6 are still marked unstable, for example. It's kind of a lucky guess whether your current kernel will behave properly with your planned workload. For example, in Linux 4.9, RAID-1 and RAID-10 were marked as "mostly OK" with a note that says:
Needs to be able to create two copies always. Can get stuck in irreversible read-only mode if only one copy can be made.
Even as of now, RAID-1 and RAID-10 has this note:
The simple redundancy RAID levels utilize different mirrors in a way that does not achieve the maximum performance. The logic can be improved so the reads will spread over the mirrors evenly or based on device congestion.
Granted, that's not a stability concern anymore, just performance. A reviewer of a draft of this article actually claimed that BTRFS only reads from one of the drives, which hopefully is inaccurate, but goes to show how confusing all this is. There are other warnings in the Debian wiki that are quite scary. Even the legendary Arch wiki has a warning on top of their BTRFS page, still. Even if those issues are now fixed, it can be hard to tell when they were fixed. There is a changelog by feature but it explicitly warns that it doesn't know "which kernel version it is considered mature enough for production use", so it's also useless for this. It would have been much better if BTRFS was released into the world only when those bugs were being completely fixed. Or that, at least, features were announced when they were stable, not just "we merged to mainline, good luck". Even now, we get mixed messages even in the official BTRFS documentation which says "The Btrfs code base is stable" (main page) while at the same time clearly stating unstable parts in the status page (currently RAID56). There are much harsher BTRFS critics than me out there so I will stop here, but let's just say that I feel a little uncomfortable trusting server data with full RAID arrays to BTRFS. But surely, for a workstation, things should just work smoothly... Right? Well, let's see the snags I hit.

My BTRFS test setup Before I go any further, I should probably clarify how I am testing BTRFS in the first place. The reason I tried BTRFS is that I was ... let's just say "strongly encouraged" by the LWN editors to install Fedora for the terminal emulators series. That, in turn, meant the setup was done with BTRFS, because that was somewhat the default in Fedora 27 (or did I want to experiment? I don't remember, it's been too long already). So Fedora was setup on my 1TB HDD and, with encryption, the partition table looks like this:
NAME                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                      8:0    0 931,5G  0 disk  
 sda1                   8:1    0   200M  0 part  /boot/efi
 sda2                   8:2    0     1G  0 part  /boot
 sda3                   8:3    0   7,8G  0 part  
   fedora_swap        253:5    0   7.8G  0 crypt [SWAP]
 sda4                   8:4    0 922,5G  0 part  
   fedora_crypt       253:4    0 922,5G  0 crypt /
(This might not entirely be accurate: I rebuilt this from the Debian side of things.) This is pretty straightforward, except for the swap partition: normally, I just treat swap like any other logical volume and create it in a logical volume. This is now just speculation, but I bet it was setup this way because "swap" support was only added in BTRFS 5.0. I fully expect BTRFS experts to yell at me now because this is an old setup and BTRFS is so much better now, but that's exactly the point here. That setup is not that old (2018? old? really?), and migrating to a new partition scheme isn't exactly practical right now. But let's move on to more practical considerations.

No builtin encryption BTRFS aims at replacing the entire mdadm, LVM, and ext4 stack with a single entity, and adding new features like deduplication, checksums and so on. Yet there is one feature it is critically missing: encryption. See, my typical stack is actually mdadm, LUKS, and then LVM and ext4. This is convenient because I have only a single volume to decrypt. If I were to use BTRFS on servers, I'd need to have one LUKS volume per-disk. For a simple RAID-1 array, that's not too bad: one extra key. But for large RAID-10 arrays, this gets really unwieldy. The obvious BTRFS alternative, ZFS, supports encryption out of the box and mixes it above the disks so you only have one passphrase to enter. The main downside of ZFS encryption is that it happens above the "pool" level so you can typically see filesystem names (and possibly snapshots, depending on how it is built), which is not the case with a more traditional stack.

Subvolumes, filesystems, and devices I find BTRFS's architecture to be utterly confusing. In the traditional LVM stack (which is itself kind of confusing if you're new to that stuff), you have those layers:
  • disks: let's say /dev/nvme0n1 and nvme1n1
  • RAID arrays with mdadm: let's say the above disks are joined in a RAID-1 array in /dev/md1
  • volume groups or VG with LVM: the above RAID device (technically a "physical volume" or PV) is assigned into a VG, let's call it vg_tbbuild05 (multiple PVs can be added to a single VG which is why there is that abstraction)
  • LVM logical volumes: out of that volume group actually "virtual partitions" or "logical volumes" are created, that is where your filesystem lives
  • filesystem, typically with ext4: that's your normal filesystem, which treats the logical volume as just another block device
A typical server setup would look like this:
NAME                      MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme0n1                   259:0    0   1.7T  0 disk  
 nvme0n1p1               259:1    0     8M  0 part  
 nvme0n1p2               259:2    0   512M  0 part  
   md0                     9:0    0   511M  0 raid1 /boot
 nvme0n1p3               259:3    0   1.7T  0 part  
   md1                     9:1    0   1.7T  0 raid1 
     crypt_dev_md1       253:0    0   1.7T  0 crypt 
       vg_tbbuild05-root 253:1    0    30G  0 lvm   /
       vg_tbbuild05-swap 253:2    0 125.7G  0 lvm   [SWAP]
       vg_tbbuild05-srv  253:3    0   1.5T  0 lvm   /srv
 nvme0n1p4               259:4    0     1M  0 part
I stripped the other nvme1n1 disk because it's basically the same. Now, if we look at my BTRFS-enabled workstation, which doesn't even have RAID, we have the following:
  • disk: /dev/sda with, again, /dev/sda4 being where BTRFS lives
  • filesystem: fedora_crypt, which is, confusingly, kind of like a volume group. it's where everything lives. i think.
  • subvolumes: home, root, /, etc. those are actually the things that get mounted. you'd think you'd mount a filesystem, but no, you mount a subvolume. that is backwards.
It looks something like this to lsblk:
NAME                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                      8:0    0 931,5G  0 disk  
 sda1                   8:1    0   200M  0 part  /boot/efi
 sda2                   8:2    0     1G  0 part  /boot
 sda3                   8:3    0   7,8G  0 part  [SWAP]
 sda4                   8:4    0 922,5G  0 part  
   fedora_crypt       253:4    0 922,5G  0 crypt /srv
Notice how we don't see all the BTRFS volumes here? Maybe it's because I'm mounting this from the Debian side, but lsblk definitely gets confused here. I frankly don't quite understand what's going on, even after repeatedly looking around the rather dismal documentation. But that's what I gather from the following commands:
root@curie:/home/anarcat# btrfs filesystem show
Label: 'fedora'  uuid: 5abb9def-c725-44ef-a45e-d72657803f37
    Total devices 1 FS bytes used 883.29GiB
    devid    1 size 922.47GiB used 916.47GiB path /dev/mapper/fedora_crypt
root@curie:/home/anarcat# btrfs subvolume list /srv
ID 257 gen 108092 top level 5 path home
ID 258 gen 108094 top level 5 path root
ID 263 gen 108020 top level 258 path root/var/lib/machines
I only got to that point through trial and error. Notice how I use an existing mountpoint to list the related subvolumes. If I try to use the filesystem path, the one that's listed in filesystem show, I fail:
root@curie:/home/anarcat# btrfs subvolume list /dev/mapper/fedora_crypt 
ERROR: not a btrfs filesystem: /dev/mapper/fedora_crypt
ERROR: can't access '/dev/mapper/fedora_crypt'
Maybe I just need to use the label? Nope:
root@curie:/home/anarcat# btrfs subvolume list fedora
ERROR: cannot access 'fedora': No such file or directory
ERROR: can't access 'fedora'
This is really confusing. I don't even know if I understand this right, and I've been staring at this all afternoon. Hopefully, the lazyweb will correct me eventually. (As an aside, why are they called "subvolumes"? If something is a "sub" of "something else", that "something else" must exist right? But no, BTRFS doesn't have "volumes", it only has "subvolumes". Go figure. Presumably the filesystem still holds "files" though, at least empirically it doesn't seem like it lost anything so far. In any case, at least I can refer to this section in the future, the next time I fumble around the btrfs commandline, as I surely will. I will possibly even update this section as I get better at it, or based on my reader's judicious feedback.

Mounting BTRFS subvolumes So how did I even get to that point? I have this in my /etc/fstab, on the Debian side of things:
UUID=5abb9def-c725-44ef-a45e-d72657803f37   /srv    btrfs  defaults 0   2
This thankfully ignores all the subvolume nonsense because it relies on the UUID. mount tells me that's actually the "root" (? /?) subvolume:
root@curie:/home/anarcat# mount   grep /srv
/dev/mapper/fedora_crypt on /srv type btrfs (rw,relatime,space_cache,subvolid=5,subvol=/)
Let's see if I can mount the other volumes I have on there. Remember that subvolume list showed I had home, root, and var/lib/machines. Let's try root:
mount -o subvol=root /dev/mapper/fedora_crypt /mnt
Interestingly, root is not the same as /, it's a different subvolume! It seems to be the Fedora root (/, really) filesystem. No idea what is happening here. I also have a home subvolume, let's mount it too, for good measure:
mount -o subvol=home /dev/mapper/fedora_crypt /mnt/home
Note that lsblk doesn't notice those two new mountpoints, and that's normal: it only lists block devices and subvolumes (rather inconveniently, I'd say) do not show up as devices:
root@curie:/home/anarcat# lsblk 
NAME                   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                      8:0    0 931,5G  0 disk  
 sda1                   8:1    0   200M  0 part  
 sda2                   8:2    0     1G  0 part  
 sda3                   8:3    0   7,8G  0 part  
 sda4                   8:4    0 922,5G  0 part  
   fedora_crypt       253:4    0 922,5G  0 crypt /srv
This is really, really confusing. Maybe I did something wrong in the setup. Maybe it's because I'm mounting it from outside Fedora. Either way, it just doesn't feel right.

No disk usage per volume If you want to see what's taking up space in one of those subvolumes, tough luck:
root@curie:/home/anarcat# df -h  /srv /mnt /mnt/home
Filesystem                Size  Used Avail Use% Mounted on
/dev/mapper/fedora_crypt  923G  886G   31G  97% /srv
/dev/mapper/fedora_crypt  923G  886G   31G  97% /mnt
/dev/mapper/fedora_crypt  923G  886G   31G  97% /mnt/home
(Notice, in passing, that it looks like the same filesystem is mounted in different places. In that sense, you'd expect /srv and /mnt (and /mnt/home?!) to be exactly the same, but no: they are entirely different directory structures, which I will not call "filesystems" here because everyone's head will explode in sparks of confusion.) Yes, disk space is shared (that's the Size and Avail columns, makes sense). But nope, no cookie for you: they all have the same Used columns, so you need to actually walk the entire filesystem to figure out what each disk takes. (For future reference, that's basically:
root@curie:/home/anarcat# time du -schx /mnt/home /mnt /srv
124M    /mnt/home
7.5G    /mnt
875G    /srv
883G    total
real    2m49.080s
user    0m3.664s
sys 0m19.013s
And yes, that was painfully slow.) ZFS actually has some oddities in that regard, but at least it tells me how much disk each volume (and snapshot) takes:
root@tubman:~# time df -t zfs -h
Filesystem         Size  Used Avail Use% Mounted on
rpool/ROOT/debian  3.5T  1.4G  3.5T   1% /
rpool/var/tmp      3.5T  384K  3.5T   1% /var/tmp
rpool/var/spool    3.5T  256K  3.5T   1% /var/spool
rpool/var/log      3.5T  2.0G  3.5T   1% /var/log
rpool/home/root    3.5T  2.2G  3.5T   1% /root
rpool/home         3.5T  256K  3.5T   1% /home
rpool/srv          3.5T   80G  3.5T   3% /srv
rpool/var/cache    3.5T  114M  3.5T   1% /var/cache
bpool/BOOT/debian  571M   90M  481M  16% /boot
real    0m0.003s
user    0m0.002s
sys 0m0.000s
That's 56360 times faster, by the way. But yes, that's not fair: those in the know will know there's a different command to do what df does with BTRFS filesystems, the btrfs filesystem usage command:
root@curie:/home/anarcat# time btrfs filesystem usage /srv
Overall:
    Device size:         922.47GiB
    Device allocated:        916.47GiB
    Device unallocated:        6.00GiB
    Device missing:          0.00B
    Used:            884.97GiB
    Free (estimated):         30.84GiB  (min: 27.84GiB)
    Free (statfs, df):        30.84GiB
    Data ratio:               1.00
    Metadata ratio:           2.00
    Global reserve:      512.00MiB  (used: 0.00B)
    Multiple profiles:              no
Data,single: Size:906.45GiB, Used:881.61GiB (97.26%)
   /dev/mapper/fedora_crypt  906.45GiB
Metadata,DUP: Size:5.00GiB, Used:1.68GiB (33.58%)
   /dev/mapper/fedora_crypt   10.00GiB
System,DUP: Size:8.00MiB, Used:128.00KiB (1.56%)
   /dev/mapper/fedora_crypt   16.00MiB
Unallocated:
   /dev/mapper/fedora_crypt    6.00GiB
real    0m0,004s
user    0m0,000s
sys 0m0,004s
Almost as fast as ZFS's df! Good job. But wait. That doesn't actually tell me usage per subvolume. Notice it's filesystem usage, not subvolume usage, which unhelpfully refuses to exist. That command only shows that one "filesystem" internal statistics that are pretty opaque.. You can also appreciate that it's wasting 6GB of "unallocated" disk space there: I probably did something Very Wrong and should be punished by Hacker News. I also wonder why it has 1.68GB of "metadata" used... At this point, I just really want to throw that thing out of the window and restart from scratch. I don't really feel like learning the BTRFS internals, as they seem oblique and completely bizarre to me. It feels a little like the state of PHP now: it's actually pretty solid, but built upon so many layers of cruft that I still feel it corrupts my brain every time I have to deal with it (needle or haystack first? anyone?)...

Conclusion I find BTRFS utterly confusing and I'm worried about its reliability. I think a lot of work is needed on usability and coherence before I even consider running this anywhere else than a lab, and that's really too bad, because there are really nice features in BTRFS that would greatly help my workflow. (I want to use filesystem snapshots as high-performance, high frequency backups.) So now I'm experimenting with OpenZFS. It's so much simpler, just works, and it's rock solid. After this 8 minute read, I had a good understanding of how ZFS worked. Here's the 30 seconds overview:
  • vdev: a RAID array
  • vpool: a volume group of vdevs
  • datasets: normal filesystems (or block device, if you want to use another filesystem on top of ZFS)
There's also other special volumes like caches and logs that you can (really easily, compared to LVM caching) use to tweak your setup. You might also want to look at recordsize or ashift to tweak the filesystem to fit better your workload (or deal with drives lying about their sector size, I'm looking at you Samsung), but that's it. Running ZFS on Linux currently involves building kernel modules from scratch on every host, which I think is pretty bad. But I was able to setup a ZFS-only server using this excellent documentation without too much problem. I'm hoping some day the copyright issues are resolved and we can at least ship binary packages, but the politics (e.g. convincing Debian that is the right thing to do) and the logistics (e.g. DKMS auto-builders? is that even a thing? how about signed DKMS packages? fun-fun-fun!) seem really impractical. Who knows, maybe hell will freeze over (again) and Oracle will fix the CDDL. I personally think that we should just completely ignore this problem (which wasn't even supposed to be a problem) and ship binary packages directly, but I'm a pragmatic and do not always fit well with the free software fundamentalists. All of this to say that, short term, we don't have a reliable, advanced filesystem/logical disk manager in Linux. And that's really too bad.

10 May 2022

Melissa Wen: Multiple syncobjs support for V3D(V) (Part 2)

In the previous post, I described how we enable multiple syncobjs capabilities in the V3D kernel driver. Now I will tell you what was changed on the userspace side, where we reworked the V3DV sync mechanisms to use Vulkan multiple wait and signal semaphores directly. This change represents greater adherence to the Vulkan submission framework. I was not used to Vulkan concepts and the V3DV driver. Fortunately, I counted on the guidance of the Igalia s Graphics team, mainly Iago Toral (thanks!), to understand the Vulkan Graphics Pipeline, sync scopes, and submission order. Therefore, we changed the original V3DV implementation for vkQueueSubmit and all related functions to allow direct mapping of multiple semaphores from V3DV to the V3D-kernel interface. Disclaimer: Here s a brief and probably inaccurate background, which we ll go into more detail later on. In Vulkan, GPU work submissions are described as command buffers. These command buffers, with GPU jobs, are grouped in a command buffer submission batch, specified by vkSubmitInfo, and submitted to a queue for execution. vkQueueSubmit is the command called to submit command buffers to a queue. Besides command buffers, vkSubmitInfo also specifies semaphores to wait before starting the batch execution and semaphores to signal when all command buffers in the batch are complete. Moreover, a fence in vkQueueSubmit can be signaled when all command buffer batches have completed execution. From this sequence, we can see some implicit ordering guarantees. Submission order defines the start order of execution between command buffers, in other words, it is determined by the order in which pSubmits appear in VkQueueSubmit and pCommandBuffers appear in VkSubmitInfo. However, we don t have any completion guarantees for jobs submitted to different GPU queue, which means they may overlap and complete out of order. Of course, jobs submitted to the same GPU engine follow start and finish order. A fence is ordered after all semaphores signal operations for signal operation order. In addition to implicit sync, we also have some explicit sync resources, such as semaphores, fences, and events. Considering these implicit and explicit sync mechanisms, we rework the V3DV implementation of queue submissions to better use multiple syncobjs capabilities from the kernel. In this merge request, you can find this work: v3dv: add support to multiple wait and signal semaphores. In this blog post, we run through each scope of change of this merge request for a V3D driver-guided description of the multisync support implementation.

Groundwork and basic code clean-up: As the original V3D-kernel interface allowed only one semaphore, V3DV resorted to booleans to translate multiple semaphores into one. Consequently, if a command buffer batch had at least one semaphore, it needed to wait on all jobs submitted complete before starting its execution. So, instead of just boolean, we created and changed structs that store semaphores information to accept the actual list of wait semaphores.

Expose multisync kernel interface to the driver: In the two commits below, we basically updated the DRM V3D interface from that one defined in the kernel and verified if the multisync capability is available for use.

Handle multiple semaphores for all GPU job types: At this point, we were only changing the submission design to consider multiple wait semaphores. Before supporting multisync, V3DV was waiting for the last job submitted to be signaled when at least one wait semaphore was defined, even when serialization wasn t required. V3DV handle GPU jobs according to the GPU queue in which they are submitted:
  • Control List (CL) for binning and rendering
  • Texture Formatting Unit (TFU)
  • Compute Shader Dispatch (CSD)
Therefore, we changed their submission setup to do jobs submitted to any GPU queues able to handle more than one wait semaphores. These commits created all mechanisms to set arrays of wait and signal semaphores for GPU job submissions:
  • Checking the conditions to define the wait_stage.
  • Wrapping them in a multisync extension.
  • According to the kernel interface (described in the previous blog post), configure the generic extension as a multisync extension.
Finally, we extended the ability of GPU jobs to handle multiple signal semaphores, but at this point, no GPU job is actually in charge of signaling them. With this in place, we could rework part of the code that tracks CPU and GPU job completions by verifying the GPU status and threads spawned by Event jobs.

Rework the QueueWaitIdle mechanism to track the syncobj of the last job submitted in each queue: As we had only single in/out syncobj interfaces for semaphores, we used a single last_job_sync to synchronize job dependencies of the previous submission. Although the DRM scheduler guarantees the order of starting to execute a job in the same queue in the kernel space, the order of completion isn t predictable. On the other hand, we still needed to use syncobjs to follow job completion since we have event threads on the CPU side. Therefore, a more accurate implementation requires last_job syncobjs to track when each engine (CL, TFU, and CSD) is idle. We also needed to keep the driver working on previous versions of v3d kernel-driver with single semaphores, then we kept tracking ANY last_job_sync to preserve the previous implementation.

Rework synchronization and submission design to let the jobs handle wait and signal semaphores: With multiple semaphores support, the conditions for waiting and signaling semaphores changed accordingly to the particularities of each GPU job (CL, CSD, TFU) and CPU job restrictions (Events, CSD indirect, etc.). In this sense, we redesigned V3DV semaphores handling and job submissions for command buffer batches in vkQueueSubmit. We scrutinized possible scenarios for submitting command buffer batches to change the original implementation carefully. It resulted in three commits more: We keep track of whether we have submitted a job to each GPU queue (CSD, TFU, CL) and a CPU job for each command buffer. We use syncobjs to track the last job submitted to each GPU queue and a flag that indicates if this represents the beginning of a command buffer. The first GPU job submitted to a GPU queue in a command buffer should wait on wait semaphores. The first CPU job submitted in a command buffer should call v3dv_QueueWaitIdle() to do the waiting and ignore semaphores (because it is waiting for everything). If the job is not the first but has the serialize flag set, it should wait on the completion of all last job submitted to any GPU queue before running. In practice, it means using syncobjs to track the last job submitted by queue and add these syncobjs as job dependencies of this serialized job. If this job is the last job of a command buffer batch, it may be used to signal semaphores if this command buffer batch has only one type of GPU job (because we have guarantees of execution ordering). Otherwise, we emit a no-op job just to signal semaphores. It waits on the completion of all last jobs submitted to any GPU queue and then signal semaphores. Note: We changed this approach to correctly deal with ordering changes caused by event threads at some point. Whenever we have an event job in the command buffer, we cannot use the last job in the last command buffer assumption. We have to wait all event threads complete to signal After submitting all command buffers, we emit a no-op job to wait on all last jobs by queue completion and signal fence. Note: at some point, we changed this approach to correct deal with ordering changes caused by event threads, as mentioned before.

Final considerations With many changes and many rounds of reviews, the patchset was merged. After more validations and code review, we polished and fixed the implementation together with external contributions: Also, multisync capabilities enabled us to add new features to V3DV and switch the driver to the common synchronization and submission framework:
  • v3dv: expose support for semaphore imports
    This was waiting for multisync support in the v3d kernel, which is already available. Exposing this feature however enabled a few more CTS tests that exposed pre-existing bugs in the user-space driver so we fix those here before exposing the feature.
  • v3dv: Switch to the common submit framework
    This should give you emulated timeline semaphores for free and kernel-assisted sharable timeline semaphores for cheap once you have the kernel interface wired in.
We used a set of games to ensure no performance regression in the new implementation. For this, we used GFXReconstruct to capture Vulkan API calls when playing those games. Then, we compared results with and without multisync caps in the kernelspace and also enabling multisync on v3dv. We didn t observe any compromise in performance, but improvements when replaying scenes of vkQuake game.

Melissa Wen: Multiple syncobjs support for V3D(V) (Part 1)

As you may already know, we at Igalia have been working on several improvements to the 3D rendering drivers of Broadcom Videocore GPU, found in Raspberry Pi 4 devices. One of our recent works focused on improving V3D(V) drivers adherence to Vulkan submission and synchronization framework. We had to cross various layers from the Linux Graphics stack to add support for multiple syncobjs to V3D(V), from the Linux/DRM kernel to the Vulkan driver. We have delivered bug fixes, a generic gate to extend job submission interfaces, and a more direct sync mapping of the Vulkan framework. These changes did not impact the performance of the tested games and brought greater precision to the synchronization mechanisms. Ultimately, support for multiple syncobjs opened the door to new features and other improvements to the V3DV submission framework.

DRM Syncobjs But, first, what are DRM sync objs?
* DRM synchronization objects (syncobj, see struct &drm_syncobj) provide a
* container for a synchronization primitive which can be used by userspace
* to explicitly synchronize GPU commands, can be shared between userspace
* processes, and can be shared between different DRM drivers.
* Their primary use-case is to implement Vulkan fences and semaphores.
[...]
* At it's core, a syncobj is simply a wrapper around a pointer to a struct
* &dma_fence which may be NULL.
And Jason Ekstrand well-summarized dma_fence features in a talk at the Linux Plumbers Conference 2021:
A struct that represents a (potentially future) event:
  • Has a boolean signaled state
  • Has a bunch of useful utility helpers/concepts, such as refcount, callback wait mechanisms, etc.
Provides two guarantees:
  • One-shot: once signaled, it will be signaled forever
  • Finite-time: once exposed, is guaranteed signal in a reasonable amount of time

What does multiple semaphores support mean for Raspberry Pi 4 GPU drivers? For our main purpose, the multiple syncobjs support means that V3DV can submit jobs with more than one wait and signal semaphore. In the kernel space, wait semaphores become explicit job dependencies to wait on before executing the job. Signal semaphores (or post dependencies), in turn, work as fences to be signaled when the job completes its execution, unlocking following jobs that depend on its completion. The multisync support development comprised of many decision-making points and steps summarized as follow:
  • added to the v3d kernel-driver capabilities to handle multiple syncobj;
  • exposed multisync capabilities to the userspace through a generic extension; and
  • reworked synchronization mechanisms of the V3DV driver to benefit from this feature
  • enabled simulator to work with multiple semaphores
  • tested on Vulkan games to verify the correctness and possible performance enhancements.
We decided to refactor parts of the V3D(V) submission design in kernel-space and userspace during this development. We improved job scheduling on V3D-kernel and the V3DV job submission design. We also delivered more accurate synchronizing mechanisms and further updates in the Broadcom Vulkan driver running on Raspberry Pi 4. Therefore, we summarize here changes in the kernel space, describing the previous state of the driver, taking decisions, side improvements, and fixes.

From single to multiple binary in/out syncobjs: Initially, V3D was very limited in the numbers of syncobjs per job submission. V3D job interfaces (CL, CSD, and TFU) only supported one syncobj (in_sync) to be added as an execution dependency and one syncobj (out_sync) to be signaled when a submission completes. Except for CL submission, which accepts two in_syncs: one for binner and another for render job, it didn t change the limited options. Meanwhile in the userspace, the V3DV driver followed alternative paths to meet Vulkan s synchronization and submission framework. It needed to handle multiple wait and signal semaphores, but the V3D kernel-driver interface only accepts one in_sync and one out_sync. In short, V3DV had to fit multiple semaphores into one when submitting every GPU job.

Generic ioctl extension The first decision was how to extend the V3D interface to accept multiple in and out syncobjs. We could extend each ioctl with two entries of syncobj arrays and two entries for their counters. We could create new ioctls with multiple in/out syncobj. But after examining other drivers solutions to extend their submission s interface, we decided to extend V3D ioctls (v3d_cl_submit_ioctl, v3d_csd_submit_ioctl, v3d_tfu_submit_ioctl) by a generic ioctl extension. I found a curious commit message when I was examining how other developers handled the issue in the past:
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Mar 22 09:23:22 2019 +0000
    drm/i915: Introduce the i915_user_extension_method
    
    An idea for extending uABI inspired by Vulkan's extension chains.
    Instead of expanding the data struct for each ioctl every time we need
    to add a new feature, define an extension chain instead. As we add
    optional interfaces to control the ioctl, we define a new extension
    struct that can be linked into the ioctl data only when required by the
    user. The key advantage being able to ignore large control structs for
    optional interfaces/extensions, while being able to process them in a
    consistent manner.
    
    In comparison to other extensible ioctls, the key difference is the
    use of a linked chain of extension structs vs an array of tagged
    pointers. For example,
    
    struct drm_amdgpu_cs_chunk  
    	__u32		chunk_id;
        __u32		length_dw;
        __u64		chunk_data;
     ;
[...]
So, inspired by amdgpu_cs_chunk and i915_user_extension, we opted to extend the V3D interface through a generic interface. After applying some suggestions from Iago Toral (Igalia) and Daniel Vetter, we reached the following struct:
struct drm_v3d_extension  
	__u64 next;
	__u32 id;
#define DRM_V3D_EXT_ID_MULTI_SYNC		0x01
	__u32 flags; /* mbz */
 ;
This generic extension has an id to identify the feature/extension we are adding to an ioctl (that maps the related struct type), a pointer to the next extension, and flags (if needed). Whenever we need to extend the V3D interface again for another specific feature, we subclass this generic extension into the specific one instead of extending ioctls indefinitely.

Multisync extension For the multiple syncobjs extension, we define a multi_sync extension struct that subclasses the generic extension struct. It has arrays of in and out syncobjs, the respective number of elements in each of them, and a wait_stage value used in CL submissions to determine which job needs to wait for syncobjs before running.
struct drm_v3d_multi_sync  
	struct drm_v3d_extension base;
	/* Array of wait and signal semaphores */
	__u64 in_syncs;
	__u64 out_syncs;
	/* Number of entries */
	__u32 in_sync_count;
	__u32 out_sync_count;
	/* set the stage (v3d_queue) to sync */
	__u32 wait_stage;
	__u32 pad; /* mbz */
 ;
And if a multisync extension is defined, the V3D driver ignores the previous interface of single in/out syncobjs. Once we had the interface to support multiple in/out syncobjs, v3d kernel-driver needed to handle it. As V3D uses the DRM scheduler for job executions, changing from single syncobj to multiples is quite straightforward. V3D copies from userspace the in syncobjs and uses drm_syncobj_find_fence()+ drm_sched_job_add_dependency() to add all in_syncs (wait semaphores) as job dependencies, i.e. syncobjs to be checked by the scheduler before running the job. On CL submissions, we have the bin and render jobs, so V3D follows the value of wait_stage to determine which job depends on those in_syncs to start its execution. When V3D defines the last job in a submission, it replaces dma_fence of out_syncs with the done_fence from this last job. It uses drm_syncobj_find() + drm_syncobj_replace_fence() to do that. Therefore, when a job completes its execution and signals done_fence, all out_syncs are signaled too.

Other improvements to v3d kernel driver This work also made possible some improvements in the original implementation. Following Iago s suggestions, we refactored the job s initialization code to allocate memory and initialize a job in one go. With this, we started to clean up resources more cohesively, clearly distinguishing cleanups in case of failure from job completion. We also fixed the resource cleanup when a job is aborted before the DRM scheduler arms it - at that point, drm_sched_job_arm() had recently been introduced to job initialization. Finally, we prepared the semaphore interface to implement timeline syncobjs in the future.

Going Up The patchset that adds multiple syncobjs support and improvements to V3D is available here and comprises four patches:
  • drm/v3d: decouple adding job dependencies steps from job init
  • drm/v3d: alloc and init job in one shot
  • drm/v3d: add generic ioctl extension
  • drm/v3d: add multiple syncobjs support
After extending the V3D kernel interface to accept multiple syncobjs, we worked on V3DV to benefit from V3D multisync capabilities. In the next post, I will describe a little of this work.

27 April 2022

Antoine Beaupr : building Debian packages under qemu with sbuild

I've been using sbuild for a while to build my Debian packages, mainly because it's what is used by the Debian autobuilders, but also because it's pretty powerful and efficient. Configuring it just right, however, can be a challenge. In my quick Debian development guide, I had a few pointers on how to configure sbuild with the normal schroot setup, but today I finished a qemu based configuration.

Why I want to use qemu mainly because it provides better isolation than a chroot. I sponsor packages sometimes and while I typically audit the source code before building, it still feels like the extra protection shouldn't hurt. I also like the idea of unifying my existing virtual machine setup with my build setup. My current VM is kind of all over the place: libvirt, vagrant, GNOME Boxes, etc?). I've been slowly converging over libvirt however, and most solutions I use right now rely on qemu under the hood, certainly not chroots... I could also have decided to go with containers like LXC, LXD, Docker (with conbuilder, whalebuilder, docker-buildpackage), systemd-nspawn (with debspawn), unshare (with schroot --chroot-mode=unshare), or whatever: I didn't feel those offer the level of isolation that is provided by qemu. The main downside of this approach is that it is (obviously) slower than native builds. But on modern hardware, that cost should be minimal.

How Basically, you need this:
sudo mkdir -p /srv/sbuild/qemu/
sudo apt install sbuild-qemu
sudo sbuild-qemu-create -o /srv/sbuild/qemu/unstable.img unstable https://deb.debian.org/debian
Then to make this used by default, add this to ~/.sbuildrc:
# run autopkgtest inside the schroot
$run_autopkgtest = 1;
# tell sbuild to use autopkgtest as a chroot
$chroot_mode = 'autopkgtest';
# tell autopkgtest to use qemu
$autopkgtest_virt_server = 'qemu';
# tell autopkgtest-virt-qemu the path to the image
# use --debug there to show what autopkgtest is doing
$autopkgtest_virt_server_options = [ '--', '/srv/sbuild/qemu/%r-%a.img' ];
# tell plain autopkgtest to use qemu, and the right image
$autopkgtest_opts = [ '--', 'qemu', '/srv/sbuild/qemu/%r-%a.img' ];
# no need to cleanup the chroot after build, we run in a completely clean VM
$purge_build_deps = 'never';
# no need for sudo
$autopkgtest_root_args = '';
Note that the above will use the default autopkgtest (1GB, one core) and qemu (128MB, one core) configuration, which might be a little low on resources. You probably want to be explicit about this, with something like this:
# extra parameters to pass to qemu
# --enable-kvm is not necessary, detected on the fly by autopkgtest
my @_qemu_options = ['--ram-size=4096', '--cpus=2'];
# tell autopkgtest-virt-qemu the path to the image
# use --debug there to show what autopkgtest is doing
$autopkgtest_virt_server_options = [ @_qemu_options, '--', '/srv/sbuild/qemu/%r-%a.img' ];
$autopkgtest_opts = [ '--', 'qemu', @qemu_options, '/srv/sbuild/qemu/%r-%a.img'];
This configuration will:
  1. create a virtual machine image in /srv/sbuild/qemu for unstable
  2. tell sbuild to use that image to create a temporary VM to build the packages
  3. tell sbuild to run autopkgtest (which should really be default)
  4. tell autopkgtest to use qemu for builds and for tests
Note that the VM created by sbuild-qemu-create have an unlocked root account with an empty password.

Other useful tasks
  • enter the VM to make test, changes will be discarded (thanks Nick Brown for the sbuild-qemu-boot tip!):
     sbuild-qemu-boot /srv/sbuild/qemu/unstable-amd64.img
    
    That program is shipped only with bookworm and later, an equivalent command is:
     qemu-system-x86_64 -snapshot -enable-kvm -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0,id=rng-device0 -m 2048 -nographic /srv/sbuild/qemu/unstable-amd64.img
    
    The key argument here is -snapshot.
  • enter the VM to make permanent changes, which will not be discarded:
     sudo sbuild-qemu-boot --readwrite /srv/sbuild/qemu/unstable-amd64.img
    
    Equivalent command:
     sudo qemu-system-x86_64 -enable-kvm -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0,id=rng-device0 -m 2048 -nographic /srv/sbuild/qemu/unstable-amd64.img
    
  • update the VM (thanks lavamind):
     sudo sbuild-qemu-update /srv/sbuild/qemu/unstable-amd64.img
    
  • build in a specific VM regardless of the suite specified in the changelog (e.g. UNRELEASED, bookworm-backports, bookworm-security, etc):
     sbuild --autopkgtest-virt-server-opts="-- qemu /var/lib/sbuild/qemu/bookworm-amd64.img"
    
    Note that you'd also need to pass --autopkgtest-opts if you want autopkgtest to run in the correct VM as well:
     sbuild --autopkgtest-opts="-- qemu /var/lib/sbuild/qemu/unstable.img" --autopkgtest-virt-server-opts="-- qemu /var/lib/sbuild/qemu/bookworm-amd64.img"
    
    You might also need parameters like --ram-size if you customized it above.
And yes, this is all quite complicated and could be streamlined a little, but that's what you get when you have years of legacy and just want to get stuff done. It seems to me autopkgtest-virt-qemu should have a magic flag starts a shell for you, but it doesn't look like that's a thing. When that program starts, it just says ok and sits there. Maybe because the authors consider the above to be simple enough (see also bug #911977 for a discussion of this problem).

Live access to a running test When autopkgtest starts a VM, it uses this funky qemu commandline:
qemu-system-x86_64 -m 4096 -smp 2 -nographic -net nic,model=virtio -net user,hostfwd=tcp:127.0.0.1:10022-:22 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0,id=rng-device0 -monitor unix:/tmp/autopkgtest-qemu.w1mlh54b/monitor,server,nowait -serial unix:/tmp/autopkgtest-qemu.w1mlh54b/ttyS0,server,nowait -serial unix:/tmp/autopkgtest-qemu.w1mlh54b/ttyS1,server,nowait -virtfs local,id=autopkgtest,path=/tmp/autopkgtest-qemu.w1mlh54b/shared,security_model=none,mount_tag=autopkgtest -drive index=0,file=/tmp/autopkgtest-qemu.w1mlh54b/overlay.img,cache=unsafe,if=virtio,discard=unmap,format=qcow2 -enable-kvm -cpu kvm64,+vmx,+lahf_lm
... which is a typical qemu commandline, I'm sorry to say. That gives us a VM with those settings (paths are relative to a temporary directory, /tmp/autopkgtest-qemu.w1mlh54b/ in the above example):
  • the shared/ directory is, well, shared with the VM
  • port 10022 is forward to the VM's port 22, presumably for SSH, but not SSH server is started by default
  • the ttyS1 and ttyS2 UNIX sockets are mapped to the first two serial ports (use nc -U to talk with those)
  • the monitor UNIX socket is a qemu control socket (see the QEMU monitor documentation, also nc -U)
In other words, it's possible to access the VM with:
nc -U /tmp/autopkgtest-qemu.w1mlh54b/ttyS2
The nc socket interface is ... not great, but it works well enough. And you can probably fire up an SSHd to get a better shell if you feel like it.

Nitty-gritty details no one cares about

Fixing hang in sbuild cleanup I'm having a hard time making heads or tails of this, but please bear with me. In sbuild + schroot, there's this notion that we don't really need to cleanup after ourselves inside the schroot, as the schroot will just be delted anyways. This behavior seems to be handled by the internal "Session Purged" parameter. At least in lib/Sbuild/Build.pm, we can see this:
my $is_cloned_session = (defined ($session->get('Session Purged')) &&
             $session->get('Session Purged') == 1) ? 1 : 0;
[...]
if ($is_cloned_session)  
$self->log("Not cleaning session: cloned chroot in use\n");
  else  
if ($purge_build_deps)  
    # Removing dependencies
    $resolver->uninstall_deps();
  else  
    $self->log("Not removing build depends: as requested\n");
 
 
The schroot builder defines that parameter as:
    $self->set('Session Purged', $info-> 'Session Purged' );
... which is ... a little confusing to me. $info is:
my $info = $self->get('Chroots')->get_info($schroot_session);
... so I presume that depends on whether the schroot was correctly cleaned up? I stopped digging there... ChrootUnshare.pm is way more explicit:
$self->set('Session Purged', 1);
I wonder if we should do something like this with the autopkgtest backend. I guess people might technically use it with something else than qemu, but qemu is the typical use case of the autopkgtest backend, in my experience. Or at least certainly with things that cleanup after themselves. Right? For some reason, before I added this line to my configuration:
$purge_build_deps = 'never';
... the "Cleanup" step would just completely hang. It was quite bizarre.

Disgression on the diversity of VM-like things There are a lot of different virtualization solutions one can use (e.g. Xen, KVM, Docker or Virtualbox). I have also found libguestfs to be useful to operate on virtual images in various ways. Libvirt and Vagrant are also useful wrappers on top of the above systems. There are particularly a lot of different tools which use Docker, Virtual machines or some sort of isolation stronger than chroot to build packages. Here are some of the alternatives I am aware of: Take, for example, Whalebuilder, which uses Docker to build packages instead of pbuilder or sbuild. Docker provides more isolation than a simple chroot: in whalebuilder, packages are built without network access and inside a virtualized environment. Keep in mind there are limitations to Docker's security and that pbuilder and sbuild do build under a different user which will limit the security issues with building untrusted packages. On the upside, some of things are being fixed: whalebuilder is now an official Debian package (whalebuilder) and has added the feature of passing custom arguments to dpkg-buildpackage. None of those solutions (except the autopkgtest/qemu backend) are implemented as a sbuild plugin, which would greatly reduce their complexity. I was previously using Qemu directly to run virtual machines, and had to create VMs by hand with various tools. This didn't work so well so I switched to using Vagrant as a de-facto standard to build development environment machines, but I'm returning to Qemu because it uses a similar backend as KVM and can be used to host longer-running virtual machines through libvirt. The great thing now is that autopkgtest has good support for qemu and sbuild has bridged the gap and can use it as a build backend. I originally had found those bugs in that setup, but all of them are now fixed:
  • #911977: sbuild: how do we correctly guess the VM name in autopkgtest?
  • #911979: sbuild: fails on chown in autopkgtest-qemu backend
  • #911963: autopkgtest qemu build fails with proxy_cmd: parameter not set
  • #911981: autopkgtest: qemu server warns about missing CPU features
So we have unification! It's possible to run your virtual machines and Debian builds using a single VM image backend storage, which is no small feat, in my humble opinion. See the sbuild-qemu blog post for the annoucement Now I just need to figure out how to merge Vagrant, GNOME Boxes, and libvirt together, which should be a matter of placing images in the right place... right? See also hosting.

pbuilder vs sbuild I was previously using pbuilder and switched in 2017 to sbuild. AskUbuntu.com has a good comparative between pbuilder and sbuild that shows they are pretty similar. The big advantage of sbuild is that it is the tool in use on the buildds and it's written in Perl instead of shell. My concerns about switching were POLA (I'm used to pbuilder), the fact that pbuilder runs as a separate user (works with sbuild as well now, if the _apt user is present), and setting up COW semantics in sbuild (can't just plug cowbuilder there, need to configure overlayfs or aufs, which was non-trivial in Debian jessie). Ubuntu folks, again, have more documentation there. Debian also has extensive documentation, especially about how to configure overlays. I was ultimately convinced by stapelberg's post on the topic which shows how much simpler sbuild really is...

Who Thanks lavamind for the introduction to the sbuild-qemu package.

21 April 2022

Andy Simpkins: Firmware and Debian

There has been a flurry of activity on the Debian mailing lists ever since Steve McIntyre raised the issue of including non-free firmware as part of official Debian installation images. Firstly I should point out that I am in complete agreement with Steve s proposal to include non-free firmware as part of an installation image. Likewise I think that we should have a separate archive section for firmware. Because without doing so it will soon become almost impossible to install onto any new hardware. However, as always the issue is more nuanced than a first glance would suggest. Lets start by defining what is firmware? Firmware is any software that runs outside the orchestration of the operating system. Typically firmware will be executed on a processor(s) separate from the processor(s) running the OS, but this does not need to be the case. As Debian we are content that our systems can operate using fully free and open source software and firmware. We can install our OS without needing any non-free firmware. This is an illusion! Each and every PC platform contains non-free firmware It may be possible to run free firmware on some Graphics controllers, Wi-Fi chip-sets, or Ethernet cards and we can (and perhaps should) choose to spend our money on systems where this is the case. When installing a new system we might still be forced to hold our nose and install with non-free firmware on the peripheral before we are able to upgrade it to FLOSS firmware later if this is exists or is even possible to do so. However after the installation we are running a full FLOSS system in terms of software and firmware. We all (almost without exception) are running propitiatory firmware whether we like it or not. Even after carefully selecting graphics and network hardware with FLOSS firmware options we still haven t escaped from non-free-firmware. Other peripherals contain firmware too each keyboard, disk (SSDs and Spinning rust). Even your USB memory stick that you use to contain the Debian installation image contains a microcontroller and hence also contains firmware that runs on it.
  1. Much of this firmware can not even be updated.
  2. Some can be updated, but is stored in FLASH ROM and the hardware vendor has defeated all programming methods (possibly circumnavigated with a hardware mod).
  3. Some of it can be updated but requires external device programmers (and often the programming connections are a series of test points dotted around the board and not on a header in order to make programming as difficult as possible).
  4. Sometimes the firmware can be updated from within the host operating system (i.e. Debian)
  5. Sometimes, as Steve pointed out in his post, the hardware vendor has enough firmware on a peripheral to perform basic functions perhaps enough to install the OS, but requires additional firmware to enable specific feature (i.e. higher screen resolutions, hardware accelerated functions etc.)
  6. Finally some vendors don t even bother with any non-volatile storage beyond a basic boot loader and firmware must be loaded before the device can be used in any mode.
What about the motherboard? If we are lucky we might be able to run a FLOSS implementation of the UEFI subsystem (edk2/tianocore for example), indeed the non AMD64/i386 platforms based around ARM, MIPS architectures are often the most free when it comes to firmware. What about the microcode on the processor? Personally I wasn t aware that that this was updatable firmware until the Spectre and Meltdown classes of vulnerabilities arose a few years back. So back to Debian images including non-free firmware. This is specifically to address the last two use cases mentioned above, i.e. where firmware needs to be loaded to achieve a minimum functioning of a device. Although it could also include motherboard support, and microcode as well. As far as I can tell the proposal exists for several reasons: #1 Because some freely distributable firmware is required for more and more devices, in order to install Debian, or because whilst Debian can be installed a desktop environment can not be started or fully function #2 Because frankly it is less work to produce, test and maintain fewer installation images As someone who performs tests on our images, this clearly gets my vote :-) and perhaps most important of all.. #3 Because our least experienced users, and new users will download an official image and give up if things don t just work TM Steve s proposal option 5 would address theses issues and I fully support it. I would love to see separate repositories for firmware and firmware-none free. Additionally to accompany firmware non-free I would like to have information on what the firmware actually does. Can I run my hardware without it, what function(s) are limited without the firmware, better yet is there a FLOSS equivalent that I can load instead? Is this something that we can present in Debian installer? I would love not to require non-free firmware, but if I can t, I would love if DI would enable a user to make an informed choice as to what, if any, firmware is installed. Should we be requesting (requiring?) this information for any non-free firmware image that we carry in the archive? Finally lets consider firmware in the wider general case, not just the case where we need to load firmware from within Debian each and every boot. Personally I am annoyed whenever a hardware manufacturer has gone out of their way to prevent firmware updates. Lets face it software contains bugs, and we can assume that the software making up a firmware image will as well. Critical (security) vulnerabilities found in firmware, especially if this runs on the same processor(s) as the OS can impact on the wider system, not just the device itself. This will mean that, without updatable firmware, the hardware itself should be withdrawn from use whilst it would otherwise still function. By preventing firmware updates vendors are forcing early obsolescence in the hardware they sell, perhaps good for their bottom line, but certainly no good for users or the environment. Here I can practice what I preach. As an Electronic Engineer / Systems architect I have been beating the drum for In System Updatable firmware for ALL programmable devices in a system, be it a simple peripheral or a deeply embedded system. I can honestly say that over the last 20 years (yes I have been banging this particular drum for that long) I have had 100% success in arguing this case commercially. Having device programmers in R&D departments is one thing, but that is additional cost for production, and field service. Needing custom programming headers or even a bed of nails fixture to connect your target device to a programmer is more trouble than it is worth. Finally, the ability to update firmware in the field means that you can launch your product on schedule, make a sale and ship to a customer even if the first thing that you need to do is download an update. Offering that to any project manager will make you very popular indeed. So what if this firmware is non-free? As long as the firmware resides in non-volatile media without needing the OS to interact with it, we as a project don t need to carry it in our archives. And we as principled individuals can vote with our feet and wallets by choosing to purchase devices that have free firmware. But where that isn t an option, I ll take updatable but non-free firmware over non-free firmware that can not be updated any day of the week. Sure, the manufacture can choose to no longer support the firmware, and it is shocking how soon this happens often in the consumer market, the manufacture has withdrawn support for a product before it even reaches the end user (In which case we should boycott that manufacture in future until they either change their ways of go bust). But again if firmware can be updated in system that would at least allow the possibility of open firmware to arise. Indeed the only commercial case I have seen to argue against updatable firmware has been either for DRM, in which case good lets get rid of both, or for RF licence compliance, and even then it is tenuous because in this case the manufacture wants ISP for its own use right up until a device is shipped out the door, typically achived by blowing one time programmable fuse links .

20 April 2022

Russell Coker: Android Without Play

A while ago I was given a few reasonably high-end Android phones to give away. I gave two very nice phones to someone who looks after refugees so a couple of refugee families could make video calls to relatives. The third phone is a Huawei Nova 7i [1] which doesn t have the Google Play Store. The Nova 7i is a ridiculously powerful computer (8G of RAM in a phone!!!) but without the Google Play Store it s not much use to the average phone user. It has the HuaWei App Gallery which isn t as bad as most of the proprietary app stores of small players in the Android world, it has SnapChat, TikTok, Telegram, Alibaba, WeChat, and Grays auction (an app I didn t even know existed) along with many others. It also links to ApkPure (apparently a 3rd party app installer that obtains APK files for major commercial apps) for Facebook among others. The ApkPure thing might be Huawei outsourcing the violation of Facebook terms of service. For the moment I ve decided to only use free software on this phone and use my old phone for non-free stuff (Facebook, LinkedIn, etc). The eventual aim is that I can only carry a phone with free software for normal use and carry a second phone if I m active on LinkedIn or something. My recollection is that when I first got the phone (almost 2 years ago) it didn t have such a range of apps. The first thing to install was f-droid [2] as the app repository. F-droid has a repository of thousands of free software Android apps as well as some apps that are slightly less free which are tagged appropriately. You can install the F-Droid app from the web site. As an aside I had to go to settings and enable force old index format to get the list of packages, I don t know why as other phones had worked without it. Here are the F-Droid apps I installed: Future Plans The current main things I m missing are a calendar, a contact list, and a shared note taking system (like Google Keep). For calendaring and a contact list the CalDAV and CardDAV protocols seem best. The most common implementation on the server side appears to be DAViCal [5]. The Nextcloud system supports CalDAV, CardDAV, web editing of notes and documents (including LibreOffice if you install that plugin) [6]. But it is huge and demands write access to all it s own code (bad for security), and it s not packaged for Debian. Also in my tests it gave me an error 401 when I tried to authenticate to it from the Android Nextcloud client. I ve seen a positive review about Radicale, a simple CalDAV and CardDAV server that doesn t need a database [7]. I prefer the Unix philosophy of keeping things simple with file storage unless there s a real need for anything else. I don t think that anything I ever do with calendaring will require the PostgreSQL database that DAViCal uses. I ll give Radicale a go for CalDAV and CardDAV, but I still need something for shared notes (shopping lists etc). Suggestions welcome. Current Status Lack of a contacts list is a major loss of functionality in a phone. I could store contacts in the phone memory or on the SIM, but I would still have to get all my old contacts in there and also getting something half working reduces motivation for getting it working properly. Lack of a calendar is also a problem, again I could work around that by exporting all my Google calendars as iCal URLs but I d rather get it working correctly. The lack of shared notes may be a harder problem to solve given the failure of Nextcloud. For that I would consider just having the keep.google.com web site always open in Mozilla at least in the short term. At the moment I require two phones, my new Android phone without Google and the old one for my contacts list etc. Hopefully in a week or so I ll have my new phone doing contacts, calendaring, and notes. Then my old phone will just be for proprietary apps which I don t need most of the time and I can leave it at home when I don t need that sort of thing.

6 April 2022

Bits from Debian: Infomaniak Platinum Sponsor of DebConf22

infomaniaklogo We are very pleased to announce that Infomaniak has committed to support DebConf22 as a Platinum sponsor. This is the fourth year in a row that Infomaniak is sponsoring The Debian Conference with the higher tier! Infomaniak is Switzerland's largest web-hosting company, also offering backup and storage services, solutions for event organizers, live-streaming and video on demand services. It wholly owns its datacenters and all elements critical to the functioning of the services and products provided by the company (both software and hardware). With this commitment as Platinum Sponsor, Infomaniak contributes to make possible our annual conference, and directly supports the progress of Debian and Free Software helping to strengthen the community that continues to collaborate on Debian projects throughout the rest of the year. Thank you very much Infomaniak, for your support of DebConf22! Become a sponsor too! DebConf22 will take place from July 17th to 24th, 2022 at the Innovation and Training Park (ITP) in Prizren, Kosovo, and will be preceded by DebCamp, from July 10th to 16th. And DebConf22 is still accepting sponsors! Interested companies and organizations may contact the DebConf team through sponsors@debconf.org, and visit the DebConf22 website at https://debconf22.debconf.org/sponsors/become-a-sponsor. DebConf22 banner open registration

Jonathan Dowland: One, by Be

picture of a vinyl record
The sublime One, by Be is a pastoral, English summer time instrumental improvisation around field recordings and the theme of the honey bee. A lovely piece to accompany deep thinking. I m reminded of Virginia Astley. Be are associated with Caught by the River, a collective who explore ways of setpping out of daily digital live and embrace, nature, walks, calm, etc.

4 April 2022

Arturo Borrero Gonz lez: Wikimedia Toolforge and Grid Engine

Logos This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez. One of the most important and successful products provided by the Wikimedia Cloud Services team at the Wikimedia Foundation is Toolforge, a hosting service commonly known in the industry as Platform as a Service (PaaS). In particular, it is a platform that allows users and developers to run and use a variety of applications with the ultimate goal of helping the Wikimedia mission from the technical side. Toolforge is powered by two different backend engines, Kubernetes and Grid Engine. The two backends have traditionally offered different features for tool developers. But as time moves forward we ve learnt that Kubernetes is the future. Explaining why is the purpose of this blog post: we want to share more information and reasoning behind this mindset. There are a number of reasons that make Grid Engine poorly suitable to remain as execution backend in Toolforge: As mentioned above, our desire is to cover all our grid-like needs with Kubernetes, a technology which has several benefits: The relationship between Toolforge and Grid Engine has been interesting over the years. The grid has been used for quite a lot of time, we have plenty of documentation and established good practices. On the other hand, the grid is hard to maintain, imposes a heavy burden on the WMCS team and is a technology we must eventually discontinue. How to accommodate the two realities is a refreshing challenge, one that we hope to tackle together in the near future. A tradeoff exists here, but it is clear to us which option is best. So we will work on deprecating and removing Grid Engine and migrating use cases into Kubernetes. This deprecation, however, will be done with care, as we know our technical community relies on the grid for some import Toolforge tools. And some of these workflows will need some adaptation in order to be fully supported on Kubernetes. Stay tuned for more information on present and next works surrounding the Wikimedia Toolforge service. The next blog post will share more concrete details. This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.

31 March 2022

Russell Coker: Links March 2022

Anarcat wrote a great blog post about switching from OpenNTP to Chrony which gives a good overview of how NTP works and how accurate the different versions are [1]. Bleeping Computer has an amusing article about criminals who copied a lot of data from NVidia servers including specs of their latest products [2], they are threatening to release all the data if NVidia doesn t stop crippling their GPUs to make them unsuitable for crypto currency mining. I don t support these criminals, but I think NVidia should allow people who buy hardware to use their property as they choose. If cryptocurrency miners buy all the NVidia products then NVidia still makes the sales, they could even auction them to make more money. NPR has a disturbing article about the way execution by lethal injection works in the US [3]. It seems that most people die in an extremely unpleasant way. It makes the North Korean execution by anti-aircraft gun seem civilised. The DirtyPipe vulnerability is the latest serious security issue in the Linux kernel [4]. The report of how it was discovered is very interesting and should be read by all sysadmins. SE Linux will not save you from this as the vulnerability allows writing to read-only files like /etc/passwd. Politico has an insightful analysis of Putin, it s not good news he wants to conquer all territory that had ever been part of a Russian empire at any time in history [5]. The Guardian has an informative article about the EU s attempts to debunk Russian propaganda about Covid19 [6]. Fortunately the sanctions are reducing Russia s ability to do such things now. The Guardian has in interesting article about a project to use literary analysis to predict wars [7]. Funded by the German military but funding was cut after it was proven to work. The Fact Act is a proposal by David Brin for political changes in the US to involve scientists and statisticians in an official advisory role in the legislative process [8], it s an idea with a lot of potential. Technology Review has an interesting interview with the leader of the NSA s Research Directorate [9]. In 2008 the EFF posted a long and informative article about the RIAA s war against music fans [10]. I had followed a lot of the news about this when it was happening, but I still learnt some things from this article that I hadn t known at the time. Also considering past legal battles in the context of the current situation is useful. As an aside all the music I want to listen to is now on YouTube and youtube-dl works really well for me. The 1952 edition of Psychiatry: Journal of Interpersonal Relations has an interesting article On Cooling the Mark Out [11] which starts about how criminal gangs engaged in fraud try to make their victims come to terms with the loss in a way that doesn t involve the police. But it goes on to cover ways of dealing with loss of status in general. The layout is hacky with words broken by hyphens in the middle of lines as it appears to have been scanned from paper, converted to MS-Word, and from there to PDF. But it s worth it. The Internet Heist by Cory Doctorow is an insightful series of 3 articles about the MPAA (MAFIAA) attempts to take over all TV distribution in the US [12]. Wired has an interesting exerpt from the book Spies, Lies, and Algorithms: The History and Future of American Intelligence , by Amy B. Zegart [13]. Interesting summary of the open source intelligence systems (which have nothing to do with open source as free software). But it would be interesting to have an open source intelligence organisation along similar lines to open source software. The guy who tracks billionaire s private jets is an example of this.

24 March 2022

Ingo Juergensmann: New Server NVMe Issues

My current server is somewhat aged. I bought it new in July 2014 with a 6-core Xeon E5-2630L, 32 GB RAM and 4x 3.5 hot-swappable drives. Gladly I had the opportunity to extend the memory to 128 GB RAM at no additional cost by using memory from my ex-employer. It also has 4x 2 TB WD Red HDDs with 5400 rpm hooked up to the SATA backplane, but unfortunately only two of them are SATA-3 with 6 Gbit/s. The new server is a used/refurbished Supermicro server with 2x 14-core Xeon E5-2683 and 256 GB RAM and 4x 3.5 hot-swappable drives. It also came with a Hardware-RAID SAS/SATA 8-port controller with BBU. I also ordered two slim drive kits (MCP-220-81504-0N & MCP-220-81506-0N) to be able to use 2x 3.5 slots for rotational HDDs as a cheap storage. Right now I added 2x 128 GB Supermicro SATA DOMs, 4x WD Red 4 TB SSDs and a Sonnet Fusion 4 4 Silent and 4x 1 TB Seagate Firecuda 520 NVMe disks. And here the issue starts: The NVMe should be capable of 4-5 GB/s, but they are connected to a PCIe 3.0 x16 port via the Sonnet Fusion 4 4, which itself features a PCIe bridge, so bifurbacation is not necessary. When doing some tests with bonnie++ I get around 1 GB/s transfer rates out of a RAID10 setup with all 4 NVMes. In fact, regardless of the RAID level there are only transfer rates of about 1 1.2 GB/s with bonnie++. (All software RAIDs with mdadm.) But also when constructing a RAID each NVMe gives around 300-600 MB/s in sync speed except for one exception: RAID1. Regardless of how many NVMe disks in a RAID1 setup the sync speed is up to 2.5 GB/s for each of the NVMe disks. So the lower transfer rates with bonnie++ or other RAID levels shouldn t be limited by bus speed nor by CPU speed. Alas, atop shows upto 100% CPU usage for all tests. I even tested In my understanding RAID10 should perform similar to RAID1 in terms of syncing and better and while bonnie++ tests (up to 2x write and 4x read speed compared to a single disk). For the bonnie++ tests I even made some tests that are available here. You can find the test parameters listed in the hostname column: Baldur is the hostname, then followed by the layout (near-2, far-2, offset-2), chunk size and concurrency of bonnie++. In the end there was no big impact of the chunk size of the RAID. So, now I m wondering what the reason for the slow performance of those 4x NVMe disks is? Bus speed of the PCIe 3.0 x16 shouldn t be the cause, because I assume that the software RAID will need to transfer the blocks in RAID1 as well as in RAID10 over the bus. Same goes for the CPU: the amount of CPU work should be roughly the same for RAID1 and for RAID10. RAID10 should even have an advantage because the blocks only need to be synced to 2 disks in a stripe set. Bonnie++ tests are a different topic for sure. But when testing reading with dd from the md-devices I only get around 1-1.5 GB/s as well. Even when using LVM RAID instead of LVM on top of md RAID. All NVMe disks are already set to 4k and IO scheduler is set to mq-deadline. Is there anything I could do to improve the performance of the NVMe disks? On the other head, pure transfer rates are not that important to a server that runs a dozen of VMs. Here the improved IOPS performance over rotation disks is a clear performance gain. But I m still curious if I could get maybe 2 GB/s out of a RAID10 setup with the NVMe disks. Then again having two independent RAID1 setups for MariaDB and for PostgreSQL databases might be a better choice over a single RAID10 setup?

26 February 2022

Russ Allbery: INN 2.6.5

This is a bit of a sneak preview announcement since I'm waiting for the ISC mirror to update before sending the official announcement to the normal channels, but INN 2.6.5 has been released. (The release was finalized a few days ago, and I'm a bit behind in posting it.) This is a bug fix and minor feature release over INN 2.6.4, and the upgrade should be painless. You can download the new release from ftp.isc.org (once it updates) or my personal INN pages. The latter also has links to the full changelog and the other INN documentation. As always, thanks to Julien LIE for preparing this release and doing most of the maintenance work on INN! Changes in this release:

23 February 2022

Ian Jackson: Rooting an Eos Fairphone 4

Last week I received (finally) my Fairphone 4, supplied with a de-googled operating system, which I had ordered from the E Foundation s shop in December. (I m am very hard on hardware and my venerable Fairphone 2 is really on its last legs.) I expect to have full control over the software on any computing device I own which is as complicated, capable, and therefore, hazardous, as a mobile phone. Unfortunately the Eos image (they prefer to spell it /e/ os , srsly!) doesn t come with a way to get root without taking fairly serious measures including unlocking the bootloader. Unlocking the bootloader wouldn t be desirable for me but I can t live without root. So. I started with these helpful instructions: https://forum.xda-developers.com/t/fairphone-4-root.4376421/ I found the whole process a bit of a trial, and I thought I would write down what I did. But, it s not straightforward, at least for someone like me who only has a dim understanding of all this Android stuff. Unfortunately, due to the number of missteps and restarts, what I actually did is not really a sensible procedure. So here is a retcon of a process I think will work: Unlock the bootloader The E Foundation provide instructions for unlocking the bootloader on a stock FP4, here https://doc.e.foundation/devices/FP4/install and they seem applicable to the Murena phone supplied with Eos pre-installed, too. NB tht unlocking the bootloader wipes the phone. So we do it first. So:
  1. Power on the phone, with no SIM installed
  2. You get a welcome screen.
  3. Skip all things on startup including wifi
  4. Go to the very end of the settings, tap a gazillion times on the phone s version until you re a developer
  5. In the developer settings, allow usb debugging
  6. In the developer settings, allow oem bootloader unlocking
  7. Connect a computer via a USB cable, say yes on phone to USB debugging
  8. adb reboot bootloader
  9. The phone will reboot into a texty kind of screen, the bootloader
  10. fastboot flashing unlock
  11. The phone will reboot, back to the welcome screen
  12. Repeat steps 3-9 (maybe not all are necessary)
  13. fastboot flashing unlock_critical
  14. The phone will reboot, back to the welcome screen
Note that although you are running fastboot, you must run this command with the phone in bootloader mode, not fastboot (aka fastbootd ) mode. If you run fastboot flashing unlcok from fastboot you just get a don t know what you re talking about . I found conflicting instructions on what kind of Vulcan nerve pinches could be used to get into which boot modes, and had poor experiences with those. adb reboot bootloader always worked reliably for me. Some docs say to run fastboot oem unlock; I used flashing. Maybe this depends on the Android tools version. Initial privacy prep and OTA update We want to update the supplied phone OS. The build mine shipped with is too buggy to run Magisk, the application we are going to use to root the phone. (With the pre-installed phone OS, Magisk crashes at the patch boot image step.) But I didn t want to let the phone talk to Google, even for the push notifications registration.
  1. From the welcome screen, skip all things except location, date, time. Notably, do not set up wifi
  2. In settings, microg section
    1. turn off cloud messaging
    2. turn off google safetynet
    3. turn off google registration (NB you must do this after the other two, because their sliders become dysfunctional after you turn google registration off)
    4. turn off both location modules
  3. In settings, location section, turn off allowed location for browser and magic earth
  4. Now go into settings and enable wifi, giving it your wifi details
  5. Tell the phone to update its operating system. This is a big download.
Install Magisk, the root manager (As a starting point I used these instructions https://www.xda-developers.com/how-to-install-magisk/ and a lot of random forum posts.) You will need the official boot.img. Bizarrely there doesn t seem to be a way to obtain this from the phone. Instead, you must download it. You can find it by starting at https://doc.e.foundation/devices/FP4/install which links to https://images.ecloud.global/stable/FP4/. At the time of writing, the most recent version, whose version number seemed to correspond to the OS update I installed above, was IMG-e-0.21-r-20220123158735-stable-FP4.zip.
  1. Download the giant zipfile to your computer
  2. Unzip it to extract boot.img
  3. Copy the file to your phone s storage . Eg, via adb: with the phone booted into the main operating system, using USB debugging, adb push boot.img /storage/self/primary/Download.
  4. On the phone, open the browser, and enter https://f-droid.org. Click on the link to install f-droid. You will need to enable installing apps from the browser (follow the provided flow to the settings, change the setting, and then use Back, and you can do the install). If you wish, you can download the f-droid apk separately on a computer, and verify it with pgp.
  5. Using f-droid, install Magisk. You will need to enable installing apps from f-droid. (I installed Magisk from f-droid because 1. I was going to trust f-droid anyway 2. it has a shorter URL than Magisk s.)
  6. Open the Magisk app. Tell Magisk to install (Magisk, not the app). There will be only one option: patch boot file. Tell it to patch the boot.img file from before.
  7. Transfer the magisk_patched-THING.img back to your computer (eg via adb pull).
  8. adb reboot bootloader
  9. fastboot boot magisk_patched-THING.img (again, NB, from bootloader mode, not from fastboot mode)
  10. In Magisk you ll see it shows as installed. But it s not really; you ve just booted from an image with it. Ask to install Magisk with Direct install .
After you have done all this, I believe that each time you do an over-the-air OS update, you must, between installing the update and rebooting the phone, ask Magisk to Install to inactive slot (after OTA) . Presumably if you forget you must do the fastboot boot dance again. After all this, I was able to use tsu in Termux. There s a strange behaviour with the root prompt you get apropos Termux s request for root; I found that it definitely worked if Termux wasn t the foreground app You have to leave the bootloader unlocked. Howwever, as I understand it, the phone s encryption will still prevent an attacker from hoovering the data out of your phone. The bootloader lock is to prevent someone tricking you into entering the decryption passkey into a trojaned device. Other things to change There are probably other things to change. I have not yet transferred my Signal account from my old phone. It is possible that Signal will require me to re-enable the google push notifications, but I hope that having disabled them in microg it will be happy to use its own system, as it does on my old phone.

comment count unavailable comments

17 January 2022

Wouter Verhelst: Different types of Backups

In my previous post, I explained how I recently set up backups for my home server to be synced using Amazon's services. I received a (correct) comment on that by Iustin Pop which pointed out that while it is reasonably cheap to upload data into Amazon's offering, the reverse -- extracting data -- is not as cheap. He is right, in that extracting data from S3 Glacier Deep Archive costs over an order of magnitude more than it costs to store it there on a monthly basis -- in my case, I expect to have to pay somewhere in the vicinity of 300-400 USD for a full restore. However, I do not consider this to be a major problem, as these backups are only to fulfill the rarer of the two types of backups cases. There are two reasons why you should have backups. The first is the most common one: "oops, I shouldn't have deleted that file". This happens reasonably often; people will occasionally delete or edit a file that they did not mean to, and then they will want to recover their data. At my first job, a significant part of my job was to handle recovery requests from users who had accidentally deleted a file that they still needed. Ideally, backups to handle this type of situation are easily accessible to end users, and are performed reasonably frequently. A system that automatically creates and deletes filesystem snapshots (such as the zfsnap script for ZFS snapshots, which I use on my server) works well. The crucial bit here is to ensure that it is easier to copy an older version of a file than it is to start again from scratch -- if a user must file a support request that may or may not be answered within a day or so, it is likely they will not do so for a file they were working on for only half a day, which means they lose half a day of work in such a case. If, on the other hand, they can just go into the snapshots directory themselves and it takes them all of two minutes to copy their file, then they will also do that for files they only created half an hour ago, so they don't even lose half an hour of work and can get right back to it. This means that backup strategies to mitigate the "oops I lost a file" case ideally do not involve off-site file storage, and instead are performed online. The second case is the much rarer one, but (when required) has the much bigger impact: "oops the building burned down". Variants of this can involve things like lightning strikes, thieves, earth quakes, and the like; in all cases, the point is that you want to be able to recover all your files, even if every piece of equipment you own is no longer usable. That being the case, you will first need to replace that equipment, which is not going to be cheap, and it is also not going to be an overnight thing. In order to still be useful after you lost all your equipment, they must also be stored off-site, and should preferably be offline backups, too. Since replacing your equipment is going to cost you time and money, it's fine if restoring the backups is going to take a while -- you can't really restore from backup any time soon anyway. And since you will lose a number of days of content that you can't create when you can only fall back on your off-site backups, it's fine if you also lose a few days of content that you will have to re-create. All in all, the two types of backups have opposing requirements: "oops I lost a file" backups should be performed often and should be easily available; "oops I lost my building" backups should not be easily available, and are ideally done less often, so you don't pay a high amount of money for storage of your off-sites. In my opinion, if you have good "lost my file" backups, then it's also fine if the recovery of your backups are a bit more expensive. You don't expect to have to ever pay for these; you may end up with a situation where you don't have a choice, and then you'll be happy that the choice is there, but as long as you can reasonably pay for the worst case scenario of a full restore, it's not a case you should be worried about much. As such, and given that a full restore from Amazon Storage Gateway is going to be somewhere between 300 and 400 USD for my case -- a price I can afford, although it's not something I want to pay every day -- I don't think it's a major issue that extracting data is significantly more expensive than uploading data. But of course, this is something everyone should consider for themselves...

16 January 2022

Wouter Verhelst: Backing up my home server with Bacula and Amazon Storage Gateway

I have a home server. Initially conceived and sized so I could digitize my (rather sizeable) DVD collection, I started using it for other things; I added a few play VMs on it, started using it as a destination for the deja-dup-based backups of my laptop and the time machine-based ones of the various macs in the house, and used it as the primary location of all the photos I've taken with my cameras over the years (currently taking up somewhere around 500G) as well as those that were taking at our wedding (another 100G). To add to that, I've copied the data that my wife had on various older laptops and external hard drives onto this home server as well, so that we don't lose the data should something happen to one or more of these bits of older hardware. Needless to say, the server was running full, so a few months ago I replaced the 4x2T hard drives that I originally put in the server with 4x6T ones, and there was much rejoicing. But then I started considering what I was doing. Originally, the intent was for the server to contain DVD rips of my collection; if I were to lose the server, I could always re-rip the collection and recover that way (unless something happened that caused me to lose both at the same time, of course, but I consider that sufficiently unlikely that I don't want to worry about it). Much of the new data on the server, however, cannot be recovered like that; if the server dies, I lose my photos forever, with no way of recovering them. Obviously that can't be okay. So I started looking at options to create backups of my data, preferably in ways that make it easily doable for me to automate the backups -- because backups that have to be initiated are backups that will be forgotten, and backups that are forgotten are backups that don't exist. So let's not try that. When I was still self-employed in Belgium and running a consultancy business, I sold a number of lower-end tape libraries for which I then configured bacula, and I preferred a solution that would be similar to that without costing an arm and a leg. I did have a look at a few second-hand tape libraries, but even second hand these are still way outside what I can budget for this kind of thing, so that was out too. After looking at a few solutions that seemed very hackish and would require quite a bit of handholding (which I don't think is a good idea), I remembered that a few years ago, I had a look at the Amazon Storage Gateway for a customer. This gateway provides a virtual tape library with 10 drives and 3200 slots (half of which are import/export slots) over iSCSI. The idea is that you install the VM on a local machine, you connect it to your Amazon account, you connect your backup software to it over iSCSI, and then it syncs the data that you write to Amazon S3, with the ability to archive data to S3 Glacier or S3 Glacier Deep Archive. I didn't end up using it at the time because it required a VMWare virtualization infrastructure (which I'm not interested in), but I found out that these days, they also provide VM images for Linux KVM-based virtual machines (amongst others), so that changes things significantly. After making a few calculations, I figured out that for the amount of data that I would need to back up, I would require a monthly budget of somewhere between 10 and 20 USD if the bulk of the data would be on S3 Glacier Deep Archive. This is well within my means, so I gave it a try. The VM's technical requirements state that you need to assign four vCPUs and 16GiB of RAM, which just so happens to be the exact amount of RAM and CPU that my physical home server has. Obviously we can't do that. I tried getting away with 4GiB and 2 vCPUs, but that didn't work; the backup failed out after about 500G out of 2T had been written, due to the VM running out of resources. On the VM's console I found complaints that it required more memory, and I saw it mention something in the vicinity of 7GiB instead, so I decided to try again, this time with 8GiB of RAM rather than 4. This worked, and the backup was successful. As far as bacula is concerned, the tape library is just a (very big...) normal tape library, and I got data throughput of about 30M/s while the VM's upload buffer hadn't run full yet, with things slowing down to pretty much my Internet line speed when it had. With those speeds, Bacula finished the backup successfully in "1 day 6 hours 43 mins 45 secs", although the storage gateway was still uploading things to S3 Glacier for a few hours after that. All in all, this seems like a viable backup solution for large(r) amounts of data, although I haven't yet tried to perform a restore.

Russell Coker: SSD Endurance

I previously wrote about the issue of swap potentially breaking SSD [1]. My conclusion was that swap wouldn t be a problem as no normally operating systems that I run had swap using any significant fraction of total disk writes. In that post the most writes I could see was 128GB written per day on a 120G Intel SSD (writing the entire device once a day). My post about swap and SSD was based on the assumption that you could get many thousands of writes to the entire device which was incorrect. Here s a background on the terminology from WD [2]. So in the case of the 120G Intel SSD I was doing over 1 DWPD (Drive Writes Per Day) which is in the middle of the range of SSD capability, Intel doesn t specify the DWPD or TBW (Tera Bytes Written) for that device. The most expensive and high end NVMe device sold by my local computer store is the Samsung 980 Pro which has a warranty of 150TBW for the 250G device and 600TBW for the 1TB device [3]. That means that the system which used to have an Intel SSD would have exceeded the warranty in 3 years if it had a 250G device. My current workstation has been up for just over 7 days and has averaged 110GB written per day. It has some light VM use and the occasional kernel compile, a fairly typical developer workstation. It s storage is 2*Crucial 1TB NVMe devices in a BTRFS RAID-1, the NVMe devices are the old series of Crucial ones and are rated for 200TBW which means that they can be expected to last for 5 years under the current load. This isn t a real problem for me as the performance of those devices is lower than I hoped for so I will buy faster ones before they are 5yo anyway. My home server (and my wife s workstation) is averaging 325GB per day on the SSDs used for the RAID-1 BTRFS filesystem for root and for most data that is written much (including VMs). The SSDs are 500G Samsung 850 EVOs [4] which are rated at 150TBW which means just over a year of expected lifetime. The SSDs are much more than a year old, I think Samsung stopped selling them more than a year ago. Between the 2 SSDs SMART reports 18 uncorrectable errors and btrfs device stats reports 55 errors on one of them. I m not about to immediately replace them, but it appears that they are well past their prime. The server which runs my blog (among many other things) is averaging over 1TB written per day. It currently has a RAID-1 of hard drives for all storage but it s previous incarnation (which probably had about the same amount of writes) had a RAID-1 of enterprise SSDs for the most written data. After a few years of running like that (and some time running with someone else s load before it) the SSDs became extremely slow (sustained writes of 15MB/s) and started getting errors. So that s a pair of SSDs that were burned out. Conclusion The amounts of data being written are steadily increasing. Recent machines with more RAM can decrease storage usage in some situations but that doesn t compare to the increased use of checksummed and logged filesystems, VMs, databases for local storage, and other things that multiply writes. The amount of writes allowed under warranty isn t increasing much and there are new technologies for larger SSD storage that decrease the DWPD rating of the underlying hardware. For the systems I own it seems that they are all going to exceed the rated TBW for the SSDs before I have other reasons to replace them, and they aren t particularly high usage systems. A mail server for a large number of users would hit it much earlier. RAID of SSDs is a really good thing. Replacement of SSDs is something that should be planned for and a way of swapping SSDs to less important uses is also good (my parents have some SSDs that are too small for my current use but which work well for them). Another thing to consider is that if you have a server with spare drive bays you could put some extra SSDs in to spread the wear among a larger RAID-10 array. Instead of having a 2*SSD BTRFS RAID-1 for a server you could have 6*SSD to get a 3* longer lifetime than a regular RAID-1 before the SSDs wear out (BTRFS supports this sort of thing). Based on these calculations and the small number of errors I ve seen on my home server I ll add a 480G SSD I have lying around to the array to spread the load and keep it running for a while longer.

4 January 2022

Jonathan McDowell: Upgrading from a CC2531 to a CC2538 Zigbee coordinator

Previously I setup a CC2531 as a Zigbee coordinator for my home automation. This has turned out to be a good move, with the 4 gang wireless switch being particularly useful. However the range of the CC2531 is fairly poor; it has a simple PCB antenna. It s also a very basic device. I set about trying to improve the range and scalability and settled upon a CC2538 + CC2592 device, which feature an MMCX antenna connector. This device also has the advantage that it s ARM based, which I m hopeful means I might be able to build some firmware myself using a standard GCC toolchain. For now I fetched the JetHome firmware from https://github.com/jethome-ru/zigbee-firmware/tree/master/ti/coordinator/cc2538_cc2592 (JH_2538_2592_ZNP_UART_20211222.hex) - while it s possible to do USB directly with the CC2538 my board doesn t have those bits so going the external USB UART route is easier. The device had some existing firmware on it, so I needed to erase this to force a drop into the boot loader. That means soldering up the JTAG pins and hooking it up to my Bus Pirate for OpenOCD goodness.
OpenOCD config
source [find interface/buspirate.cfg]
buspirate_port /dev/ttyUSB1
buspirate_mode normal
buspirate_vreg 1
buspirate_pullup 0
transport select jtag
source [find target/cc2538.cfg]
Steps to erase
$ telnet localhost 4444
Trying ::1...
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Open On-Chip Debugger
> mww 0x400D300C 0x7F800
> mww 0x400D3008 0x0205
> shutdown
shutdown command invoked
Connection closed by foreign host.
At that point I can switch to the UART connection (on PA0 + PA1) and flash using cc2538-bsl:
$ git clone https://github.com/JelmerT/cc2538-bsl.git
$ cc2538-bsl/cc2538-bsl.py -p /dev/ttyUSB1 -e -w -v ~/JH_2538_2592_ZNP_UART_20211222.hex
Opening port /dev/ttyUSB1, baud 500000
Reading data from /home/noodles/JH_2538_2592_ZNP_UART_20211222.hex
Firmware file: Intel Hex
Connecting to target...
CC2538 PG2.0: 512KB Flash, 32KB SRAM, CCFG at 0x0027FFD4
Primary IEEE Address: 00:12:4B:00:22:22:22:22
    Performing mass erase
Erasing 524288 bytes starting at address 0x00200000
    Erase done
Writing 524256 bytes starting at address 0x00200000
Write 232 bytes at 0x0027FEF88
    Write done
Verifying by comparing CRC32 calculations.
    Verified (match: 0x74f2b0a1)
I then wanted to migrate from the old device to the new without having to repair everything. So I shut down Home Assistant and backed up the CC2531 network information using zigpy-znp (which is already installed for Home Assistant):
python3 -m zigpy_znp.tools.network_backup /dev/zigbee > cc2531-network.json
I copied the backup to cc2538-network.json and modified the coordinator_ieee to be the new device s MAC address (rather than end up with 2 devices claiming the same MAC if/when I reuse the CC2531) and did:
python3 -m zigpy_znp.tools.network_restore --input cc2538-network.json /dev/ttyUSB1
The old CC2531 needed unplugged first, otherwise I got an RuntimeError: Network formation refused, RF environment is likely too noisy. Temporarily unscrew the antenna or shield the coordinator with metal until a network is formed. error. After that I updated my udev rules to map the CC2538 to /dev/zigbee and restarted Home Assistant. To my surprise it came up and detected the existing devices without any extra effort on my part. However that resulted in 2 coordinators being shown in the visualisation, with the old one turning up as unk_manufacturer. Fixing that involved editing /etc/homeassistant/.storage/core.device_registry and removing the entry which had the old MAC address, removing the device entry in /etc/homeassistant/.storage/zha.storage for the old MAC and then finally firing up sqlite to modify the Zigbee database:
$ sqlite3 /etc/homeassistant/zigbee.db
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> DELETE FROM devices_v6 WHERE ieee = '00:12:4b:00:11:11:11:11';
sqlite> DELETE FROM endpoints_v6 WHERE ieee = '00:12:4b:00:11:11:11:11';
sqlite> DELETE FROM in_clusters_v6 WHERE ieee = '00:12:4b:00:11:11:11:11';
sqlite> DELETE FROM neighbors_v6 WHERE ieee = '00:12:4b:00:11:11:11:11' OR device_ieee = '00:12:4b:00:11:11:11:11';
sqlite> DELETE FROM node_descriptors_v6 WHERE ieee = '00:12:4b:00:11:11:11:11';
sqlite> DELETE FROM out_clusters_v6 WHERE ieee = '00:12:4b:00:11:11:11:11';
sqlite> .quit
So far it all seems a bit happier than with the CC2531; I ve been able to pair a light bulb that was previously detected but would not integrate, which suggests the range is improved. (This post another in the set of things I should write down so I can just grep my own website when I forget what I did to do foo .)

3 January 2022

Paul Wise: FLOSS Activities December 2021

Focus This month I didn't have any particular focus. I just worked on issues in my info bubble.

Changes

Issues

Review
  • Spam: reported 166 Debian mailing list posts
  • Patches: reviewed libpst upstream patches
  • Debian packages: sponsored nsis, memtest86+
  • Debian wiki: RecentChanges for the month
  • Debian BTS usertags: changes for the month
  • Debian screenshots:

Administration
  • libpst: setup GitHub presence, migrate from hg to git, requested details from bug reporters
  • plac: cleaned up git repo anomalies
  • Debian BTS: unarchive/reopen/triage bugs for reintroduced packages: stardict, node-carto
  • Debian wiki: unblock IP addresses, approve accounts

Communication
  • Respond to queries from Debian users and contributors on the mailing lists and IRC

Sponsors The purple-discord, python-plac, sptag, smart-open, libpst, memtest86+, oci-python-sdk work was sponsored. All other work was done on a volunteer basis.

Russ Allbery: Review: Crashed

Review: Crashed, by Adam Tooze
Publisher: Penguin Books
Copyright: 2018
Printing: 2019
ISBN: 0-525-55880-2
Format: Kindle
Pages: 615
The histories of the 2008 financial crisis that I have read focus almost exclusively on the United States. They also stop after the bank rescue and TARP or, if they press on into the aftermath, focus on the resulting damage to the US economy and the widespread pain of falling housing prices and foreclosure. Crashed does neither, instead arguing that 2008 was a crisis of European banks as much as American banks. It extends its history to cover the sovereign debt crisis in the eurozone, treating it as a continuation of the same crisis in a different guise. In the process, Tooze makes a compelling argument that one can draw a clear, if wandering, line from the moral revulsion at the propping up of the international banking system to Brexit and Trump. Qualifications first, since they are important for this type of comprehensive and, in places, surprising and counterintuitive history. Adam Tooze is Kathryn and Shelby Cullom Davis Professor of History at Columbia University and the director of its European Institute. His previous books have won multiple awards, and Crashed won the Lionel Gelber Prize for non-fiction on foreign policy. That it won a prize in that topic, rather than history or economics, is a hint at Tooze's chosen lens. The first half of the book is the lead-up and response to the crisis provoked by the collapse in value of securitized US mortgages and leading to the failure of Lehman Brothers, the failure in all but name of AIG, and a massive bank rescue. The financial instruments at the center of the crisis are complex and difficult to understand, and Tooze provides only brief explanation. This therefore may not be the best first book on the crisis; for that, I would still recommend Bethany McClean and Joe Nocera's All the Devils Are Here, although it's hard to beat Michael Lewis's storytelling in The Big Short. Tooze is not interested in dwelling on a blow-by-blow account of the crisis and initial response, and some of his account feels perfunctory. He is instead interested in describing its entangled global sweep. The new detail I took from the first half of Crashed is the depth of involvement of the European banks in what is often portrayed as a US crisis. Tooze goes into more specifics than other accounts on the eurodollar market, run primarily through the City of London, and the vast dollar-denominated liabilities of European banks. When the crisis struck, the breakdown of liquidity markets left those banks with no source of dollar funding to repay dollar-denominated short-term loans. The scale of dollar borrowing by European banks was vast, dwarfing the currency reserves or trade surpluses of their home countries. An estimate from the Bank of International Settlements put the total dollar funding needs for European banks at more than $2 trillion. The institution that saved the European banks was the United States Federal Reserve. This was an act of economic self-protection, not largesse; in the absence of dollar liquidity, the fire sale of dollar assets by European banks in a desperate attempt to cover their loans would have exacerbated the market crash. But it's remarkable in its extent, and in how deeply this contradicts the later public political position that 2008 was an American recession caused by American banks. 52% of the mortgage-backed securities purchased by the Federal Reserve in its quantitative easing policies (popularly known as QE1, QE2, and QE3) were sold by foreign banks. Deutsche Bank and Credit Suisse unloaded more securities on the Fed than any American bank by a significant margin. And when that wasn't enough, the Fed went farther and extended swap lines to major national banks, providing them dollar liquidity that they could then pass along to their local institutions. In essence, in Tooze's telling, the US Federal Reserve became the reserve bank for the entire world, preventing a currency crisis by providing dollars to financial systems both foreign and domestic, and it did so with a remarkable lack of scrutiny. Its swap lines avoided public review until 2010, when Bloomberg won a court fight to extract the records. That allowed the European banks that benefited to hide the extent of their exposure.
In Europe, the bullish CEOs of Deutsche Bank and Barclays claimed exceptional status because they avoided taking aid from their national governments. What the Fed data reveal is the hollowness of those boasts. The banks might have avoided state-sponsored recapitalization, but every major bank in the entire world was taking liquidity assistance on a grand scale from its local central bank, and either directly or indirectly by way of the swap lines from the Fed.
The emergency steps taken by Timothy Geithner in the Treasury Department were nearly as dramatic as those of the Federal Reserve. Without regard for borders, and pushing the boundary of their legal authority, they intervened massively in the world (not just the US) economy to save the banking and international finance system. And it worked. One of the benefits of a good history is to turn stories about heroes and villains into more nuanced information about motives and philosophies. I came away from Sheila Bair's account of the crisis furious at Geithner's protection of banks from any meaningful consequences for their greed. Tooze's account, and analysis, agrees with Bair in many respects, but Bair was continuing a personal fight and Tooze has more space to put Geithner into context. That context tells an interesting story about the shape of political economics in the 21st century. Tooze identifies Geithner as an institutionalist. His goal was to keep the system running, and he was acutely aware of what would happen if it failed. He therefore focused on the pragmatic and the practical: the financial system was about to collapse, he did whatever was necessary to keep it working, and that effort was successful. Fairness, fault, and morals were treated as irrelevant. This becomes more obvious when contrasted with the eurozone crisis, which started with a Greek debt crisis in the wake of the recession triggered by the 2008 crisis. Greece is tiny by the standards of the European economy, so at first glance there is no obvious reason why its debt crisis should have perturbed the financial system. Under normal circumstances, its lenders should have been able to absorb such relatively modest losses. But the immediate aftermath of the 2008 crisis was not normal circumstances, particularly in Europe. The United States had moved aggressively to recapitalize its banks using the threat of compensation caps and government review of their decisions. The European Union had not; European countries had done very little, and their banks were still in a fragile state. Worse, the European Central Bank had sent signals that the market interpreted as guaranteeing the safety of all European sovereign debt equally, even though this was explicitly ruled out by the Lisbon Treaty. If Greece defaulted on its debt, not only would that be another shock to already-precarious banks, it would indicate to the market that all European debt was not equal and other countries may also be allowed to default. As the shape of the Greek crisis became clearer, the cost of borrowing for all of the economically weaker European countries began rising towards unsustainable levels. In contrast to the approach taken by the United States government, though, Europe took a moralistic approach to the crisis. Jean-Claude Trichet, then president of the European Central Bank, held the absolute position that defaulting on or renegotiating the Greek debt was unthinkable and would not be permitted, even though there was no realistic possibility that Greece would be able to repay. He also took a conservative hard line on the role of the ECB, arguing that it could not assist in this crisis. (Tooze is absolutely scathing towards Trichet, who comes off in this account as rigidly inflexible, volatile, and completely irrational.) Germany's position, represented by Angela Merkel, was far more realistic: Greece's debt should be renegotiated and the creditors would have to accept losses. This is, in Tooze's account, clearly correct, and indeed is what eventually happened. But the problem with Merkel's position was the potential fallout. The German government was still in denial about the health of its own banks, and political opinion, particularly in Merkel's coalition, was strongly opposed to making German taxpayers responsible for other people's debts. Stopping the progression of a Greek default to a loss of confidence in other European countries would require backstopping European sovereign debt, and Merkel was not willing to support this. Tooze is similarly scathing towards Merkel, but I'm not sure it's warranted by his own account. She seemed, even in his account, boxed in by domestic politics and the tight constraints of the European political structure. Regardless, even after Trichet's term ended and he was replaced by the far more pragmatic Mario Draghi, Germany and Merkel continued to block effective action to relieve Greece's debt burden. As a result, the crisis lurched from inadequate stopgap to inadequate stopgap, forcing crippling austerity, deep depressions, and continued market instability while pretending unsustainable debt would magically become payable through sufficient tax increases and spending cuts. US officials such as Geithner, who put morals and arguably legality aside to do whatever was needed to save the system, were aghast. One takeaway from this is that expansionary austerity is the single worst macroeconomic idea that anyone has ever had.
In the summer of 2012 [the IMF's] staff revisited the forecasts they had made in the spring of 2010 as the eurozone crisis began and discovered that they had systematically underestimated the negative impact of budget cuts. Whereas they had started the crisis believing that the multiplier was on average around 0.5, they now concluded that from 2010 forward it had been in excess of 1. This meant that cutting government spending by 1 euro, as the austerity programs demanded, would reduce economic activity by more than 1 euro. So the share of the state in economic activity actually increased rather than decreased, as the programs presupposed. It was a staggering admission. Bad economics and faulty empirical assumptions had led the IMF to advocate a policy that destroyed the economic prospects for a generation of young people in Southern Europe.
Another takeaway, though, is central to Tooze's point in the final section of the book: the institutionalists in the United States won the war on financial collapse via massive state interventions to support banks and the financial system, a model that Europe grudgingly had to follow when attempting to reject it caused vast suffering while still failing to stabilize the financial system. But both did so via actions that were profoundly and obviously unfair, and only questionably legal. Bankers suffered few consequences for their greed and systematic mismanagement, taking home their normal round of bonuses while millions of people lost their homes and unemployment rates for young men in some European countries exceeded 50%. In Europe, the troika's political pressure against Greece and Italy was profoundly anti-democratic. The financial elite achieved their goal of saving the financial system. It could have failed, that failure would have been catastrophic, and their actions are defensible on pragmatic grounds. But they completely abandoned the moral high ground in the process. The political forces opposed to centrist neoliberalism attempted to step into that moral gap. On the Left, that came in the form of mass protest movements, Occupy Wall Street, Bernie Sanders, and parties such as Syriza in Greece. The Left, broadly, took the moral side of debtors, holding that the primary pain of the crisis should instead be born by the wealthy creditors who were more able to absorb it. The Right by contrast, in the form of the Tea Party movement inside the Republican Party in the United States and the nationalist parties in Europe, broadly blamed debtors for taking on excessive debt and focused their opposition on use of taxpayer dollars to bail out investment banks and other institutions of the rich. Tooze correctly points out that the Right's embrace of racist nationalism and incoherent demagoguery obscures the fact that their criticism of the elite center has real merit and is partly shared by the Left. As Tooze sketches out, the elite centrist consensus held in most of Europe, beating back challenges from both the Left and the Right, although it faltered in the UK, Poland, and Hungary. In the United States, the Democratic Party similarly solidified around neoliberalism and saw off its challenges from the Left. The Republican Party, however, essentially abandoned the centrist position, embracing the Right. That left the Democratic Party as the sole remaining neoliberal institutionalist party, supplemented by a handful of embattled Republican centrists. Wall Street and its money swung to the Democratic Party, but it was deeply unpopular on both the Left and the Right and this shift may have hurt them more than helped. The Democrats, by not abandoning the center, bore the brunt of the residual anger over the bank bailout and subsequent deep recession. Tooze sees in that part of the explanation for Trump's electoral victory over Hilary Clinton. This review is already much too long, and I haven't even mentioned Tooze's clear explanation of the centrality of treasury bonds to world finances, or his discussions of Russian and Ukraine, China, or Brexit, all of which I thought were excellent. This is not only an comprehensive history of both of the crises and international politics of the time period. It is also a thought-provoking look at how drastic of interventions are required to keep the supposed free market working, who is left to suffer after those interventions, and the political consequences of the choice to prioritize the stability of a deeply inequitable and unsafe financial system. At least in the United States, there is now a major political party that is likely to oppose even mundane international financial institutions, let alone another major intervention. The neoliberal center is profoundly weakened. But nothing has been done to untangle the international financial system, and little has been done to reduce its risk. The world will go into the next financial challenge still suffering from a legitimacy crisis. Given the miserly, condescending, and dismissive treatment of the suffering general populace after moving heaven and earth to save the banking system, that legitimacy crisis is arguably justified, but an uncontrolled crash of the financial system is not likely to be any kinder to the average citizen than it is to the investment bankers. Crashed is not the best-written book at a sentence-by-sentence level. Tooze's prose is choppy and a bit awkward, and his paragraphs occasionally wander away from a clear point. But the content is excellent and thought-provoking, filling in large sections of the crisis picture that I had not previously been aware of and making a persuasive argument for its continuing effects on current politics. Recommended if you're not tired of reading about financial crises. Rating: 8 out of 10

29 December 2021

Noah Meyerhans: When You Could Hear Security Scans

Have you ever wondered what a security probe of a computer sounded like? I d guess probably not, because on the face of it that doesn t make a whole lot of sense. But there was a time when I could very clearly discern the sound of a computer being scanned. It sounded like a small mechanical heart beat: Click-click click-click click-click Prior to 2010, I had a computer under my desk with what at the time were not unheard-of properties: Its storage was based on a stack of spinning metal platters (a now-antiquated device known as a hard drive ), and it had a publicly routable IPv4 address with an unfiltered connection to the Internet. Naturally it ran Linux and an ssh server. As was common in those days, service logging was handled by a syslog daemon. The syslog daemon would sort log messages based on various criteria and record them somewhere. In most simple environments, somewhere was simply a file on local storage. When writing to a local file, syslog daemons can be optionally configured to use the fsync() system call to ensure that writes are flushed to disk. Practically speaking, what this meant is that a page of disk-backed memory would be written to the disk as soon as an event occurred that triggered a log message. Because of potential performance implications, fsync() was not typically enabled for most log files. However, due to the more sensitive nature of authentication logs, it was often enabled for /var/log/auth.log. In the first decade of the 2000 s, there was a fairly unsophisticated worm loose on the Internet that would probe sshd with some common username/password combinations. The worm would pause for a second or so between login attempts, most likely in an effort to avoid automated security responses. The effect was that a system being probed by this worm would generate disk write every second, with a very distinct audible signature from the hard drive. I think this situation is a fun demonstration of a side-channel data leak. It s primitive and doesn t leak very much information, but it was certainly enough to make some inference about the state of the system in question. Of course, side-channel leakage issues have been a concern for ages, but I like this one for its simplicity. It was something that could be explained and demonstrated easily, even to somebody with relatively limited understanding of how computers work , unlike, for instance measuring electromagnetic emanations from CPU power management units. For a different take on the sounds of a computing infrastructure, Peep (The Network Auralizer) won an award at a USENIX conference long, long ago. I d love to see a modern deployment of such a system. I m sure you could build something for your cloud deployment using something like AWS EventBridge or Amazon SQS fairly easily. For more on research into actual real-world side-channel attacks, you can read A Survey of Microarchitectural Side-channel Vulnerabilities, Attacks and Defenses in Cryptography or A Survey of Electromagnetic Side-Channel Attacks and Discussion on their Case-Progressing Potential for Digital Forensics.

Next.