Search Results: "rbd"

23 October 2021

Arthur Diniz: Dropdown for GitHub workflows input parameters

Dropdown for GitHub workflows input parameters Sometimes when we look at CI/CD tools embedded within git-based software repository manager like GitHub, GitLab or Bitbucket, we ran into a lack of some features. This time me and my DevOps/SRE team were facing a pain of not being able to have the option to create drop-downs within GitHub workflows using input parameters. Although this functionality is already available on other platforms such as Bitbucket, the specific client we were working on stored the code inside GitHub. At first I thought that someone has already solved this problem somehow, but doing an extensive search on the internet I found several angry GitHub users opening requests within the Support Community and even in the stack overflow. comment-1 comment-2 comment-3 comment-5 comment-4 So I decided to create a solution for this, always thinking about simplicity and in a way that makes it easy to get this missing functionality. I started by creating an input array pattern using commas and using a tag (the selector) e.g brackets as the default value marker. Here is an example of what an input string would look like:
name: gh-action-dropdown-list-input
on:
  workflow_dispatch:
    inputs:
      environment:
        description: 'Environment'
        required: true
        default: 'dev,staging,[uat],prod'
Now the final question that would turn out to be the most complicated to deal with. How can I change the GitHub Actions interface to replace the input pattern we created earlier to a dropdown? The simplest answer I thought was to create a chrome and firefox extension that would do all this logic behind the scenes and replace the HTML input element with the selected tag containing the array values and leaving the tag value (selector) always as the default. All code was developed in pure JavaScript, open-source licensed under Apache 2.0 and available at https://github.com/arthurbdiniz/gh-action-dropdown-list-input.

Install extension Once installed, the extension is ready to use and the final result we see is the Actions interface with drop-downs. :) showcase-1 showcase-2

Configuring selectors Go to the top right corner of the browser you are using and click on the extension logo. A screen will popup with tag options. Choose the right tags for you and save it.
This action might require reloading the GitHub workflow tab. config

Have fun using drop-downs inside GitHub. If you liked this project please share this post and if possible star within the repository. Also feel free to connect with me on LinkedIn: https://www.linkedin.com/in/arthurbdiniz

References
  • https://github.community/t/can-workflow-dispatch-input-be-option-list/127338
  • https://stackoverflow.com/questions/69296314/dropdown-for-github-workflows-input-parameters

20 October 2021

Arturo Borrero Gonz lez: Iterating on how we do NFS at Wikimedia Cloud Services

Logos This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez. NFS is a central piece of infrastructure that is essential to services like Toolforge. Recently, the Cloud Services team at Wikimedia had been reviewing how we do NFS. The current situation NFS is a central piece of technology for some of the services that the Wikimedia Cloud Services team offers to the community. We have several shares that power different use cases: Toolforge user home directories live on NFS, and Cloud VPS users can also access dumps using this protocol. The current setup involves several physical hardware servers, with about 20TB of storage, offering shares over 10G links to the cloud. For the system to be more fault-tolerant, we duplicate each share for redundancy using DRBD. Running NFS on dedicated hardware servers has traditionally offered us advantages: mostly on the performance and the capacity fields. As time has passed, we have been enumerating more and more reasons to review how we do NFS. For one, the current setup is in violation of some of our internal rules regarding realm separation. Additionally, we had been longing for additional flexibility managing our servers: we wanted to use virtual machines managed by Openstack Nova. The DRBD-based high-availability system required mostly a hand-crafted procedure for failover/failback. There s also some scalability concerns as NFS is easy to grow up, but not to grow horizontally, and of course, we have to be able to keep the tenancy setup while doing so, something that NFS does by using LDAP/Unix users and may get complicated too when growing. In general, the servers have become too big to fail , clearly technical debt, and it has taken us years to decide on taking on the task to rethink the architecture. It s worth mentioning that in an ideal world, we wouldn t depend on NFS, but the truth is that it will still be a central piece of infrastructure for years to come in services like Toolforge. Over a series of brainstorming meetings, the WMCS team evaluated the situation and sorted out the many moving parts. The team managed to boil down the potential service future to two competing options: Then we decided to research both options in parallel. For a number of reasons, the evaluation was timeboxed to three weeks. Both ideas had a couple of points in common: the NFS data would be stored on our Ceph farm via Cinder volumes, and we would rely on Ceph reliability to avoid using DRBD. Another open topic was how to back up data from Ceph, to store our important bits in more than one basket. We will get to the back up topic later. The manila experiment The Wikimedia Foundation was an early adopter of some Openstack components (Nova, Glance, Designate, Horizon), but Manila was never evaluated for usage until now. Our approach for this experiment was to closely follow the upstream guidelines. We read the documentation and tried to understand the different setups you can build with Manila. As we often feel with other Openstack components, the documentation doesn t perfectly describe how to introduce a given component in your particular local setup. Here we use an admin-controller flat-topology Neutron network. This network is shared by all tenants (or projects) in our Openstack deployment. Also, Manila can use many different driver backends, for things like NetApps or CephFS that we don t use , yet. After some research, the generic driver was the one that seemed to better fit our use case. The generic driver leverages Nova virtual machines instances plus Cinder volume to create and manage the shares. In general, Manila supports two operational modes, whether it should create/destroy the share servers (i.e, the virtual machine instances) or not. This option is called driver_handles_share_server (or DHSS) and takes a boolean value. We were interested in trying with DHSS=true, to really benefit from the potential of the setup. Manila diagram NFS idea 6, original image in Wikitech So, after sorting all these variables, we moved on with our initial testing. We built a PoC setup as depicted in the diagram above, with the manila-share component running in a virtual machine inside the cloud. The PoC led to us reporting several bugs upstream: In some cases we tried to address these bugs ourselves: It s worth mentioning that the upstream community was extra-welcoming to us, and we re thankful for that. However, at the end of our three-week period, our Manila setup still wasn t working as expected. Your experience may change with other drivers perhaps the ZFSonLinux or the CephFS ones. In general, we were having trouble making the setup work as expected, so we decided to abandon this approach in favor of the other option we were considering at the beginning. Simple virtual machine serving NFS The alternative was to create a Nova virtual machine instance by hand and to configure it using puppet. We have been investing in an automation framework lately, so the idea is to not actually create the server by hand. Anyway, the data would be decoupled from the instance into Cinder volumes, which led us to the question we left for later: How should we back up those terabytes of important information? Just to be clear, the backup problem was independent of the above options; with Manila we would still have had to solve the same challenge. We would like to see our data be backed up somewhere else other than in Ceph. And that s exactly where we are at right now. We ve been exploring different backup strategies and will finally use the Cinder backup API. Conclusion The iteration will end with the dedicated NFS hardware servers being stopped, and the shares being served from within the cloud. The migration will take some time to happen because we will check and double-check that everything works as expected (including from the performance point of view) before making definitive changes. We already have some plans to make sure our users experience as little service impact as possible. The most troublesome shares will be those related to Toolforge. At some point we will need to disallow writes to the NFS share, rsync the data out of the hardware servers into the Cinder volumes, point the NFS clients to the new virtual machines, and then enable writes again. The main Toolforge share has about 8TB of data, so this will take a while. We will have more updates in the future. Who knows, perhaps our next-next iteration, in a couple of years, will see us adopting Openstack Manila for good. Featured image credit: File:(from break water) Manila Skyline panoramio.jpg, ewol, CC BY-SA 3.0 This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.

31 May 2021

Russell Coker: Some Ideas About Storage Reliability

Hard Drive Brands When people ask for advice about what storage to use they often get answers like use brand X, it works well for me and brand Y had a heap of returns a few years ago . I m not convinced there is any difference between the small number of manufacturers that are still in business. One problem we face with reliability of computer systems is that the rate of change is significant, so every year there will be new technological developments to improve things and every company will take advantage of them. Storage devices are unique among computer parts for their requirement for long-term reliability. For most other parts in a computer system a fault that involves total failure is usually easy to fix and even a fault that causes unreliable operation usually won t spread it s damage too far before being noticed (except in corner cases like RAM corruption causing corrupted data on disk). Every year each manufacturer will bring out newer disks that are bigger, cheaper, faster, or all three. Those disks will be expected to remain in service for 3 years in most cases, and for consumer disks often 5 years or more. The manufacturers can t test the new storage technology for even 3 years before releasing it so their ability to prove the reliability is limited. Maybe you could buy some 8TB disks now that were manufactured to the same design as used 3 years ago, but if you buy 12TB consumer grade disks, the 20TB+ data center disks, or any other device that is pushing the limits of new technology then you know that the manufacturer never tested it running for as long as you plan to run it. Generally the engineering is done well and they don t have many problems in the field. Sometimes a new range of disks has a significant number of defects, but that doesn t mean the next series of disks from the same manufacturer will have problems. The issues with SSDs are similar to the issues with hard drives but a little different. I m not sure how much of the improvements in SSDs recently have been due to new technology and how much is due to new manufacturing processes. I had a bad experience with a nameless brand SSD a couple of years ago and now stick to the better known brands. So for SSDs I don t expect a great quality difference between devices that have the names of major computer companies on them, but stuff that comes from China with the name of the discount web store stamped on it is always a risk. Hard Drive vs SSD A few years ago some people were still avoiding SSDs due to the perceived risk of new technology. The first problem with this is that hard drives have lots of new technology in them. The next issue is that hard drives often have some sort of flash storage built in, presumably a SSHD or Hybrid Drive gets all the potential failures of hard drives and SSDs. One theoretical issue with SSDs is that filesystems have been (in theory at least) designed to cope with hard drive failure modes not SSD failure modes. The problem with that theory is that most filesystems don t cope with data corruption at all. If you want to avoid losing data when a disk returns bad data and claims it to be good then you need to use ZFS, BTRFS, the NetApp WAFL filesystem, Microsoft ReFS (with the optional file data checksum feature enabled), or Hammer2 (which wasn t production ready last time I tested it). Some people are concerned that their filesystem won t support wear levelling for SSD use. When a flash storage device is exposed to the OS via a block interface like SATA there isn t much possibility of wear levelling. If flash storage exposes that level of hardware detail to the OS then you need a filesystem like JFFS2 to use it. I believe that most SSDs have something like JFFS2 inside the firmware and use it to expose what looks like a regular block device. Another common concern about SSD is that it will wear out from too many writes. Lots of people are using SSD for the ZIL (ZFS Intent Log) on the ZFS filesystem, that means that SSD devices become the write bottleneck for the system and in some cases are run that way 24*7. If there was a problem with SSDs wearing out I expect that ZFS users would be complaining about it. Back in 2014 I wrote a blog post about whether swap would break SSD [1] (conclusion it won t). Apart from the nameless brand SSD I mentioned previously all of my SSDs in question are still in service. I have recently had a single Samsung 500G SSD give me 25 read errors (which BTRFS recovered from the other Samsung SSD in the RAID-1), I have yet to determine if this is an ongoing issue with the SSD in question or a transient thing. I also had a 256G SSD in a Hetzner DC give 23 read errors a few months after it gave a SMART alert about Wear_Leveling_Count (old age). Hard drives have moving parts and are therefore inherently more susceptible to vibration than SSDs, they are also more likely to cause vibration related problems in other disks. I will probably write a future blog post about disks that work in small arrays but not in big arrays. My personal experience is that SSDs are at least as reliable as hard drives even when run in situations where vibration and heat aren t issues. Vibration or a warm environment can cause data loss from hard drives in situations where SSDs will work reliably. NVMe I think that NVMe isn t very different from other SSDs in terms of the actual storage. But the different interface gives some interesting possibilities for data loss. OS, filesystem, and motherboard bugs are all potential causes of data loss when using a newer technology. Future Technology The latest thing for high end servers is Optane Persistent memory [2] also known as DCPMM. This is NVRAM that fits in a regular DDR4 DIMM socket that gives performance somewhere between NVMe and RAM and capacity similar to NVMe. One of the ways of using this is Memory Mode where the DCPMM is seen by the OS as RAM and the actual RAM caches the DCPMM (essentially this is swap space at the hardware level), this could make multiple terabytes of RAM not ridiculously expensive. Another way of using it is App Direct Mode where the DCPMM can either be a simulated block device for regular filesystems or a byte addressable device for application use. The final option is Mixed Memory Mode which has some DCPMM in Memory Mode and some in App Direct Mode . This has much potential for use of backups and to make things extra exciting App Direct Mode has RAID-0 but no other form of RAID. Conclusion I think that the best things to do for storage reliability are to have ECC RAM to avoid corruption before the data gets written, use reasonable quality hardware (buy stuff with a brand that someone will want to protect), and avoid new technology. New hardware and new software needed to talk to new hardware interfaces will have bugs and sometimes those bugs will lose data. Filesystems like BTRFS and ZFS are needed to cope with storage devices returning bad data and claiming it to be good, this is a very common failure mode. Backups are a good thing.

30 May 2021

Russell Coker: Wifi Performance on Linux

Wifi usually just works. In the past I haven t had to worry much about performance as for home use things have always been bearable and at work it s never been my job so I just file a bug report with the relevant people when things go wrong. But a few years ago I had some problems. For my home network I got a free Wifi AP which wasn t performing well. My AP supported 802.11 modes b/g or g/n (b, g, and n are slow, medium, and fast speeds). I initially had the AP running in b/g mode because I had an 802.11b USB wifi device that I used. When I replaced that with one that did 802.11g I tried changing the AP to g/n mode but performance was even worse on my laptop (although quite good on phones) so I switched back. For phones it appeared to work well giving 54Mb/s while on my laptop (a second hand Thinkpad X1 Carbon) it was giving 11Mb/s at best and often much less than that. The best demonstration of problems was to start transferring a large file while pinging a system on the LAN the AP was connected to. Usually it would give ping times of 1s or more, sometimes 5s+ ping times. While this was happening the Invalid misc count increased rapidly, often by more than 100 per second. The results of Google searches suggest that Invalid misc is due to interference and recommend changing the channel. My AP had been on channel 1 which had performed poorly, channels 2-8 were ok, and channel 9 seemed reasonably good. As an aside trying all channels manually is not a good idea, it takes a lot of time and gives little useful data. After changing to channel 9 it still only gave about 500KB/s when transferring large files with ping times of about 100ms, but that s a big improvement. I tried running iwlist scanning to scan the Wifi network for other APs, that showed that channel 1 was used a lot but didn t make it clear what I should do other than that. The next thing I tried was the Wifi Analyser app on Android [1] (which doesn t work on my latest phone, I don t know if it s still being actively maintained, it will definitely work on older phones). That has a nice graph mode that shows which channels are used and how the frequencies spread and interfere with other channels. One thing I hadn t realised before I looked at the graphs is that 802.11n uses 4 channels and interferes past that. If you have two 802.11n devices you don t have much space left out of the 14 channels available. To make more space I configured the Wifi AP in my ADSL modem to 802.11b/g mode and assigned it a channel away from the others making 4 channels available with no interference. After that iwconfig reported between 60 and 120Mb/s and I got consistent transfer rates over 1.5MB/s while ping times remained below 100ms. The 5GHz frequency range is less congested. But at the time I didn t feel like buying 5GHz equipment. Since that time I had signed up with an ISP that had a good deal on a Wifi AP that had 5GHz. Now I have all my devices configured to use 5GHz or 2.4GHz depending on which they think is best. So there s less devices on 2.4GHz and the AP is configured for 20MHz channel width in the 2.4GHz range (which means 802.11b/g). Conclusion 802.11n seems to be a bad idea unless you run the only AP in an area. In a suburban area you will have 3 other houses broadcasting in your area and 802.11n is bad for everyone. The worst case scenario would be one person using 802.11n and interfering with everyone else s 802.11g and then having everyone else turn on 802.11n to try and make things faster. 5GHz is less congested as most people run old hardware. It also has a shorter range which has the upside of getting less interference from other people. I m considering installing 5GHz APs at both ends of my house and configuring all my new devices to not use 2.4GHz. Wifi spectrum analysis software is much better than manual testing of channels or trying to deduce things from the output if iwlist scanning .

12 August 2020

Michael Stapelberg: distri: 20x faster initramfs (initrd) from scratch

In case you are not yet familiar with why an initramfs (or initrd, or initial ramdisk) is typically used when starting Linux, let me quote the wikipedia definition: [ ] initrd is a scheme for loading a temporary root file system into memory, which may be used as part of the Linux startup process [ ] to make preparations before the real root file system can be mounted. Many Linux distributions do not compile all file system drivers into the kernel, but instead load them on-demand from an initramfs, which saves memory. Another common scenario, in which an initramfs is required, is full-disk encryption: the disk must be unlocked from userspace, but since userspace is encrypted, an initramfs is used.

Motivation Thus far, building a distri disk image was quite slow: This is on an AMD Ryzen 3900X 12-core processor (2019):
distri % time make cryptimage serial=1
80.29s user 13.56s system 186% cpu 50.419 total # 19s image, 31s initrd
Of these 50 seconds, dracut s initramfs generation accounts for 31 seconds (62%)! Initramfs generation time drops to 8.7 seconds once dracut no longer needs to use the single-threaded gzip(1) , but the multi-threaded replacement pigz(1) : This brings the total time to build a distri disk image down to:
distri % time make cryptimage serial=1
76.85s user 13.23s system 327% cpu 27.509 total # 19s image, 8.7s initrd
Clearly, when you use dracut on any modern computer, you should make pigz available. dracut should fail to compile unless one explicitly opts into the known-slower gzip. For more thoughts on optional dependencies, see Optional dependencies don t work . But why does it take 8.7 seconds still? Can we go faster? The answer is Yes! I recently built a distri-specific initramfs I m calling minitrd. I wrote both big parts from scratch:
  1. the initramfs generator program (distri initrd)
  2. a custom Go userland (cmd/minitrd), running as /init in the initramfs.
minitrd generates the initramfs image in 400ms, bringing the total time down to:
distri % time make cryptimage serial=1
50.09s user 8.80s system 314% cpu 18.739 total # 18s image, 400ms initrd
(The remaining time is spent in preparing the file system, then installing and configuring the distri system, i.e. preparing a disk image you can run on real hardware.) How can minitrd be 20 times faster than dracut? dracut is mainly written in shell, with a C helper program. It drives the generation process by spawning lots of external dependencies (e.g. ldd or the dracut-install helper program). I assume that the combination of using an interpreted language (shell) that spawns lots of processes and precludes a concurrent architecture is to blame for the poor performance. minitrd is written in Go, with speed as a goal. It leverages concurrency and uses no external dependencies; everything happens within a single process (but with enough threads to saturate modern hardware). Measuring early boot time using qemu, I measured the dracut-generated initramfs taking 588ms to display the full disk encryption passphrase prompt, whereas minitrd took only 195ms. The rest of this article dives deeper into how minitrd works.

What does an initramfs do? Ultimately, the job of an initramfs is to make the root file system available and continue booting the system from there. Depending on the system setup, this involves the following 5 steps:

1. Load kernel modules to access the block devices with the root file system Depending on the system, the block devices with the root file system might already be present when the initramfs runs, or some kernel modules might need to be loaded first. On my Dell XPS 9360 laptop, the NVMe system disk is already present when the initramfs starts, whereas in qemu, we need to load the virtio_pci module, followed by the virtio_scsi module. How will our userland program know which kernel modules to load? Linux kernel modules declare patterns for their supported hardware as an alias, e.g.:
initrd# grep virtio_pci lib/modules/5.4.6/modules.alias
alias pci:v00001AF4d*sv*sd*bc*sc*i* virtio_pci
Devices in sysfs have a modalias file whose content can be matched against these declarations to identify the module to load:
initrd# cat /sys/devices/pci0000:00/*/modalias
pci:v00001AF4d00001005sv00001AF4sd00000004bc00scFFi00
pci:v00001AF4d00001004sv00001AF4sd00000008bc01sc00i00
[ ]
Hence, for the initial round of module loading, it is sufficient to locate all modalias files within sysfs and load the responsible modules. Loading a kernel module can result in new devices appearing. When that happens, the kernel sends a uevent, which the uevent consumer in userspace receives via a netlink socket. Typically, this consumer is udev(7) , but in our case, it s minitrd. For each uevent messages that comes with a MODALIAS variable, minitrd will load the relevant kernel module(s). When loading a kernel module, its dependencies need to be loaded first. Dependency information is stored in the modules.dep file in a Makefile-like syntax:
initrd# grep virtio_pci lib/modules/5.4.6/modules.dep
kernel/drivers/virtio/virtio_pci.ko: kernel/drivers/virtio/virtio_ring.ko kernel/drivers/virtio/virtio.ko
To load a module, we can open its file and then call the Linux-specific finit_module(2) system call. Some modules are expected to return an error code, e.g. ENODEV or ENOENT when some hardware device is not actually present. Side note: next to the textual versions, there are also binary versions of the modules.alias and modules.dep files. Presumably, those can be queried more quickly, but for simplicitly, I have not (yet?) implemented support in minitrd.

2. Console settings: font, keyboard layout Setting a legible font is necessary for hi-dpi displays. On my Dell XPS 9360 (3200 x 1800 QHD+ display), the following works well:
initrd# setfont latarcyrheb-sun32
Setting the user s keyboard layout is necessary for entering the LUKS full-disk encryption passphrase in their preferred keyboard layout. I use the NEO layout:
initrd# loadkeys neo

3. Block device identification In the Linux kernel, block device enumeration order is not necessarily the same on each boot. Even if it was deterministic, device order could still be changed when users modify their computer s device topology (e.g. connect a new disk to a formerly unused port). Hence, it is good style to refer to disks and their partitions with stable identifiers. This also applies to boot loader configuration, and so most distributions will set a kernel parameter such as root=UUID=1fa04de7-30a9-4183-93e9-1b0061567121. Identifying the block device or partition with the specified UUID is the initramfs s job. Depending on what the device contains, the UUID comes from a different place. For example, ext4 file systems have a UUID field in their file system superblock, whereas LUKS volumes have a UUID in their LUKS header. Canonically, probing a device to extract the UUID is done by libblkid from the util-linux package, but the logic can easily be re-implemented in other languages and changes rarely. minitrd comes with its own implementation to avoid cgo or running the blkid(8) program.

4. LUKS full-disk encryption unlocking (only on encrypted systems) Unlocking a LUKS-encrypted volume is done in userspace. The kernel handles the crypto, but reading the metadata, obtaining the passphrase (or e.g. key material from a file) and setting up the device mapper table entries are done in user space.
initrd# modprobe algif_skcipher
initrd# cryptsetup luksOpen /dev/sda4 cryptroot1
After the user entered their passphrase, the root file system can be mounted:
initrd# mount /dev/dm-0 /mnt

5. Continuing the boot process (switch_root) Now that everything is set up, we need to pass execution to the init program on the root file system with a careful sequence of chdir(2) , mount(2) , chroot(2) , chdir(2) and execve(2) system calls that is explained in this busybox switch_root comment.
initrd# mount -t devtmpfs dev /mnt/dev
initrd# exec switch_root -c /dev/console /mnt /init
To conserve RAM, the files in the temporary file system to which the initramfs archive is extracted are typically deleted.

How is an initramfs generated? An initramfs image (more accurately: archive) is a compressed cpio archive. Typically, gzip compression is used, but the kernel supports a bunch of different algorithms and distributions such as Ubuntu are switching to lz4. Generators typically prepare a temporary directory and feed it to the cpio(1) program. In minitrd, we read the files into memory and generate the cpio archive using the go-cpio package. We use the pgzip package for parallel gzip compression. The following files need to go into the cpio archive:

minitrd Go userland The minitrd binary is copied into the cpio archive as /init and will be run by the kernel after extracting the archive. Like the rest of distri, minitrd is built statically without cgo, which means it can be copied as-is into the cpio archive.

Linux kernel modules Aside from the modules.alias and modules.dep metadata files, the kernel modules themselves reside in e.g. /lib/modules/5.4.6/kernel and need to be copied into the cpio archive. Copying all modules results in a 80 MiB archive, so it is common to only copy modules that are relevant to the initramfs s features. This reduces archive size to 24 MiB. The filtering relies on hard-coded patterns and module names. For example, disk encryption related modules are all kernel modules underneath kernel/crypto, plus kernel/drivers/md/dm-crypt.ko. When generating a host-only initramfs (works on precisely the computer that generated it), some initramfs generators look at the currently loaded modules and just copy those.

Console Fonts and Keymaps The kbd package s setfont(8) and loadkeys(1) programs load console fonts and keymaps from /usr/share/consolefonts and /usr/share/keymaps, respectively. Hence, these directories need to be copied into the cpio archive. Depending on whether the initramfs should be generic (work on many computers) or host-only (works on precisely the computer/settings that generated it), the entire directories are copied, or only the required font/keymap.

cryptsetup, setfont, loadkeys These programs are (currently) required because minitrd does not implement their functionality. As they are dynamically linked, not only the programs themselves need to be copied, but also the ELF dynamic linking loader (path stored in the .interp ELF section) and any ELF library dependencies. For example, cryptsetup in distri declares the ELF interpreter /ro/glibc-amd64-2.27-3/out/lib/ld-linux-x86-64.so.2 and declares dependencies on shared libraries libcryptsetup.so.12, libblkid.so.1 and others. Luckily, in distri, packages contain a lib subdirectory containing symbolic links to the resolved shared library paths (hermetic packaging), so it is sufficient to mirror the lib directory into the cpio archive, recursing into shared library dependencies of shared libraries. cryptsetup also requires the GCC runtime library libgcc_s.so.1 to be present at runtime, and will abort with an error message about not being able to call pthread_cancel(3) if it is unavailable.

time zone data To print log messages in the correct time zone, we copy /etc/localtime from the host into the cpio archive.

minitrd outside of distri? I currently have no desire to make minitrd available outside of distri. While the technical challenges (such as extending the generator to not rely on distri s hermetic packages) are surmountable, I don t want to support people s initramfs remotely. Also, I think that people s efforts should in general be spent on rallying behind dracut and making it work faster, thereby benefiting all Linux distributions that use dracut (increasingly more). With minitrd, I have demonstrated that significant speed-ups are achievable.

Conclusion It was interesting to dive into how an initramfs really works. I had been working with the concept for many years, from small tasks such as debug why the encrypted root file system is not unlocked to more complicated tasks such as set up a root file system on DRBD for a high-availability setup . But even with that sort of experience, I didn t know all the details, until I was forced to implement every little thing. As I suspected going into this exercise, dracut is much slower than it needs to be. Re-implementing its generation stage in a modern language instead of shell helps a lot. Of course, my minitrd does a bit less than dracut, but not drastically so. The overall architecture is the same. I hope my effort helps with two things:
  1. As a teaching implementation: instead of wading through the various components that make up a modern initramfs (udev, systemd, various shell scripts, ), people can learn about how an initramfs works in a single place.
  2. I hope the significant time difference motivates people to improve dracut.

Appendix: qemu development environment Before writing any Go code, I did some manual prototyping. Learning how other people prototype is often immensely useful to me, so I m sharing my notes here. First, I copied all kernel modules and a statically built busybox binary:
% mkdir -p lib/modules/5.4.6
% cp -Lr /ro/lib/modules/5.4.6/* lib/modules/5.4.6/
% cp ~/busybox-1.22.0-amd64/busybox sh
To generate an initramfs from the current directory, I used:
% find .   cpio -o -H newc   pigz > /tmp/initrd
In distri s Makefile, I append these flags to the QEMU invocation:
-kernel /tmp/kernel \
-initrd /tmp/initrd \
-append "root=/dev/mapper/cryptroot1 rdinit=/sh ro console=ttyS0,115200 rd.luks=1 rd.luks.uuid=63051f8a-54b9-4996-b94f-3cf105af2900 rd.luks.name=63051f8a-54b9-4996-b94f-3cf105af2900=cryptroot1 rd.vconsole.keymap=neo rd.vconsole.font=latarcyrheb-sun32 init=/init systemd.setenv=PATH=/bin rw vga=836"
The vga= mode parameter is required for loading font latarcyrheb-sun32. Once in the busybox shell, I manually prepared the required mount points and kernel modules:
ln -s sh mount
ln -s sh lsmod
mkdir /proc /sys /run /mnt
mount -t proc proc /proc
mount -t sysfs sys /sys
mount -t devtmpfs dev /dev
modprobe virtio_pci
modprobe virtio_scsi
As a next step, I copied cryptsetup and dependencies into the initramfs directory:
% for f in /ro/cryptsetup-amd64-2.0.4-6/lib/*; do full=$(readlink -f $f); rel=$(echo $full   sed 's,^/,,g'); mkdir -p $(dirname $rel); install $full $rel; done
% ln -s ld-2.27.so ro/glibc-amd64-2.27-3/out/lib/ld-linux-x86-64.so.2
% cp /ro/glibc-amd64-2.27-3/out/lib/ld-2.27.so ro/glibc-amd64-2.27-3/out/lib/ld-2.27.so
% cp -r /ro/cryptsetup-amd64-2.0.4-6/lib ro/cryptsetup-amd64-2.0.4-6/
% mkdir -p ro/gcc-libs-amd64-8.2.0-3/out/lib64/
% cp /ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1 ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1
% ln -s /ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1 ro/cryptsetup-amd64-2.0.4-6/lib
% cp -r /ro/lvm2-amd64-2.03.00-6/lib ro/lvm2-amd64-2.03.00-6/
In busybox, I used the following commands to unlock the root file system:
modprobe algif_skcipher
./cryptsetup luksOpen /dev/sda4 cryptroot1
mount /dev/dm-0 /mnt

25 July 2017

Reproducible builds folks: Reproducible Builds: week 117 in Buster cycle

Here's what happened in the Reproducible Builds effort between Sunday July 16 and Saturday July 22 2017: Toolchain development Bernhard M. Wiedemann wrote a tool to automatically run through different sources of non-determinism, and report which of these caused irreproducibility. Dan Kegel's patches to fpm were merged. Bugs filed Patches submitted upstream: Patches filed in Debian: Reviews of unreproducible packages 73 package reviews have been added, 44 have been updated and 50 have been removed in this week, adding to our knowledge about identified issues. No issue types were updated. Weekly QA work During our reproducibility testing, FTBFS bugs have been detected and reported by: diffoscope development reprotest development Ximin also restarted the discussion with autopkgtest-devel about code reuse for reprotest. Santiago Torres began a series of patches to make reprotest more distro-agnostic, with the aim of making it usable on Arch Linux. Ximin reviewed these patches. Misc. This week's edition was written by Ximin Luo, Bernhard M. Wiedemann and Chris Lamb & reviewed by a bunch of Reproducible Builds folks on IRC & the mailing lists.

5 May 2017

Patrick Matth i: Be careful: Upgrading Debian Jessie to Stretch, with Pacemaker DRBD and an nested ext4 LVM hosted on VMware products

Detached DRBD (diskless) In the past I setup some new Pacemaker clustered nodes with a fresh Debian Stretch installation. I followed our standard installation guide, created also shared replicated DRBD storage, but whenever I tried to mount the ext4 storage DRBD detached the disks on both node sides with I/O errors. After recreating it, using other storage volumes and testing my ProLiant hardware (whop I thought it had got a defect..) it still occurs, but somewhere in the middle of testing, a quicker setup without LVM it worked fine, hum.. Much later I found this (only post at this time about it) on the DRBD-user mailinglist: [0]
This means, if you use the combination of VMware-Product -> Debian Stretch -> local Storage -> DRBD -> LVM -> ext4 you will be affected by this bug. This happens, because VMware always publishs the information, that the guest is able to support the WRITE SAME feature, which is wrong. Since the DRBD version, which is also shipped with Stretch, DRBD now also supports WRITE SAME, so it tries to use this feature, but this fails then.
This is btw the same reason, why VMware users see in their dmesg this:
WRITE SAME failed.Manually zeroing.
As a workaround I am using now systemd, to disable WRITE SAME for all attached block devices in the guest. Simply run the following:
for i in find /sys/block/*/device/scsi_disk/*/max_write_same_blocks ; do echo w $i 0 ; done > /etc/tmpfiles.d/write_same.conf
[0]: http://lists.linbit.com/pipermail/drbd-user/2017-January/022931.html Pacemaker failovers with DRBD+LVM do not work If you use a DRBD with a nested LVM, you already had to add the following lines to your /etc/lvm/lvm.conf in past Debian releases (assuming that sdb and sdc are DRBD devices):
filter = [ r /dev/sdb.* /dev/sdc.* ]
write_cache_state = 0
Wit Debian Stretch this is not enough. Your failovers will result in a broken state on the second node, because it can not find your LVs and VGs. I found out, that killing lvmetad helps. So I also added a global_filter (it should be used for all LVM services):
global_filter = [ r /dev/sdb.* /dev/sdc.* ]
But this also didn t helped.. My only solution was to disable lvmetad (which I am also not using at all). So adding this all in combination works now for me and failovers are as smooth as with Jessie:
filter = [ r /dev/sdb.* /dev/sdc.* ]
global_filter = [ r /dev/sdb.* /dev/sdc.* ]
write_cache_state = 0
use_lvmetad = 0
Do not forget to update your initrd, so that the LVM configuration is updated on booting your server:
update-initramfs -k all -u
Reboot, that s it :)

19 October 2016

Reproducible builds folks: Reproducible Builds: week 77 in Stretch cycle

What happened in the Reproducible Builds effort between Sunday October 9 and Saturday October 15 2016: Media coverage Documentation update After discussions with HW42, Steven Chamberlain, Vagrant Cascadian, Daniel Shahaf, Christopher Berg, Daniel Kahn Gillmor and others, Ximin Luo has started writing up more concrete and detailed design plans for setting SOURCE_ROOT_DIR for reproducible debugging symbols, buildinfo security semantics and buildinfo security infrastructure. Toolchain development and fixes Dmitry Shachnev noted that our patch for #831779 has been temporarily rejected by docutils upstream; we are trying to persuade them again. Tony Mancill uploaded javatools/0.59 to unstable containing original patch by Chris Lamb. This fixed an issue where documentation Recommends: substvars would not be reproducible. Ximin Luo filed bug 77985 to GCC as a pre-requisite for future patches to make debugging symbols reproducible. Packages reviewed and fixed, and bugs filed The following updated packages have become reproducible - in our current test setup - after being fixed: The following updated packages appear to be reproducible now, for reasons we were not able to figure out. (Relevant changelogs did not mention reproducible builds.) Some uploads have addressed some reproducibility issues, but not all of them: Some uploads have addressed nearly all reproducibility issues, except for build path issues: Patches submitted that have not made their way to the archive yet: Reviews of unreproducible packages 101 package reviews have been added, 49 have been updated and 4 have been removed in this week, adding to our knowledge about identified issues. 3 issue types have been updated: Weekly QA work During of reproducibility testing, some FTBFS bugs have been detected and reported by: tests.reproducible-builds.org Debian: Openwrt/LEDE/NetBSD/coreboot/Fedora/archlinux: Misc. We are running a poll to find a good time for an IRC meeting. This week's edition was written by Ximin Luo, Holger Levsen & Chris Lamb and reviewed by a bunch of Reproducible Builds folks on IRC.

17 March 2016

Patrick Matth i: Debian Jessie 8.3: Short howto for Corosync+Pacemaker Active/Passive Cluster with two nodes and DRBD/LVM

Hello, since I had to change my old heartbeat v1 setup to an more modern Corosync+Pacemaker setup, because heartbeat v1 does not support systemd (it first looks like it is working, but it fails on service start/stops), I want to share a simple setup: First you have to activate the jessie-backports repository, because the cluster stack is not available/broken in Debian Jessie. Install the required packages with:
apt-get install -t jessie-backports libqb0 fence-agents pacemaker corosync pacemaker-cli-utils crmsh drbd-utils
After that configure your DRBD and LVM (VG+LV) on it (there are enough tutorials for it). Then deploy this configuration to /etc/corosync/corosync.conf:
totem
version: 2
token: 3000
token_retransmits_before_loss_const: 10
clear_node_high_bit: yes
crypto_cipher: none
crypto_hash: none
transport: udpu
interface
ringnumber: 0
bindnetaddr: 192.168.99.0

logging
to_logfile: yes
logfile: /var/log/corosync/corosync.log
debug: off
timestamp: on
logger_subsys
subsys: QUORUM
debug: off

quorum
provider: corosync_votequorum
two_node: 1
wait_for_all: 1
nodelist
node
ring0_addr: node1-1

node
ring0_addr: node1-2

Both nodes require a passwordless keypair, which is copied to the another node, so that you can ssh from one to each other. Then you can start with crm configure:
property stonith-enabled=no
property no-quorum-policy=ignore
property default-resource-stickiness=100 primitive DRBD_r0 ocf:linbit:drbd params drbd_resource= r0 op start interval= 0 timeout= 240 \
op stop interval= 0 timeout= 100 \
op monitor role=Master interval=59s timeout=30s \
op monitor role=Slave interval=60s timeout=30s
primitive LVM_r0 ocf:heartbeat:LVM params volgrpname= data1 op monitor interval= 30s
primitive SRV_MOUNT_1 ocf:heartbeat:Filesystem params device= /dev/mapper/data1-lv1 directory= /srv/storage fstype= ext4 options= noatime,nodiratime,nobarrier op monitor interval= 40s primitive IP-rsc ocf:heartbeat:IPaddr2 params ip= 123.123.123.123 nic= eth0 cidr_netmask= 24 meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart
primitive IPInt-rsc ocf:heartbeat:IPaddr2 params ip= 192.168.99.4 nic= eth1 cidr_netmask= 24 meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart primitive MariaDB-rsc lsb:mysql meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart
primitive Redis-rsc lsb:redis-server meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart
primitive Memcached-rsc lsb:memcached meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart
primitive PHPFPM-rsc lsb:php5-fpm meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart
primitive Apache2-rsc lsb:apache2 meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart
primitive Nginx-rsc lsb:nginx meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart group APCLUSTER LVM_r0 SRV_MOUNT_1 IP-rsc IPInt-rsc MariaDB-rsc Redis-rsc Memcached-rsc PHPFPM-rsc Apache2-rsc Nginx-rsc
ms ms_DRBD_APCLUSTER DRBD_r0 meta master-max= 1 master-node-max= 1 clone-max= 2 clone-node-max= 1 notify= true colocation APCLUSTER_on_DRBD_r0 inf: APCLUSTER ms_DRBD_APCLUSTER:Master
order APCLUSTER_after_DRBD_r0 inf: ms_DRBD_APCLUSTER:promote APCLUSTER:start commit
The last (bold marked) lines made me some headache. In short they define that the DRBD device on the active node has to be the primary one and that it is required to start the APCLUSTER on the host, since the LVM, filesystem and services require to access its data. Just a short copy paste howto for an simple use case with not so much deep explanaitions..

8 February 2016

Lunar: Reproducible builds: week 41 in Stretch cycle

What happened in the reproducible builds effort this week:

Toolchain fixes After remarks from Guillem Jover, Lunar updated his patch adding generation of .buildinfo files in dpkg.

Packages fixed The following packages have become reproducible due to changes in their build dependencies: dracut, ent, gdcm, guilt, lazarus, magit, matita, resource-agents, rurple-ng, shadow, shorewall-doc, udiskie. The following packages became reproducible after getting fixed:
  • disque/1.0~rc1-5 by Chris Lamb, noticed by Reiner Herrmann.
  • dlm/4.0.4-2 by Ferenc W gner.
  • drbd-utils/8.9.6-1 by Apollon Oikonomopoulos.
  • java-common/0.54 by by Emmanuel Bourg.
  • libjibx1.2-java/1.2.6-1 by Emmanuel Bourg.
  • libzstd/0.4.7-1 by Kevin Murray.
  • python-releases/1.0.0-1 by Jan Dittberner.
  • redis/2:3.0.7-2 by Chris Lamb, noticed by Reiner Herrmann.
  • tetex-brev/4.22.github.20140417-3 by Petter Reinholdtsen.
Some uploads fixed some reproducibility issues, but not all of them:
  • anarchism/14.0-4 by Holger Levsen.
  • hhvm/3.11.1+dfsg-1 by Faidon Liambotis.
  • netty/1:4.0.34-1 by Emmanuel Bourg.
Patches submitted which have not made their way to the archive yet:
  • #813309 on lapack by Reiner Herrmann: removes the test log and sorts the files packed into the static library locale-independently.
  • #813345 on elastix by akira: suggest to use the $datetime placeholder in Doxygen footer.
  • #813892 on dietlibc by Reiner Herrmann: remove gzip headers, sort md5sums file, and sort object files linked in static libraries.
  • #813912 on git by Reiner Herrmann: remove timestamps from documentation generated with asciidoc, remove gzip headers, and sort md5sums and tclIndex files.

reproducible.debian.net For the first time, we've reached more than 20,000 packages with reproducible builds for sid on amd64 with our current test framework. Vagrant Cascadian has set up another test system for armhf. Enabling four more builder jobs to be added to Jenkins. (h01ger)

Package reviews 233 reviews have been removed, 111 added and 86 updated in the previous week. 36 new FTBFS bugs were reported by Chris Lamb and Alastair McKinstry. New issue: timestamps_in_manpages_generated_by_yat2m. The description for the blacklisted_on_jenkins issue has been improved. Some packages are also now tagged with blacklisted_on_jenkins_armhf_only.

Misc. Steven Chamberlain gave an update on the status of FreeBSD and variants after the BSD devroom at FOSDEM 16. He also discussed how jails can be used for easier and faster reproducibility tests. The video for h01ger's talk in the main track of FOSDEM 16 about the reproducible ecosystem is now available.

4 January 2016

Lunar: Reproducible builds: week 36 in Stretch cycle

What happened in the reproducible builds effort between December 27th and January 2nd: Infrastructure dak now silently accepts and discards .buildinfo files (commit 1, 2), thanks to Niels Thykier and Ansgar Burchardt. This was later confirmed as working by Mattia Rizzolo. Packages fixed The following packages have become reproducible due to changes in their build dependencies: banshee-community-extensions, javamail, mono-debugger-libs, python-avro. The following packages became reproducible after getting fixed: Some uploads fixed some reproducibility issues, but not all of them: Untested changes: reproducible.debian.net The testing distribution (the upcoming stretch) is now tested on armhf. (h01ger) Four new armhf build nodes provided by Vagrant Cascandian were integrated in the infrastructer. This allowed for 9 new armhf builder jobs. (h01ger) The RPM-based build system, koji, is now in unstable and testing. (Marek Marczykowski-G recki, Ximin Luo). Package reviews 131 reviews have been removed, 71 added and 53 updated in the previous week. 58 new FTBFS reports were made by Chris Lamb and Chris West. New issues identified this week: nondeterminstic_ordering_in_gsettings_glib_enums_xml, nondeterminstic_output_in_warnings_generated_by_breathe, qt_translate_noop_nondeterminstic_ordering. Misc. Steven Chamberlain explained in length why reproducible cross-building across architectures mattered, and posted results of his tests comparing a stage1 debootstrapped chroot of linux-i386 once done from official Debian packages, the others cross-built from kfreebsd-amd64.

26 July 2015

Lunar: Reproducible builds: week 12 in Stretch cycle

What happened in the reproducible builds effort this week: Toolchain fixes Eric Dorlan uploaded automake-1.15/1:1.15-2 which makes the output of mdate-sh deterministic. Original patch by Reiner Herrmann. Kenneth J. Pronovici uploaded epydoc/3.0.1+dfsg-8 which now honors SOURCE_DATE_EPOCH. Original patch by Reiner Herrmann. Chris Lamb submitted a patch to dh-python to make the order of the generated maintainer scripts deterministic. Chris also offered a fix for a source of non-determinism in dpkg-shlibdeps when packages have alternative dependencies. Dhole provided a patch to add support for SOURCE_DATE_EPOCH to gettext. Packages fixed The following 78 packages became reproducible in our setup due to changes in their build dependencies: chemical-mime-data, clojure-contrib, cobertura-maven-plugin, cpm, davical, debian-security-support, dfc, diction, dvdwizard, galternatives, gentlyweb-utils, gifticlib, gmtkbabel, gnuplot-mode, gplanarity, gpodder, gtg-trace, gyoto, highlight.js, htp, ibus-table, impressive, jags, jansi-native, jnr-constants, jthread, jwm, khronos-api, latex-coffee-stains, latex-make, latex2rtf, latexdiff, libcrcutil, libdc0, libdc1394-22, libidn2-0, libint, libjava-jdbc-clojure, libkryo-java, libphone-ui-shr, libpicocontainer-java, libraw1394, librostlab-blast, librostlab, libshevek, libstxxl, libtools-logging-clojure, libtools-macro-clojure, litl, londonlaw, ltsp, macsyfinder, mapnik, maven-compiler-plugin, mc, microdc2, miniupnpd, monajat, navit, pdmenu, pirl, plm, scikit-learn, snp-sites, sra-sdk, sunpinyin, tilda, vdr-plugin-dvd, vdr-plugin-epgsearch, vdr-plugin-remote, vdr-plugin-spider, vdr-plugin-streamdev, vdr-plugin-sudoku, vdr-plugin-xineliboutput, veromix, voxbo, xaos, xbae. The following packages became reproducible after getting fixed: Some uploads fixed some reproducibility issues but not all of them: Patches submitted which have not made their way to the archive yet: reproducible.debian.net The statistics on the main page of reproducible.debian.net are now updated every five minutes. A random unreviewed package is suggested in the look at a package form on every build. (h01ger) A new package set based new on the Core Internet Infrastructure census has been added. (h01ger) Testing of FreeBSD has started, though no results yet. More details have been posted to the freebsd-hackers mailing list. The build is run on a new virtual machine running FreeBSD 10.1 with 3 cores and 6 GB of RAM, also sponsored by Profitbricks. strip-nondeterminism development Andrew Ayer released version 0.009 of strip-nondeterminism. The new version will strip locales from Javadoc, include the name of files causing errors, and ignore unhandled (but rare) zip64 archives. debbindiff development Lunar continued its major refactoring to enhance code reuse and pave the way to fuzzy-matching and parallel processing. Most file comparators have now been converted to the new class hierarchy. In order to support for archive formats, work has started on packaging Python bindings for libarchive. While getting support for more archive formats with a common interface is very nice, libarchive is a stream oriented library and might have bad performance with how debbindiff currently works. Time will tell if better solutions need to be found. Documentation update Lunar started a Reproducible builds HOWTO intended to explain the different aspects of making software build reproducibly to the different audiences that might have to get involved like software authors, producers of binary packages, and distributors. Package reviews 17 obsolete reviews have been removed, 212 added and 46 updated this week. 15 new bugs for packages failing to build from sources have been reported by Chris West (Faux), and Mattia Rizzolo. Presentations Lunar presented Debian efforts and some recipes on making software build reproducibly at Libre Software Meeting 2015. Slides and a video recording are available. Misc. h01ger, dkg, and Lunar attended a Core Infrastructure Initiative meeting. The progress and tools mode for the Debian efforts were shown. Several discussions also helped getting a better understanding of the needs of other free software projects regarding reproducible builds. The idea of a global append only log, similar to the logs used for Certificate Transparency, came up on multiple occasions. Using such append only logs for keeping records of sources and build results has gotten the name Binary Transparency Logs . They would at least help identifying a compromised software signing key. Whether the benefits in using such logs justify the costs need more research.

12 January 2015

Russell Coker: Systemd Notes

A few months ago I gave a lecture about systemd for the Linux Users of Victoria. Here are some of my notes reformatted as a blog post: Scripts in /etc/init.d can still be used, they work the same way as they do under sysvinit for the user. You type the same commands to start and stop daemons. To get a result similar to changing runlevel use the systemctl isolate command. Runlevels were never really supported in Debian (unlike Red Hat where they were used for starting and stopping the X server) so for Debian users there s no change here. The command systemctl with no params shows a list of loaded services and highlights failed units. The command journalctl -u UNIT-PATTERN shows journal entries for the unit(s) in question. The pattern uses wildcards not regexs. The systemd journal includes the stdout and stderr of all daemons. This solves the problem of daemons that don t log all errors to syslog and leave the sysadmin wondering why they don t work. The command systemctl status UNIT gives the status and last log entries for the unit in question. A program can use ioctl(fd, TIOCSTI, ) to push characters into a tty buffer. If the sysadmin runs an untrusted program with the same controlling tty then it can cause the sysadmin shell to run hostile commands. The system call setsid() to create a new terminal session is one solution but managing which daemons can be started with it is difficult. The way that systemd manages start/stop of all daemons solves this. I am glad to be rid of the run_init program we used to use on SE Linux systems to deal with this. Systemd has a mechanism to ask for passwords for SSL keys and encrypted filesystems etc. There have been problems with that in the past but I think they are all fixed now. While there is some difficulty during development the end result of having one consistent way of managing this will be better than having multiple daemons doing it in different ways. The commands systemctl enable and systemctl disable enable/disable daemon start at boot which is easier than the SysVinit alternative of update-rc.d in Debian. Systemd has built in seat management, which is not more complex than consolekit which it replaces. Consolekit was installed automatically without controversy so I don t think there should be controversy about systemd replacing consolekit. Systemd improves performance by parallel start and autofs style fsck. The command systemd-cgtop shows resource use for cgroups it creates. The command systemd-analyze blame shows what delayed the boot process and
systemd-analyze critical-chain shows the critical path in boot delays. Sysremd also has security features such as service private /tmp and restricting service access to directory trees. Conclusion For basic use things just work, you don t need to learn anything new to use systemd. It provides significant benefits for boot speed and potentially security. It doesn t seem more complex than other alternative solutions to the same problems. https://wiki.debian.org/systemd http://freedesktop.org/wiki/Software/systemd/Optimizations/ http://0pointer.de/blog/projects/security.html

30 December 2012

Iustin Pop: Interesting tool of the day: ghc-gc-tune

Courtesy of a recent Google+ post/Stack overflow answer, I stumbled upon ghc-gc-tune, a simple but nice tool which generates interesting graphs for Haskell programs. What is does is quite trivial: iterate over a range of arena and heap sizes, and run a specified program with those RTS options, then generate a graph comparing the performance (by default cpu time) across the combinations of the values. Note that for newer GHC versions, you'll need to link with -rtsopts to allow for the -A/-H custom sizes. The reason I mention it is that the graphs it generates can be quite interesting. For one of the Ganeti programs, hspace, run with the command line ./hspace --simu p,20,1t,96g,16,1 --disk-template drbd, it generates this graph: runtime graph with trivial usage This picture, if I read it correctly, says that this program is actually well behaved (well, +RTS -s says "3 MB total memory in use" with default options), and that the optimum sizes actually relate to the (L1? L2?) cache size. But note that the maximum difference is only about ~1.6 . By changing the parameters to hspace and make it allocate more memory (./hspace --simu p,20,1t,128g,256,32 --disk-template plain --tiered=1g,128m,1, ~52MB reported by +RTS -s), the graph changes significantly: runtime graph with 50mb usage Now we have somewhat the opposite situation: very small arena sizes are detrimental (and by a big factor, 4.5 ), large arena/heap sizes are OK-ish, and the sweet spot is around 2-4MB arena size with heap sizes up to 4MB. Now maybe these particular examples were not very elightening (and they were definitely not well-conducted tests, etc.), but they should allow some intuition into how the program behaves. Plus, the tool can also generate other plots, for example peak memory usage.

26 November 2012

Russell Coker: Links November 2012

Julian Treasure gave an informative TED talk about The 4 Ways Sound Affects US [1]. Among other things he claims that open plan offices reduce productivity by 66%! He suggests that people who work in such offices wear headphones and play bird-songs. Naked Capitalism has an interesting interview between John Cusack and Jonathan Turley about how the US government policy of killing US citizens without trial demonstrates the failure of their political system [2]. Washington s blog has an interesting article on the economy in Iceland [3]. Allowing the insolvent banks to go bankrupt was the best thing that they have ever done for their economy. Clay Shirky wrote an insightful article about the social environment of mailing lists and ways to limit flame-wars [4]. ZRep is an interesting program that mirrors ZFS filesystems via regular snapshots and send/recv operations [5]. It seems that it could offer similar benefits to DRBD but at the file level and with greater reliability. James Lockyer gave a movingTEDx talk about his work in providing a legal defence for the wrongly convicted [6]. This has included overturning convictions after as much as half a century in which the falsely accused had already served a life sentence. Nathan Myers wrote an epic polemic about US government policy since 9-11 [7]. It s good to see that some Americans realise it s wrong. There is an insightful TED blog post about TED Fellow Salvatore Iaconesi who has brain cancer [8]. Apparently he had some problems with medical records in proprietary formats which made it difficult to get experts to properly assess his condition. Open document standards can be a matter of life and death and should be mandated by federal law. Paul Wayper wrote an interesting and amusing post about Emotional Computing which compares the strategies of Apple, MS, and the FOSS community among other things [9]. Kevin Allocca of Youtube gave an insightful TED talk about why videos go viral [10]. Jason Fried gave an interesting TED talk Why Work Doesn t Happen at Work [11]. His main issues are distraction and wasted time in meetings. He gives some good ideas for how to improve productivity. But they can also be used for sabotage. If someone doesn t like their employer then they could call for meetings, incite managers to call meetings, and book meetings so that they don t follow each other and thus waste more of the day (EG meetings at 1PM and 3PM instead of having the second meeting when the first finishes). Shyam Sankar gave an interesting TED talk about human computer cooperation [12]. He describes the success of human-computer partnerships in winning chess tournaments, protein folding, and other computational challenges. It seems that the limit for many types of computation will be the ability to get people and computers to work together efficiently. Cory Doctorow wrote an interesting and amusing article for Locus Magazine about some of the failings of modern sci-fi movies [13]. He is mainly concerned with pointless movies that get the science and technology aspects wrong and the way that the blockbuster budget process drives the development of such movies. Of course there are many other things wrong with sci-fi movies such as the fact that most of them are totally implausible (EG aliens who look like humans). The TED blog has an interesting interview with Catarina Mota about hacker spaces and open hardware [14]. Sociological Images has an interesting article about sporting behaviour [15]. They link to a very funny youtube video of a US high school football team who make the other team believe that they aren t playing until they win [16] Related posts:
  1. Links April 2012 Karen Tse gave an interesting TED talk about how to...
  2. Links March 2012 Washington s Blog has an informative summary of recent articles about...
  3. Links November 2011 Forbes has an interesting article about crowd-sourcing by criminals and...

27 April 2012

Russell Coker: BTRFS and ZFS as Layering Violations

LWN has an interesting article comparing recent developments in the Linux world to the Unix Wars that essentially killed every proprietary Unix system [1]. The article is really interesting and I recommend reading it, it s probably only available to subscribers at the moment but should be generally available in a week or so (I used my Debian access sponsored by HP to read it). A comment on that article cites my previous post about the reliability of RAID [2] and then goes on to disagree with my conclusion that using the filesystem for everything is the right thing to do. The Benefits of Layers I don t believe as strongly in the BTRFS/ZFS design as the commentator probably thinks. The current way my servers (and a huge number of other Linux systems) work of having RAID to form a reliable array of disks from a set of cheap disks for the purpose of reliability and often capacity or performance is a good thing. I have storage on top of the RAID array and can fix the RAID without bothering about the filesystem(s) and have done so in the past. I can also test the RAID array without involving any filesystem specific code. Then I have LVM running on top of the RAID array in exactly the same way that it runs on top of a single hard drive or SSD in the case of a laptop or netbook. So Linux on a laptop is much the same as Linux on a server in terms of storage once we get past the issue of whether a single disk or a RAID array is used for the LVM PV, among other things this means that the same code paths are used and I m less likely to encounter a bug when I install a new system. LVM provides multiple LVs which can be used for filesystems, swap, or anything else that uses storage. So if a filesystem gets badly corrupted I can umount it, create an LVM snapshot, and then take appropriate measures to try and fix it without interfering with other filesystems. When using layered storage I can easily add or change layers when it s appropriate. For example I have encryption on only some LVs on my laptop and netbook systems (there is no point encrypting the filesystem used for .iso files of Linux distributions) and on some servers I use RAID-0 for cached data. When using a filesystem like BTRFS or ZFS which includes subvolumes (similar in result to LVM in some cases) and internal RAID you can t separate the layers. So if something gets corrupted then you have to deal with all the complexity of BTRFS or ZFS instead of just fixing the one layer that has a problem. Update: One thing I forgot to mention when I first published this is the benefits of layering for some uncommon cases such as network devices. I can run an Ext4 filesystem over a RAID-1 array which has one device on NBD on another system. That s a bit unusual but it is apparently working well for some people. The internal RAID on ZFS and BTRFS doesn t support such things and using software RAID underneath ZFS or BTRFS loses some features. When using DRBD you might have two servers with local RAID arrays, DRBD on top of that, and then an Ext4 filesystem. As any form of RAID other than internal RAID loses reliability features for ZFS and BTRFS that means that no matter how you might implement those filesystems with DRBD it seems that you will lose somehow. It seems that neither BTRFS nor ZFS supports a disconnected RAID mode (like a Linux software RAID with a bitmap so it can resync only the parts that didn t change) so it s not possible to use BTRFS or ZFS RAID-1 with an NBD device. The only viable way of combining ZFS data integrity features with DRBD replication seems to be using a zvol for DRBD and then running Ext4 on top of that. The Benefits of Integration When RAID and the filesystem are separate things (with some added abstraction from LVM) it s difficult to optimise the filesystem for RAID performance at the best of times and impossible in many cases. When the filesystem manages RAID it can optimise it s operation to match the details of the RAID layout. I believe that in some situations ZFS will use mirroring instead of RAID-Z for small writes to reduce the load and that ZFS will combine writes into a single RAID-Z stripe (or set of contiguous RAID-Z stripes) to improve write performance. It would be possible to have a RAID driver that includes checksums for all blocks, it could then read from another device when a checksum fails and give some of the reliability features that ZFS and BTRFS offer. Then to provide all the reliability benefits of ZFS you would at least need a filesystem that stores multiple copies of the data which would of course need checksums (because the filesystem could be used on a less reliable block device) and therefore you would end up with two checksums on the same data. Note that if you want to have a RAID array with checksums on all blocks then ZFS has a volume management feature (which is well described by Mark Round) [3]. Such a zvol could be used for a block device in a virtual machine and in an ideal world it would be possible to use one as swap space. But the zvol is apparently managed with all the regular ZFS mechanisms so it s not a direct list of blocks on disk and thus can t be extracted if there is a problem with ZFS. Snapshots are an essential feature by today s standards. The ability to create lots of snapshots with low overhead is a significant feature of filesystems like BTRFS and ZFS. Now it is possible to run BTRFS or ZFS on top of a volume manager like LVM which does snapshots to cover the case of the filesystem getting corrupted. But again that would end up with two sets of overhead. The way that ZFS supports snapshots which inherit encryption keys is also interesting. Conclusion It s technically possible to implement some of the ZFS features as separate layers, such as a software RAID implementation that put checksums on all blocks. But it appears that there isn t much interest in developing such things. So while people would use it (and people are using ZFS ZVols as block devices for other filesystems as described in a comment on Mark Round s blog) it s probably not going to be implemented. Therefore we have a choice of all the complexity and features of BTRFS or ZFS, or the current RAID+LVM+Ext4 option. While the complexity of BTRFS and ZFS is a concern for me (particularly as BTRFS is new and ZFS is really complex and not well supported on Linux) it seems that there is no other option for certain types of large storage at the moment. ZFS on Linux isn t a great option for me, but for some of my clients it seems to be the only option. ZFS on Solaris would be a better option in some ways, but that s not possible when you have important Linux software that needs fast access to the storage. Related posts:
  1. Starting with BTRFS Based on my investigation of RAID reliability [1] I have...
  2. ZFS vs BTRFS on Cheap Dell Servers I previously wrote about my first experiences with BTRFS [1]....
  3. Reliability of RAID ZDNet has an insightful article by Robin Harris predicting the...

8 February 2012

Russell Coker: More DRBD Performance tests

I ve previously written Some Notes on DRBD [1] and a post about DRBD Benchmarking [2]. Previously I had determined that replication protocol C gives the best performance for DRBD, that the batch-time parameters for Ext4 aren t worth touching for a single IDE disk, that barrier=0 gives a massive performance boost, and that DRBD gives a significant performance hit even when the secondary is not connected. Below are the results of some more tests of delivering mail from my Postal benchmark to my LMTP server which uses the Dovecot delivery agent to write it to disk, the rates are in messages per minute where each message is an average of 70K in size. The ext4 filesystem is used for all tests and the filesystem features list is has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize .
p4-2.8
Default Ext4 1663
barrier=0 2875
DRBD no secondary al-extents=7 645
DRBD no secondary default 2409
DRBD no secondary al-extents=1024 2513
DRBD no secondary al-extents=3389 2650
DRBD connected 1575
DRBD connected al-extents=1024 1560
DRBD connected al-extents=1024 Gig-E 1544
The al-extents option determines the size of the dirty areas that need to be resynced when a failed node rejoins the cluster. The default is 127 extents of 4M each for a block size of 508MB to be synchronised. The maximum is 3389 for a synchronisation block size of just over 13G. Even with fast disks and gigabit Ethernet it s going to take a while to synchronise things if dirty zones are 13GB in size. In my tests using the maximum size of al-extents gives a 10% performance benefit in disconnected mode while a size of 1024 gives a 4% performance boost. Changing the al-extents size seems to make no significant difference for a connected DRBD device. All the tests on connected DRBD devices were done with 100baseT apart from the last one which was a separate Gigabit Ethernet cable connecting the two systems. Conclusions For the level of traffic that I m using it seems that Gigabit Ethernet provides no performance benefit, the fact that it gave a slightly lower result is not relevant as the difference is within the margin of error. Increasing the al-extents value helps with disconnected performance, a value of 1024 gives a 4% performance boost. I m not sure that a value of 3389 is a good idea though. The ext4 barriers are disabled by DRBD so a disconnected DRBD device gives performance that is closer to a barrier=0 mount than a regular ext4 mount. With the significant performance difference between connected and disconnected modes it seems possible that for some usage scenarios it could be useful to disable the DRBD secondary at times of peak load it depends on whether DRBD is used as a really current backup or a strict mirror. Future Tests I plan to do some tests of DRBD over Linux software RAID-1 and tests to compare RAID-1 with and without bitmap support. I also plan to do some tests with the BTRFS filesystem, I know it s not ready for production but it would still be nice to know what the performance is like. But I won t use the same systems, they don t have enough CPU power. In my previous tests I established that a 1.5GHz P4 isn t capable of driving the 20G IDE disk to it s maximum capacity and I m not sure that the 2.8GHz P4 is capable of running a RAID to it s capacity. So I will use a dual-core 64bit system with a pair of SATA disks for future tests. The difference in performance between 20G IDE disks and 160G SATA disks should be a lot less than the performance difference between a 2.8GHz P4 and a dual-core 64bit CPU. Related posts:
  1. DRBD Benchmarking I ve got some performance problems with a mail server that s...
  2. Some Notes on DRBD DRBD is a system for replicating a block device across...
  3. Ethernet bonding Bonding is one of the terms used to describe multiple...

5 February 2012

Russell Coker: Reliability of RAID

ZDNet has an insightful article by Robin Harris predicting the demise of RAID-6 due to the probability of read errors [1]. Basically as drives get larger the probability of hitting a read error during reconstruction increases and therefore you need to have more redundancy to deal with this. He suggests that as of 2009 drives were too big for a reasonable person to rely on correct reads from all remaining drives after one drive failed (in the case of RAID-5) and that in 2019 there will be a similar issue with RAID-6. Of course most systems in the field aren t using even RAID-6. All the most economical hosting options involve just RAID-1 and RAID-5 is still fairly popular with small servers. With RAID-1 and RAID-5 you have a serious problem when (not if) a disk returns random or outdated data and says that it is correct, you have no way of knowing which of the disks in the set has good data and which has bad data. For RAID-5 it will be theoretically possible to reconstruct the data in some situations by determining which disk should have it s data discarded to give a result that passes higher level checks (EG fsck or application data consistency), but this is probably only viable in extreme cases (EG one disk returns only corrupt data for all reads). For the common case of a RAID-1 array if one disk returns a few bad sectors then probably most people will just hope that it doesn t hit something important. The case of Linux software RAID-1 is of interest to me because that is used by many of my servers. Robin has also written about some NetApp research into the incidence of read errors which indicates that 8.5% of consumer disks had such errors during the 32 month study period [2]. This is a concern as I run enough RAID-1 systems with consumer disks that it is very improbable that I m not getting such errors. So the question is, how can I discover such errors and fix them? In Debian the mdadm package does a monthly scan of all software RAID devices to try and find such inconsistencies, but it doesn t send an email to alert the sysadmin! I have filed Debian bug #658701 with a patch to make mdadm send email about this. But this really isn t going to help a lot as the email will be sent AFTER the kernel has synchronised the data with a 50% chance of overwriting the last copy of good data with the bad data! Also the kernel code doesn t seem to tell userspace which disk had the wrong data in a 3-disk mirror (and presumably a RAID-6 works in the same way) so even if the data can be corrected I won t know which disk is failing. Another problem with RAID checking is the fact that it will inherently take a long time and in practice can take a lot longer than necessary. For example I run some systems with LVM on RAID-1 on which only a fraction of the VG capacity is used, in one case the kernel will check 2.7TB of RAID even when there s only 470G in use! The BTRFS Filesystem The btrfs Wiki is currently at btrfs.ipv5.de as the kernel.org wikis are apparently still read-only since the compromise [3]. BTRFS is noteworthy for doing checksums on data and metadata and for having internal support for RAID. So if two disks in a BTRFS RAID-1 disagree then the one with valid checksums will be taken as correct! I ve just done a quick test of this. I created a filesystem with the command mkfs.btrfs -m raid1 -d raid1 /dev/vg0/raid? and copied /dev/urandom to it until it was full. I then used dd to copy /dev/urandom to some parts of /dev/vg0/raidb while reading files from the mounted filesystem that worked correctly although I was disappointed that it didn t report any errors, I had hoped that it would read half the data from each device and fix some errors on the fly. Then I ran the command btrfs scrub start . and it gave lots of verbose errors in the kernel message log telling me which device had errors and where the errors are. I was a little disappointed that the command btrfs scrub status . just gave me a count of the corrected errors and didn t mention which device had the errors. It seems to me that BTRFS is going to be a much better option than Linux software RAID once it is stable enough to use in production. I am considering upgrading one of my less important servers to Debian/Unstable to test out BTRFS in this configuration. BTRFS is rumored to have performance problems, I will test this but don t have time to do so right now. Anyway I m not always particularly concerned about performance, I have some systems where reliability is important enough to justify a performance loss. BTRFS and Xen The system with the 2.7TB RAID-1 is a Xen server and LVM volumes on that RAID are used for the block devices of the Xen DomUs. It seems obvious that I could create a single BTRFS filesystem for such a machine that uses both disks in a RAID-1 configuration and then use files on the BTRFS filesystem for Xen block devices. But that would give a lot of overhead of having a filesystem within a filesystem. So I am considering using two LVM volume groups, one for each disk. Then for each DomU which does anything disk intensive I can export two LVs, one from each physical disk and then run BTRFS inside the DomU. The down-side of this is that each DomU will need to scrub the devices and monitor the kernel log for checksum errors. Among other things I will have to back-port the BTRFS tools to CentOS 4. This will be more difficult to manage than just having an LVM VG running on a RAID-1 array and giving each DomU a couple of LVs for storage. BTRFS and DRBD The combination of BTRFS RAID-1 and DRBD is going to be a difficult one. The obvious way of doing it would be to run DRBD over loopback devices that use large files on a BTRFS filesystem. That gives the overhead of a filesystem in a filesystem as well as the DRBD overhead. It would be nice if BTRFS supported more than two copies of mirrored data. Then instead of DRBD over RAID-1 I could have two servers that each have two devices exported via NBD and BTRFS could store the data on all four devices. With that configuration I could lose an entire server and get a read error without losing any data! Comparing Risks I don t want to use BTRFS in production now because of the risk of bugs. While it s unlikely to have really serious bugs it s theoretically possible that as bug could deny access to data until kernel code is fixed and it s also possible (although less likely) that a bug could result in data being overwritten such that it can never be recovered. But for the current configuration (Ext4 on Linux software RAID-1) it s almost certain that I will lose small amounts of data and it s most probable that I have silently lost data on many occasions without realising. Related posts:
  1. Some RAID Issues I just read an interesting paper titled An Analysis of...
  2. ECC RAM is more useful than RAID A common myth in the computer industry seems to be...
  3. Software vs Hardware RAID Should you use software or hardware RAID? Many people claim...

5 January 2012

Russell Coker: DRBD Benchmarking

I ve got some performance problems with a mail server that s using DRBD so I ve done some benchmark tests to try and improve things. I used Postal for testing delivery to an LMTP server [1]. The version of Postal I released a few days ago had a bug that made LMTP not work, I ll release a new version to fix that next time I work on Postal or when someone sends me a request for LMTP support (so far no-one has asked for LMTP support so I presume that most users don t mind that it s not yet working). The local spool on my test server is managed by Dovecot, the Dovecot delivery agent stores the mail and the Dovecot POP and IMAP servers provide user access. For delivery I m using the LMTP server I wrote which has been almost ready for GPL release for a couple of years. All I need to write is a command-line parser to support delivery options for different local delivery agents. Currently my LMTP server is hard-coded to run /usr/lib/dovecot/deliver and has it s parameters hard-coded too. As an aside if someone would like to contribute some GPL C/C++ code to convert a string like /usr/lib/dovecot/deliver -e -f %from% -d %to% -n into something that will populate an argv array for execvp() then that would be really appreciated. Authentication is to a MySQL server running on a fast P4 system. The MySQL server was never at any fraction of it s CPU or disk IO capacity so using a different authentication system probably wouldn t have given different results. I used MySQL because it s what I m using in production. Apart from my LMTP server and the new version of Postal all software involved in the testing is from Debian/Squeeze. The Tests All tests were done on a 20G IDE disk. I started testing with a Pentium-4 1.5GHz system with 768M of RAM but then moved to a Pentium-4 2.8GHz system with 1G of RAM when I found CPU time to be a bottleneck with barrier=0. All test results are for the average number of messages delivered per minute for a 19 minute test run where the first minute s results are discarded. The delivery process used 12 threads to deliver mail.
P4-1.5 p4-2.8
Default Ext4 1468 1663
Ext4 max_batch_time=30000 1385 1656
Ext4 barrier=0 1997 2875
Ext4 on DRBD no secondary 1810 2409
When doing the above tests the 1.5GHz system was using 100% CPU time when the filesystem was mounted with barrier=0, about half of that was for system (although I didn t make notes at the time). So the testing on the 1.5GHz system showed that increasing the Ext4 max_batch_time number doesn t give a benefit for a single disk, that mounting with barrier=0 gives a significant performance benefit, and that using DRBD in disconnected mode gives a good performance benefit through forcing barrier=0. As an aside I wonder why they didn t support barriers on DRBD given all the other features that they have for preserving data integrity. The tests with the 2.8GHz system demonstrate the performance benefits of having adequate CPU power, as an aside I hope that Ext4 is optimised for multi-core CPUs because if a 20G IDE disk needs a 2.8GHz P4 then modern RAID arrays probably require more CPU power than a single core can provide. It s also interesting to note that a degraded DRBD device (where the secondary has never been enabled) only gives 84% of the performance of /dev/sda4 when mounted with barrier=0.
p4-2.8
Default Ext4 1663
Ext4 max_batch_time=30000 1656
Ext4 min_batch_time=15000,max_batch_time=30000 1626
Ext4 max_batch_time=0 1625
Ext4 barrier=0 2875
Ext4 on DRBD no secondary 2409
Ext4 on DRBD connected C 1575
Ext4 on DRBD connected B 1428
Ext4 on DRBD connected A 1284
Of all the options for batch times that I tried it seemed that every change decreased the performance slightly but as the greatest decrease in performance was only slightly more than 2% it doesn t matter much. One thing that really surprised me was the test results from different replication protocols. The DRBD replication protocols are documented here [2]. Protocol C is fully synchronous a write request doesn t complete until the remote node has it on disk. Protocol B is memory synchronous, the write is complete when it s on a local disk and in RAM on the other node. Protocol A is fully asynchronous, a write is complete when it s on a local disk. I had expected protocol A to give the best performance as it has lower latency for critical write operations and for protocol C to be the worst. My theory is that DRBD has a performance bug for the protocols that the developers don t recommend. One other thing I can t explain is that according to iostat the data partition on the secondary DRBD node had almost 1% more sectors written than the primary and the number of writes was more than 1% greater on the secondary. I had hoped that with protocol A the writes would be combined on the secondary node to give a lower disk IO load. I filed Debian bug report #654206 about the kernel not exposing the correct value for max_batch_time. The fact that no-one else has reported that bug (which is in kernels from at least 2.6.32 to 3.1.0) is an indication that not many people have found it useful. Conclusions When using DRBD use protocol C as it gives better integrity and better performance. Significant CPU power is apparently required for modern filesystems. The fact that a Maxtor 20G 7200rpm IDE disk [3] can t be driven at full speed by a 1.5GHz P4 was a surprise to me. DRBD significantly reduces performance when compared to a plain disk mounted with barrier=0 (for a fair comparison). The best that DRBD could do in my tests was 55% of native performance when connected and 84% of native performance when disconnected. When comparing a cluster of cheap machines running DRBD on RAID-1 arrays to a single system running RAID-6 with redundant PSUs etc the performance loss from DRBD is a serious problem that can push the economic benefit back towards the single system. Next I will benchmark DRBD on RAID-1 and test the performance hit of using bitmaps with Linux software RAID-1. If anyone knows how to make a HTML table look good then please let me know. It seems that the new blog theme that I m using prevents borders. Update: I mentioned my Debian bug report about the mount option and the fact that it s all on Debian/Squeeze. Related posts:
  1. I need an LMTP server I am working on a system where a front-end mail...
  2. Some Notes on DRBD DRBD is a system for replicating a block device across...
  3. paper about ZCAV This paper by Rodney Van Meter about ZCAV (Zoned Constant...

17 December 2011

Russell Coker: Some Notes on DRBD

DRBD is a system for replicating a block device across multiple systems. It s most commonly used for having one system write to the DRBD block device such that all writes are written to a local disk and a remote disk. In the default configuration a write is not complete until it s committed to disk locally and remotely. There is support for having multiple systems write to disk at the same time, but naturally that only works if the filesystem drivers are aware of this. I m installing DRBD on some Debian/Squeeze servers for the purpose of mirroring a mail store across multiple systems. For the virtual machines which run mail queues I m not using DRBD because the failure conditions that I m planning for don t include two disks entirely failing. I m planning for a system having an outage for a while so it s OK to have some inbound and outbound mail delayed but it s not OK for the mail store to be unavailable. Global changes I ve made in /etc/drbd.d/global_common.conf In the common section I changed the protocol from C to B , this means that a write() system call returns after data is committed locally and sent to the other node. This means that if the primary node goes permanently offline AND if the secondary node has a transient power failure or kernel crash causing the buffer contents to be lost then writes can be lost. I don t think that this scenario is likely enough to make it worth choosing protocol C and requiring that all writes go to disk on both nodes before they are considered to be complete. In the net section I added the following: sndbuf-size 512k;
data-integrity-alg sha1; This uses a larger network sending buffer (apparently good for fast local networks although I d have expected that the low delay on a local Gig-E would give a low bandwidth delay product) and to use sha1 hashes on all packets (why does it default to no data integrity). Reserved Numbers The default port number that is used is 7789. I think it s best to use ports below 1024 for system services so I ve setup some systems starting with port 100 and going up from there. I use a different port for every DRBD instance, so if I have two clustered resources on a LAN then I ll use different ports even if they aren t configured to ever run on the same system. You never know when the cluster assignment will change and DRBD port numbers seems like something that could potentially cause real problems if there was a port conflict. Most of the documentation assumes that the DRBD device nodes on a system will start at /dev/drbd0 and increment, but this is not a requirement. I am configuring things such that there will only ever be one /dev/drbd0 on a network. This means that there is no possibility of a cut/paste error in a /etc/fstab file or a Xen configuration file causing data loss. As an aside I recently discovered that a Xen Dom0 can do a read-write mount of a block device that is being used read-write by a Xen DomU, there is some degree of protection against a DomU using a block device that is already being used in the Dom0 but no protection against the Dom0 messing with the DomU s resources. It would be nice if there was an option of using some device name other than /dev/drbdX where X is a number. Using meaningful names would reduce the incidence of doing things to the wrong device. As an aside it would be nice if there was some sort of mount helper for determining which devices shouldn t be mounted locally and which mount options are permitted it MIGHT be OK to do a read-only mount of a DomU s filesystem in the Dom0 but probably all mounting should be prevented. Also a mount helper for such things would ideally be able to change the default mount options, for example it could make the defaults be nosuid,nodev (or even noexec,nodev) when mounting filesystems from removable devices. Initial Synchronisation After a few trials it seems to me that things generally work if you create DRBD on two nodes at the same time and then immediately make one of them primary. If you don t then it will probably refuse to accept one copy of the data as primary as it can t seem to realise that both are inconsistent. I can t understand why it does this in the case where there are two nodes with inconsistent data, you know for sure that there is no good data so there should be an operation to zero both devices and make them equal. Instead there The solution sometimes seems to be to run drbdsetup /dev/drbd0 primary - (where drbd0 is replaced with the appropriate device). This seems to work well and allowed me to create a DRBD installation before I had installed the second server. If the servers have been connected in Inconsistent/Inconsistent state then the solution seems to involve running drbdadm -- --overwrite-data-of-peer primary db0-mysql (for the case of a resource named db0-mysql defined in /etc/drbd.d/db0-mysql.res). Also it seems that some commands can only be run from one node. So if you have a primary node that s in service and another node in Secondary/Unknown state (IE disconnected) with data state Inconsistent/DUnknown then while you would expect to be able to connect from the secondary node is appears that nothing other than a drbdadm connect command run from the primary node will get things going.

Next.