Search Results: "rbd"

28 January 2023

Emmanuel Kasper: Table of correspondence between AWS / Azure / Red Hat OpenShift Container Platform / upstream projects

If you know the Amazon Web Services or Azure portfolio, and you are interested in OpenShift or the OKD OpenShift community distribution, this is a table of corresponding technologies. OpenShift is Red Hat s Kubernetes distribution: it is basically the upstream Kubernetes delivered with monitoring, logging, CI/CD, underlying OS, tested upgrade paths not found with a manual kubernetes.io kubeadm install. After passing the two corresponding certifications, my opinion on cloud operators is that it is very much a step back in the direction of proprietary software. You can rebuild their cloud stack with opensource components, but it is also a lot of integration work, similar to using the Linux from scratch distribution instead of something like Debian. A good middle point are the OpenShift and OKD Kubernetes distributions, who integrate the most common cloud components, but allow an installation on your own hardware or cloud provider of your choice.

AWS	Azure	OpenShift	*OpenShift upstream project&
Cloud Trail		Kubernetes API Server audit log	Kubernetes
Cloud Watch	Azure Monitor, Azure Log Analytics	OpenShift Monitoring	Prometheus, Kubernetes Metrics
AWS Artifact		Compliance Operator	OpenSCAP
AWS Trusted Advisor	Azure Advisor	Insights
AWS Marketplace		Red Hat Market place	Operator Hub
AWS Identity and Access Management (IAM)	Azure Active Directory, Azure AD DS	Red Hat SSO	Keycloack
AWS Elastisc Beanstalk	Azure App Services	OpenShift Source2Image (S2I)	Source2Image (S2I)
AWS S3	Azure Blob Storage`**`	ODF Rados Gateway	Rook RGW
AWS Elastic Block Storage	Azure Disk Storage	ODF Rados Block Device	Rook RBD
AWS Elastic File System	Azure Files	ODF Ceph FS	Rook CephFS
AWS ELB Classic	Azure Load Balancer	MetalLB Operator	MetalLB
AWS ELB Application Load Balancer	Azure Application Gateway	OpenShift Router	HAProxy
Amazon Simple Notification Service		OpenShift Streams for Apache Kafka	Apache Kafka
Amazon Guard Duty	Microsoft Defender for Cloud	API Server audit log review, ACS Runtime detection	Stackrox
Amazon Inspector	Microsoft Defender for Cloud	Quay.io container scanner, ACS Vulnerability Assessment	Clair, Stackrox
AWS Lambda	Azure Serverless	Openshift Serverless`*`	Knative
AWS Key Management System	Azure Key Vault	could be done with Hashicorp Vault	Vault
AWS WAF		NGINX Ingress Controller Operator with ModSecurity	NGINX ModSecurity
Amazon Elasticache		Redis Enterprise Operator	Redis, memcached as alternative
AWS Relational Database Service	Azure SQL	Crunchy Data Operator	PostgreSQL
	Azure Arc	OpenShift ACM	Open Cluster Management
AWS Scaling Group	Azure Scale Set	OpenShift Autoscaler	OKD Autoscaler

* OpenShift Serverless requires the application to be packaged as a container, something AWS Lambda does not require. ** Azure Blob Storage covers the object storage use case of S3, but is itself not S3 compatible

28 September 2022

Vincent Fourmond: Version 3.1 of QSoas is out

The new version of QSoas has just been released ! It brings in a host of new features, as the releases before, but maybe the most important change is the following... Binary images now freely available ! Starting from now, all the binary images for the new versions of QSoas will freely available from the download page. You can download the precompiled versions of QSoas for MacOS or windows. So now, you have no reason anymore not to try !
My aim with making the binaries freely available is also to simplify the release process for me and therefore increase the rate at which new versions are released. Improvements to the fit interface Some work went into improving the fit interface, in particular for the handling of fit trajectories when doing parameter space exploration, for difficult fits with many parameters and many local minima. The fit window now features real menus, along with tab a way to display the terminal (see the menus and the tabs selection on the image).

Individual fits have also been improved, with, among others, the possibility to easily simulate voltammograms with the kinetic-system fits, and the handling of Marcus-Hush-Chidsey (or Marcus "distribution of states") kinetics for electron transfers. Column and row names This release greatly improves the handling of column and row names, including commands to easily modify them, the possibility to use Ruby formulas to change them, and a much better way read and write them to data files. Mastering the use of column names (and to a lesser extent, row names) can greatly simplify data handling, especially when dealing with files with a large number of columns. Complex numbers Version 3.1 brings in support for formulas handling complex numbers. Although it is not possible to store complex numbers directly into datasets, it is easy to separate them in real and imaginary parts to your liking. Scripting improvement Two important improvements for scripting are included in version 3.1. The first is the possibility to define virtual files inside a script file, which makes it easy to define subfunctions to run using commands like run-for-each. The second is the possibility to define variables to be reused later (like the script arguments) using the new command let. There are a lot of other new features, improvements and so on, look for the full list there. About QSoas
QSoas is a powerful open source data analysis program that focuses on flexibility and powerful fitting capacities. It is released under the GNU General Public License. It is described in Fourmond, Anal. Chem., 2016, 88 (10), pp 5050 5052. Current version is 3.1. You can download its source code or precompiled versions for MacOS and Windows there. Alternatively, you can clone from the GitHub repository.

1 September 2022

Emmanuel Kasper: OpenShift vs. AWS product mapping

If you know the Amazon Web Services portfolio, and you are interested in OpenShift or the OKD OpenShift community distribution, this is a table of corresponding technologies. OpenShift is Red Hat s Kubernetes distribution: it is basically the upstream Kubernetes delivered with monitoring, logging, CI/CD, underlying OS, tested upgrade paths not found with a manual kubernetes.io kubeadm install.

AWS	OpenShift	OpenShift upstream project
Cloud Trail	Kubernetes API Server audit log	Kubernetes
Cloud Watch	OpenShift Monitoring	Prometheus
AWS Artifact	Compliance Operator	OpenSCAP
AWS Trusted Advisor	Insights
AWS Marketplace	OpenShift Operator Hub
AWS Identity and Access Management (IAM)	Red Hat SSO	Keycloack
AWS Elastisc Beanstalk	OpenShift Source2Image (S2I)	Source2Image (S2I)
AWS S3	ODF Rados Gateway	Rook RGW
AWS Elastic Bloc Storage	ODF Rados Block Device	Rook RBD
AWS Elastic File System	ODF Ceph FS	Rook CephFS
Amazon Simple Notification Service	OpenShift Streams for Apache Kafka	Apache Kafka
Amazon Guard Duty	API Server audit log review, ACS Runtime detection	Stackrox
Amazon Inspector	Quay.io container scanner, ACS Vulnerability Assessment	Clair, Stackrox
AWS Lambda	Openshift Serverless`*`	Knative
AWS Key Management System	could be done with Hashicorp Vault	Vault
AWS WAF	NGINX Ingress Controller Operator with ModSecurity	NGINX ModSecurity
Amazon Elasticache	Redis Enterprise Operator	Redis, memcached as alternative
AWS Relational Database Service	Crunchy Data Operator	PostgreSQL

* OpenShift Serverless requires the application to be packaged as a container, something AWS Lamda does not require.

22 April 2022

Russell Coker: Joplin Notes

In response to my post about Android phones without Google Play [1] I received an email recommending Joplin for notes on Android [2]. Joplin supports storing notes on a number of protocols including Nextcloud and WebDAV. I setup WebDAV because it s easiest, here is Digital Ocean instructions for WebDAV on Apache [3]. That basically works. One problem for my use case is that the Joplin client doesn t support accounts on multiple servers and the only released way of sharing notes between accounts is using the paid Joplin Cloud service. There is a Joplin Server in beta which allows sharing notes but that is designed to run in Docker and is written in TypeScript so it was too much pain to setup. One mitigating factor is that there are Notebooks which are collections of notes. So if multiple people who trust each other share an account they can have Notebooks for personal notes and a Notebook for shared notes. There is also a Snap install of the client for Debian [4]. Snap isn t my favourite way of doing things but packaging JavaScript programs will probably be painful so I ll do it if I continue using Joplin.

23 October 2021

Arthur Diniz: Dropdown for GitHub workflows input parameters

Dropdown for GitHub workflows input parameters Sometimes when we look at CI/CD tools embedded within git-based software repository manager like GitHub, GitLab or Bitbucket, we ran into a lack of some features. This time me and my DevOps/SRE team were facing a pain of not being able to have the option to create drop-downs within GitHub workflows using input parameters. Although this functionality is already available on other platforms such as Bitbucket, the specific client we were working on stored the code inside GitHub. At first I thought that someone has already solved this problem somehow, but doing an extensive search on the internet I found several angry GitHub users opening requests within the Support Community and even in the stack overflow. comment-1

So I decided to create a solution for this, always thinking about simplicity and in a way that makes it easy to get this missing functionality. I started by creating an input array pattern using commas and using a tag (the selector) e.g brackets as the default value marker. Here is an example of what an input string would look like:

name: gh-action-dropdown-list-input
on:
  workflow_dispatch:
    inputs:
      environment:
        description: 'Environment'
        required: true
        default: 'dev,staging,[uat],prod'

Now the final question that would turn out to be the most complicated to deal with. How can I change the GitHub Actions interface to replace the input pattern we created earlier to a dropdown? The simplest answer I thought was to create a chrome and firefox extension that would do all this logic behind the scenes and replace the HTML input element with the selected tag containing the array values and leaving the tag value (selector) always as the default. All code was developed in pure JavaScript, open-source licensed under Apache 2.0 and available at https://github.com/arthurbdiniz/gh-action-dropdown-list-input.

Install extension

Chrome

Firefox

Once installed, the extension is ready to use and the final result we see is the Actions interface with drop-downs. :)

Configuring selectors Go to the top right corner of the browser you are using and click on the extension logo. A screen will popup with tag options. Choose the right tags for you and save it.
This action might require reloading the GitHub workflow tab.

Have fun using drop-downs inside GitHub. If you liked this project please share this post and if possible star within the repository. Also feel free to connect with me on LinkedIn: https://www.linkedin.com/in/arthurbdiniz

References

https://github.community/t/can-workflow-dispatch-input-be-option-list/127338

https://stackoverflow.com/questions/69296314/dropdown-for-github-workflows-input-parameters

20 October 2021

Arturo Borrero Gonz lez: Iterating on how we do NFS at Wikimedia Cloud Services

This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez. NFS is a central piece of infrastructure that is essential to services like Toolforge. Recently, the Cloud Services team at Wikimedia had been reviewing how we do NFS. The current situation NFS is a central piece of technology for some of the services that the Wikimedia Cloud Services team offers to the community. We have several shares that power different use cases: Toolforge user home directories live on NFS, and Cloud VPS users can also access dumps using this protocol. The current setup involves several physical hardware servers, with about 20TB of storage, offering shares over 10G links to the cloud. For the system to be more fault-tolerant, we duplicate each share for redundancy using DRBD. Running NFS on dedicated hardware servers has traditionally offered us advantages: mostly on the performance and the capacity fields. As time has passed, we have been enumerating more and more reasons to review how we do NFS. For one, the current setup is in violation of some of our internal rules regarding realm separation. Additionally, we had been longing for additional flexibility managing our servers: we wanted to use virtual machines managed by Openstack Nova. The DRBD-based high-availability system required mostly a hand-crafted procedure for failover/failback. There s also some scalability concerns as NFS is easy to grow up, but not to grow horizontally, and of course, we have to be able to keep the tenancy setup while doing so, something that NFS does by using LDAP/Unix users and may get complicated too when growing. In general, the servers have become too big to fail , clearly technical debt, and it has taken us years to decide on taking on the task to rethink the architecture. It s worth mentioning that in an ideal world, we wouldn t depend on NFS, but the truth is that it will still be a central piece of infrastructure for years to come in services like Toolforge. Over a series of brainstorming meetings, the WMCS team evaluated the situation and sorted out the many moving parts. The team managed to boil down the potential service future to two competing options:

Adopt and introduce a new Openstack component into our cloud: Manila this was the right choice if we were interested in a general NFS as a service offering for our Cloud VPS users.
Put the data on Cinder volumes and serve NFS from a couple of virtual machines created by hand this was the right choice if we wanted something that required low effort to engineer and adopt.

Then we decided to research both options in parallel. For a number of reasons, the evaluation was timeboxed to three weeks. Both ideas had a couple of points in common: the NFS data would be stored on our Ceph farm via Cinder volumes, and we would rely on Ceph reliability to avoid using DRBD. Another open topic was how to back up data from Ceph, to store our important bits in more than one basket. We will get to the back up topic later. The manila experiment The Wikimedia Foundation was an early adopter of some Openstack components (Nova, Glance, Designate, Horizon), but Manila was never evaluated for usage until now. Our approach for this experiment was to closely follow the upstream guidelines. We read the documentation and tried to understand the different setups you can build with Manila. As we often feel with other Openstack components, the documentation doesn t perfectly describe how to introduce a given component in your particular local setup. Here we use an admin-controller flat-topology Neutron network. This network is shared by all tenants (or projects) in our Openstack deployment. Also, Manila can use many different driver backends, for things like NetApps or CephFS that we don t use , yet. After some research, the generic driver was the one that seemed to better fit our use case. The generic driver leverages Nova virtual machines instances plus Cinder volume to create and manage the shares. In general, Manila supports two operational modes, whether it should create/destroy the share servers (i.e, the virtual machine instances) or not. This option is called driver_handles_share_server (or DHSS) and takes a boolean value. We were interested in trying with DHSS=true, to really benefit from the potential of the setup. Manila diagram

NFS idea 6, original image in Wikitech So, after sorting all these variables, we moved on with our initial testing. We built a PoC setup as depicted in the diagram above, with the manila-share component running in a virtual machine inside the cloud. The PoC led to us reporting several bugs upstream:

In some cases we tried to address these bugs ourselves:

It s worth mentioning that the upstream community was extra-welcoming to us, and we re thankful for that. However, at the end of our three-week period, our Manila setup still wasn t working as expected. Your experience may change with other drivers perhaps the ZFSonLinux or the CephFS ones. In general, we were having trouble making the setup work as expected, so we decided to abandon this approach in favor of the other option we were considering at the beginning. Simple virtual machine serving NFS The alternative was to create a Nova virtual machine instance by hand and to configure it using puppet. We have been investing in an automation framework lately, so the idea is to not actually create the server by hand. Anyway, the data would be decoupled from the instance into Cinder volumes, which led us to the question we left for later: How should we back up those terabytes of important information? Just to be clear, the backup problem was independent of the above options; with Manila we would still have had to solve the same challenge. We would like to see our data be backed up somewhere else other than in Ceph. And that s exactly where we are at right now. We ve been exploring different backup strategies and will finally use the Cinder backup API. Conclusion The iteration will end with the dedicated NFS hardware servers being stopped, and the shares being served from within the cloud. The migration will take some time to happen because we will check and double-check that everything works as expected (including from the performance point of view) before making definitive changes. We already have some plans to make sure our users experience as little service impact as possible. The most troublesome shares will be those related to Toolforge. At some point we will need to disallow writes to the NFS share, rsync the data out of the hardware servers into the Cinder volumes, point the NFS clients to the new virtual machines, and then enable writes again. The main Toolforge share has about 8TB of data, so this will take a while. We will have more updates in the future. Who knows, perhaps our next-next iteration, in a couple of years, will see us adopting Openstack Manila for good. Featured image credit: File:(from break water) Manila Skyline panoramio.jpg, ewol, CC BY-SA 3.0 This post was originally published in the Wikimedia Tech blog, authored by Arturo Borrero Gonzalez.

31 May 2021

Russell Coker: Some Ideas About Storage Reliability

Hard Drive Brands When people ask for advice about what storage to use they often get answers like use brand X, it works well for me and brand Y had a heap of returns a few years ago . I m not convinced there is any difference between the small number of manufacturers that are still in business. One problem we face with reliability of computer systems is that the rate of change is significant, so every year there will be new technological developments to improve things and every company will take advantage of them. Storage devices are unique among computer parts for their requirement for long-term reliability. For most other parts in a computer system a fault that involves total failure is usually easy to fix and even a fault that causes unreliable operation usually won t spread it s damage too far before being noticed (except in corner cases like RAM corruption causing corrupted data on disk). Every year each manufacturer will bring out newer disks that are bigger, cheaper, faster, or all three. Those disks will be expected to remain in service for 3 years in most cases, and for consumer disks often 5 years or more. The manufacturers can t test the new storage technology for even 3 years before releasing it so their ability to prove the reliability is limited. Maybe you could buy some 8TB disks now that were manufactured to the same design as used 3 years ago, but if you buy 12TB consumer grade disks, the 20TB+ data center disks, or any other device that is pushing the limits of new technology then you know that the manufacturer never tested it running for as long as you plan to run it. Generally the engineering is done well and they don t have many problems in the field. Sometimes a new range of disks has a significant number of defects, but that doesn t mean the next series of disks from the same manufacturer will have problems. The issues with SSDs are similar to the issues with hard drives but a little different. I m not sure how much of the improvements in SSDs recently have been due to new technology and how much is due to new manufacturing processes. I had a bad experience with a nameless brand SSD a couple of years ago and now stick to the better known brands. So for SSDs I don t expect a great quality difference between devices that have the names of major computer companies on them, but stuff that comes from China with the name of the discount web store stamped on it is always a risk. Hard Drive vs SSD A few years ago some people were still avoiding SSDs due to the perceived risk of new technology. The first problem with this is that hard drives have lots of new technology in them. The next issue is that hard drives often have some sort of flash storage built in, presumably a SSHD or Hybrid Drive gets all the potential failures of hard drives and SSDs. One theoretical issue with SSDs is that filesystems have been (in theory at least) designed to cope with hard drive failure modes not SSD failure modes. The problem with that theory is that most filesystems don t cope with data corruption at all. If you want to avoid losing data when a disk returns bad data and claims it to be good then you need to use ZFS, BTRFS, the NetApp WAFL filesystem, Microsoft ReFS (with the optional file data checksum feature enabled), or Hammer2 (which wasn t production ready last time I tested it). Some people are concerned that their filesystem won t support wear levelling for SSD use. When a flash storage device is exposed to the OS via a block interface like SATA there isn t much possibility of wear levelling. If flash storage exposes that level of hardware detail to the OS then you need a filesystem like JFFS2 to use it. I believe that most SSDs have something like JFFS2 inside the firmware and use it to expose what looks like a regular block device. Another common concern about SSD is that it will wear out from too many writes. Lots of people are using SSD for the ZIL (ZFS Intent Log) on the ZFS filesystem, that means that SSD devices become the write bottleneck for the system and in some cases are run that way 24*7. If there was a problem with SSDs wearing out I expect that ZFS users would be complaining about it. Back in 2014 I wrote a blog post about whether swap would break SSD [1] (conclusion it won t). Apart from the nameless brand SSD I mentioned previously all of my SSDs in question are still in service. I have recently had a single Samsung 500G SSD give me 25 read errors (which BTRFS recovered from the other Samsung SSD in the RAID-1), I have yet to determine if this is an ongoing issue with the SSD in question or a transient thing. I also had a 256G SSD in a Hetzner DC give 23 read errors a few months after it gave a SMART alert about Wear_Leveling_Count (old age). Hard drives have moving parts and are therefore inherently more susceptible to vibration than SSDs, they are also more likely to cause vibration related problems in other disks. I will probably write a future blog post about disks that work in small arrays but not in big arrays. My personal experience is that SSDs are at least as reliable as hard drives even when run in situations where vibration and heat aren t issues. Vibration or a warm environment can cause data loss from hard drives in situations where SSDs will work reliably. NVMe I think that NVMe isn t very different from other SSDs in terms of the actual storage. But the different interface gives some interesting possibilities for data loss. OS, filesystem, and motherboard bugs are all potential causes of data loss when using a newer technology. Future Technology The latest thing for high end servers is Optane Persistent memory [2] also known as DCPMM. This is NVRAM that fits in a regular DDR4 DIMM socket that gives performance somewhere between NVMe and RAM and capacity similar to NVMe. One of the ways of using this is Memory Mode where the DCPMM is seen by the OS as RAM and the actual RAM caches the DCPMM (essentially this is swap space at the hardware level), this could make multiple terabytes of RAM not ridiculously expensive. Another way of using it is App Direct Mode where the DCPMM can either be a simulated block device for regular filesystems or a byte addressable device for application use. The final option is Mixed Memory Mode which has some DCPMM in Memory Mode and some in App Direct Mode . This has much potential for use of backups and to make things extra exciting App Direct Mode has RAID-0 but no other form of RAID. Conclusion I think that the best things to do for storage reliability are to have ECC RAM to avoid corruption before the data gets written, use reasonable quality hardware (buy stuff with a brand that someone will want to protect), and avoid new technology. New hardware and new software needed to talk to new hardware interfaces will have bugs and sometimes those bugs will lose data. Filesystems like BTRFS and ZFS are needed to cope with storage devices returning bad data and claiming it to be good, this is a very common failure mode. Backups are a good thing.

30 May 2021

Russell Coker: Wifi Performance on Linux

Wifi usually just works. In the past I haven t had to worry much about performance as for home use things have always been bearable and at work it s never been my job so I just file a bug report with the relevant people when things go wrong. But a few years ago I had some problems. For my home network I got a free Wifi AP which wasn t performing well. My AP supported 802.11 modes b/g or g/n (b, g, and n are slow, medium, and fast speeds). I initially had the AP running in b/g mode because I had an 802.11b USB wifi device that I used. When I replaced that with one that did 802.11g I tried changing the AP to g/n mode but performance was even worse on my laptop (although quite good on phones) so I switched back. For phones it appeared to work well giving 54Mb/s while on my laptop (a second hand Thinkpad X1 Carbon) it was giving 11Mb/s at best and often much less than that. The best demonstration of problems was to start transferring a large file while pinging a system on the LAN the AP was connected to. Usually it would give ping times of 1s or more, sometimes 5s+ ping times. While this was happening the Invalid misc count increased rapidly, often by more than 100 per second. The results of Google searches suggest that Invalid misc is due to interference and recommend changing the channel. My AP had been on channel 1 which had performed poorly, channels 2-8 were ok, and channel 9 seemed reasonably good. As an aside trying all channels manually is not a good idea, it takes a lot of time and gives little useful data. After changing to channel 9 it still only gave about 500KB/s when transferring large files with ping times of about 100ms, but that s a big improvement. I tried running iwlist scanning to scan the Wifi network for other APs, that showed that channel 1 was used a lot but didn t make it clear what I should do other than that. The next thing I tried was the Wifi Analyser app on Android [1] (which doesn t work on my latest phone, I don t know if it s still being actively maintained, it will definitely work on older phones). That has a nice graph mode that shows which channels are used and how the frequencies spread and interfere with other channels. One thing I hadn t realised before I looked at the graphs is that 802.11n uses 4 channels and interferes past that. If you have two 802.11n devices you don t have much space left out of the 14 channels available. To make more space I configured the Wifi AP in my ADSL modem to 802.11b/g mode and assigned it a channel away from the others making 4 channels available with no interference. After that iwconfig reported between 60 and 120Mb/s and I got consistent transfer rates over 1.5MB/s while ping times remained below 100ms. The 5GHz frequency range is less congested. But at the time I didn t feel like buying 5GHz equipment. Since that time I had signed up with an ISP that had a good deal on a Wifi AP that had 5GHz. Now I have all my devices configured to use 5GHz or 2.4GHz depending on which they think is best. So there s less devices on 2.4GHz and the AP is configured for 20MHz channel width in the 2.4GHz range (which means 802.11b/g). Conclusion 802.11n seems to be a bad idea unless you run the only AP in an area. In a suburban area you will have 3 other houses broadcasting in your area and 802.11n is bad for everyone. The worst case scenario would be one person using 802.11n and interfering with everyone else s 802.11g and then having everyone else turn on 802.11n to try and make things faster. 5GHz is less congested as most people run old hardware. It also has a shorter range which has the upside of getting less interference from other people. I m considering installing 5GHz APs at both ends of my house and configuring all my new devices to not use 2.4GHz. Wifi spectrum analysis software is much better than manual testing of channels or trying to deduce things from the output if iwlist scanning .

[1] https://play.google.com/store/apps/details?id=com.farproc.wifi.analyzer

12 August 2020

Michael Stapelberg: distri: 20x faster initramfs (initrd) from scratch

In case you are not yet familiar with why an initramfs (or initrd, or initial ramdisk) is typically used when starting Linux, let me quote the wikipedia definition: [ ] initrd is a scheme for loading a temporary root file system into memory, which may be used as part of the Linux startup process [ ] to make preparations before the real root file system can be mounted. Many Linux distributions do not compile all file system drivers into the kernel, but instead load them on-demand from an initramfs, which saves memory. Another common scenario, in which an initramfs is required, is full-disk encryption: the disk must be unlocked from userspace, but since userspace is encrypted, an initramfs is used.

Motivation Thus far, building a distri disk image was quite slow: This is on an AMD Ryzen 3900X 12-core processor (2019):

distri % time make cryptimage serial=1
80.29s user 13.56s system 186% cpu 50.419 total # 19s image, 31s initrd

Of these 50 seconds, dracut s initramfs generation accounts for 31 seconds (62%)! Initramfs generation time drops to 8.7 seconds once dracut no longer needs to use the single-threaded gzip(1) , but the multi-threaded replacement pigz(1) : This brings the total time to build a distri disk image down to:

distri % time make cryptimage serial=1
76.85s user 13.23s system 327% cpu 27.509 total # 19s image, 8.7s initrd

Clearly, when you use dracut on any modern computer, you should make pigz available. dracut should fail to compile unless one explicitly opts into the known-slower gzip. For more thoughts on optional dependencies, see Optional dependencies don t work . But why does it take 8.7 seconds still? Can we go faster? The answer is Yes! I recently built a distri-specific initramfs I m calling minitrd. I wrote both big parts from scratch:

the initramfs generator program (distri initrd)
a custom Go userland (cmd/minitrd), running as /init in the initramfs.

minitrd generates the initramfs image in 400ms, bringing the total time down to:

distri % time make cryptimage serial=1
50.09s user 8.80s system 314% cpu 18.739 total # 18s image, 400ms initrd

(The remaining time is spent in preparing the file system, then installing and configuring the distri system, i.e. preparing a disk image you can run on real hardware.) How can minitrd be 20 times faster than dracut? dracut is mainly written in shell, with a C helper program. It drives the generation process by spawning lots of external dependencies (e.g. ldd or the dracut-install helper program). I assume that the combination of using an interpreted language (shell) that spawns lots of processes and precludes a concurrent architecture is to blame for the poor performance. minitrd is written in Go, with speed as a goal. It leverages concurrency and uses no external dependencies; everything happens within a single process (but with enough threads to saturate modern hardware). Measuring early boot time using qemu, I measured the dracut-generated initramfs taking 588ms to display the full disk encryption passphrase prompt, whereas minitrd took only 195ms. The rest of this article dives deeper into how minitrd works.

What does an initramfs do? Ultimately, the job of an initramfs is to make the root file system available and continue booting the system from there. Depending on the system setup, this involves the following 5 steps:

1. Load kernel modules to access the block devices with the root file system Depending on the system, the block devices with the root file system might already be present when the initramfs runs, or some kernel modules might need to be loaded first. On my Dell XPS 9360 laptop, the NVMe system disk is already present when the initramfs starts, whereas in qemu, we need to load the virtio_pci module, followed by the virtio_scsi module. How will our userland program know which kernel modules to load? Linux kernel modules declare patterns for their supported hardware as an alias, e.g.:

initrd# grep virtio_pci lib/modules/5.4.6/modules.alias
alias pci:v00001AF4d*sv*sd*bc*sc*i* virtio_pci

Devices in sysfs have a modalias file whose content can be matched against these declarations to identify the module to load:

initrd# cat /sys/devices/pci0000:00/*/modalias
pci:v00001AF4d00001005sv00001AF4sd00000004bc00scFFi00
pci:v00001AF4d00001004sv00001AF4sd00000008bc01sc00i00
[ ]

Hence, for the initial round of module loading, it is sufficient to locate all modalias files within sysfs and load the responsible modules. Loading a kernel module can result in new devices appearing. When that happens, the kernel sends a uevent, which the uevent consumer in userspace receives via a netlink socket. Typically, this consumer is udev(7) , but in our case, it s minitrd. For each uevent messages that comes with a MODALIAS variable, minitrd will load the relevant kernel module(s). When loading a kernel module, its dependencies need to be loaded first. Dependency information is stored in the modules.dep file in a Makefile-like syntax:

initrd# grep virtio_pci lib/modules/5.4.6/modules.dep
kernel/drivers/virtio/virtio_pci.ko: kernel/drivers/virtio/virtio_ring.ko kernel/drivers/virtio/virtio.ko

To load a module, we can open its file and then call the Linux-specific finit_module(2) system call. Some modules are expected to return an error code, e.g. ENODEV or ENOENT when some hardware device is not actually present. Side note: next to the textual versions, there are also binary versions of the modules.alias and modules.dep files. Presumably, those can be queried more quickly, but for simplicitly, I have not (yet?) implemented support in minitrd.

2. Console settings: font, keyboard layout Setting a legible font is necessary for hi-dpi displays. On my Dell XPS 9360 (3200 x 1800 QHD+ display), the following works well:

initrd# setfont latarcyrheb-sun32

Setting the user s keyboard layout is necessary for entering the LUKS full-disk encryption passphrase in their preferred keyboard layout. I use the NEO layout:

initrd# loadkeys neo

3. Block device identification In the Linux kernel, block device enumeration order is not necessarily the same on each boot. Even if it was deterministic, device order could still be changed when users modify their computer s device topology (e.g. connect a new disk to a formerly unused port). Hence, it is good style to refer to disks and their partitions with stable identifiers. This also applies to boot loader configuration, and so most distributions will set a kernel parameter such as `root=UUID=1fa04de7-30a9-4183-93e9-1b0061567121`. Identifying the block device or partition with the specified `UUID` is the initramfs s job. Depending on what the device contains, the UUID comes from a different place. For example, `ext4` file systems have a UUID field in their file system superblock, whereas LUKS volumes have a UUID in their LUKS header. Canonically, probing a device to extract the UUID is done by `libblkid` from the `util-linux` package, but the logic can easily be re-implemented in other languages and changes rarely. `minitrd` comes with its own implementation to avoid cgo or running the `blkid(8)` program.

4. LUKS full-disk encryption unlocking (only on encrypted systems) Unlocking a LUKS-encrypted volume is done in userspace. The kernel handles the crypto, but reading the metadata, obtaining the passphrase (or e.g. key material from a file) and setting up the device mapper table entries are done in user space.

initrd# modprobe algif_skcipher
initrd# cryptsetup luksOpen /dev/sda4 cryptroot1

After the user entered their passphrase, the root file system can be mounted:

initrd# mount /dev/dm-0 /mnt

5. Continuing the boot process (switch_root) Now that everything is set up, we need to pass execution to the init program on the root file system with a careful sequence of chdir(2) , mount(2) , chroot(2) , chdir(2) and execve(2) system calls that is explained in this busybox switch_root comment.

initrd# mount -t devtmpfs dev /mnt/dev
initrd# exec switch_root -c /dev/console /mnt /init

To conserve RAM, the files in the temporary file system to which the initramfs archive is extracted are typically deleted.

How is an initramfs generated? An initramfs image (more accurately: archive) is a compressed cpio archive. Typically, gzip compression is used, but the kernel supports a bunch of different algorithms and distributions such as Ubuntu are switching to lz4. Generators typically prepare a temporary directory and feed it to the `cpio(1)` program. In `minitrd`, we read the files into memory and generate the cpio archive using the go-cpio package. We use the pgzip package for parallel gzip compression. The following files need to go into the cpio archive:

minitrd Go userland The `minitrd` binary is copied into the cpio archive as `/init` and will be run by the kernel after extracting the archive. Like the rest of distri, `minitrd` is built statically without cgo, which means it can be copied as-is into the cpio archive.

Linux kernel modules Aside from the `modules.alias` and `modules.dep` metadata files, the kernel modules themselves reside in e.g. `/lib/modules/5.4.6/kernel` and need to be copied into the cpio archive. Copying all modules results in a 80 MiB archive, so it is common to only copy modules that are relevant to the initramfs s features. This reduces archive size to 24 MiB. The filtering relies on hard-coded patterns and module names. For example, disk encryption related modules are all kernel modules underneath `kernel/crypto`, plus `kernel/drivers/md/dm-crypt.ko`. When generating a host-only initramfs (works on precisely the computer that generated it), some initramfs generators look at the currently loaded modules and just copy those.

Console Fonts and Keymaps The `kbd` package s `setfont(8)` and `loadkeys(1)` programs load console fonts and keymaps from `/usr/share/consolefonts` and `/usr/share/keymaps`, respectively. Hence, these directories need to be copied into the cpio archive. Depending on whether the initramfs should be generic (work on many computers) or host-only (works on precisely the computer/settings that generated it), the entire directories are copied, or only the required font/keymap.

cryptsetup, setfont, loadkeys These programs are (currently) required because `minitrd` does not implement their functionality. As they are dynamically linked, not only the programs themselves need to be copied, but also the ELF dynamic linking loader (path stored in the `.interp` ELF section) and any ELF library dependencies. For example, `cryptsetup` in distri declares the ELF interpreter `/ro/glibc-amd64-2.27-3/out/lib/ld-linux-x86-64.so.2` and declares dependencies on shared libraries `libcryptsetup.so.12`, `libblkid.so.1` and others. Luckily, in distri, packages contain a `lib` subdirectory containing symbolic links to the resolved shared library paths (hermetic packaging), so it is sufficient to mirror the lib directory into the cpio archive, recursing into shared library dependencies of shared libraries. `cryptsetup` also requires the GCC runtime library `libgcc_s.so.1` to be present at runtime, and will abort with an error message about not being able to call `pthread_cancel(3)` if it is unavailable.

time zone data To print log messages in the correct time zone, we copy `/etc/localtime` from the host into the cpio archive.

minitrd outside of distri? I currently have no desire to make `minitrd` available outside of distri. While the technical challenges (such as extending the generator to not rely on distri s hermetic packages) are surmountable, I don t want to support people s initramfs remotely. Also, I think that people s efforts should in general be spent on rallying behind `dracut` and making it work faster, thereby benefiting all Linux distributions that use dracut (increasingly more). With `minitrd`, I have demonstrated that significant speed-ups are achievable.

Conclusion It was interesting to dive into how an initramfs really works. I had been working with the concept for many years, from small tasks such as debug why the encrypted root file system is not unlocked to more complicated tasks such as set up a root file system on DRBD for a high-availability setup . But even with that sort of experience, I didn t know all the details, until I was forced to implement every little thing. As I suspected going into this exercise, `dracut` is much slower than it needs to be. Re-implementing its generation stage in a modern language instead of shell helps a lot. Of course, my `minitrd` does a bit less than `dracut`, but not drastically so. The overall architecture is the same. I hope my effort helps with two things:

As a teaching implementation: instead of wading through the various components that make up a modern initramfs (udev, systemd, various shell scripts, ), people can learn about how an initramfs works in a single place.

I hope the significant time difference motivates people to improve `dracut`.

Appendix: qemu development environment Before writing any Go code, I did some manual prototyping. Learning how other people prototype is often immensely useful to me, so I m sharing my notes here. First, I copied all kernel modules and a statically built busybox binary:

% mkdir -p lib/modules/5.4.6
% cp -Lr /ro/lib/modules/5.4.6/* lib/modules/5.4.6/
% cp ~/busybox-1.22.0-amd64/busybox sh

To generate an initramfs from the current directory, I used:

% find .   cpio -o -H newc   pigz > /tmp/initrd

In distri s Makefile, I append these flags to the QEMU invocation:

-kernel /tmp/kernel \
-initrd /tmp/initrd \
-append "root=/dev/mapper/cryptroot1 rdinit=/sh ro console=ttyS0,115200 rd.luks=1 rd.luks.uuid=63051f8a-54b9-4996-b94f-3cf105af2900 rd.luks.name=63051f8a-54b9-4996-b94f-3cf105af2900=cryptroot1 rd.vconsole.keymap=neo rd.vconsole.font=latarcyrheb-sun32 init=/init systemd.setenv=PATH=/bin rw vga=836"

The vga= mode parameter is required for loading font latarcyrheb-sun32. Once in the busybox shell, I manually prepared the required mount points and kernel modules:

ln -s sh mount
ln -s sh lsmod
mkdir /proc /sys /run /mnt
mount -t proc proc /proc
mount -t sysfs sys /sys
mount -t devtmpfs dev /dev
modprobe virtio_pci
modprobe virtio_scsi

As a next step, I copied cryptsetup and dependencies into the initramfs directory:

% for f in /ro/cryptsetup-amd64-2.0.4-6/lib/*; do full=$(readlink -f $f); rel=$(echo $full   sed 's,^/,,g'); mkdir -p $(dirname $rel); install $full $rel; done
% ln -s ld-2.27.so ro/glibc-amd64-2.27-3/out/lib/ld-linux-x86-64.so.2
% cp /ro/glibc-amd64-2.27-3/out/lib/ld-2.27.so ro/glibc-amd64-2.27-3/out/lib/ld-2.27.so
% cp -r /ro/cryptsetup-amd64-2.0.4-6/lib ro/cryptsetup-amd64-2.0.4-6/
% mkdir -p ro/gcc-libs-amd64-8.2.0-3/out/lib64/
% cp /ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1 ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1
% ln -s /ro/gcc-libs-amd64-8.2.0-3/out/lib64/libgcc_s.so.1 ro/cryptsetup-amd64-2.0.4-6/lib
% cp -r /ro/lvm2-amd64-2.03.00-6/lib ro/lvm2-amd64-2.03.00-6/

In busybox, I used the following commands to unlock the root file system:

modprobe algif_skcipher
./cryptsetup luksOpen /dev/sda4 cryptroot1
mount /dev/dm-0 /mnt

25 July 2017

Reproducible builds folks: Reproducible Builds: week 117 in Buster cycle

Here's what happened in the Reproducible Builds effort between Sunday July 16 and Saturday July 22 2017: Toolchain development Bernhard M. Wiedemann wrote a tool to automatically run through different sources of non-determinism, and report which of these caused irreproducibility. Dan Kegel's patches to fpm were merged. Bugs filed Patches submitted upstream:

Bernhard M. Wiedemann:
- Sort file lists:
  - eric5 merged
  - libsass-python merged
  - tcl merged
  - python3.6 in progress
  - drbd
  - blobwars
- Instead of fixing ordering issues in a custom .pak archive (same as blobwars above), we allow to install individual data files to avoid the issue:
  - edgar avoid sort
- Omit the build date entirely:
  - fence-agents
- SOURCE_DATE_EPOCH support:
  - criu, merged
  - dapl, merged
  - shorewall, merged
  - youtube-dl in progress
  - automake
  - crosstool-ng
  - docker
  - drbd
  - drbd-utils
  - getdp
  - infinipath-psm
  - opa-fm
  - opa-psm2
  - texinfo
- geany/glfw unknown

Patches filed in Debian:

Adrian Bunk:
- #868599 filed against ocaml-curses.
- #868609 filed against le.
- #868612 filed against mixxx.
- #868855 filed against softhsm2.
- #868858 filed against gwc.
- #869086 filed against dsniff.
Chris Lamb:
- #868790 filed against castle-game-engine, forwarded upstream.
- #868843 filed against xorg-server, forwarded upstream.
- #869516 filed against libcdio.
Drew Parsons:
- #868505 filed against sdpa.
Lucas Nussbaum:
- #868904 filed against gwc.
- #868927 filed against python-pybedtools.
Sascha Steinbiss:
- #868772 filed against ragel.

Reviews of unreproducible packages 73 package reviews have been added, 44 have been updated and 50 have been removed in this week, adding to our knowledge about identified issues. No issue types were updated. Weekly QA work During our reproducibility testing, FTBFS bugs have been detected and reported by:

Adrian Bunk (106)
Daniel Stender (1)
Drew Parsons (1)
F lix Sipma (1)
Lucas Nussbaum (25)

diffoscope development

Juliana Rodrigues:
- Add new XML comparator. (Closes: #866120)
Guangyuan Yang:
- Fix 2 cases in test_device on FreeBSD
Chris Lamb:
- comparators.xml: Fix EPUB "missing file" tests; they ship a META-INF/container.xml file.
- comparators.sqlite: Simplify file detection in Sqlite3Database.RE_FILE_TYPE
- Style and attribution fixes to XML comparator and comparators.directory
Ximin Luo:
- main, logging: restore old logger settings to avoid pytest vomiting in certain situations
- comparators/directory: Fix #868534 by expecting less strict test output

reprotest development

Ximin Luo:
- Use autopkgtest upstream paths, makes things easier to import
- Add script for importing autopkgtest code, and import autopkgtest 4.4

Ximin also restarted the discussion with autopkgtest-devel about code reuse for reprotest. Santiago Torres began a series of patches to make reprotest more distro-agnostic, with the aim of making it usable on Arch Linux. Ximin reviewed these patches. Misc. This week's edition was written by Ximin Luo, Bernhard M. Wiedemann and Chris Lamb & reviewed by a bunch of Reproducible Builds folks on IRC & the mailing lists.

5 May 2017

Patrick Matth i: Be careful: Upgrading Debian Jessie to Stretch, with Pacemaker DRBD and an nested ext4 LVM hosted on VMware products

Detached DRBD (diskless) In the past I setup some new Pacemaker clustered nodes with a fresh Debian Stretch installation. I followed our standard installation guide, created also shared replicated DRBD storage, but whenever I tried to mount the ext4 storage DRBD detached the disks on both node sides with I/O errors. After recreating it, using other storage volumes and testing my ProLiant hardware (whop I thought it had got a defect..) it still occurs, but somewhere in the middle of testing, a quicker setup without LVM it worked fine, hum.. Much later I found this (only post at this time about it) on the DRBD-user mailinglist: [0]
This means, if you use the combination of VMware-Product -> Debian Stretch -> local Storage -> DRBD -> LVM -> ext4 you will be affected by this bug. This happens, because VMware always publishs the information, that the guest is able to support the WRITE SAME feature, which is wrong. Since the DRBD version, which is also shipped with Stretch, DRBD now also supports WRITE SAME, so it tries to use this feature, but this fails then.
This is btw the same reason, why VMware users see in their dmesg this:

WRITE SAME failed.Manually zeroing.

As a workaround I am using now systemd, to disable WRITE SAME for all attached block devices in the guest. Simply run the following:

for i in find /sys/block/*/device/scsi_disk/*/max_write_same_blocks ; do echo w $i 0 ; done > /etc/tmpfiles.d/write_same.conf

[0]: http://lists.linbit.com/pipermail/drbd-user/2017-January/022931.html Pacemaker failovers with DRBD+LVM do not work If you use a DRBD with a nested LVM, you already had to add the following lines to your /etc/lvm/lvm.conf in past Debian releases (assuming that sdb and sdc are DRBD devices):

filter = [ r /dev/sdb.* /dev/sdc.* ]
write_cache_state = 0

Wit Debian Stretch this is not enough. Your failovers will result in a broken state on the second node, because it can not find your LVs and VGs. I found out, that killing lvmetad helps. So I also added a global_filter (it should be used for all LVM services):

global_filter = [ r /dev/sdb.* /dev/sdc.* ]

But this also didn t helped.. My only solution was to disable lvmetad (which I am also not using at all). So adding this all in combination works now for me and failovers are as smooth as with Jessie:

filter = [ r /dev/sdb.* /dev/sdc.* ]
global_filter = [ r /dev/sdb.* /dev/sdc.* ]
write_cache_state = 0
use_lvmetad = 0

Do not forget to update your initrd, so that the LVM configuration is updated on booting your server:

update-initramfs -k all -u

Reboot, that s it :)

19 October 2016

Reproducible builds folks: Reproducible Builds: week 77 in Stretch cycle

What happened in the Reproducible Builds effort between Sunday October 9 and Saturday October 15 2016: Media coverage

despinosa wrote a blog post on Vala and reproducibility
h01ger and lynxis gave a talk called "From Reproducible Debian builds to Reproducible OpenWrt, LEDE" (video, slides) at the OpenWrt Summit 2016 held in Berlin, together with ELCE, held by the Linux Foundation.
A discussion on debian-devel@ resulted in a nice quotable comment from Paul Wise: "(Reproducible) builds from source (with continuous rechecking) is the only way to have enough confidence that a Debian user has the freedoms promised to them by the Debian social contract."
Chris Lamb will present a talk at Software Freedom Kosovo on reproducible builds on Saturday 22nd October.

Documentation update After discussions with HW42, Steven Chamberlain, Vagrant Cascadian, Daniel Shahaf, Christopher Berg, Daniel Kahn Gillmor and others, Ximin Luo has started writing up more concrete and detailed design plans for setting SOURCE_ROOT_DIR for reproducible debugging symbols, buildinfo security semantics and buildinfo security infrastructure. Toolchain development and fixes Dmitry Shachnev noted that our patch for #831779 has been temporarily rejected by docutils upstream; we are trying to persuade them again. Tony Mancill uploaded javatools/0.59 to unstable containing original patch by Chris Lamb. This fixed an issue where documentation Recommends: substvars would not be reproducible. Ximin Luo filed bug 77985 to GCC as a pre-requisite for future patches to make debugging symbols reproducible. Packages reviewed and fixed, and bugs filed The following updated packages have become reproducible - in our current test setup - after being fixed:

cobbler/2.6.6+dfsg1-13 by Thomas Goirand, original patch by Chris Lamb.
collectd/5.6.1-1 by Marc Fournier.
fonts-tiresias/0.1-3 by G rkan Myczko, original patch by Chris Lamb.
fntsample/4.0-2 by , original patch by Chris Lamb.
fpga-icestorm/0~20160913git266e758-2 by Ruben Undheim, original patch by Chris Lamb.
frog/0.13.5-1 by Maarten van Gompel, original patch by Chris Lamb.
lambda-align/1.0.0-2 by Sascha Steinbiss, original patch by Chris Lamb.
pleiades/1.7.0-2 by Hideki Yamane, original patch by Chris Lamb.
sweethome3d/5.2+dfsg-1 by Markus Koschany, original fix by Gabriele Giacone.
trac-subtickets/0.2.0-2 by W. Martin Borgert.

The following updated packages appear to be reproducible now, for reasons we were not able to figure out. (Relevant changelogs did not mention reproducible builds.)

aodh/3.0.0-2 by Thomas Goirand.
eog-plugins/3.16.5-1 by Michael Biebl.
flam3/3.0.1-5 by Daniele Adriana Goulart Lopes.
hyphy/2.2.7+dfsg-1 by Andreas Tille.
libbson/1.4.1-1 by A. Jesse Jiryu Davis.
libmongoc/1.4.1-1 by A. Jesse Jiryu Davis.
lxc/1:2.0.5-1 by Evgeni Golov.
spice-gtk/0.33-1 by Liang Guo.
spice-vdagent/0.17.0-1 by Liang Guo.
tnef/1.4.12-1 by Kevin Coyner.

Some uploads have addressed some reproducibility issues, but not all of them:

chktex/1.7.6-1 by Thorsten Alteholz, original patch by Sascha Steinbiss.
dbus/1.10.12-1 by Simon McVittie.
doomsday/1.15.8-3 by Markus Koschany, #839338 by Lucas Nussbaum.
emacs25/25.1+1-1 by Rob Browning.
gpgme1.0/1.7.0-3 by Daniel Kahn Gillmor.
monkeysign/2.2.0 by Antoine Beaupr .
python-attrs/16.2.0-1 by Tristan Seligmann, original patch by Chris Lamb.
shotwell/0.24.0-1 by J rg Frings-F rst, original patch by Alexis Bienven e.
supple/1.0.6-2 by Daniel Silverstone.
why/2.36-1 by Ralf Treinen, original patch by Valentin Lorentz.

Some uploads have addressed nearly all reproducibility issues, except for build path issues:

palo/1.96 by Helge Deller, #778437 by Chris Lamb.
rbdoom3bfg/1.1.0~preview3+dfsg+git20160807-1 by Tobias Frost.
singular/4.0.3-p3+ds-1 by Jerome Benoit.
varnish/5.0.0-3 by Stig Sandbeck Mathisen, original patch by Chris Lamb.
yaml-cpp/0.5.2-4 by Paul Novotny, original patch by Reiner Herrmann.

Patches submitted that have not made their way to the archive yet:

#840741 filed against http-icons by Chris Lamb.
#840177 filed against qconf by Chris Lamb.
#840845 filed against python-pygraphviz by Chris Lamb.
#840346 filed against qjoypad by Chris Lamb.

Reviews of unreproducible packages 101 package reviews have been added, 49 have been updated and 4 have been removed in this week, adding to our knowledge about identified issues. 3 issue types have been updated:

Added max_output_size_reached, ftbfs_due_to_jenkins_semaphore_setup, and build_id_differences_only.

Weekly QA work During of reproducibility testing, some FTBFS bugs have been detected and reported by:

Anders Kaseorg (1)
Chris Lamb (18)

tests.reproducible-builds.org Debian:

h01ger has turned off the "Scheduled in testing+unstable+experimental" regular IRC notifications and turned them into emails to those running jenkins.d.n.
Re-add opi2a armhf node and 3 new builder jobs for a total of 60 build jobs for armhf. (h01ger and vagrant)
vagrant suggested to add a variation of init systems effecting the build, and h01ger added it to the TODO list.
Steven Chamberlain submitted a patch so that now all buildinfo files are collected (unsigned yet) at submit@buildinfo.kfreebsd.eu.
Holger enabled CPU type variation (Intel Haswell or AMD Opteron 62xx) for i386. Thanks to Profitbricks.com for their great and continued support!

Openwrt/LEDE/NetBSD/coreboot/Fedora/archlinux:

Increase memory on the 2 build nodes from 12 to 16gb, thanks to profitbricks.com

Misc. We are running a poll to find a good time for an IRC meeting. This week's edition was written by Ximin Luo, Holger Levsen & Chris Lamb and reviewed by a bunch of Reproducible Builds folks on IRC.

17 March 2016

Patrick Matth i: Debian Jessie 8.3: Short howto for Corosync+Pacemaker Active/Passive Cluster with two nodes and DRBD/LVM

Hello, since I had to change my old heartbeat v1 setup to an more modern Corosync+Pacemaker setup, because heartbeat v1 does not support systemd (it first looks like it is working, but it fails on service start/stops), I want to share a simple setup:

Two nodes (node1-1 and node1-2)
Active/Passive setup
Shared IP (here: 123.123.123.123/24)
Internal network on eth1 (here: 192.168.99.0/24)
DRBD shared storage
LVM on top of DRBD
Multiple services, depending also on the DRBD/LVM storage

First you have to activate the jessie-backports repository, because the cluster stack is not available/broken in Debian Jessie. Install the required packages with:

apt-get install -t jessie-backports libqb0 fence-agents pacemaker corosync pacemaker-cli-utils crmsh drbd-utils

After that configure your DRBD and LVM (VG+LV) on it (there are enough tutorials for it). Then deploy this configuration to /etc/corosync/corosync.conf:

totem
version: 2
token: 3000
token_retransmits_before_loss_const: 10
clear_node_high_bit: yes
crypto_cipher: none
crypto_hash: none
transport: udpu
interface
ringnumber: 0
bindnetaddr: 192.168.99.0

logging
to_logfile: yes
logfile: /var/log/corosync/corosync.log
debug: off
timestamp: on
logger_subsys
subsys: QUORUM
debug: off

quorum
provider: corosync_votequorum
two_node: 1
wait_for_all: 1
nodelist
node
ring0_addr: node1-1

node
ring0_addr: node1-2

Both nodes require a passwordless keypair, which is copied to the another node, so that you can ssh from one to each other. Then you can start with crm configure:

property stonith-enabled=no
property no-quorum-policy=ignore
property default-resource-stickiness=100 primitive DRBD_r0 ocf:linbit:drbd params drbd_resource= r0 op start interval= 0 timeout= 240 \
op stop interval= 0 timeout= 100 \
op monitor role=Master interval=59s timeout=30s \
op monitor role=Slave interval=60s timeout=30s
primitive LVM_r0 ocf:heartbeat:LVM params volgrpname= data1 op monitor interval= 30s
primitive SRV_MOUNT_1 ocf:heartbeat:Filesystem params device= /dev/mapper/data1-lv1 directory= /srv/storage fstype= ext4 options= noatime,nodiratime,nobarrier op monitor interval= 40s primitive IP-rsc ocf:heartbeat:IPaddr2 params ip= 123.123.123.123 nic= eth0 cidr_netmask= 24 meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart
primitive IPInt-rsc ocf:heartbeat:IPaddr2 params ip= 192.168.99.4 nic= eth1 cidr_netmask= 24 meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart primitive MariaDB-rsc lsb:mysql meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart
primitive Redis-rsc lsb:redis-server meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart
primitive Memcached-rsc lsb:memcached meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart
primitive PHPFPM-rsc lsb:php5-fpm meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart
primitive Apache2-rsc lsb:apache2 meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart
primitive Nginx-rsc lsb:nginx meta migration-threshold=2 op monitor interval=20 timeout=60 on-fail=restart group APCLUSTER LVM_r0 SRV_MOUNT_1 IP-rsc IPInt-rsc MariaDB-rsc Redis-rsc Memcached-rsc PHPFPM-rsc Apache2-rsc Nginx-rsc
ms ms_DRBD_APCLUSTER DRBD_r0 meta master-max= 1 master-node-max= 1 clone-max= 2 clone-node-max= 1 notify= true colocation APCLUSTER_on_DRBD_r0 inf: APCLUSTER ms_DRBD_APCLUSTER:Master
order APCLUSTER_after_DRBD_r0 inf: ms_DRBD_APCLUSTER:promote APCLUSTER:start commit

The last (bold marked) lines made me some headache. In short they define that the DRBD device on the active node has to be the primary one and that it is required to start the APCLUSTER on the host, since the LVM, filesystem and services require to access its data. Just a short copy paste howto for an simple use case with not so much deep explanaitions..

8 February 2016

Lunar: Reproducible builds: week 41 in Stretch cycle

What happened in the reproducible builds effort this week:

Toolchain fixes After remarks from Guillem Jover, Lunar updated his patch adding generation of .buildinfo files in dpkg.

Packages fixed The following packages have become reproducible due to changes in their build dependencies: dracut, ent, gdcm, guilt, lazarus, magit, matita, resource-agents, rurple-ng, shadow, shorewall-doc, udiskie. The following packages became reproducible after getting fixed:

disque/1.0~rc1-5 by Chris Lamb, noticed by Reiner Herrmann.

dlm/4.0.4-2 by Ferenc W gner.

drbd-utils/8.9.6-1 by Apollon Oikonomopoulos.

java-common/0.54 by by Emmanuel Bourg.

libjibx1.2-java/1.2.6-1 by Emmanuel Bourg.

libzstd/0.4.7-1 by Kevin Murray.

python-releases/1.0.0-1 by Jan Dittberner.

redis/2:3.0.7-2 by Chris Lamb, noticed by Reiner Herrmann.

tetex-brev/4.22.github.20140417-3 by Petter Reinholdtsen.

Some uploads fixed some reproducibility issues, but not all of them:

anarchism/14.0-4 by Holger Levsen.

hhvm/3.11.1+dfsg-1 by Faidon Liambotis.

netty/1:4.0.34-1 by Emmanuel Bourg.

Patches submitted which have not made their way to the archive yet:

#813309 on lapack by Reiner Herrmann: removes the test log and sorts the files packed into the static library locale-independently.

#813345 on elastix by akira: suggest to use the `$datetime` placeholder in Doxygen footer.

#813892 on dietlibc by Reiner Herrmann: remove gzip headers, sort `md5sums` file, and sort object files linked in static libraries.

#813912 on git by Reiner Herrmann: remove timestamps from documentation generated with asciidoc, remove gzip headers, and sort md5sums and tclIndex files.

reproducible.debian.net For the first time, we've reached more than 20,000 packages with reproducible builds for sid on `amd64` with our current test framework. Vagrant Cascadian has set up another test system for `armhf`. Enabling four more builder jobs to be added to Jenkins. (h01ger)

Package reviews 233 reviews have been removed, 111 added and 86 updated in the previous week. 36 new FTBFS bugs were reported by Chris Lamb and Alastair McKinstry. New issue: timestamps_in_manpages_generated_by_yat2m. The description for the blacklisted_on_jenkins issue has been improved. Some packages are also now tagged with blacklisted_on_jenkins_armhf_only.

Misc. Steven Chamberlain gave an update on the status of FreeBSD and variants after the BSD devroom at FOSDEM 16. He also discussed how jails can be used for easier and faster reproducibility tests. The video for h01ger's talk in the main track of FOSDEM 16 about the reproducible ecosystem is now available.

4 January 2016

Lunar: Reproducible builds: week 36 in Stretch cycle

What happened in the reproducible builds effort between December 27th and January 2nd: Infrastructure dak now silently accepts and discards .buildinfo files (commit 1, 2), thanks to Niels Thykier and Ansgar Burchardt. This was later confirmed as working by Mattia Rizzolo. Packages fixed The following packages have become reproducible due to changes in their build dependencies: banshee-community-extensions, javamail, mono-debugger-libs, python-avro. The following packages became reproducible after getting fixed:

avrdude/6.2-5 by Milan Kupcevic.
blosxom/2.1.2-2 uploaded by Rhonda D'Vine, original patches (#777292, #793001) by Chris Lamb and akira.
buzztrax/0.10.2-2 uploaded by Sebastian Dr ge, original patch by Chris Lamb, fixed upstream.
dx/1:4.4.4-8 by Graham Inggs.
gap-guava/3.12+ds1-3 by Jerome Benoit.
goffice/0.10.26-1 uploaded by Dmitry Smirnov, fixed upstream.
gunroar/0.15.dfsg1-8 uploaded by Markus Koschany, original patch by Reiner Herrmann.
iceweasel/43.0.2-1 by Mike Hommey.
ii-esu/1.0a.dfsg1-7 uploaded by Markus Koschany, original patch by Reiner Herrmann.
jing-trang/20131210+dfsg+1-4 by Samuel Thibault.
mstflint/4.1.0+1.46.gb1cdaf7-1 by Mehdi Dogguy.
mu-cade/0.11.dfsg1-9 uploaded by Markus Koschany, original patch by Reiner Herrmann.
mumble/1.2.12-1 by Christopher Knadle.
netris/0.52-10 by Rhonda D'Vine, original patches (#778201, #793707) by Chris Lamb and akira.
onboard/1.1.2-2 uploaded by Mike Gabriel, original patch by Reiner Herrmann.
parsec47/0.2.dfsg1-7 uploaded by Markus Koschany, original patch by Reiner Herrmann.
pathological/1.1.3-14 by Markus Koschany.
projectl/1.001.dfsg1-8 uploaded by Markus Koschany, original patch by Reiner Herrmann.
re2c/0.15.3-1 by JCF Ploemen.
s3d/0.2.2-14 by Sven Eckelmann.
tulip/4.8.0dfsg-2 by Yann Dirson.
val-and-rick/0.1a.dfsg1-5 uploaded by Markus Koschany, original patch by Reiner Herrmann.
xterm/321-1 by Sven Joachim.

Some uploads fixed some reproducibility issues, but not all of them:

debian-installer/20160101 uploaded by Cyril Brulebois with several fixes from Steven Chamberlain.
drbd-utils/8.9.5-1 by Apollon Oikonomopoulos.
hhvm/3.11.0+dfsg-1 by Faidon Liambotis.
rkward/0.6.4-1 uploaded by Thomas Friedrichsmeier, original patch by Philip Rinn.
rsbackup/3.0-2 uploaded by Matthew Vernon, original patches (#777394, #793716) by Chris Lamb and akira.
tin/1:2.3.2-1 by Marco d'Itri.
transdecoder/2.0.1+dfsg-2 uploaded by Andreas Tille, original patch by Chris Lamb.
tumiki-fighters/0.2.dfsg1-7 uploaded by Markus Koschany, original patch by Reiner Herrmann.

Untested changes:

fltk1.1/1.1.10-20 by Aaron M. Ucko, currently FTBFS.
fltk1.3/1.3.3-5 by Aaron M. Ucko, currently FTBFS.

reproducible.debian.net The testing distribution (the upcoming stretch) is now tested on armhf. (h01ger) Four new armhf build nodes provided by Vagrant Cascandian were integrated in the infrastructer. This allowed for 9 new armhf builder jobs. (h01ger) The RPM-based build system, koji, is now in unstable and testing. (Marek Marczykowski-G recki, Ximin Luo). Package reviews 131 reviews have been removed, 71 added and 53 updated in the previous week. 58 new FTBFS reports were made by Chris Lamb and Chris West. New issues identified this week: nondeterminstic_ordering_in_gsettings_glib_enums_xml, nondeterminstic_output_in_warnings_generated_by_breathe, qt_translate_noop_nondeterminstic_ordering. Misc. Steven Chamberlain explained in length why reproducible cross-building across architectures mattered, and posted results of his tests comparing a stage1 debootstrapped chroot of linux-i386 once done from official Debian packages, the others cross-built from kfreebsd-amd64.

26 July 2015

Lunar: Reproducible builds: week 12 in Stretch cycle

What happened in the reproducible builds effort this week: Toolchain fixes Eric Dorlan uploaded automake-1.15/1:1.15-2 which makes the output of mdate-sh deterministic. Original patch by Reiner Herrmann. Kenneth J. Pronovici uploaded epydoc/3.0.1+dfsg-8 which now honors SOURCE_DATE_EPOCH. Original patch by Reiner Herrmann. Chris Lamb submitted a patch to dh-python to make the order of the generated maintainer scripts deterministic. Chris also offered a fix for a source of non-determinism in dpkg-shlibdeps when packages have alternative dependencies. Dhole provided a patch to add support for SOURCE_DATE_EPOCH to gettext. Packages fixed The following 78 packages became reproducible in our setup due to changes in their build dependencies: chemical-mime-data, clojure-contrib, cobertura-maven-plugin, cpm, davical, debian-security-support, dfc, diction, dvdwizard, galternatives, gentlyweb-utils, gifticlib, gmtkbabel, gnuplot-mode, gplanarity, gpodder, gtg-trace, gyoto, highlight.js, htp, ibus-table, impressive, jags, jansi-native, jnr-constants, jthread, jwm, khronos-api, latex-coffee-stains, latex-make, latex2rtf, latexdiff, libcrcutil, libdc0, libdc1394-22, libidn2-0, libint, libjava-jdbc-clojure, libkryo-java, libphone-ui-shr, libpicocontainer-java, libraw1394, librostlab-blast, librostlab, libshevek, libstxxl, libtools-logging-clojure, libtools-macro-clojure, litl, londonlaw, ltsp, macsyfinder, mapnik, maven-compiler-plugin, mc, microdc2, miniupnpd, monajat, navit, pdmenu, pirl, plm, scikit-learn, snp-sites, sra-sdk, sunpinyin, tilda, vdr-plugin-dvd, vdr-plugin-epgsearch, vdr-plugin-remote, vdr-plugin-spider, vdr-plugin-streamdev, vdr-plugin-sudoku, vdr-plugin-xineliboutput, veromix, voxbo, xaos, xbae. The following packages became reproducible after getting fixed:

analog/2:6.0-21 uploaded by Andreas Beckmann, original patch by Dhole.
base-passwd/3.5.38 uploaded by Colin Watson, original patch by Juan Picca.
debconf/1.5.57 uploaded by Colin Watson, original patch by Lunar.
ipband/0.8.1-4 by Mats Erik Andersson.
kfreebsd-10/10.1~svn274115-7 by Steven Chamberlain.
libcommons-cli-java/1.3.1-1 by tony mancill.
libpsl/0.7.1-1 by Daniel Kahn Gillmor.
maven-archiver/2.6-3 by Emmanuel Bourg.
mtink/1.0.16-9 by Graham Inggs.
ocamlweb/1.39-2 uploaded by Mehdi Dogguy, original patch by Chris Lamb.
rbdoom3bfg/1.0.3+repack1+git20150625-1 by Tobias Frost.
spatialite-tools/4.2.1~rc1-2 by Bas Couwenberg.
task/2.4.4+dfsg-1 by Sebastien Badia.

Some uploads fixed some reproducibility issues but not all of them:

bullet/2.83.4+dfsg-1 by Markus Koschany.
cdo/1.6.6+dfsg.1-2 by Alastair McKinstry.
fish/2.2.0-1 uploaded by Tristan Seligmann, original patch by Chris Lamb.
sympy/0.7.6-3 by Sergey B Kirpichev.
xtables-addons/2.7-1 uploaded by Dmitry Smirnov, original patch by Reiner Herrmann.

Patches submitted which have not made their way to the archive yet:

#792178 on gunroar by Reiner Herrmann: use C locale when sorting source files.
#792181 on tth by Reiner Herrmann: remove timestamps from generated HTML files.
#792285 on pkgconf by Juan Picca: set LC_ALL=C when running sort.
#792319 on jsmath-fonts by Chris Lamb: set TZ=UTC when calling unzip.
#792424 on swh-plugins by Chris Lamb: sort inputs in Makefile.
#792525 on ruby-standalone by Reiner Herrmann: use UTC and C locale when formatting the manpage date for the documentation.
#792528 on dict-foldoc by Reiner Herrmann: use C locale when formatting the date for the documentation.
#792529 on tomatoes by Reiner Herrmann: use date from debian/changelog in version string.
#792593 on lives by Dhole: process a Perl hash in stable order.
#792596 on jsmath by Dhole: set TZ=UTC when calling unzip.
#792597 on jsmath-fonts-sprite by Dhole: set TZ=UTC when calling unzip.
#792598 on libreoffice-canzeley-client by Dhole: set TZ=UTC when calling unzip.
#792599 on openthesaurus by Dhole: set TZ=UTC when calling unzip.
#792602 on fonts-stix by Dhole: set TZ=UTC when calling unzip.
#792667 on jack-audio-connection-kit by use date from debian/changelog in manpages.
#792668 on pyhoca-gui by remove date from package version number.
#792671 on apertium-dbus by remove *.pyo and *.pyc from binary package.
#792673 on bup by use date from debian/changelog when generating version strings.
#792684 on cain by Chris Lamb: ensure stable permissions when creating source tarball.
#792709 on dict-jargon by Dhole: set timestamp in archive using the latest entry of debian/changelog.
#792727 on libaqbanking by Micha Lenk (upstream): sort source files in documentation.
#792763 on docbook-dsssl by Chris Lamb: sort input files when creating changelog.
#792770 on lynx-cur by Reiner Herrmann: use C locale when sorting configuration files.
#792771 on mu-cade by Reiner Herrmann: use C locale when sorting source files.
#792772 on titanion by Reiner Herrmann: use C locale when sorting source files.
#792783 on linuxlogo by Reiner Herrmann: use C locale when sorting source files.
#792821 on pkg-config by Juan Picca: use C locale when sorting source files.
#792828 on tiger by Daniel Kahn Gillmor: use C locale when listing soure files.

reproducible.debian.net The statistics on the main page of reproducible.debian.net are now updated every five minutes. A random unreviewed package is suggested in the look at a package form on every build. (h01ger) A new package set based new on the Core Internet Infrastructure census has been added. (h01ger) Testing of FreeBSD has started, though no results yet. More details have been posted to the freebsd-hackers mailing list. The build is run on a new virtual machine running FreeBSD 10.1 with 3 cores and 6 GB of RAM, also sponsored by Profitbricks. strip-nondeterminism development Andrew Ayer released version 0.009 of strip-nondeterminism. The new version will strip locales from Javadoc, include the name of files causing errors, and ignore unhandled (but rare) zip64 archives. debbindiff development Lunar continued its major refactoring to enhance code reuse and pave the way to fuzzy-matching and parallel processing. Most file comparators have now been converted to the new class hierarchy. In order to support for archive formats, work has started on packaging Python bindings for libarchive. While getting support for more archive formats with a common interface is very nice, libarchive is a stream oriented library and might have bad performance with how debbindiff currently works. Time will tell if better solutions need to be found. Documentation update Lunar started a Reproducible builds HOWTO intended to explain the different aspects of making software build reproducibly to the different audiences that might have to get involved like software authors, producers of binary packages, and distributors. Package reviews 17 obsolete reviews have been removed, 212 added and 46 updated this week. 15 new bugs for packages failing to build from sources have been reported by Chris West (Faux), and Mattia Rizzolo. Presentations Lunar presented Debian efforts and some recipes on making software build reproducibly at Libre Software Meeting 2015. Slides and a video recording are available. Misc. h01ger, dkg, and Lunar attended a Core Infrastructure Initiative meeting. The progress and tools mode for the Debian efforts were shown. Several discussions also helped getting a better understanding of the needs of other free software projects regarding reproducible builds. The idea of a global append only log, similar to the logs used for Certificate Transparency, came up on multiple occasions. Using such append only logs for keeping records of sources and build results has gotten the name Binary Transparency Logs . They would at least help identifying a compromised software signing key. Whether the benefits in using such logs justify the costs need more research.

12 January 2015

Russell Coker: Systemd Notes

A few months ago I gave a lecture about systemd for the Linux Users of Victoria. Here are some of my notes reformatted as a blog post: Scripts in /etc/init.d can still be used, they work the same way as they do under sysvinit for the user. You type the same commands to start and stop daemons. To get a result similar to changing runlevel use the systemctl isolate command. Runlevels were never really supported in Debian (unlike Red Hat where they were used for starting and stopping the X server) so for Debian users there s no change here. The command systemctl with no params shows a list of loaded services and highlights failed units. The command journalctl -u UNIT-PATTERN shows journal entries for the unit(s) in question. The pattern uses wildcards not regexs. The systemd journal includes the stdout and stderr of all daemons. This solves the problem of daemons that don t log all errors to syslog and leave the sysadmin wondering why they don t work. The command systemctl status UNIT gives the status and last log entries for the unit in question. A program can use ioctl(fd, TIOCSTI, ) to push characters into a tty buffer. If the sysadmin runs an untrusted program with the same controlling tty then it can cause the sysadmin shell to run hostile commands. The system call setsid() to create a new terminal session is one solution but managing which daemons can be started with it is difficult. The way that systemd manages start/stop of all daemons solves this. I am glad to be rid of the run_init program we used to use on SE Linux systems to deal with this. Systemd has a mechanism to ask for passwords for SSL keys and encrypted filesystems etc. There have been problems with that in the past but I think they are all fixed now. While there is some difficulty during development the end result of having one consistent way of managing this will be better than having multiple daemons doing it in different ways. The commands systemctl enable and systemctl disable enable/disable daemon start at boot which is easier than the SysVinit alternative of update-rc.d in Debian. Systemd has built in seat management, which is not more complex than consolekit which it replaces. Consolekit was installed automatically without controversy so I don t think there should be controversy about systemd replacing consolekit. Systemd improves performance by parallel start and autofs style fsck. The command systemd-cgtop shows resource use for cgroups it creates. The command systemd-analyze blame shows what delayed the boot process and
systemd-analyze critical-chain shows the critical path in boot delays. Sysremd also has security features such as service private /tmp and restricting service access to directory trees. Conclusion For basic use things just work, you don t need to learn anything new to use systemd. It provides significant benefits for boot speed and potentially security. It doesn t seem more complex than other alternative solutions to the same problems. https://wiki.debian.org/systemd http://freedesktop.org/wiki/Software/systemd/Optimizations/ http://0pointer.de/blog/projects/security.html

30 December 2012

Iustin Pop: Interesting tool of the day: ghc-gc-tune

Courtesy of a recent Google+ post/Stack overflow answer, I stumbled upon ghc-gc-tune, a simple but nice tool which generates interesting graphs for Haskell programs. What is does is quite trivial: iterate over a range of arena and heap sizes, and run a specified program with those RTS options, then generate a graph comparing the performance (by default cpu time) across the combinations of the values. Note that for newer GHC versions, you'll need to link with -rtsopts to allow for the -A/-H custom sizes. The reason I mention it is that the graphs it generates can be quite interesting. For one of the Ganeti programs, hspace, run with the command line ./hspace --simu p,20,1t,96g,16,1 --disk-template drbd, it generates this graph:

This picture, if I read it correctly, says that this program is actually well behaved (well, +RTS -s says "3 MB total memory in use" with default options), and that the optimum sizes actually relate to the (L1? L2?) cache size. But note that the maximum difference is only about ~1.6 . By changing the parameters to hspace and make it allocate more memory (

./hspace --simu p,20,1t,128g,256,32 --disk-template plain
--tiered=1g,128m,1

, ~52MB reported by +RTS -s), the graph changes significantly:

Now we have somewhat the opposite situation: very small arena sizes are detrimental (and by a big factor, 4.5 ), large arena/heap sizes are OK-ish, and the sweet spot is around 2-4MB arena size with heap sizes up to 4MB. Now maybe these particular examples were not very elightening (and they were definitely not well-conducted tests, etc.), but they should allow some intuition into how the program behaves. Plus, the tool can also generate other plots, for example peak memory usage.

26 November 2012

Russell Coker: Links November 2012

Julian Treasure gave an informative TED talk about The 4 Ways Sound Affects US [1]. Among other things he claims that open plan offices reduce productivity by 66%! He suggests that people who work in such offices wear headphones and play bird-songs. Naked Capitalism has an interesting interview between John Cusack and Jonathan Turley about how the US government policy of killing US citizens without trial demonstrates the failure of their political system [2]. Washington s blog has an interesting article on the economy in Iceland [3]. Allowing the insolvent banks to go bankrupt was the best thing that they have ever done for their economy. Clay Shirky wrote an insightful article about the social environment of mailing lists and ways to limit flame-wars [4]. ZRep is an interesting program that mirrors ZFS filesystems via regular snapshots and send/recv operations [5]. It seems that it could offer similar benefits to DRBD but at the file level and with greater reliability. James Lockyer gave a movingTEDx talk about his work in providing a legal defence for the wrongly convicted [6]. This has included overturning convictions after as much as half a century in which the falsely accused had already served a life sentence. Nathan Myers wrote an epic polemic about US government policy since 9-11 [7]. It s good to see that some Americans realise it s wrong. There is an insightful TED blog post about TED Fellow Salvatore Iaconesi who has brain cancer [8]. Apparently he had some problems with medical records in proprietary formats which made it difficult to get experts to properly assess his condition. Open document standards can be a matter of life and death and should be mandated by federal law. Paul Wayper wrote an interesting and amusing post about Emotional Computing which compares the strategies of Apple, MS, and the FOSS community among other things [9]. Kevin Allocca of Youtube gave an insightful TED talk about why videos go viral [10]. Jason Fried gave an interesting TED talk Why Work Doesn t Happen at Work [11]. His main issues are distraction and wasted time in meetings. He gives some good ideas for how to improve productivity. But they can also be used for sabotage. If someone doesn t like their employer then they could call for meetings, incite managers to call meetings, and book meetings so that they don t follow each other and thus waste more of the day (EG meetings at 1PM and 3PM instead of having the second meeting when the first finishes). Shyam Sankar gave an interesting TED talk about human computer cooperation [12]. He describes the success of human-computer partnerships in winning chess tournaments, protein folding, and other computational challenges. It seems that the limit for many types of computation will be the ability to get people and computers to work together efficiently. Cory Doctorow wrote an interesting and amusing article for Locus Magazine about some of the failings of modern sci-fi movies [13]. He is mainly concerned with pointless movies that get the science and technology aspects wrong and the way that the blockbuster budget process drives the development of such movies. Of course there are many other things wrong with sci-fi movies such as the fact that most of them are totally implausible (EG aliens who look like humans). The TED blog has an interesting interview with Catarina Mota about hacker spaces and open hardware [14]. Sociological Images has an interesting article about sporting behaviour [15]. They link to a very funny youtube video of a US high school football team who make the other team believe that they aren t playing until they win [16]

Links April 2012 Karen Tse gave an interesting TED talk about how to...
Links March 2012 Washington s Blog has an informative summary of recent articles about...
Links November 2011 Forbes has an interesting article about crowd-sourcing by criminals and...

27 April 2012

Russell Coker: BTRFS and ZFS as Layering Violations

LWN has an interesting article comparing recent developments in the Linux world to the Unix Wars that essentially killed every proprietary Unix system [1]. The article is really interesting and I recommend reading it, it s probably only available to subscribers at the moment but should be generally available in a week or so (I used my Debian access sponsored by HP to read it). A comment on that article cites my previous post about the reliability of RAID [2] and then goes on to disagree with my conclusion that using the filesystem for everything is the right thing to do. The Benefits of Layers I don t believe as strongly in the BTRFS/ZFS design as the commentator probably thinks. The current way my servers (and a huge number of other Linux systems) work of having RAID to form a reliable array of disks from a set of cheap disks for the purpose of reliability and often capacity or performance is a good thing. I have storage on top of the RAID array and can fix the RAID without bothering about the filesystem(s) and have done so in the past. I can also test the RAID array without involving any filesystem specific code. Then I have LVM running on top of the RAID array in exactly the same way that it runs on top of a single hard drive or SSD in the case of a laptop or netbook. So Linux on a laptop is much the same as Linux on a server in terms of storage once we get past the issue of whether a single disk or a RAID array is used for the LVM PV, among other things this means that the same code paths are used and I m less likely to encounter a bug when I install a new system. LVM provides multiple LVs which can be used for filesystems, swap, or anything else that uses storage. So if a filesystem gets badly corrupted I can umount it, create an LVM snapshot, and then take appropriate measures to try and fix it without interfering with other filesystems. When using layered storage I can easily add or change layers when it s appropriate. For example I have encryption on only some LVs on my laptop and netbook systems (there is no point encrypting the filesystem used for .iso files of Linux distributions) and on some servers I use RAID-0 for cached data. When using a filesystem like BTRFS or ZFS which includes subvolumes (similar in result to LVM in some cases) and internal RAID you can t separate the layers. So if something gets corrupted then you have to deal with all the complexity of BTRFS or ZFS instead of just fixing the one layer that has a problem. Update: One thing I forgot to mention when I first published this is the benefits of layering for some uncommon cases such as network devices. I can run an Ext4 filesystem over a RAID-1 array which has one device on NBD on another system. That s a bit unusual but it is apparently working well for some people. The internal RAID on ZFS and BTRFS doesn t support such things and using software RAID underneath ZFS or BTRFS loses some features. When using DRBD you might have two servers with local RAID arrays, DRBD on top of that, and then an Ext4 filesystem. As any form of RAID other than internal RAID loses reliability features for ZFS and BTRFS that means that no matter how you might implement those filesystems with DRBD it seems that you will lose somehow. It seems that neither BTRFS nor ZFS supports a disconnected RAID mode (like a Linux software RAID with a bitmap so it can resync only the parts that didn t change) so it s not possible to use BTRFS or ZFS RAID-1 with an NBD device. The only viable way of combining ZFS data integrity features with DRBD replication seems to be using a zvol for DRBD and then running Ext4 on top of that. The Benefits of Integration When RAID and the filesystem are separate things (with some added abstraction from LVM) it s difficult to optimise the filesystem for RAID performance at the best of times and impossible in many cases. When the filesystem manages RAID it can optimise it s operation to match the details of the RAID layout. I believe that in some situations ZFS will use mirroring instead of RAID-Z for small writes to reduce the load and that ZFS will combine writes into a single RAID-Z stripe (or set of contiguous RAID-Z stripes) to improve write performance. It would be possible to have a RAID driver that includes checksums for all blocks, it could then read from another device when a checksum fails and give some of the reliability features that ZFS and BTRFS offer. Then to provide all the reliability benefits of ZFS you would at least need a filesystem that stores multiple copies of the data which would of course need checksums (because the filesystem could be used on a less reliable block device) and therefore you would end up with two checksums on the same data. Note that if you want to have a RAID array with checksums on all blocks then ZFS has a volume management feature (which is well described by Mark Round) [3]. Such a zvol could be used for a block device in a virtual machine and in an ideal world it would be possible to use one as swap space. But the zvol is apparently managed with all the regular ZFS mechanisms so it s not a direct list of blocks on disk and thus can t be extracted if there is a problem with ZFS. Snapshots are an essential feature by today s standards. The ability to create lots of snapshots with low overhead is a significant feature of filesystems like BTRFS and ZFS. Now it is possible to run BTRFS or ZFS on top of a volume manager like LVM which does snapshots to cover the case of the filesystem getting corrupted. But again that would end up with two sets of overhead. The way that ZFS supports snapshots which inherit encryption keys is also interesting. Conclusion It s technically possible to implement some of the ZFS features as separate layers, such as a software RAID implementation that put checksums on all blocks. But it appears that there isn t much interest in developing such things. So while people would use it (and people are using ZFS ZVols as block devices for other filesystems as described in a comment on Mark Round s blog) it s probably not going to be implemented. Therefore we have a choice of all the complexity and features of BTRFS or ZFS, or the current RAID+LVM+Ext4 option. While the complexity of BTRFS and ZFS is a concern for me (particularly as BTRFS is new and ZFS is really complex and not well supported on Linux) it seems that there is no other option for certain types of large storage at the moment. ZFS on Linux isn t a great option for me, but for some of my clients it seems to be the only option. ZFS on Solaris would be a better option in some ways, but that s not possible when you have important Linux software that needs fast access to the storage.

Starting with BTRFS Based on my investigation of RAID reliability [1] I have...
ZFS vs BTRFS on Cheap Dell Servers I previously wrote about my first experiences with BTRFS [1]....
Reliability of RAID ZDNet has an insightful article by Robin Harris predicting the...

Next.