Search Results: "alberto"

13 September 2022

Alberto Garc a: Adding software to the Steam Deck with systemd-sysext

Introduction: an immutable OS The Steam Deck runs SteamOS, a single-user operating system based on Arch Linux. Although derived from a standard package-based distro, the OS in the Steam Deck is immutable and system updates replace the contents of the root filesystem atomically instead of using the package manager. An immutable OS makes the system more stable and its updates less error-prone, but users cannot install additional packages to add more software. This is not a problem for most users since they are only going to run Steam and its games (which are stored in the home partition). Nevertheless, the OS also has a desktop mode which provides a standard Linux desktop experience, and here it makes sense to be able to install more software. How to do that though? It is possible for the user to become root, make the root filesytem read-write and install additional software there, but any changes will be gone after the next OS update. Modifying the rootfs can also be dangerous if the user is not careful. Ways to add additional software The simplest and safest way to install additional software is with Flatpak, and that s the method recommended in the Steam Deck Desktop FAQ. Flatpak is already installed and integrated in the system via the Discover app so I won t go into more details here. However, while Flatpak works great for desktop applications not every piece of software is currently available, and Flatpak is also not designed for other types of programs like system services or command-line tools. Fortunately there are several ways to add software to the Steam Deck without touching the root filesystem, each one with different pros and cons. I will probably talk about some of them in the future, but in this post I m going to focus on one that is already available in the system: systemd-sysext. About systemd-sysext This is a tool included in recent versions of systemd and it is designed to add additional files (in the form of system extensions) to an otherwise immutable root filesystem. Each one of these extensions contains a set of files. When extensions are enabled (aka merged ) those files will appear on the root filesystem using overlayfs. From then on the user can open and run them normally as if they had been installed with a package manager. Merged extensions are seamlessly integrated with the rest of the OS. Since extensions are just collections of files they can be used to add new applications but also other things like system services, development tools, language packs, etc. Creating an extension: yakuake I m using yakuake as an example for this tutorial since the extension is very easy to create, it is an application that some users are demanding and is not easy to distribute with Flatpak. So let s create a yakuake extension. Here are the steps: 1) Create a directory and unpack the files there:

$ mkdir yakuake
$ wget https://steamdeck-packages.steamos.cloud/archlinux-mirror/extra/os/x86_64/yakuake-21.12.1-1-x86_64.pkg.tar.zst
$ tar -C yakuake -xaf yakuake-*.tar.zst usr

2) Create a file called extension-release.NAME under usr/lib/extension-release.d with the fields ID and VERSION_ID taken from the Steam Deck s /etc/os-release file.

$ mkdir -p yakuake/usr/lib/extension-release.d/
$ echo ID=steamos > yakuake/usr/lib/extension-release.d/extension-release.yakuake
$ echo VERSION_ID=3.3.1 >> yakuake/usr/lib/extension-release.d/extension-release.yakuake

3) Create an image file with the contents of the extension:

$ mksquashfs yakuake yakuake.raw

That s it! The extension is ready. A couple of important things: image files must have the .raw suffix and, despite the name, they can contain any filesystem that the OS can mount. In this example I used SquashFS but other alternatives like EroFS or ext4 are equally valid. NOTE: systemd-sysext can also use extensions from plain directories (i.e skipping the mksquashfs part). Unfortunately we cannot use them in our case because overlayfs does not work with the casefold feature that is enabled on the Steam Deck. Using the extension Once the extension is created you simply need to copy it to a place where systemd-systext can find it. There are several places where they can be installed (see the manual for a list) but due to the Deck s partition layout and the potentially large size of some extensions it probably makes more sense to store them in the home partition and create a link from one of the supported locations (/var/lib/extensions in this example):

(deck@steamdeck ~)$ mkdir extensions
(deck@steamdeck ~)$ scp user@host:/path/to/yakuake.raw extensions/
(deck@steamdeck ~)$ sudo ln -s $PWD/extensions /var/lib/extensions

Once the extension is installed in that directory you only need to enable and start systemd-sysext:

(deck@steamdeck ~)$ sudo systemctl enable systemd-sysext
(deck@steamdeck ~)$ sudo systemctl start systemd-sysext

After this, if everything went fine you should be able to see (and run) /usr/bin/yakuake. The files should remain there from now on, also if you reboot the device. You can see what extensions are enabled with this command:

$ systemd-sysext status
HIERARCHY EXTENSIONS SINCE
/opt      none       -
/usr      yakuake    Tue 2022-09-13 18:21:53 CEST

If you add or remove extensions from the directory then a simple systemd-sysext refresh is enough to apply the changes. Unfortunately, and unlike distro packages, extensions don t have any kind of post-installation hooks or triggers, so in the case of Yakuake you probably won t see an entry in the KDE application menu immediately after enabling the extension. You can solve that by running kbuildsycoca5 once from the command line. Limitations and caveats Using systemd extensions is generally very easy but there are some things that you need to take into account:

Using extensions is easy (you put them in the directory and voil !). However, creating extensions is not necessarily always easy. To begin with, any libraries, files, etc., that your extensions may need should be either present in the root filesystem or provided by the extension itself. You may need to combine files from different sources or packages into a single extension, or compile them yourself.
In particular, if the extension contains binaries they should probably come from the Steam Deck repository or they should be built to work with those packages. If you need to build your own binaries then having a SteamOS virtual machine can be handy. There you can install all development files and also test that everything works as expected. One could also create a Steam Deck SDK extension with all the necessary files to develop directly on the Deck
Extensions are not distribution packages, they don t have dependency information and therefore they should be self-contained. They also lack triggers and other features available in packages. For desktop applications I still recommend using a system like Flatpak when possible.
Extensions are tied to a particular version of the OS and, as explained above, the ID and VERSION_ID of each extension must match the values from /etc/os-release. If the fields don t match then the extension will be ignored. This is to be expected because there s no guarantee that a particular extension is going to work with a different version of the OS. This can happen after a system update. In the best case one simply needs to update the extension s VERSION_ID, but in some cases it might be necessary to create the extension again with different/updated files.
Extensions only install files in /usr and /opt. Any other file in the image will be ignored. This can be a problem if a particular piece of software needs files in other directories.
When extensions are enabled the /usr and /opt directories become read-only because they are now part of an overlayfs. They will remain read-only even if you run steamos-readonly disable !!. If you really want to make the rootfs read-write you need to disable the extensions (systemd-sysext unmerge) first.
Unlike Flatpak or Podman (including toolbox / distrobox), this is (by design) not meant to isolate the contents of the extension from the rest of the system, so you should be careful with what you re installing. On the other hand, this lack of isolation makes systemd-sysext better suited to some use cases than those container-based systems.

Conclusion systemd extensions are an easy way to add software (or data files) to the immutable OS of the Steam Deck in a way that is seamlessly integrated with the rest of the system. Creating them can be more or less easy depending on the case, but using them is extremely simple. Extensions are not packages, and systemd-sysext is not a package manager or a general-purpose tool to solve all problems, but if you are aware of its limitations it can be a practical tool. It is also possible to share extensions with other users, but here the usual warning against installing binaries from untrusted sources applies. Use with caution, and enjoy!

5 July 2022

Alberto Garc a: Running the Steam Deck s OS in a virtual machine using QEMU

Introduction The Steam Deck is a handheld gaming computer that runs a Linux-based operating system called SteamOS. The machine comes with SteamOS 3 (code name holo ), which is in turn based on Arch Linux. Although there is no SteamOS 3 installer for a generic PC (yet), it is very easy to install on a virtual machine using QEMU. This post explains how to do it. The goal of this VM is not to play games (you can already install Steam on your computer after all) but to use SteamOS in desktop mode. The Gamescope mode (the console-like interface you normally see when you use the machine) requires additional development to make it work with QEMU and will not work with these instructions. A SteamOS VM can be useful for debugging, development, and generally playing and tinkering with the OS without risking breaking the Steam Deck. Running the SteamOS desktop in a virtual machine only requires QEMU and the OVMF UEFI firmware and should work in any relatively recent distribution. In this post I m using QEMU directly, but you can also use virt-manager or some other tool if you prefer, we re emulating a standard x86_64 machine here. General concepts SteamOS is a single-user operating system and it uses an A/B partition scheme, which means that there are two sets of partitions and two copies of the operating system. The root filesystem is read-only and system updates happen on the partition set that is not active. This allows for safer updates, among other things. There is one single /home partition, shared by both partition sets. It contains the games, user files, and anything that the user wants to install there. Although the user can trivially become root, make the root filesystem read-write and install or change anything (the pacman package manager is available), this is not recommended because

it increases the chances of breaking the OS, and
any changes will disappear with the next OS update.

A simple way for the user to install additional software that survives OS updates and doesn t touch the root filesystem is Flatpak. It comes preinstalled with the OS and is integrated with the KDE Discover app. Preparing all necessary files The first thing that we need is the installer. For that we have to download the Steam Deck recovery image from here: https://store.steampowered.com/steamos/download/?ver=steamdeck&snr= Once the file has been downloaded, we can uncompress it and we ll get a raw disk image called steamdeck-recovery-4.img (the number may vary). Note that the recovery image is already SteamOS (just not the most up-to-date version). If you simply want to have a quick look you can play a bit with it and skip the installation step. In this case I recommend that you extend the image before using it, for example with truncate -s 64G steamdeck-recovery-4.img or, better, create a qcow2 overlay file and leave the original raw image unmodified: qemu-img create -f qcow2 -F raw -b steamdeck-recovery-4.img steamdeck-recovery-extended.qcow2 64G But here we want to perform the actual installation, so we need a destination image. Let s create one: $ qemu-img create -f qcow2 steamos.qcow2 64G Installing SteamOS Now that we have all files we can start the virtual machine:

$ qemu-system-x86_64 -enable-kvm -smp cores=4 -m 8G \
    -device usb-ehci -device usb-tablet \
    -device intel-hda -device hda-duplex \
    -device VGA,xres=1280,yres=800 \
    -drive if=pflash,format=raw,readonly=on,file=/usr/share/ovmf/OVMF.fd \
    -drive if=virtio,file=steamdeck-recovery-4.img,driver=raw \
    -device nvme,drive=drive0,serial=badbeef \
    -drive if=none,id=drive0,file=steamos.qcow2

Note that we re emulating an NVMe drive for steamos.qcow2 because that s what the installer script expects. This is not strictly necessary but it makes things a bit easier. If you don t want to do that you ll have to edit ~/tools/repair_device.sh and change DISK and DISK_SUFFIX. SteamOS installer shortcuts

Once the system has booted we ll see a KDE Plasma session with a few tools on the desktop. If we select Reimage Steam Deck and click Proceed on the confirmation dialog then SteamOS will be installed on the destination drive. This process should not take a long time. Now, once the operation finishes a new confirmation dialog will ask if we want to reboot the Steam Deck, but here we have to choose Cancel . We cannot use the new image yet because it would try to boot into the Gamescope session, which won t work, so we need to change the default desktop session. SteamOS comes with a helper script that allows us to enter a chroot after automatically mounting all SteamOS partitions, so let s open a Konsole and make the Plasma session the default one in both partition sets:

$ sudo steamos-chroot --disk /dev/nvme0n1 --partset A
# steamos-readonly disable
# echo '[Autologin]' > /etc/sddm.conf.d/zz-steamos-autologin.conf
# echo 'Session=plasma.desktop' >> /etc/sddm.conf.d/zz-steamos-autologin.conf
# steamos-readonly enable
# exit
$ sudo steamos-chroot --disk /dev/nvme0n1 --partset B
# steamos-readonly disable
# echo '[Autologin]' > /etc/sddm.conf.d/zz-steamos-autologin.conf
# echo 'Session=plasma.desktop' >> /etc/sddm.conf.d/zz-steamos-autologin.conf
# steamos-readonly enable
# exit

After this we can shut down the virtual machine. Our new SteamOS drive is ready to be used. We can discard the recovery image now if we want. Booting SteamOS and first steps To boot SteamOS we can use a QEMU line similar to the one used during the installation. This time we re not emulating an NVMe drive because it s no longer necessary.

$ cp /usr/share/OVMF/OVMF_VARS.fd .
$ qemu-system-x86_64 -enable-kvm -smp cores=4 -m 8G \
   -device usb-ehci -device usb-tablet \
   -device intel-hda -device hda-duplex \
   -device VGA,xres=1280,yres=800 \
   -drive if=pflash,format=raw,readonly=on,file=/usr/share/ovmf/OVMF.fd \
   -drive if=pflash,format=raw,file=OVMF_VARS.fd \
   -drive if=virtio,file=steamos.qcow2 \
   -device virtio-net-pci,netdev=net0 \
   -netdev user,id=net0,hostfwd=tcp::2222-:22

(the last two lines redirect tcp port 2222 to port 22 of the guest to be able to SSH into the VM. If you don t want to do that you can omit them) If everything went fine, you should see KDE Plasma again, this time with a desktop icon to launch Steam and another one to Return to Gaming Mode (which we should not use because it won t work). See the screenshot that opens this post. Congratulations, you re running SteamOS now. Here are some things that you probably want to do:

(optional) Change the keyboard layout in the system settings (the default one is US English)
Set the password for the deck user: run passwd on a terminal
Enable / start the SSH server: sudo systemctl enable sshd and/or sudo systemctl start sshd .
SSH into the machine: ssh -p 2222 deck@localhost

Updating the OS to the latest version The Steam Deck recovery image doesn t install the most recent version of SteamOS, so now we should probably do a software update.

First of all ensure that you re giving enought RAM to the VM (in my examples I run QEMU with -m 8G). The OS update might fail if you use less.
(optional) Change the OS branch if you want to try the beta release: sudo steamos-select-branch beta (or main, if you want the bleeding edge)
Check the currently installed version in /etc/os-release (see the BUILD_ID variable)
Check the available version: steamos-update check
Download and install the software update: steamos-update

Note: if the last step fails after reaching 100% with a post-install handler error then go to Connections in the system settings, rename Wired Connection 1 to something else (anything, the name doesn t matter), click Apply and run steamos-update again. This works around a bug in the update process. Recent images fix this and this workaround is not necessary with them. As we did with the recovery image, before rebooting we should ensure that the new update boots into the Plasma session, otherwise it won t work:

$ sudo steamos-chroot --partset other
# steamos-readonly disable
# echo '[Autologin]' > /etc/sddm.conf.d/zz-steamos-autologin.conf
# echo 'Session=plasma.desktop' >> /etc/sddm.conf.d/zz-steamos-autologin.conf
# steamos-readonly enable
# exit

After this we can restart the system. If everything went fine we should be running the latest SteamOS release. Enjoy! Reporting bugs SteamOS is under active development. If you find problems or want to request improvements please go to the SteamOS community tracker. Edit 06 Jul 2022: Small fixes, mention how to install the OS without using NVMe.

3 December 2020

Alberto Garc a: Subcluster allocation for qcow2 images

In previous blog posts I talked about QEMU s qcow2 file format and how to make it faster. This post gives an overview of how the data is structured inside the image and how that affects performance, and this presentation at KVM Forum 2017 goes further into the topic. This time I will talk about a new extension to the qcow2 format that seeks to improve its performance and reduce its memory requirements. Let s start by describing the problem. Limitations of qcow2 One of the most important parameters when creating a new qcow2 image is the cluster size. Much like a filesystem s block size, the qcow2 cluster size indicates the minimum unit of allocation. One difference however is that while filesystems tend to use small blocks (4 KB is a common size in ext4, ntfs or hfs+) the standard qcow2 cluster size is 64 KB. This adds some overhead because QEMU always needs to write complete clusters so it often ends up doing copy-on-write and writing to the qcow2 image more data than what the virtual machine requested. This gets worse if the image has a backing file because then QEMU needs to copy data from there, so a write request not only becomes larger but it also involves additional read requests from the backing file(s). Because of that qcow2 images with larger cluster sizes tend to:

grow faster, wasting more disk space and duplicating data.
increase the amount of necessary I/O during cluster allocation,
reducing the allocation performance.

Unfortunately, reducing the cluster size is in general not an option because it also has an impact on the amount of metadata used internally by qcow2 (reference counts, guest-to-host cluster mapping). Decreasing the cluster size increases the number of clusters and the amount of necessary metadata. This has direct negative impact on I/O performance, which can be mitigated by caching it in RAM, therefore increasing the memory requirements (the aforementioned post covers this in more detail). Subcluster allocation The problems described in the previous section are well-known consequences of the design of the qcow2 format and they have been discussed over the years. I have been working on a way to improve the situation and the work is now finished and available in QEMU 5.2 as a new extension to the qcow2 format called extended L2 entries. The so-called L2 tables are used to map guest addresses to data clusters. With extended L2 entries we can store more information about the status of each data cluster, and this allows us to have allocation at the subcluster level. The basic idea is that data clusters are now divided into 32 subclusters of the same size, and each one of them can be allocated separately. This allows combining the benefits of larger cluster sizes (less metadata and RAM requirements) with the benefits of smaller units of allocation (less copy-on-write, smaller images). If the subcluster size matches the block size of the filesystem used inside the virtual machine then we can eliminate the need for copy-on-write entirely. So with subcluster allocation we get:

Sixteen times less metadata per unit of allocation, greatly reducing the amount of necessary L2 cache.
Much faster I/O during allocating when the image has a backing file, up to 10-15 times more I/O operations per second for the same cluster size in my tests (see chart below).
Smaller images and less duplication of data.

This figure shows the average number of I/O operations per second that I get with 4KB random write requests to an empty 40GB image with a fully populated backing file.

I/O performance comparison between traditional and extended qcow2 images

Things to take into account:

The performance improvements described earlier happen during allocation. Writing to already allocated (sub)clusters won t be any faster.
If the image does not have a backing file chances are that the allocation performance is equally fast, with or without extended L2 entries. This depends on the filesystem, so it should be tested before enabling this feature (but note that the other benefits mentioned above still apply).
Images with extended L2 entries are sparse, that is, they have holes and because of that their apparent size will be larger than the actual disk usage.
It is not recommended to enable this feature in compressed images, as compressed clusters cannot take advantage of any of the benefits.
Images with extended L2 entries cannot be read with older versions of QEMU.

How to use this? Extended L2 entries are available starting from QEMU 5.2. Due to the nature of the changes it is unlikely that this feature will be backported to an earlier version of QEMU. In order to test this you simply need to create an image with extended_l2=on, and you also probably want to use a larger cluster size (the default is 64 KB, remember that every cluster has 32 subclusters). Here is an example:

$ qemu-img create -f qcow2 -o extended_l2=on,cluster_size=128k img.qcow2 1T

And that s all you need to do. Once the image is created all allocations will happen at the subcluster level. More information This work was presented at the 2020 edition of the KVM Forum. Here is the video recording of the presentation, where I cover all this in more detail: You can also find the slides here. Acknowledgments This work has been possible thanks to Outscale, who have been sponsoring Igalia and my work in QEMU.

And thanks of course to the rest of the QEMU development team for their feedback and help with this!

23 March 2020

Bits from Debian: New Debian Developers and Maintainers (January and February 2020)

The following contributors got their Debian Developer accounts in the last two months:

Gard Spreemann (gspr)
Jonathan Bustillos (jathan)
Scott Talbert (swt2c)

The following contributors were added as Debian Maintainers in the last two months:

Thiago Andrade Marques
William Grzybowski
Sudip Mukherjee
Birger Schacht
Michael Robin Crusoe
Lars Tangvald
Alberto Molina Coballes
Emmanuel Arias
Hsieh-Tseng Shen
Jamie Strandboge

Congratulations!

29 August 2017

Jeremy Bicha: GNOME Tweaks 3.25.91

The GNOME 3.26 release cycle is in its final bugfix stage before release. Here s a look at what s new in GNOME Tweaks since my last post. I ve heard people say that GNOME likes to remove stuff. If that were true, how would there be anything left in GNOME? But maybe it s partially true. And maybe it s possible for removals to be a good thing?

Removal #1: Power Button Settings The Power page in Tweaks 3.25.91 looks a bit empty. In previous releases, the Tweaks app had a When the Power button is pressed setting that nearly duplicated the similar setting in the Settings app (gnome-control-center). I worked to restore support for Power Off as one of its options. Since this is now in Settings 3.25.91, there s no need for it to be in Tweaks any more. Removal #2: Hi-DPI Settings GNOME Tweaks offered a basic control to scale windows 2x for Hi-DPI displays. More advanced support is now in the Settings app. I suspect that fractional scaling won t be supported in GNOME 3.26 but it s something to look forward to in GNOME 3.28! Removal #3 Global Dark Theme I am announcing today that one of the oldest and popular tweaks will be removed from Tweaks 3.28 (to be released next March). Global Dark Theme is being removed because:

Changing the Global Dark Theme option required closing any currently running apps and reopening them to get the correct theme.
It didn t work for sandboxed apps (Flatpak and Snap)
It only worked for gtk3 apps (it can t work on gtk2 apps)
Some themes never supported a Dark variant. The switch wouldn t do anything at all with a theme like that.

Adwaita now has a separate Adwaita Dark theme. Arc has 2 different dark variations. Therefore, if you are a theme developer, you have about 6-7 months to offer a dark version of your theme. The dark version can be distributed the same way as your regular version. Removal #4 Some letters from our name In case you haven t noticed, GNOME Tweak Tool is now GNOME Tweaks. This better matches the GNOME app naming style. Thanks Alberto Fanjul for this improvement! For other details of what s changed including a helpful scrollbar fix from Ant nio Fernandes, see the NEWS file.

7 June 2017

Jeremy Bicha: GNOME Tweak Tool 3.25.2

Today, I released the first development snapshot (3.25.2) of what will be GNOME Tweak Tool 3.26. Many of the panels have received UI updates. Here are a few highlights. Before this version, Tweak Tool didn t report its own version number on its About dialog! Also, as far as I know, there was no visible place in the default GNOME install for you to see what version of GTK+ is on your system. Especially now that GNOME and GTK+ releases don t share the same version numbers any more, I thought it was useful information to be in a tweak app.

Florian M llner updated the layout of the GNOME Shell Extensions page:

Rui Matos added a new Disable While Typing tweak to the Touchpad section.

Alberto Fanjul added a Battery Percentage tweak for GNOME Shell s top bar.

I added a Left/Right Placement tweak for the window buttons (minimize, maximize, close) . This screenshot shows a minimize and close button on the left.

I think it s well known that Ubuntu s window buttons have been on the left for years but GNOME has kept the window buttons on the right. In fact, the GNOME 3 default is a single close button (see the other screenshots). For Unity (Ubuntu s default UI from 2011 until this year), it made sense for the buttons to be on the left because of how Unity s menu bar worked (the right side was used by the indicator system status menus). I don t believe the Ubuntu Desktop team has decided yet which side the window buttons will be on or which buttons there will be. I m ok with either side but I think I have a slight preference towards putting them on the right like Windows does. One reason I m not too worried about the Ubuntu default is that it s now very easy to switch them to the other side! If Ubuntu includes a dock like the excellent Dash to Dock in the default install, I think it makes sense for Ubuntu to add a minimize button by default. My admittedly unusual opinion is that there s no need for a maximize button.

For one thing, GNOME is thoroughly tested with one window button; adding a second one shouldn t be too big of a deal, but maybe adding a 3rd button might not work as well with the design of some apps.
When I maximize an app, I either double-click the titlebar or drag the app to the top of the screen so a maximize button just isn t needed.
A dedicated maximize just doesn t make as much sense when there is more than one possible maximization state. Besides traditional maximize, there is now left and right semi-maximize. There s even a goal for GNOME 3.26 to support quarter-tiling .

Other Changes and Info

Ikey Doherty ported Tweak Tool from python2 to python3.
Florian M llner switched the build system to meson. For an app like Tweak Tool, meson makes the build faster and simpler for developers to maintain.
For more details about what s changed, see the log and the NEWS
GNOME Tweak Tool 3.26 will be released alongside GNOME 3.26 in mid-September.

7 April 2017

Arturo Borrero Gonz lez: openvpn deployment with Debian Stretch

Debian Stretch feels like an excellent release by the Debian project. The final stable release is about to happen in the short term. Among the great things you can do with Debian, you could set up a VPN using the openvpn software. In this blog post I will describe how I ve deployed myself an openvpn server using Debian Stretch, my network environment and my configurations & workflow. Before all, I would like to reference my requisites and the characteristics of what I needed:

a VPN server which allows internet clients to access our datacenter internal network (intranet) securely
strong authentications mechanisms for the users (user/password + client certificate)
the user/password information is stored in a LDAP server of the datacenter
support for several (hundreds?) of clients
only need to route certain subnets (intranet) through the VPN, not the entire network traffic of the clients
full IPv4 & IPv6 dual stack support, of course
a group of system admins will perform changes to the configurations, adding and deleting clients

I agree this is a rather complex scenario and not all the people will face these requirements. The service diagram has this shape: VPN diagram

(DIA source file) So, it works like this:

clients connect via internet to our openvpn server, vpn.example.com
the openvpn server validates the connection and the tunnel is established (green)
now the client is virtually inside our network (blue)
the client wants to access some intranet resource, the tunnel traffic is NATed (red)

Our datacenter intranet is using public IPv4 addressing, but the VPN tunnels use private IPv4 addresses. To don t mix public and private address NAT is used. Obviously we don t want to invest public IPv4 addresses in our internal tunnels. We don t have this limitations in IPv6, we could use public IPv6 addresses within the tunnels. But we prefer sticking to a hard dual stack IPv4/IPv6 approach and also use private IPv6 addresses inside the tunnels and also NAT the IPv6 from private to public. This way, there are no differences in how IPv4 and IPv6 network are managed. We follow this approach for the addressing:

client 1 tunnel: 192.168.100.11, fd00:0:1::11
client 1 public NAT: x.x.x.11, x:x::11
client 2 tunnel: 192.168.100.12, fd00:0:1::12
client 2 public NAT: x.x.x.12, x:x::12
[ ]

The NAT runs in the VPN server, since this is kind of a router. We use nftables for this task. As the final win, I will describe how we manage all this configuration using the git version control system. Using git we can track which admin made which change. A git hook will deploy the files from the git repo itself to /etc/ so the services can read them. The VPN server networking configuration is as follows (/etc/network/interfaces file, adjust to your network environments):

auto lo
iface lo inet loopback
# main public IPv4 address of vpn.example.com
allow-hotplug eth0
iface eth0 inet static
        address x.x.x.4
        netmask 255.255.255.0
        gateway x.x.x.1
# main public IPv6 address of vpn.example.com
iface eth0 inet6 static
        address x:x:x:x::4
        netmask 64
        gateway x:x:x:x::1
# NAT Public IPv4 addresses (used to NAT tunnel of client 1)
auto eth0:11
iface eth0:11 inet static
        address x.x.x.11
        netmask 255.255.255.0
# NAT Public IPv6 addresses (used to NAT tunnel of client 1)
iface eth0:11 inet6 static
        address x:x:x:x::11
        netmask 64
# NAT Public IPv4 addresses (used to NAT tunnel of client 2)
auto eth0:12
iface eth0:12 inet static
        address x.x.x.12
        netmask 255.255.255.0
# NAT Public IPv6 addresses (used to NAT tunnel of client 2)
iface eth0:12 inet6 static
        address x:x:x:x::12
        netmask 64

Thanks to the amazing and tireless work of the Alberto Gonzalez Iniesta (DD), the openvpn package in debian is in very good shape, ready to use. In vpn.example.com, install the required packages:

% sudo aptitude install openvpn openvpn-auth-ldap nftables git sudo

Two git repositories will be used, one for the openvpn configuration and another for nftables (the nftables config is described later):

% sudo mkdir -p /srv/git/vpn.example.com-nft.git
% sudo git init --bare /srv/git/vpn.example.com-nft.git
% sudo mkdir -p /srv/git/vpn.example.com-openvpn.git
% sudo git init --bare /srv/git/vpn.example.com-openvpn.git
% sudo chown -R :git /srv/git/*
% sudo chmod -R g+rw /srv/git/*

The repositories belong to the git group, a system group we create to let systems admins operate the server using git:

% sudo addgroup --system git
% sudo adduser admin1 git
% sudo adduser admin2 git

For the openvpn git repository, we need at least this git hook (file /srv/git/vpn.example.com-openvpn.git/hooks/post-receive with execution permission):

#!/bin/bash
NAME="hooks/post-receive"
OPENVPN_ROOT="/etc/openvpn"
export GIT_WORK_TREE="$OPENVPN_ROOT"
UNAME=$(uname -n)
info()
 
        echo "$ UNAME  $ NAME  $1 ..."
 
info "checkout latest data to $GIT_WORK_TREE"
sudo git checkout -f
info "cleaning untracked files and dirs at $GIT_WORK_TREE"
sudo git clean -f -d

For this hook to work, sudo permissions are required (file /etc/sudoers.d/openvpn-git):

User_Alias      OPERATORS = admin1, admin2
Defaults        env_keep += "GIT_WORK_TREE"
 
OPERATORS       ALL=(ALL) NOPASSWD:/usr/bin/git checkout -f
OPERATORS       ALL=(ALL) NOPASSWD:/usr/bin/git clean -f -d

Please review this sudoers file to match your environment and security requirements. The openvpn package deploys several systemd services:

% dpkg -L openvpn   grep service
/lib/systemd/system/openvpn-client@.service
/lib/systemd/system/openvpn-server@.service
/lib/systemd/system/openvpn.service
/lib/systemd/system/openvpn@.service

We don t need all of them, we can use the simple openvpn.service:

% sudo systemctl edit --full openvpn.service

And put a content like this:

% systemctl cat openvpn.service
# /etc/systemd/system/openvpn.service
[Unit]
Description=OpenVPN server
Documentation=man:openvpn(8)
Documentation=https://community.openvpn.net/openvpn/wiki/Openvpn23ManPage
Documentation=https://community.openvpn.net/openvpn/wiki/HOWTO
 
[Service]
PrivateTmp=true
KillMode=mixed
Type=forking
ExecStart=/usr/sbin/openvpn --daemon ovpn --status /run/openvpn/%i.status 10 --cd /etc/openvpn --config /etc/openvpn/server.conf --writepid /run/openvpn/server.pid
PIDFile=/run/openvpn/server.pid
ExecReload=/bin/kill -HUP $MAINPID
WorkingDirectory=/etc/openvpn
ProtectSystem=yes
CapabilityBoundingSet=CAP_IPC_LOCK CAP_NET_ADMIN CAP_NET_BIND_SERVICE CAP_NET_RAW CAP_SETGID CAP_SETUID CAP_SYS_CHROOT CAP_DAC_READ_SEARCH CAP_AUDIT_WRITE
LimitNPROC=10
DeviceAllow=/dev/null rw
DeviceAllow=/dev/net/tun rw
 
[Install]
WantedBy=multi-user.target

We can move on now to configure nftables to perform the NATs. First, it s good to load the NAT configuration at boot time, so you need a service file like this (/etc/systemd/system/nftables.service):

[Unit]
Description=nftables
Documentation=man:nft(8) http://wiki.nftables.org
 
[Service]
Type=oneshot
RemainAfterExit=yes
StandardInput=null
ProtectSystem=full
ProtectHome=true
WorkingDirectory=/etc/nftables.d
ExecStart=/usr/sbin/nft -f ruleset.nft
ExecReload=/usr/sbin/nft -f ruleset.nft
ExecStop=/usr/sbin/nft flush ruleset
 
[Install]
WantedBy=multi-user.target

The nftables git hooks are implemented as described in nftables managed with git. We are interested in the git hooks: (file /srv/git/vpn.example.com-nft.git/hooks/post-receive):

#!/bin/bash
NAME="hooks/post-receive"
NFT_ROOT="/etc/nftables.d"
RULESET="$ NFT_ROOT /ruleset.nft"
export GIT_WORK_TREE="$NFT_ROOT"
UNAME=$(uname -n)
info()
 
        echo "$ UNAME  $ NAME  $1 ..."
 
info "checkout latest data to $GIT_WORK_TREE"
sudo git checkout -f
info "cleaning untracked files and dirs at $GIT_WORK_TREE"
sudo git clean -f -d
info "deploying new ruleset"
set -e
cd $NFT_ROOT && sudo nft -f $RULESET
info "new ruleset deployment was OK"

This hook moves our nftables configuration to /etc/nftables.d and then applies it to the kernel. So a single commit changes the runtime configuration of the server. You could implement some QA using the git hook update, check this file! Remember, git hooks requires exec permissions to work. Of course, you will need again a sudo policy for these nft hooks. Finally, we can start configuring both openvpn and nftables using git. For the VPN you will require the configure the PKI side: server certificates, and the CA signing your client s certificates. You can check openvpn s own documentation about this. Your first commit for openvpn could be the server.conf file:

plugin		/usr/lib/openvpn/openvpn-plugin-auth-pam.so common-auth
mode		server
user		nobody
group		nogroup
port		1194
proto		udp6
daemon
comp-lzo
persist-key
persist-tun
tls-server
cert		/etc/ssl/private/vpn.example.com_pub.crt
key		/etc/ssl/private/vpn.example.com_priv.pem
ca		/etc/ssl/cacert/clients_ca.pem
dh		/etc/ssl/certs/dh2048.pem
cipher		AES-128-CBC
dev		tun
topology	subnet
server		192.168.100.0 255.255.255.0
server-ipv6	fd00:0:1:35::/64
ccd-exclusive
client-config-dir ccd
max-clients	100
inactive	43200
keepalive	10 360
log-append	/var/log/openvpn.log
status		/var/log/openvpn-status.log
status-version	1
verb		4
mute		20

Don t forget the ccd/ directory. This directory contains a file per user using the VPN service. Each file is named after the CN of the client certificate:

# private addresses for client 1
ifconfig-push		192.168.100.11 255.255.255.0
ifconfig-ipv6-push	fd00:0:1::11/64
# routes to the intranet network
push "route-ipv6 x:x:x:x::/64"
push "route x.x.3.128 255.255.255.240"

# private addresses for client 2
ifconfig-push		192.168.100.12 255.255.255.0
ifconfig-ipv6-push	fd00:0:1::12/64
# routes to the intranet network
push "route-ipv6 x:x:x:x::/64"
push "route x.x.3.128 255.255.255.240"

You end with at leats these files in the openvpn git tree:

server.conf
ccd/CN=CLIENT_1
ccd/CN=CLIENT_2

Please note that if you commit a change to ccd/, the changes are read at runtime by openvpn. In the other hand, changes to server.conf require you to restart the openvpn service by hand. Remember, the addressing is like this:

(DIA source file) In the nftables git tree, you should put a ruleset like this (a single file named ruleset.nft is valid):

flush ruleset
table ip nat  
	map mapping_ipv4_snat  
		type ipv4_addr : ipv4_addr
		elements =  	192.168.100.11 : x.x.x.11,
				192.168.100.12 : x.x.x.12  
	 
	map mapping_ipv4_dnat  
		type ipv4_addr : ipv4_addr
		elements =  	x.x.x.11 : 192.168.100.11,
				x.x.x.12 : 192.168.100.12  
	 
	chain prerouting  
		type nat hook prerouting priority -100; policy accept;
		dnat to ip daddr map @mapping_ipv4_dnat
	 
	chain postrouting  
		type nat hook postrouting priority 100; policy accept;
		oifname "eth0" snat to ip saddr map @mapping_ipv4_snat
	 
 
table ip6 nat  
	map mapping_ipv6_snat  
		type ipv6_addr : ipv6_addr
		elements =  	fd00:0:1::11 : x:x:x::11,
				fd00:0:1::12 : x:x:x::12  
	 
	map mapping_ipv6_dnat  
		type ipv6_addr : ipv6_addr
		elements =  	x:x:x::11 : fd00:0:1::11,
				x:x:x::12 : fd00:0:1::12  
	 
	chain prerouting  
		type nat hook prerouting priority -100; policy accept;
		dnat to ip6 daddr map @mapping_ipv6_dnat
	 
	chain postrouting  
		type nat hook postrouting priority 100; policy accept;
		oifname "eth0" snat to ip6 saddr map @mapping_ipv6_snat
	 
 
table inet filter  
	chain forward  
		type filter hook forward priority 0; policy accept;
		# some forwarding filtering policy, if required, for both IPv4 and IPv6

Since the server is in fact routing packets between the tunnel and the public network, we require forwarding enabled in sysctl:

net.ipv4.conf.all.forwarding = 1
net.ipv6.conf.all.forwarding = 1

Of course, the VPN clients will require a client.conf file which looks like this:

client
remote vpn.example.com 1194
dev tun
proto udp
resolv-retry infinite
comp-lzo
verb 5
nobind
persist-key
persist-tun
user nobody
group nogroup
 
tls-client
ca      /etc/ssl/cacert/server_ca.crt
pkcs12  /home/user/mycertificate.p12
verify-x509-name vpn.example.com name
cipher AES-128-CBC
auth-user-pass
auth-nocache

Workflow for the system admins:

git clone the openvpn repo
modify ccd/ and server.conf
git commit the changes, push to the server
if server.conf was modified, restart openvpn
git clone the nftables repo
modify ruleset
git commit the changes, push to the server

Comments via email welcome!

8 February 2017

Alberto Garc a: QEMU and the qcow2 metadata checks

When choosing a disk image format for your virtual machine one of the factors to take into considerations is its I/O performance. In this post I ll talk a bit about the internals of qcow2 and about one of the aspects that can affect its performance under QEMU: its consistency checks. As you probably know, qcow2 is QEMU s native file format. The first thing that I d like to highlight is that this format is perfectly fine in most cases and its I/O performance is comparable to that of a raw file. When it isn t, chances are that this is due to an insufficiently large L2 cache. In one of my previous blog posts I wrote about the qcow2 L2 cache and how to tune it, so if your virtual disk is too slow, you should go there first. I also recommend Max Reitz and Kevin Wolf s qcow2: why (not)? talk from KVM Forum 2015, where they talk about a lot of internal details and show some performance tests. qcow2 clusters: data and metadata A qcow2 file is organized into units of constant size called clusters. The cluster size defaults to 64KB, but a different value can be set when creating a new image: qemu-img create -f qcow2 -o cluster_size=128K hd.qcow2 4G Clusters can contain either data or metadata. A qcow2 file grows dynamically and only allocates space when it is actually needed, so apart from the header there s no fixed location for any of the data and metadata clusters: they can appear mixed anywhere in the file. Here s an example of what it looks like internally:

In this example we can see the most important types of clusters that a qcow2 file can have:

Header: this one contains basic information such as the virtual size of the image, the version number, and pointers to where the rest of the metadata is located, among other things.
Data clusters: the data that the virtual machine sees.
L1 and L2 tables: a two-level structure that maps the virtual disk that the guest can see to the actual location of the data clusters in the qcow2 file.
Refcount table and blocks: a two-level structure with a reference count for each data cluster. Internal snapshots use this: a cluster with a reference count >= 2 means that it s used by other snapshots, and therefore any modifications require a copy-on-write operation.

Metadata overlap checks In order to detect corruption when writing to qcow2 images QEMU (since v1.7) performs several sanity checks. They verify that QEMU does not try to overwrite sections of the file that are already being used for metadata. If this happens, the image is marked as corrupted and further access is prevented. Although in most cases these checks are innocuous, under certain scenarios they can have a negative impact on disk write performance. This depends a lot on the case, and I want to insist that in most scenarios it doesn t have any effect. When it does, the general rule is that you ll have more chances of noticing it if the storage backend is very fast or if the qcow2 image is very large. In these cases, and if I/O performance is critical for you, you might want to consider tweaking the images a bit or disabling some of these checks, so let s take a look at them. There are currently eight different checks. They re named after the metadata sections that they check, and can be divided into the following categories:

Checks that run in constant time. These are equally fast for all kinds of images and I don t think they re worth disabling.
- main-header
- active-l1
- refcount-table
- snapshot-table
Checks that run in variable time but don t need to read anything from disk.
- refcount-block
- active-l2
- inactive-l1
Checks that need to read data from disk. There is just one check here and it s only needed if there are internal snapshots.
- inactive-l2

By default all tests are enabled except for the last one (inactive-l2), because it needs to read data from disk. Disabling the overlap checks Tests can be disabled or enabled from the command line using the following syntax: -drive file=hd.qcow2,overlap-check.inactive-l2=on
-drive file=hd.qcow2,overlap-check.snapshot-table=off It s also possible to select the group of checks that you want to enable using the following syntax: -drive file=hd.qcow2,overlap-check.template=none
-drive file=hd.qcow2,overlap-check.template=constant
-drive file=hd.qcow2,overlap-check.template=cached
-drive file=hd.qcow2,overlap-check.template=all Here, none means that no tests are enabled, constant enables all tests from group 1, cached enables all tests from groups 1 and 2, and all enables all of them. As I explained in the previous section, if you re worried about I/O performance then the checks that are probably worth evaluating are refcount-block, active-l2 and inactive-l1. I m not counting inactive-l2 because it s off by default. Let s look at the other three:

inactive-l1: This is a variable length check because it depends on the number of internal snapshots in the qcow2 image. However its performance impact is likely to be negligible in all cases so I don t think it s worth bothering with.
active-l2: This check depends on the virtual size of the image, and on the percentage that has already been allocated. This check might have some impact if the image is very large (several hundred GBs or more). In that case one way to deal with it is to create an image with a larger cluster size. This also has the nice side effect of reducing the amount of memory needed for the L2 cache.
refcount-block: This check depends on the actual size of the qcow2 file and it s independent from its virtual size. This check is relatively expensive even for small images, so if you notice performance problems chances are that they are due to this one. The good news is that we have been working on optimizing it, so if it s slowing down your VMs the problem might go away completely in QEMU 2.9.

Conclusion The qcow2 consistency checks are useful to detect data corruption, but they can affect write performance. If you re unsure and you want to check it quickly, open an image with overlap-check.template=none and see for yourself, but remember again that this will only affect write operations. To obtain more reliable results you should also open the image with cache=none in order to perform direct I/O and bypass the page cache. I ve seen performance increases of 50% and more, but whether you ll see them depends a lot on your setup. In many cases you won t notice any difference. I hope this post was useful to learn a bit more about the qcow2 format. There are other things that can help QEMU perform better, and I ll probably come back to them in future posts, so stay tuned! Acknowledgments My work in QEMU is sponsored by Outscale and has been made possible by Igalia and the help of the rest of the QEMU development team.

20 September 2016

Reproducible builds folks: Reproducible Builds: week 73 in Stretch cycle

What happened in the Reproducible Builds effort between Sunday September 11 and Saturday September 17 2016: Toolchain developments Ximin Luo started a new series of tools called (for now) debrepatch, to make it easier to automate checks that our old patches to Debian packages still apply to newer versions of those packages, and still make these reproducible. Ximin Luo updated one of our few remaining patches for dpkg in #787980 to make it cleaner and more minimal. The following tools were fixed to produce reproducible output:

naturaldocs/1.51-2 by Petter Reinholdtsen, original patch by Chris Lamb.

Packages reviewed and fixed, and bugs filed The following updated packages have become reproducible - in our current test setup - after being fixed:

elog/3.1.2-1-1 by Roger Kalt, original patch by Reiner Herrmann.
eyed3/0.6.18-3 by Petter Reinholdtsen, original patch by Chris Lamb.
frog/0.13.5-1 by Maarten van Gompel, original patch by Chris Lamb.
gtranslator/2.91.7-3 by Andreas Henriksson, original patch by Reiner Herrmann.
sozi/12.05-1.1 by Daniel Kahn Gillmor, original patch by Chris Lamb.

The following updated packages appear to be reproducible now, for reasons we were not able to figure out. (Relevant changelogs did not mention reproducible builds.)

evince/3.21.92-1 by Michael Biebl.
gnome-control-center/1:3.21.92-2 by Rapha l Hertzog.
libipathverbs/1.3-2 by Ana Beatriz Guerrero Lopez.
pagekite/0.5.8e-2 by Petter Reinholdtsen.

The following 3 packages were not changed, but have become reproducible due to changes in their build-dependencies: jaxrs-api python-lua zope-mysqlda. Some uploads have addressed some reproducibility issues, but not all of them:

eurephia/1.1.0-6 by Alberto Gonzalez Iniesta, original patch by Chris Lamb.
fdroidserver/0.7.0-1 by Hans-Christoph Steiner, original patch by Chris Lamb.
mini-buildd/1.0.18 by Stephan S rken.
nbc/1.2.1.r4+dfsg-3 by Petter Reinholdtsen, original patch by Chris Lamb.
ncurses/6.0+20160910-1 by Sven Joachim, #818067 by Niels Thykier.
python-kinterbasdb/3.3.0-4 by Santiago Vila, original patch by Chris Lamb.
snapper/0.3.3-1 Hideki Yamane, original patch by Sascha Steinbiss.

Patches submitted that have not made their way to the archive yet:

#838188 filed against ocaml by Johannes Schauer.

Reviews of unreproducible packages 462 package reviews have been added, 524 have been updated and 166 have been removed in this week, adding to our knowledge about identified issues. 25 issue types have been updated:

Added a new annotation for issues called "fix-deterministic" to help us update package reviews more easily. This indicates whether we expect that an issue would always happen on Jenkins; i.e. if there is a successful build, then we know the issue is fixed for that package and can update our notes.
Added random_order_in_sisu_javax_inject_named and too_much_input_for_diff.
Removed timestamps_in_manpages_generated_by_ronn.
Updated timestamps_in_allegro_dat_files. Additionally, 21 issues were marked with "fix-deterministic".

Weekly QA work FTBFS bugs have been reported by:

Chris Lamb (10)
Filip Pytloun (1)
Santiago Vila (1)

diffoscope development A new version of diffoscope 60 was uploaded to unstable by Mattia Rizzolo. It included contributions from:

Mattia Rizzolo:
- Various packaging and testing improvements.
HW42:
- minor wording fixes
Reiner Herrmann:
- minor wording fixes

It also included from changes previous weeks; see either the changes or commits linked above, or previous blog posts 72 71 70. strip-nondeterminism development New versions of strip-nondeterminism 0.027-1 and 0.028-1 were uploaded to unstable by Chris Lamb. It included contributions from:

Chris Lamb:
- Testing improvements, including better handling of timezones.

disorderfs development A new version of disorderfs 0.5.1 was uploaded to unstable by Chris Lamb. It included contributions from:

Andrew Ayer and Chris Lamb:
- Support relative paths for ROOTDIR; it no longer needs to be an absolute path.
Chris Lamb:
- Print the behaviour (shuffle/reverse/sort) on startup to stdout.

It also included from changes previous weeks; see either the changes or commits linked above, or previous blog posts 70. Misc. This week's edition was written by Ximin Luo and reviewed by a bunch of Reproducible Builds folks on IRC.

24 May 2016

Alberto Garc a: I/O bursts with QEMU 2.6

QEMU 2.6 was released a few days ago. One new feature that I have been working on is the new way to configure I/O limits in disk drives to allow bursts and increase the responsiveness of the virtual machine. In this post I ll try to explain how it works. The basic settings First I will summarize the basic settings that were already available in earlier versions of QEMU. Two aspects of the disk I/O can be limited: the number of bytes per second and the number of operations per second (IOPS). For each one of them the user can set a global limit or separate limits for read and write operations. This gives us a total of six different parameters. I/O limits can be set using the throttling.* parameters of -drive, or using the QMP block_set_io_throttle command. These are the names of the parameters for both cases:

-drive	block_set_io_throttle
throttling.iops-total	iops
throttling.iops-read	iops_rd
throttling.iops-write	iops_wr
throttling.bps-total	bps
throttling.bps-read	bps_rd
throttling.bps-write	bps_wr

It is possible to set limits for both IOPS and bps at the same time, and for each case we can decide whether to have separate read and write limits or not, but if iops-total is set then neither iops-read nor iops-write can be set. The same applies to bps-total and bps-read/write. The default value of these parameters is 0, and it means unlimited. In its most basic usage, the user can add a drive to QEMU with a limit of, say, 100 IOPS with the following -drive line:

-drive file=hd0.qcow2,throttling.iops-total=100

We can do the same using QMP. In this case all these parameters are mandatory, so we must set to 0 the ones that we don t want to limit:

     "execute": "block_set_io_throttle",
     "arguments":  
        "device": "virtio0",
        "iops": 100,
        "iops_rd": 0,
        "iops_wr": 0,
        "bps": 0,
        "bps_rd": 0,
        "bps_wr": 0

I/O bursts While the settings that we have just seen are enough to prevent the virtual machine from performing too much I/O, it can be useful to allow the user to exceed those limits occasionally. This way we can have a more responsive VM that is able to cope better with peaks of activity while keeping the average limits lower the rest of the time. Starting from QEMU 2.6, it is possible to allow the user to do bursts of I/O for a configurable amount of time. A burst is an amount of I/O that can exceed the basic limit, and there are two parameters that control them: their length and the maximum amount of I/O they allow. These two can be configured separately for each one of the six basic parameters described in the previous section, but here we ll use iops-total as an example. The I/O limit during bursts is set using iops-total-max , and the maximum length (in seconds) is set with iops-total-max-length . So if we want to configure a drive with a basic limit of 100 IOPS and allow bursts of 2000 IOPS for 60 seconds, we would do it like this (the line is split for clarity):

   -drive file=hd0.qcow2,
          throttling.iops-total=100,
          throttling.iops-total-max=2000,
          throttling.iops-total-max-length=60

Or with QMP:

     "execute": "block_set_io_throttle",
     "arguments":  
        "device": "virtio0",
        "iops": 100,
        "iops_rd": 0,
        "iops_wr": 0,
        "bps": 0,
        "bps_rd": 0,
        "bps_wr": 0,
        "iops_max": 2000,
        "iops_max_length": 60,

With this, the user can perform I/O on hd0.qcow2 at a rate of 2000 IOPS for 1 minute before it s throttled down to 100 IOPS. The user will be able to do bursts again if there s a sufficiently long period of time with unused I/O (see below for details). The default value for iops-total-max is 0 and it means that bursts are not allowed. iops-total-max-length can only be set if iops-total-max is set as well, and its default value is 1 second. Controlling the size of I/O operations When applying IOPS limits all I/O operations are treated equally regardless of their size. This means that the user can take advantage of this in order to circumvent the limits and submit one huge I/O request instead of several smaller ones. QEMU provides a setting called throttling.iops-size to prevent this from happening. This setting specifies the size (in bytes) of an I/O request for accounting purposes. Larger requests will be counted proportionally to this size. For example, if iops-size is set to 4096 then an 8KB request will be counted as two, and a 6KB request will be counted as one and a half. This only applies to requests larger than iops-size: smaller requests will be always counted as one, no matter their size. The default value of iops-size is 0 and it means that the size of the requests is never taken into account when applying IOPS limits. Applying I/O limits to groups of disks In all the examples so far we have seen how to apply limits to the I/O performed on individual drives, but QEMU allows grouping drives so they all share the same limits. This feature is available since QEMU 2.4. Please refer to the post I wrote when it was published for more details. The Leaky Bucket algorithm I/O limits in QEMU are implemented using the leaky bucket algorithm (specifically the Leaky bucket as a meter variant). This algorithm uses the analogy of a bucket that leaks water constantly. The water that gets into the bucket represents the I/O that has been performed, and no more I/O is allowed once the bucket is full. To see the way this corresponds to the throttling parameters in QEMU, consider the following values:

  iops-total=100
  iops-total-max=2000
  iops-total-max-length=60

Water leaks from the bucket at a rate of 100 IOPS.
Water can be added to the bucket at a rate of 2000 IOPS.
The size of the bucket is 2000 x 60 = 120000.
If iops-total-max is unset then the bucket size is 100.

The bucket is initially empty, therefore water can be added until it s full at a rate of 2000 IOPS (the burst rate). Once the bucket is full we can only add as much water as it leaks, therefore the I/O rate is reduced to 100 IOPS. If we add less water than it leaks then the bucket will start to empty, allowing for bursts again. Note that since water is leaking from the bucket even during bursts, it will take a bit more than 60 seconds at 2000 IOPS to fill it up. After those 60 seconds the bucket will have leaked 60 x 100 = 6000, allowing for 3 more seconds of I/O at 2000 IOPS. Also, due to the way the algorithm works, longer burst can be done at a lower I/O rate, e.g. 1000 IOPS during 120 seconds. Acknowledgments As usual, my work in QEMU is sponsored by Outscale and has been made possible by Igalia and the help of the QEMU development team.

Enjoy QEMU 2.6!

17 December 2015

Alberto Garc a: Improving disk I/O performance in QEMU 2.5 with the qcow2 L2 cache

QEMU 2.5 has just been released, with a lot of new features. As with the previous release, we have also created a video changelog. I plan to write a few blog posts explaining some of the things I have been working on. In this one I m going to talk about how to control the size of the qcow2 L2 cache. But first, let s see why that cache is useful. The qcow2 file format qcow2 is the main format for disk images used by QEMU. One of the features of this format is that its size grows on demand, and the disk space is only allocated when it is actually needed by the virtual machine. A qcow2 file is organized in units of constant size called clusters. The virtual disk seen by the guest is also divided into guest clusters of the same size. QEMU defaults to 64KB clusters, but a different value can be specified when creating a new image: qemu-img create -f qcow2 -o cluster_size=128K hd.qcow2 4G In order to map the virtual disk as seen by the guest to the qcow2 image in the host, the qcow2 image contains a set of tables organized in a two-level structure. These are called the L1 and L2 tables. There is one single L1 table per disk image. This table is small and is always kept in memory. There can be many L2 tables, depending on how much space has been allocated in the image. Each table is one cluster in size. In order to read or write data to the virtual disk, QEMU needs to read its corresponding L2 table to find out where that data is located. Since reading the table for each I/O operation can be expensive, QEMU keeps a cache of L2 tables in memory to speed up disk access.

The L2 cache can have a dramatic impact on performance. As an example, here s the number of I/O operations per second that I get with random read requests in a fully populated 20GB disk image:

L2 cache size	Average IOPS
1 MB	5100
1,5 MB	7300
2 MB	12700
2,5 MB	63600

If you re using an older version of QEMU you might have trouble getting the most out of the qcow2 cache because of this bug, so either upgrade to at least QEMU 2.3 or apply this patch. (in addition to the L2 cache, QEMU also keeps a refcount cache. This is used for cluster allocation and internal snapshots, but I m not covering it in this post. Please refer to the qcow2 documentation if you want to know more about refcount tables) Understanding how to choose the right cache size In order to choose the cache size we need to know how it relates to the amount of allocated space. The amount of virtual disk that can be mapped by the L2 cache (in bytes) is: disk_size = l2_cache_size * cluster_size / 8 With the default values for cluster_size (64KB) that is disk_size = l2_cache_size * 8192 So in order to have a cache that can cover n GB of disk space with the default cluster size we need l2_cache_size = disk_size_GB * 131072 QEMU has a default L2 cache of 1MB (1048576 bytes) so using the formulas we ve just seen we have 1048576 / 131072 = 8 GB of virtual disk covered by that cache. This means that if the size of your virtual disk is larger than 8 GB you can speed up disk access by increasing the size of the L2 cache. Otherwise you ll be fine with the defaults. How to configure the cache size Cache sizes can be configured using the -drive option in the command-line, or the blockdev-add QMP command. There are three options available, and all of them take bytes:

l2-cache-size: maximum size of the L2 table cache
refcount-cache-size: maximum size of the refcount block cache
cache-size: maximum size of both caches combined

There are two things that need to be taken into account:

Both the L2 and refcount block caches must have a size that is a multiple of the cluster size.
If you only set one of the options above, QEMU will automatically adjust the others so that the L2 cache is 4 times bigger than the refcount cache.

This means that these three options are equivalent: -drive file=hd.qcow2,l2-cache-size=2097152 -drive file=hd.qcow2,refcount-cache-size=524288 -drive file=hd.qcow2,cache-size=2621440 Although I m not covering the refcount cache here, it s worth noting that it s used much less often than the L2 cache, so it s perfectly reasonable to keep it small: -drive file=hd.qcow2,l2-cache-size=4194304,refcount-cache-size=262144 Reducing the memory usage The problem with a large cache size is that it obviously needs more memory. QEMU has a separate L2 cache for each qcow2 file, so if you re using many big images you might need a considerable amount of memory if you want to have a reasonably sized cache for each one. The problem gets worse if you add backing files and snapshots to the mix. Consider this scenario:

Here, hd0 is a fully populated disk image, and hd1 a freshly created image as a result of a snapshot operation. Reading data from this virtual disk will fill up the L2 cache of hd0, because that s where the actual data is read from. However hd0 itself is read-only, and if you write data to the virtual disk it will go to the active image, hd1, filling up its L2 cache as a result. At some point you ll have in memory cache entries from hd0 that you won t need anymore because all the data from those clusters is now retrieved from hd1. Let s now create a new live snapshot:

Now we have the same problem again. If we write data to the virtual disk it will go to hd2 and its L2 cache will start to fill up. At some point a significant amount of the data from the virtual disk will be in hd2, however the L2 caches of hd0 and hd1 will be full as a result of the previous operations, even if they re no longer needed. Imagine now a scenario with several virtual disks and a long chain of qcow2 images for each one of them. See the problem? I wanted to improve this a bit so I was working on a new setting that allows the user to reduce the memory usage by cleaning unused cache entries when they are not being used. This new setting is available in QEMU 2.5, and is called cache-clean-interval . It defines an interval (in seconds) after which all cache entries that haven t been accessed are removed from memory. This example removes all unused cache entries every 15 minutes: -drive file=hd.qcow2,cache-clean-interval=900 If unset, the default value for this parameter is 0 and it disables this feature. Further information In this post I only intended to give a brief summary of the qcow2 L2 cache and how to tune it in order to increase the I/O performance, but it is by no means an exhaustive description of the disk format. If you want to know more about the qcow2 format here s a few links:

The qcow2 file specification, from the QEMU repository.
qcow2 L2/refcount cache configuration, the original text this post is based on. It also covers the refcount block cache.
The reinvention of qcow2, a presentation (slides and video) by Kevin Wolf at KVM Forum 2011.
qcow2 why (not)?, a presentation (slides and video) by Max Reitz and Kevin Wolf at KVM Forum 2015.

Acknowledgments My work in QEMU is sponsored by Outscale and has been made possible by Igalia and the invaluable help of the QEMU development team.

Enjoy QEMU 2.5!

30 October 2015

Laura Arjona: Look at that nice looking FreedomBox!

This is a guest post by Alberto Fuentes, Debian contributor. Thanks!! I m rebuilding one my home server and decided to take a look at the FreedomBox project as the base for it. The 0.6 version was recently released and I wasn t aware of how advanced the project is already! They have a virtualbox image ready for some quick test. It took me longer to download it than to start using it. Here s is a pic of what it looks like to entice you to try it :)

All this is already on debian right now and you can turn any debian sid installation into a FreedomBox just by installing a package. The setup generates everything private on the first run, so even the virtualbox image can be used as the final thing They use Plinth (Django) to integrate the applications into the web interface. More info on how to help integrate more Debian packages here. A live demo is going to be streamed this Friday 30 Oct 2015 and a hackaton is scheduled for this saturday 31 Oct 2015. Cheers!
Filed under: My experiences and opinion Tagged: Debian, English, Freedom, FreedomBox, selfhosting

14 August 2015

Alberto Garc a: I/O limits for disk groups in QEMU 2.4

QEMU 2.4.0 has just been released, and among many other things it comes with some of the stuff I have been working on lately. In this blog post I am going to talk about disk I/O limits and the new feature to group several disks together. Disk I/O limits Disk I/O limits allow us to control the amount of I/O that a guest can perform. This is useful for example if we have several VMs in the same host and we want to reduce the impact they have on each other if the disk usage is very high. The I/O limits can be set using the QMP command block_set_io_throttle, or with the command line using the throttling.* options for the -drive parameter (in brackets in the examples below). Both the throughput and the number of I/O operations can be limited. For a more fine-grained control, the limits of each one of them can be set on read operations, write operations, or the combination of both:

bps (throttling.bps-total): Total throughput limit (in bytes/second).
bps_rd (throttling.bps-read): Read throughput limit.
bps_wr (throttling.bps-write): Write throughput limit.
iops (throttling.iops-total): Total I/O operations per second.
iops_rd (throttling.iops-read): Read I/O operations per second.
iops_wr (throttling.iops-write): Write I/O operations per second.

Example:

-drive if=virtio,file=hd1.qcow2,throttling.bps-write=52428800,throttling.iops-total=6000

In addition to that, it is also possible to configure the maximum burst size, which defines a pool of I/O that the guest can perform without being limited:

bps_max (throttling.bps-total-max): Total maximum (in bytes).
bps_rd_max (throttling.bps-read-max): Read maximum.
bps_wr_max (throttling.bps-write-max): Write maximum.
iops_max (throttling.iops-total-max): Total maximum of I/O operations.
iops_rd_max (throttling.iops-read-max): Read I/O operations.
iops_wr_max (throttling.iops-write-max): Write I/O operations.

One additional parameter named iops_size allows us to deal with the case where big I/O operations can be used to bypass the limits we have set. In this case, if a particular I/O operation is bigger than iops_size then it is counted several times when it comes to calculating the I/O limits. So a 128KB request will be counted as 4 requests if iops_size is 32KB.

iops_size (throttling.iops-size): Size of an I/O request (in bytes).

Group throttling All of these parameters I ve just described operate on individual disk drives and have been available for a while. Since QEMU 2.4 however, it is also possible to have several drives share the same limits. This is configured using the new group parameter. The way it works is that each disk with I/O limits is member of a throttle group, and the limits apply to the combined I/O of all group members using a round-robin algorithm. The way to put several disks together is just to use the group parameter with all of them using the same group name. Once the group is set, there s no need to pass the parameter to block_set_io_throttle anymore unless we want to move the drive to a different group. Since the I/O limits apply to all group members, it is enough to use block_set_io_throttle in just one of them. Here s an example of how to set groups using the command line:

-drive if=virtio,file=hd1.qcow2,throttling.iops-total=6000,throttling.group=foo
-drive if=virtio,file=hd2.qcow2,throttling.iops-total=6000,throttling.group=foo
-drive if=virtio,file=hd3.qcow2,throttling.iops-total=3000,throttling.group=bar
-drive if=virtio,file=hd4.qcow2,throttling.iops-total=6000,throttling.group=foo
-drive if=virtio,file=hd5.qcow2,throttling.iops-total=3000,throttling.group=bar
-drive if=virtio,file=hd6.qcow2,throttling.iops-total=5000

In this example, hd1, hd2 and hd4 are all members of a group named foo with a combined IOPS limit of 6000, and hd3 and hd5 are members of bar. hd6 is left alone (technically it is part of a 1-member group). Next steps I am currently working on providing more I/O statistics for disk drives, including latencies and average queue depth on a user-defined interval. The code is almost ready. Next week I will be in Seattle for the KVM Forum where I will hopefully be able to finish the remaining bits.

I will also attend LinuxCon North America. Igalia is sponsoring the event and we have a booth there. Come if you want to talk to us or see our latest demos with WebKit for Wayland. See you in Seattle!

21 July 2014

Chris Lamb: Disabling internet for specific processes with libfiu

My primary usecase is to prevent testsuites and build systems from contacting internet-based services. This, at the very least, introduces an element of non-determinism and malicious code at worst. I use Alberto Bertogli's libfiu for this, specifically the fiu-run utility which part of the fiu-utils package on Debian and Ubuntu. Here's a contrived example, where I prevent Curl from talking to the internet:

$ fiu-run -x -c 'enable name=posix/io/net/connect' curl google.com
curl: (6) Couldn't resolve host 'google.com'

... and here's an example of it detecting two possibly internet-connecting tests:

$ fiu-run -x -c 'enable name=posix/io/net/connect' ./manage.py text
[..]
----------------------------------------------------------------------
Ran 892 tests in 2.495s
FAILED (errors=2)
Destroying test database for alias 'default'...

Note that libfiu inherits all the drawbacks of LD_PRELOAD; in particular, we cannot limit the child process that calls setuid binaries such as /bin/ping:

$ fiu-run -x -c 'enable name=posix/io/net/connect' ping google.com
PING google.com (173.194.41.65) 56(84) bytes of data.
64 bytes from lhr08s01.1e100.net (17.194.41.65): icmp_req=1 ttl=57 time=21.7 ms
64 bytes from lhr08s01.1e100.net (17.194.41.65): icmp_req=2 ttl=57 time=18.9 ms
[..]

Whilst it would certainly be more robust and flexible to use iptables such as allowing localhost and other local socket connections but disabling all others I gravitate towards this entirely userspace solution as it requires no setup and I can quickly modify it to block other calls on an ad-hoc basis. The list of other "modules" libfiu supports is viewable here.

1 April 2014

Lars Wirzenius: Obnam 1.7.4 release (backup software)

I have just released version 1.7.4 of Obnam, my backup program. Actually, the release date was yesterday, but I had trouble building the binaries. Version 1.7.4, released 2014-03-31

The manual is now dual-licensed under GNU GPL v3 or later, and Creative Commons CC-BY-SA 4.0.
The 1.7.3 release never went out. Let's pretend it wasn't even tagged in git, and everyone will be happy.

Bug fixes:

Obnam FUSE got another bug fix from Valery Yundin, to fix a bug I introduced in 1.7. Reading big files via obnam mount should now work better.
Fix count of backed up files. It used to always count directories. Reported by Alberto Fuentes as Debian bug 742384.
obnam diff latest would fail due to a programming error. Reported by Junyx.

13 February 2014

Steve Kemp: Secure your rsync shares, please.

Recently I started doing a internet-wide scan for rsync servers, thinking it might be fun to write a toy search-engine/indexer. Even the basics such as searching against the names of exported shares would be interesting, I thought. Today I abandoned that after exploring some of the results, (created with zmap), because there's just too much private data out there, wide open IP redacted for obvious reason:

shelob ~ $ rsync  rsync://xx.xx.xx.xx/
ginevra        	Ginevra backup
krsna          	Alberto Laptop Backup
franziska      	Franz Laptop Backup
genoveffa      	Franz Laptop Backup 2

Some nice shares there. Lets see if they're as open as they appear to be:

shelob ~ $ rsync  rsync://xx.xx.xx.xx/ginevra/home/
drwxrwsr-x        4096 2013/10/30 13:42:29 .
drwxr-sr-x        4096 2009/02/03 10:32:27 abl
drwxr-s---       12288 2014/02/12 20:05:22 alberto
drwxr-xr-x        4096 2011/12/13 17:12:46 alessandra
drwxr-sr-x       20480 2014/02/12 22:55:01 backup
drwxr-xr-x        4096 2008/10/03 14:51:29 bertacci
..

Yup. Backups of /home, /etc/, and more. I found numerous examples of this, along with a significant number of hosts that exported "www" + "sql", as a pair, and a large number of hosts that just exported "squid/". I assume they must be some cpanel-like system, because I can't understand why thousands of people would export the same shares with the same comments otherwise. I still would like to run the indexer, but with so much easy content to steal, well I think the liability would kill me. I considered not posting this, but I suspect "bad people" already know..,

28 November 2012

Alberto Garc a: QEMU and open hardware: SPEC and FMC TDC

Working with open hardware Some weeks ago at LinuxCon EU in Barcelona I talked about how to use QEMU to improve the reliability of device drivers. At Igalia we have been using this for some projects. One of them is the Linux IndustryPack driver. For this project I virtualized two boards: the TEWS TPCI200 PCI carrier and the GE IP-Octal 232 module. This work helped us find some bugs in the device driver and improve its quality. Now, those two boards are examples of products available in the market. But fortunately we can use the same approach to develop for hardware that doesn t exist yet, or is still in a prototype phase. Such is the case of a project we are working on: adding Linux support for this FMC Time-to-digital converter.

This piece of hardware is designed by CERN and is published under the CERN Open Hardware Licence, which, in their own words is to hardware what the General Public Licence (GPL) is to software . The Open Hardware repository hosts a number of projects that have been published under this license. Why we use QEMU So we are developing the device driver for this hardware, as my colleague Samuel explains in his blog. I m the responsible of virtualizing it using QEMU. There are two main reasons why we want to do this:

Limited availability of the hardware: although the specification is pretty much ready, this is still a prototype. The board is not (yet) commercially available. With virtual hardware, the whole development team can have as many boards as it needs.
Testing: we can test the software against the virtual driver, force all kinds of conditions and scenarios, including the ones that would probably require us to physically damage the board.

While the first point might be the most obvious one, testing the software is actually the one we re more interested in. My colleague Miguel wrote a detailed blog post on how we have been using QEMU to do testing. Writing the virtual hardware Writing a virtual version of a particular piece of hardware for this purpose is not as hard as it might look. First, the point is not to reproduce accurately how the hardware works, but rather how it behaves from the operating system point of view: the hardware is a black box that the OS talks to. Second, it s not necessary to have a complete emulation of the hardware, there s no need to support every single feature, particularly if your software is not going to use it. The emulation can start with the basic functionality and then grow as needed. The FMC TDC, for example, is an FMC card which is in our case connected to a PCIe bridge called SPEC (also available in the Open Hardware repository). We need to emulate both cards in order to have a working system, but the emulation is, at the moment, treating both as if they were just one, which makes it a bit easier to have a prototype and from the device driver point of view doesn t really make a difference. Later the emulation can be split in two as I did with with TPCI200 and IP-Octal 232. This would allow us to support more FMC hardware without having to rewrite the bridging code. There s also code in the emulation to force different kind of scenarios that we are using to test if the driver behaves as expected and handles errors correctly. Those tests include the simulation of input in the any of the lines, simulation of noise, DMA errors, etc.

And we have written a set of test cases and a continuous integration system, so the driver is automatically tested every time the code is updated. If you want details on this I recommend you again to read Miguel s post.

5 November 2012

Alberto Garc a: Igalia at LinuxCon Europe

I came to Barcelona with a few other Igalians this week for LinuxCon, the Embedded
Linux Conference and the KVM Forum. We are sponsoring the event and we have a couple of presentations this year, one about QEMU, device drivers and industrial hardware (which I gave today, slides here) and the other about the Grilo multimedia framework (by Juan Su rez). We ll be around the whole week so you can come and talk to us anytime. You can find us at our booth on the ground floor, where you ll also be able to see a few demos of our latest work and get some merchandising.

3 October 2012

Alberto Garc a: IndustryPack, QEMU and LinuxCon

IndustryPack drivers for Linux In the past months we have been working at Igalia to give Linux support to IndustryPack devices. IndustryPack modules are small boards ( mezzanine ) that are attached to a carrier board, which serves as a bridge between them and the host bus (PCI, VME, ). We wrote the drivers for the TEWS TPCI200 PCI carrier and the GE IP-OCTAL-232 module.

My mate Samuel was the lead developer of the kernel drivers. He published some details about this work in his blog some time ago. The drivers are available in latest Linux release (3.6 as of this writing) but if you want the bleeding-edge version you can get it from here (make sure to use the staging-next branch). IndustryPack emulation for QEMU Along with Samuel s work on the kernel driver, I have been working to add emulation of the aformentioned IndustryPack devices to QEMU. The work consists on three parts:

TPCI200, the bridge between PCI and IndustryPack.
The IndustryPack bus.
IPOCTAL-232, an IndustryPack module with eight RS-232 serial ports.

I decided to split the emulation like this to be as close as possible to how the hardware works and to make it easier to reuse the code to implement other IndustryPack devices. The emulation is functional and can be used with the existing Linux driver. Just make sure to enable CONFIG_IPACK_BUS, CONFIG_BOARD_TPCI200 and CONFIG_SERIAL_IPOCTAL in the kernel configuration. I submitted the code to QEMU, but it hasn t been integrated yet, so if you want to test it you ll need to patch it yourself: get the QEMU source code and apply the TPCI200 patch and the IP-Octal 232 patch. Those patches have been tested with QEMU 1.2.0. And here s how you run QEMU with support for these devices:

$ qemu -device tpci200 -device ipoctal

The IP-Octal board implements eight RS-232 serial ports. Each one of those can be redirected to a character device in the host using the functionality provided by QEMU. The serial0 to serial7 parameters can be used to specify each one of the redirections. Example:

$ qemu -device tpci200 -device ipoctal,serial0=pty

With this, the first serial port of the IP-Octal board ( /dev/ipoctal.0.0.0 on the guest) will be redirected to a newly-allocated pty on the host. LinuxCon Europe Having virtual hardware allows us to test and debug the Linux driver more easily. In November I ll be in Barcelona with the rest of the Igalia OS team for LinuxCon Europe and the KVM Forum. I will be talking about how to use QEMU to improve the robustness of device drivers and speed up their development.. Some other Igalians will also be there, including Juan Su rez who will be talking about the Grilo multimedia framework. See you in Barcelona!

28 July 2012

Alberto Garc a: GUADEC 2012

Third day of GUADEC already. And in Coru a!

This is a very special city for me. I came here in 1996 to study Computer Science. Here I discovered UNIX for the first time, and spent hours learning how to use it. It s funny to see now those old UNIX servers being displayed in a small museum in the auditorium where the main track takes place. It was also here where I learnt about free software, installed my first Debian system, helped creating the local LUG and met the awesome people that founded Igalia with me. Then we went international, but our headquarters and many of our people are still here so I guess we can still call this home. So, needless to say, we are very happy to have GUADEC here this time. I hope you all are enjoying the conference as much as we are. I m quite satisfied with how it s been going so far, the local team has done a good job organising everything and taking care of lots of details to make the life of all attendees easier. I especially want to stress all the effort put into the network infrastructure, one of the best that I remember in a GUADEC conference. At Igalia we ve been very busy lately. We re putting lots of effort in making WebKit better, but our work is not limited to that. Our talks this year show some of the things we ve been doing:

We are also coordinating 4 BOFs (a11y, GNOME OS, WebKit and Grilo) and hosted a UX hackfest in our offices before the conference. And we have a booth next to the info desk where you can get some merchandising and see our interactivity demos.

In case you missed the conference this year, all talks are being recorded and the videos are expected to be published really soon (before the end of the conference). So enjoy the remaining days of GUADEC, and enjoy Coru a! And of course if you re staying after the conference and want to know more about the city or about Galicia, don t hesitate to ask me or anyone from the local team, we ll be glad to help you.

Next.