Search Results: "Vincent Bernat"

3 December 2016

Vincent Bernat: Build-time dependency patching for Android

This post shows how to patch an external dependency for an Android project at build-time with Gradle. This leverages the Transform API and Javassist, a Java bytecode manipulation tool.
buildscript  
    dependencies  
        classpath 'com.android.tools.build:gradle:2.2.+'
        classpath 'com.android.tools.build:transform-api:1.5.+'
        classpath 'org.javassist:javassist:3.21.+'
        classpath 'commons-io:commons-io:2.4'
     
 
Disclaimer: I am not a seasoned Android programmer, so take this with a grain of salt.

Context This section adds some context to the example. Feel free to skip it. Dashkiosk is an application to manage dashboards on many displays. It provides an Android application you can install on one of those cheap Android sticks. Under the table, the application is an embedded webview backed by the Crosswalk Project web runtime which brings an up-to-date web engine, even for older versions of Android1. Recently, a security vulnerability has been spotted in how invalid certificates were handled. When a certificate cannot be verified, the webview defers the decision to the host application by calling the onReceivedSslError() method:
Notify the host application that an SSL error occurred while loading a resource. The host application must call either callback.onReceiveValue(true) or callback.onReceiveValue(false). Note that the decision may be retained for use in response to future SSL errors. The default behavior is to pop up a dialog.
The default behavior is specific to Crosswalk webview: the Android builtin one just cancels the load. Unfortunately, the fix applied by Crosswalk is different and, as a side effect, the onReceivedSslError() method is not invoked anymore2. Dashkiosk comes with an option to ignore TLS errors3. The mentioned security fix breaks this feature. The following example will demonstrate how to patch Crosswalk to recover the previous behavior4.

Simple method replacement Let s replace the shouldDenyRequest() method from the org.xwalk.core.internal.SslUtil class with this version:
// In SslUtil class
public static boolean shouldDenyRequest(int error)  
    return false;
 

Transform registration Gradle Transform API enables the manipulation of compiled class files before they are converted to DEX files. To declare a transform and register it, include the following code in your build.gradle:
import com.android.build.api.transform.Context
import com.android.build.api.transform.QualifiedContent
import com.android.build.api.transform.Transform
import com.android.build.api.transform.TransformException
import com.android.build.api.transform.TransformInput
import com.android.build.api.transform.TransformOutputProvider
import org.gradle.api.logging.Logger
class PatchXWalkTransform extends Transform  
    Logger logger = null;
    public PatchXWalkTransform(Logger logger)  
        this.logger = logger
     
    @Override
    String getName()  
        return "PatchXWalk"
     
    @Override
    Set<QualifiedContent.ContentType> getInputTypes()  
        return Collections.singleton(QualifiedContent.DefaultContentType.CLASSES)
     
    @Override
    Set<QualifiedContent.Scope> getScopes()  
        return Collections.singleton(QualifiedContent.Scope.EXTERNAL_LIBRARIES)
     
    @Override
    boolean isIncremental()  
        return true
     
    @Override
    void transform(Context context,
                   Collection<TransformInput> inputs,
                   Collection<TransformInput> referencedInputs,
                   TransformOutputProvider outputProvider,
                   boolean isIncremental) throws IOException, TransformException, InterruptedException  
        // We should do something here
     
 
// Register the transform
android.registerTransform(new PatchXWalkTransform(logger))
The getInputTypes() method should return the set of types of data consumed by the transform. In our case, we want to transform classes. Another possibility is to transform resources. The getScopes() method should return a set of scopes for the transform. In our case, we are only interested by the external libraries. It s also possible to transform our own classes. The isIncremental() method returns true because we support incremental builds. The transform() method is expected to take all the provided inputs and copy them (with or without modifications) to the location supplied by the output provider. We didn t implement this method yet. This causes the removal of all external dependencies from the application.

Noop transform To keep all external dependencies unmodified, we must copy them:
@Override
void transform(Context context,
               Collection<TransformInput> inputs,
               Collection<TransformInput> referencedInputs,
               TransformOutputProvider outputProvider,
               boolean isIncremental) throws IOException, TransformException, InterruptedException  
    inputs.each  
        it.jarInputs.each  
            def jarName = it.name
            def src = it.getFile()
            def dest = outputProvider.getContentLocation(jarName, 
                                                         it.contentTypes, it.scopes,
                                                         Format.JAR);
            def status = it.getStatus()
            if (status == Status.REMOVED)   //  
                logger.info("Remove $ src ")
                FileUtils.delete(dest)
              else if (!isIncremental   status != Status.NOTCHANGED)   //  
                logger.info("Copy $ src ")
                FileUtils.copyFile(src, dest)
             
         
     
 
We also need two additional imports:
import com.android.build.api.transform.Status
import org.apache.commons.io.FileUtils
Since we are handling external dependencies, we only have to manage JAR files. Therefore, we only iterate on jarInputs and not on directoryInputs. There are two cases when handling incremental build: either the file has been removed ( ) or it has been modified ( ). In all other cases, we can safely assume the file is already correctly copied.

JAR patching When the external dependency is the Crosswalk JAR file, we also need to modify it. Here is the first part of the code (replacing ):
if ("$ src " ==~ ".*/org.xwalk/xwalk_core.*/classes.jar")  
    def pool = new ClassPool()
    pool.insertClassPath("$ src ")
    def ctc = pool.get('org.xwalk.core.internal.SslUtil') //  
    def ctm = ctc.getDeclaredMethod('shouldDenyRequest')
    ctc.removeMethod(ctm) //  
    ctc.addMethod(CtNewMethod.make("""
public static boolean shouldDenyRequest(int error)  
    return false;
 
""", ctc)) //  
    def sslUtilBytecode = ctc.toBytecode() //  
    // Write back the JAR file
    //  
  else  
    logger.info("Copy $ src ")
    FileUtils.copyFile(src, dest)
 
We also need the following additional imports to use Javassist:
import javassist.ClassPath
import javassist.ClassPool
import javassist.CtNewMethod
Once we have located the JAR file we want to modify, we add it to our classpath and retrieve the class we are interested in ( ). We locate the appropriate method and delete it ( ). Then, we add our custom method using the same name ( ). The whole operation is done in memory. We retrieve the bytecode of the modified class in . The remaining step is to rebuild the JAR file:
def input = new JarFile(src)
def output = new JarOutputStream(new FileOutputStream(dest))
//  
input.entries().each  
    if (!it.getName().equals("org/xwalk/core/internal/SslUtil.class"))  
        def s = input.getInputStream(it)
        output.putNextEntry(new JarEntry(it.getName()))
        IOUtils.copy(s, output)
        s.close()
     
 
//  
output.putNextEntry(new JarEntry("org/xwalk/core/internal/SslUtil.class"))
output.write(sslUtilBytecode)
output.close()
We need the following additional imports:
import java.util.jar.JarEntry
import java.util.jar.JarFile
import java.util.jar.JarOutputStream
import org.apache.commons.io.IOUtils
There are two steps. In , all classes are copied to the new JAR, except the SslUtil class. In , the modified bytecode for SslUtil is added to the JAR. That s all! You can view the complete example on GitHub.

More complex method replacement In the above example, the new method doesn t use any external dependency. Let s suppose we also want to replace the sslErrorFromNetErrorCode() method from the same class with the following one:
import org.chromium.net.NetError;
import android.net.http.SslCertificate;
import android.net.http.SslError;
// In SslUtil class
public static SslError sslErrorFromNetErrorCode(int error,
                                                SslCertificate cert,
                                                String url)  
    switch(error)  
        case NetError.ERR_CERT_COMMON_NAME_INVALID:
            return new SslError(SslError.SSL_IDMISMATCH, cert, url);
        case NetError.ERR_CERT_DATE_INVALID:
            return new SslError(SslError.SSL_DATE_INVALID, cert, url);
        case NetError.ERR_CERT_AUTHORITY_INVALID:
            return new SslError(SslError.SSL_UNTRUSTED, cert, url);
        default:
            break;
     
    return new SslError(SslError.SSL_INVALID, cert, url);
 
The major difference with the previous example is that we need to import some additional classes.

Android SDK import The classes from the Android SDK are not part of the external dependencies. They need to be imported separately. The full path of the JAR file is:
androidJar = "$ android.getSdkDirectory().getAbsolutePath() /platforms/" +
             "$ android.getCompileSdkVersion() /android.jar"
We need to load it before adding the new method into SslUtil class:
def pool = new ClassPool()
pool.insertClassPath(androidJar)
pool.insertClassPath("$ src ")
def ctc = pool.get('org.xwalk.core.internal.SslUtil')
def ctm = ctc.getDeclaredMethod('sslErrorFromNetErrorCode')
ctc.removeMethod(ctm)
pool.importPackage('android.net.http.SslCertificate');
pool.importPackage('android.net.http.SslError');
//  

External dependency import We must also import org.chromium.net.NetError and therefore, we need to put the appropriate JAR in our classpath. The easiest way is to iterate through all the external dependencies and add them to the classpath.
def pool = new ClassPool()
pool.insertClassPath(androidJar)
inputs.each  
    it.jarInputs.each  
        def jarName = it.name
        def src = it.getFile()
        def status = it.getStatus()
        if (status != Status.REMOVED)  
            pool.insertClassPath("$ src ")
         
     
 
def ctc = pool.get('org.xwalk.core.internal.SslUtil')
def ctm = ctc.getDeclaredMethod('sslErrorFromNetErrorCode')
ctc.removeMethod(ctm)
pool.importPackage('android.net.http.SslCertificate');
pool.importPackage('android.net.http.SslError');
pool.importPackage('org.chromium.net.NetError');
ctc.addMethod(CtNewMethod.make(" "))
// Then, rebuild the JAR...
Happy hacking!

  1. Before Android 4.4, the webview was severely outdated. Starting from Android 5, the webview is shipped as a separate component with updates. Embedding Crosswalk is still convenient as you know exactly which version you can rely on.
  2. I hope to have this fixed in later versions.
  3. This may seem harmful and you are right. However, if you have an internal CA, it is currently not possible to provide its own trust store to a webview. Moreover, the system trust store is not used either. You also may want to use TLS for authentication only with client certificates, a feature supported by Dashkiosk.
  4. Crosswalk being an opensource project, an alternative would have been to patch Crosswalk source code and recompile it. However, Crosswalk embeds Chromium and recompiling the whole stuff consumes a lot of resources.

2 May 2016

Vincent Bernat: Pragmatic Debian packaging

While the creation of Debian packages is abundantly documented, most tutorials are targeted to packages implementing the Debian policy. Moreover, Debian packaging has a reputation of being unnecessarily difficult1 and many people prefer to use less constrained tools2 like fpm or CheckInstall. However, I would like to show how building Debian packages with the official tools can become straightforward if you bend some rules:
  1. No source package will be generated. Packages will be built directly from a checkout of a VCS repository.
  2. Additional dependencies can be downloaded during build. Packaging individually each dependency is a painstaking work, notably when you have to deal with some fast-paced ecosystems like Java, Javascript and Go.
  3. The produced packages may bundle dependencies. This is likely to raise some concerns about security and long-term maintenance, but this is a common trade-off in many ecosystems, notably Java, Javascript and Go.

Pragmatic packages 101 In the Debian archive, you have two kinds of packages: the source packages and the binary packages. Each binary package is built from a source package. You need a name for each package. As stated in the introduction, we won t generate a source package but we will work with its unpacked form which is any source tree containing a debian/ directory. In our examples, we will start with a source tree containing only a debian/ directory but you are free to include this debian/ directory into an existing project. As an example, we will package memcached, a distributed memory cache. There are four files to create:
  • debian/compat,
  • debian/changelog,
  • debian/control, and
  • debian/rules.
The first one is easy. Just put 9 in it:
echo 9 > debian/compat
The second one has the following content:
memcached (0-0) UNRELEASED; urgency=medium
  * Fake entry
 -- Happy Packager <happy@example.com>  Tue, 19 Apr 2016 22:27:05 +0200
The only important information is the name of the source package, memcached, on the first line. Everything else can be left as is as it won t influence the generated binary packages.

The control file debian/control describes the metadata of both the source package and the generated binary packages. We have to write a block for each of them.
Source: memcached
Maintainer: Vincent Bernat <bernat@debian.org>
Package: memcached
Architecture: any
Description: high-performance memory object caching system
The source package is called memcached. We have to use the same name as in debian/changelog. We generate only one binary package: memcached. In the remaining of the example, when you see memcached, this is the name of a binary package. The Architecture field should be set to either any or all. Use all exclusively if the package contains only arch-independent files. In doubt, just stick to any. The Description field contains a short description of the binary package.

The build recipe The last mandatory file is debian/rules. It s the recipe of the package. We need to retrieve memcached, build it and install its file tree in debian/memcached/. It looks like this:
#!/usr/bin/make -f
DISTRIBUTION = $(shell lsb_release -sr)
VERSION = 1.4.25
PACKAGEVERSION = $(VERSION)-0~$(DISTRIBUTION)0
TARBALL = memcached-$(VERSION).tar.gz
URL = http://www.memcached.org/files/$(TARBALL)
%:
    dh $@
override_dh_auto_clean:
override_dh_auto_test:
override_dh_auto_build:
override_dh_auto_install:
    wget -N --progress=dot:mega $(URL)
    tar --strip-components=1 -xf $(TARBALL)
    ./configure --prefix=/usr
    make
    make install DESTDIR=debian/memcached
override_dh_gencontrol:
    dh_gencontrol -- -v$(PACKAGEVERSION)
The empty targets override_dh_auto_clean, override_dh_auto_test and override_dh_auto_build keep debhelper from being too smart. The override_dh_gencontrol target sets the package version3 without updating debian/changelog. If you ignore the slight boilerplate, the recipe is quite similar to what you would have done with fpm:
DISTRIBUTION=$(lsb_release -sr)
VERSION=1.4.25
PACKAGEVERSION=$ VERSION -0~$ DISTRIBUTION 0
TARBALL=memcached-$ VERSION .tar.gz
URL=http://www.memcached.org/files/$ TARBALL 
wget -N --progress=dot:mega $ URL 
tar --strip-components=1 -xf $ TARBALL 
./configure --prefix=/usr
make
make install DESTDIR=/tmp/installdir
# Build the final package
fpm -s dir -t deb \
    -n memcached \
    -v $ PACKAGEVERSION  \
    -C /tmp/installdir \
    --description "high-performance memory object caching system"
You can review the whole package tree on GitHub and build it with dpkg-buildpackage -us -uc -b.

Pragmatic packages 102 At this point, we can iterate and add several improvements to our memcached package. None of those are mandatory but they are usually worth the additional effort.

Build dependencies Our initial build recipe only work when several packages are installed, like wget and libevent-dev. They are not present on all Debian systems. You can easily express that you need them by adding a Build-Depends section for the source package in debian/control:
Source: memcached
Build-Depends: debhelper (>= 9),
               wget, ca-certificates, lsb-release,
               libevent-dev
Always specify the debhelper (>= 9) dependency as we heavily rely on it. We don t require make or a C compiler because it is assumed that the build-essential meta-package is installed and it pulls those. dpkg-buildpackage will complain if the dependencies are not met. If you want to install those packages from your CI system, you can use the following command4:
mk-build-deps \
    -t 'apt-get -o Debug::pkgProblemResolver=yes --no-install-recommends -qqy' \
    -i -r debian/control
You may also want to investigate pbuilder or sbuild, two tools to build Debian packages in a clean isolated environment.

Runtime dependencies If the resulting package is installed on a freshly installed machine, it won t work because it will be missing libevent, a required library for memcached. You can express the dependencies needed by each binary package by adding a Depends field. Moreover, for dynamic libraries, you can automatically get the right dependencies by using some substitution variables:
Package: memcached
Depends: $ misc:Depends , $ shlibs:Depends 
The resulting package will contain the following information:
$ dpkg -I ../memcached_1.4.25-0\~unstable0_amd64.deb   grep Depends
 Depends: libc6 (>= 2.17), libevent-2.0-5 (>= 2.0.10-stable)

Integration with init system Most packaged daemons come with some integration with the init system. This integration ensures the daemon will be started on boot and restarted on upgrade. For Debian-based distributions, there are several init systems available. The most prominent ones are:
  • System-V init is the historical init system. More modern inits are able to reuse scripts written for this init, so this is a safe common denominator for packaged daemons.
  • Upstart is the less-historical init system for Ubuntu (used in Ubuntu 14.10 and previous releases).
  • systemd is the default init system for Debian since Jessie and for Ubuntu since 15.04.
Writing a correct script for the System-V init is error-prone. Therefore, I usually prefer to provide a native configuration file for the default init system of the targeted distribution (Upstart and systemd).

System-V If you want to provide a System-V init script, have a look at /etc/init.d/skeleton on the most ancient distribution you want to target and adapt it5. Put the result in debian/memcached.init. It will be installed at the right place, invoked on install, upgrade and removal. On Debian-based systems, many init scripts allow user customizations by providing a /etc/default/memcached file. You can ship one by putting its content in debian/memcached.default.

Upstart Providing an Upstart job is similar: put it in debian/memcached.upstart. For example:
description "memcached daemon"
start on runlevel [2345]
stop on runlevel [!2345]
respawn
respawn limit 5 60
expect daemon
script
  . /etc/default/memcached
  exec memcached -d -u $USER -p $PORT -m $CACHESIZE -c $MAXCONN $OPTIONS
end script
When writing an Upstart job, the most important directive is expect. Be sure to get it right. Here, we use expect daemon and memcached is started with the -d flag.

systemd Providing a systemd unit is a bit more complex. The content of the file should go in debian/memcached.service. For example:
[Unit]
Description=memcached daemon
After=network.target
[Service]
Type=forking
EnvironmentFile=/etc/default/memcached
ExecStart=/usr/bin/memcached -d -u $USER -p $PORT -m $CACHESIZE -c $MAXCONN $OPTIONS
Restart=on-failure
[Install]
WantedBy=multi-user.target
We reuse /etc/default/memcached even if it is not considered a good practice with systemd6. Like for Upstart, the directive Type is quite important. We used forking as memcached is started with the -d flag. You also need to add a build-dependency to dh-systemd in debian/control:
Source: memcached
Build-Depends: debhelper (>= 9),
               wget, ca-certificates, lsb-release,
               libevent-dev,
               dh-systemd
And you need to modify the default rule in debian/rules:
%:
    dh $@ --with systemd
The extra complexity is a bit unfortunate but systemd integration is not part of debhelper7. Without those additional modifications, the unit will get installed but you won t get a proper integration and the service won t be enabled on install or boot.

Dedicated user Many daemons don t need to run as root and it is a good practice to ship a dedicated user. In the case of memcached, we can provide a _memcached user8. Add a debian/memcached.postinst file with the following content:
#!/bin/sh
set -e
case "$1" in
    configure)
        adduser --system --disabled-password --disabled-login --home /var/empty \
                --no-create-home --quiet --force-badname --group _memcached
        ;;
esac
#DEBHELPER#
exit 0
There is no cleanup of the user when the package is removed for two reasons:
  1. Less stuff to write.
  2. The user could still own some files.
The utility adduser will do the right thing whatever the requested user already exists or not. You need to add it as a dependency in debian/control:
Package: memcached
Depends: $ misc:Depends , $ shlibs:Depends , adduser
The #DEBHELPER# marker is important as it will be replaced by some code to handle the service configuration files (or some other stuff). You can review the whole package tree on GitHub and build it with dpkg-buildpackage -us -uc -b.

Pragmatic packages 103 It is possible to leverage debhelper to reduce the recipe size and to make it more declarative. This section is quite optional and it requires understanding a bit more how a Debian package is built. Feel free to skip it.

The big picture There are four steps to build a regular Debian package:
  1. debian/rules clean should clean the source tree to make it pristine.
  2. debian/rules build should trigger the build. For an autoconf-based software, like memcached, this step should execute something like ./configure && make.
  3. debian/rules install should install the file tree of each binary package. For an autoconf-based software, this step should execute make install DESTDIR=debian/memcached.
  4. debian/rules binary will pack the different file trees into binary packages.
You don t directly write each of those targets. Instead, you let dh, a component of debhelper, do most of the work. The following debian/rules file should do almost everything correctly with many source packages:
#!/usr/bin/make -f
%:
    dh $@
For each of the four targets described above, you can run dh with --no-act to see what it would do. For example:
$ dh build --no-act
   dh_testdir
   dh_update_autotools_config
   dh_auto_configure
   dh_auto_build
   dh_auto_test
Each of those helpers has a manual page. Helpers starting with dh_auto_ are a bit magic . For example, dh_auto_configure will try to automatically configure a package prior to building: it will detect the build system and invoke ./configure, cmake or Makefile.PL. If one of the helpers do not do the right thing, you can replace it by using an override target:
override_dh_auto_configure:
    ./configure --with-some-grog
Those helpers are also configurable, so you can just alter a bit their behaviour by invoking them with additional options:
override_dh_auto_configure:
    dh_auto_configure -- --with-some-grog
This way, ./configure will be called with your custom flag but also with a lot of default flags like --prefix=/usr for better integration. In the initial memcached example, we overrode all those magic targets. dh_auto_clean, dh_auto_configure and dh_auto_build are converted to no-ops to avoid any unexpected behaviour. dh_auto_install is hijacked to do all the build process. Additionally, we modified the behavior of the dh_gencontrol helper by forcing the version number instead of using the one from debian/changelog.

Automatic builds As memcached is an autoconf-enabled package, dh knows how to build it: ./configure && make && make install. Therefore, we can let it handle most of the work with this debian/rules file:
#!/usr/bin/make -f
DISTRIBUTION = $(shell lsb_release -sr)
VERSION = 1.4.25
PACKAGEVERSION = $(VERSION)-0~$(DISTRIBUTION)0
TARBALL = memcached-$(VERSION).tar.gz
URL = http://www.memcached.org/files/$(TARBALL)
%:
    dh $@ --with systemd
override_dh_auto_clean:
    wget -N --progress=dot:mega $(URL)
    tar --strip-components=1 -xf $(TARBALL)
override_dh_auto_test:
    # Don't run the whitespace test
    rm t/whitespace.t
    dh_auto_test
override_dh_gencontrol:
    dh_gencontrol -- -v$(PACKAGEVERSION)
The dh_auto_clean target is hijacked to download and setup the source tree9. We don t override the dh_auto_configure step, so dh will execute the ./configure script with the appropriate options. We don t override the dh_auto_build step either: dh will execute make. dh_auto_test is invoked after the build and it will run the memcached test suite. We need to override it because one of the test is complaining about odd whitespaces in the debian/ directory. We suppress this rogue test and let dh_auto_test executes the test suite. dh_auto_install is not overriden either, so dh will execute some variant of make install. To get a better sense of the difference, here is a diff:
--- memcached-intermediate/debian/rules 2016-04-30 14:02:37.425593362 +0200
+++ memcached/debian/rules  2016-05-01 14:55:15.815063835 +0200
@@ -12,10 +12,9 @@
 override_dh_auto_clean:
-override_dh_auto_test:
-override_dh_auto_build:
-override_dh_auto_install:
    wget -N --progress=dot:mega $(URL)
    tar --strip-components=1 -xf $(TARBALL)
-   ./configure --prefix=/usr
-   make
-   make install DESTDIR=debian/memcached
+
+override_dh_auto_test:
+   # Don't run the whitespace test
+   rm t/whitespace.t
+   dh_auto_test
It is up to you to decide if dh can do some work for you, but you could try to start from a minimal debian/rules and only override some targets.

Install additional files While make install installed the essential files for memcached, you may want to put additional files in the binary package. You could use cp in your build recipe, but you can also declare them:
  • files listed in debian/memcached.docs will be copied to /usr/share/doc/memcached by dh_installdocs,
  • files listed in debian/memcached.examples will be copied to /usr/share/doc/memcached/examples by dh_installexamples,
  • files listed in debian/memcached.manpages will be copied to the appropriate subdirectory of /usr/share/man by dh_installman,
Here is an example using wildcards for debian/memcached.docs:
doc/*.txt
If you need to copy some files to an arbitrary location, you can list them along with their destination directories in debian/memcached.install and dh_install will take care of the copy. Here is an example:
scripts/memcached-tool usr/bin
Using those files make the build process more declarative. It is a matter of taste and you are free to use cp in debian/rules instead. You can review the whole package tree on GitHub.

Other examples The GitHub repository contains some additional examples. They all follow the same scheme:
  • dh_auto_clean is hijacked to download and setup the source tree
  • dh_gencontrol is modified to use a computed version
Notably, you ll find daemons in Java, Go, Python and Node.js. The goal of those examples is to demonstrate that using Debian tools to build Debian packages can be straightforward. Hope this helps.

  1. People may remember the time before debhelper 7.0.50 (circa 2009) where debian/rules was a daunting beast. However, nowaday, the boilerplate is quite reduced.
  2. The complexity is not the only reason. Those alternative tools enable the creation of RPM packages, something that Debian tools obviously don t.
  3. There are many ways to version a package. Again, if you want to be pragmatic, the proposed solution should be good enough for Ubuntu. On Debian, it doesn t cover upgrade from one distribution version to another, but we assume that nowadays, systems get reinstalled instead of being upgraded.
  4. You also need to install devscripts and equivs package.
  5. It s also possible to use a script provided by upstream. However, there is no such thing as an init script that works on all distributions. Compare the proposed with the skeleton, check if it is using start-stop-daemon and if it sources /lib/lsb/init-functions before considering it. If it seems to fit, you can install it yourself in debian/memcached/etc/init.d/. debhelper will ensure its proper integration.
  6. Instead, a user wanting to customize the options is expected to edit the unit with systemctl edit.
  7. See #822670
  8. The Debian Policy doesn t provide any hint for the naming convention of those system users. A common usage is to prefix the daemon name with an underscore (like _memcached). Another common usage is to use Debian- as a prefix. The main drawback of the latest solution is that the name is likely to be replaced by the UID in ps and top because of its length.
  9. We could call dh_auto_clean at the end of the target to let it invoke make clean. However, it is assumed that a fresh checkout is used before each build.

10 April 2016

Vincent Bernat: Testing network software with pytest and Linux namespaces

Started in 2008, lldpd is an implementation of IEEE 802.1AB-2005 (aka LLDP) written in C. While it contains some unit tests, like many other network-related software at the time, the coverage of those is pretty poor: they are hard to write because the code is written in an imperative style and tighly coupled with the system. It would require extensive mocking1. While a rewrite (complete or iterative) would help to make the code more test-friendly, it would be quite an effort and it will likely introduce operational bugs along the way. To get better test coverage, the major features of lldpd are now verified through integration tests. Those tests leverage Linux network namespaces to setup a lightweight and isolated environment for each test. They run through pytest, a powerful testing tool.

pytest in a nutshell pytest is a Python testing tool whose primary use is to write tests for Python applications but is versatile enough for other creative usages. It is bundled with three killer features:
  • you can directly use the assert keyword,
  • you can inject fixtures in any test function, and
  • you can parametrize tests.

Assertions With unittest, the unit testing framework included with Python, and many similar frameworks, unit tests have to be encapsulated into a class and use the provided assertion methods. For example:
class testArithmetics(unittest.TestCase):
    def test_addition(self):
        self.assertEqual(1 + 3, 4)
The equivalent with pytest is simpler and more readable:
def test_addition():
    assert 1 + 3 == 4
pytest will analyze the AST and display useful error messages in case of failure. For further information, see Benjamin Peterson s article.

Fixtures A fixture is the set of actions performed in order to prepare the system to run some tests. With classic frameworks, you can only define one fixture for a set of tests:
class testInVM(unittest.TestCase):
    def setUp(self):
        self.vm = VM('Test-VM')
        self.vm.start()
        self.ssh = SSHClient()
        self.ssh.connect(self.vm.public_ip)
    def tearDown(self):
        self.ssh.close()
        self.vm.destroy()
    def test_hello(self):
        stdin, stdout, stderr = self.ssh.exec_command("echo hello")
        stdin.close()
        self.assertEqual(stderr.read(), b"")
        self.assertEqual(stdout.read(), b"hello\n")
In the example above, we want to test various commands on a remote VM. The fixture launches a new VM and configure an SSH connection. However, if the SSH connection cannot be established, the fixture will fail and the tearDown() method won t be invoked. The VM will be left running. Instead, with pytest, we could do this:
@pytest.yield_fixture
def vm():
    r = VM('Test-VM')
    r.start()
    yield r
    r.destroy()
@pytest.yield_fixture
def ssh(vm):
    ssh = SSHClient()
    ssh.connect(vm.public_ip)
    yield ssh
    ssh.close()
def test_hello(ssh):
    stdin, stdout, stderr = ssh.exec_command("echo hello")
    stdin.close()
    stderr.read() == b""
    stdout.read() == b"hello\n"
The first fixture will provide a freshly booted VM. The second one will setup an SSH connection to the VM provided as an argument. Fixtures are used through dependency injection: just give their names in the signature of the test functions and fixtures that need them. Each fixture only handle the lifetime of one entity. Whatever a dependent test function or fixture succeeds or fails, the VM will always be finally destroyed.

Parameters If you want to run the same test several times with a varying parameter, you can dynamically create test functions or use one test function with a loop. With pytest, you can parametrize test functions and fixtures:
@pytest.mark.parametrize("n1, n2, expected", [
    (1, 3, 4),
    (8, 20, 28),
    (-4, 0, -4)])
def test_addition(n1, n2, expected):
    assert n1 + n2 == expected

Testing lldpd The general plan for to test a feature in lldpd is the following:
  1. Setup two namespaces.
  2. Create a virtual link between them.
  3. Spawn a lldpd process in each namespace.
  4. Test the feature in one namespace.
  5. Check with lldpcli we get the expected result in the other.
Here is a typical test using the most interesting features of pytest:
@pytest.mark.skipif('LLDP-MED' not in pytest.config.lldpd.features,
                    reason="LLDP-MED not supported")
@pytest.mark.parametrize("classe, expected", [
    (1, "Generic Endpoint (Class I)"),
    (2, "Media Endpoint (Class II)"),
    (3, "Communication Device Endpoint (Class III)"),
    (4, "Network Connectivity Device")])
def test_med_devicetype(lldpd, lldpcli, namespaces, links,
                        classe, expected):
    links(namespaces(1), namespaces(2))
    with namespaces(1):
        lldpd("-r")
    with namespaces(2):
        lldpd("-M", str(classe))
    with namespaces(1):
        out = lldpcli("-f", "keyvalue", "show", "neighbors", "details")
        assert out['lldp.eth0.lldp-med.device-type'] == expected
First, the test will be executed only if lldpd was compiled with LLDP-MED support. Second, the test is parametrized. We will execute four distinct tests, one for each role that lldpd should be able to take as an LLDP-MED-enabled endpoint. The signature of the test has four parameters that are not covered by the parametrize() decorator: lldpd, lldpcli, namespaces and links. They are fixtures. A lot of magic happen in those to keep the actual tests short:
  • lldpd is a factory to spawn an instance of lldpd. When called, it will setup the current namespace (setting up the chroot, creating the user and group for privilege separation, replacing some files to be distribution-agnostic, ), then call lldpd with the additional parameters provided. The output is recorded and added to the test report in case of failure. The module also contains the creation of the pytest.config.lldpd object that is used to record the features supported by lldpd and skip non-matching tests. You can read fixtures/programs.py for more details.
  • lldpcli is also a factory, but it spawns instances of lldpcli, the client to query lldpd. Moreover, it will parse the output in a dictionary to reduce boilerplate.
  • namespaces is one of the most interesting pieces. It is a factory for Linux namespaces. It will spawn a new namespace or refer to an existing one. It is possible to switch from one namespace to another (with with) as they are contexts. Behind the scene, the factory maintains the appropriate file descriptors for each namespace and switch to them with setns(). Once the test is done, everything is wipped out as the file descriptors are garbage collected. You can read fixtures/namespaces.py for more details. It is quite reusable in other projects2.
  • links contains helpers to handle network interfaces: creation of virtual ethernet link between namespaces, creation of bridges, bonds and VLAN, etc. It relies on the pyroute2 module. You can read fixtures/network.py for more details.
You can see an example of a test run on the Travis build for 0.9.2. Since each test is correctly isolated, it s possible to run parallel tests with pytest -n 10 --boxed. To catch even more bugs, both the address sanitizer (ASAN) and the undefined behavior sanitizer (UBSAN) are enabled. In case of a problem, notably a memory leak, the faulty program will exit with a non-zero exit code and the associated test will fail.

  1. A project like cwrap would definitely help. However, it lacks support for Netlink and raw sockets that are essential in lldpd operations.
  2. There are three main limitations in the use of namespaces with this fixture. First, when creating a user namespace, only root is mapped to the current user. With lldpd, we have two users (root and _lldpd). Therefore, the tests have to run as root. The second limitation is with the PID namespace. It s not possible for a process to switch from one PID namespace to another. When you call setns() on a PID namespace, only children of the current process will be in the new PID namespace. The PID namespace is convenient to ensure everyone gets killed once the tests are terminated but you must keep in mind that /proc must be mounted in children only. The third limitation is that, for some namespaces (PID and user), all threads of a process must be part of the same namespace. Therefore, don t use threads in tests. Use multiprocessing module instead.

3 January 2016

Lunar: Reproducible builds: week 35 in Stretch cycle

What happened in the reproducible builds effort between December 20th to December 26th: Toolchain fixes Mattia Rizzolo rebased our experimental versions of debhelper (twice!) and dpkg on top of the latest releases. Reiner Herrmann submited a patch for mozilla-devscripts to sort the file list in generated preferences.js files. To be able to lift the restriction that packages must be built in the same path, translation support for the __FILE__ C pre-processor macro would also be required. Joerg Sonnenberger submitted a patch back in 2010 that would still be useful today. Chris Lamb started work on providing a deterministic mode for debootstrap. Packages fixed The following packages have become reproducible due to changes in their build dependencies: bouncycastle, cairo-dock-plug-ins, darktable, gshare, libgpod, pafy, ruby-redis-namespace, ruby-rouge, sparkleshare. The following packages became reproducible after getting fixed: Some uploads fixed some reproducibility issues, but not all of them: Patches submitted which have not made their way to the archive yet: reproducible.debian.net Statistics for package sets are now visible for the armhf architecture. (h01ger) The second build now has a longer timeout (18 hours) than the first build (12 hours). This should prevent wasting resources when a machine is loaded. (h01ger) Builds of Arch Linux packages are now done using a tmpfs. (h01ger) 200 GiB have been added to jenkins.debian.net (thanks to ProfitBricks!) to make room for new jobs. The current count is at 962 and growing! diffoscope development Aside from some minor bugs that have been fixed, a one-line change made huge memory (and time) savings as the output of transformation tool is now streamed line by line instead of loaded entirely in memory at once. disorderfs development Andrew Ayer released disorderfs version 0.4.2-1 on December 22th. It fixes a memory corruption error when processing command line arguments that could cause command line options to be ignored. Documentation update Many small improvements for the documentation on reproducible-builds.org sent by Georg Koppen were merged. Package reviews 666 (!) reviews have been removed, 189 added and 162 updated in the previous week. 151 new fail to build from source reports have been made by Chris West, Chris Lamb, Mattia Rizzolo, and Niko Tyni. New issues identified: unsorted_filelist_in_xul_ext_preferences, nondeterminstic_output_generated_by_moarvm. Misc. Steven Chamberlain drew our attention to one analysis of the Juniper ScreenOS Authentication Backdoor: Whilst this may have been added in source code, it was well-disguised in the disassembly and just 7 instructions long. I thought this was a good example of the current state-of-the-art, and why we'd like our binaries and eventually, installer and VM images reproducible IMHO. Joanna Rutkowska has mentioned possible ways for Qubes to become reproducible on their development mailing-list.

1 September 2015

Lunar: Reproducible builds: week 18 in Stretch cycle

What happened in the reproducible builds effort this week: Toolchain fixes Aur lien Jarno uploaded glibc/2.21-0experimental1 which will fix the issue were locales-all did not behave exactly like locales despite having it in the Provides field. Lunar rebased the pu/reproducible_builds branch for dpkg on top of the released 1.18.2. This made visible an issue with udebs and automatically generated debug packages. The summary from the meeting at DebConf15 between ftpmasters, dpkg mainatainers and reproducible builds folks has been posted to the revelant mailing lists. Packages fixed The following 70 packages became reproducible due to changes in their build dependencies: activemq-activeio, async-http-client, classworlds, clirr, compress-lzf, dbus-c++, felix-bundlerepository, felix-framework, felix-gogo-command, felix-gogo-runtime, felix-gogo-shell, felix-main, felix-shell-tui, felix-shell, findbugs-bcel, gco, gdebi, gecode, geronimo-ejb-3.2-spec, git-repair, gmetric4j, gs-collections, hawtbuf, hawtdispatch, jack-tools, jackson-dataformat-cbor, jackson-dataformat-yaml, jackson-module-jaxb-annotations, jmxetric, json-simple, kryo-serializers, lhapdf, libccrtp, libclaw, libcommoncpp2, libftdi1, libjboss-marshalling-java, libmimic, libphysfs, libxstream-java, limereg, maven-debian-helper, maven-filtering, maven-invoker, mochiweb, mongo-java-driver, mqtt-client, netty-3.9, openhft-chronicle-queue, openhft-compiler, openhft-lang, pavucontrol, plexus-ant-factory, plexus-archiver, plexus-bsh-factory, plexus-cdc, plexus-classworlds2, plexus-component-metadata, plexus-container-default, plexus-io, pytone, scolasync, sisu-ioc, snappy-java, spatial4j-0.4, tika, treeline, wss4j, xtalk, zshdb. The following packages became reproducible after getting fixed: Some uploads fixed some reproducibility issues but not all of them: Patches submitted which have not made their way to the archive yet: Chris Lamb also noticed that binaries shipped with libsilo-bin did not work. Documentation update Chris Lamb and Ximin Luo assembled a proper specification for SOURCE_DATE_EPOCH in the hope to convince more upstreams to adopt it. Thanks to Holger it is published under a non-Debian domain name. Lunar documented easiest way to solve issues with file ordering and timestamps in tarballs that came with tar/1.28-1. Some examples on how to use SOURCE_DATE_EPOCH have been improved to support systems without GNU date. reproducible.debian.net armhf is finally being tested, which also means the remote building of Debian packages finally works! This paves the way to perform the tests on even more architectures and doing variations on CPU and date. Some packages even produce the same binary Arch:all packages on different architectures (1, 2). (h01ger) Tests for FreeBSD are finally running. (h01ger) As it seems the gcc5 transition has cooled off, we schedule sid more often than testing again on amd64. (h01ger) disorderfs has been built and installed on all build nodes (amd64 and armhf). One issue related to permissions for root and unpriviliged users needs to be solved before disorderfs can be used on reproducible.debian.net. (h01ger) strip-nondeterminism Version 0.011-1 has been released on August 29th. The new version updates dh_strip_nondeterminism to match recent changes in debhelper. (Andrew Ayer) disorderfs disorderfs, the new FUSE filesystem to ease testing of filesystem-related variations, is now almost ready to be used. Version 0.2.0 adds support for extended attributes. Since then Andrew Ayer also added support to reverse directory entries instead of shuffling them, and arbitrary padding to the number of blocks used by files. Package reviews 142 reviews have been removed, 48 added and 259 updated this week. Santiago Vila renamed the not_using_dh_builddeb issue into varying_mtimes_in_data_tar_gz_or_control_tar_gz to align better with other tag names. New issue identified this week: random_order_in_python_doit_completion. 37 FTBFS issues have been reported by Chris West (Faux) and Chris Lamb. Misc. h01ger gave a talk at FrOSCon on August 23rd. Recordings are already online. These reports are being reviewed and enhanced every week by many people hanging out on #debian-reproducible. Huge thanks!

25 August 2015

Lunar: Reproducible builds: week 17 in Stretch cycle

A good amount of the Debian reproducible builds team had the chance to enjoy face-to-face interactions during DebConf15.
Names in red and blue were all present at DebConf15
Picture of the  reproducible builds  talk during DebConf15
Hugging people with whom one has been working tirelessly for months gives a lot of warm-fuzzy feelings. Several recorded and hallway discussions paved the way to solve the remaining issues to get reproducible builds part of Debian proper. Both talks from the Debian Project Leader and the release team mentioned the effort as important for the future of Debian. A forty-five minutes talk presented the state of the reproducible builds effort. It was then followed by an hour long roundtable to discuss current blockers regarding dpkg, .buildinfo and their integration in the archive. Picture of the  reproducible builds  roundtable during DebConf15 Toolchain fixes Reiner Herrmann submitted a patch to make rdfind sort the processed files before doing any operation. Chris Lamb proposed a new patch for wheel implementing support for SOURCE_DATE_EPOCH instead of the custom WHEEL_FORCE_TIMESTAMP. akira sent one making man2html SOURCE_DATE_EPOCH aware. St phane Glondu reported that dpkg-source would not respect tarball permissions when unpacking under a umask of 002. After hours of iterative testing during the DebConf workshop, Sandro Knau created a test case showing how pdflatex output can be non-deterministic with some PNG files. Packages fixed The following 65 packages became reproducible due to changes in their build dependencies: alacarte, arbtt, bullet, ccfits, commons-daemon, crack-attack, d-conf, ejabberd-contrib, erlang-bear, erlang-cherly, erlang-cowlib, erlang-folsom, erlang-goldrush, erlang-ibrowse, erlang-jiffy, erlang-lager, erlang-lhttpc, erlang-meck, erlang-p1-cache-tab, erlang-p1-iconv, erlang-p1-logger, erlang-p1-mysql, erlang-p1-pam, erlang-p1-pgsql, erlang-p1-sip, erlang-p1-stringprep, erlang-p1-stun, erlang-p1-tls, erlang-p1-utils, erlang-p1-xml, erlang-p1-yaml, erlang-p1-zlib, erlang-ranch, erlang-redis-client, erlang-uuid, freecontact, givaro, glade, gnome-shell, gupnp, gvfs, htseq, jags, jana, knot, libconfig, libkolab, libmatio, libvsqlitepp, mpmath, octave-zenity, openigtlink, paman, pisa, pynifti, qof, ruby-blankslate, ruby-xml-simple, timingframework, trace-cmd, tsung, wings3d, xdg-user-dirs, xz-utils, zpspell. The following packages became reproducible after getting fixed: Uploads that might have fixed reproducibility issues: Some uploads fixed some reproducibility issues but not all of them: Patches submitted which have not made their way to the archive yet: St phane Glondu reported two issues regarding embedded build date in omake and cduce. Aur lien Jarno submitted a fix for the breakage of make-dfsg test suite. As binutils now creates deterministic libraries by default, Aur lien's patch makes use of a wrapper to give the U flag to ar. Reiner Herrmann reported an issue with pound which embeds random dhparams in its code during the build. Better solutions are yet to be found. reproducible.debian.net Package pages on reproducible.debian.net now have a new layout improving readability designed by Mattia Rizzolo, h01ger, and Ulrike. The navigation is now on the left as vertical space is more valuable nowadays. armhf is now enabled on all pages except the dashboard. Actual tests on armhf are expected to start shortly. (Mattia Rizzolo, h01ger) The limit on how many packages people can schedule using the reschedule script on Alioth has been bumped to 200. (h01ger) mod_rewrite is now used instead of JavaScript for the form in the dashboard. (h01ger) Following the rename of the software, debbindiff has mostly been replaced by either diffoscope or differences in generated HTML and IRC notification output. Connections to UDD have been made more robust. (Mattia Rizzolo) diffoscope development diffoscope version 31 was released on August 21st. This version improves fuzzy-matching by using the tlsh algorithm instead of ssdeep. New command line options are available: --max-diff-input-lines and --max-diff-block-lines to override limits on diff input and output (Reiner Herrmann), --debugger to dump the user into pdb in case of crashes (Mattia Rizzolo). jar archives should now be detected properly (Reiner Herrman). Several general code cleanups were also done by Chris Lamb. strip-nondeterminism development Andrew Ayer released strip-nondeterminism version 0.010-1. Java properties file in jar should now be detected more accurately. A missing dependency spotted by St phane Glondu has been added. Testing directory ordering issues: disorderfs During the reproducible builds workshop at DebConf, participants identified that we were still short of a good way to test variations on filesystem behaviors (e.g. file ordering or disk usage). Andrew Ayer took a couple of hours to create disorderfs. Based on FUSE, disorderfs in an overlay filesystem that will mount the content of a directory at another location. For this first version, it will make the order in which files appear in a directory random. Documentation update Dhole documented how to implement support for SOURCE_DATE_EPOCH in Python, bash, Makefiles, CMake, and C. Chris Lamb started to convert the wiki page describing SOURCE_DATE_EPOCH into a Freedesktop-like specification in the hope that it will convince more upstream to adopt it. Package reviews 44 reviews have been removed, 192 added and 77 updated this week. New issues identified this week: locale_dependent_order_in_devlibs_depends, randomness_in_ocaml_startup_files, randomness_in_ocaml_packed_libraries, randomness_in_ocaml_custom_executables, undeterministic_symlinking_by_rdfind, random_build_path_by_golang_compiler, and images_in_pdf_generated_by_latex. 117 new FTBFS bugs have been reported by Chris Lamb, Chris West (Faux), and Niko Tyni. Misc. Some reproducibility issues might face us very late. Chris Lamb noticed that the test suite for python-pykmip was now failing because its test certificates have expired. Let's hope no packages are hiding a certificate valid for 10 years somewhere in their source! Pictures courtesy and copyright of Debian's own paparazzi: Aigars Mahinovs.

20 June 2015

Lunar: Reproducible builds: week 5 in Stretch cycle

What happened about the reproducible builds effort for this week: Toolchain fixes Uploads that should help other packages: Patch submitted for toolchain issues: Some discussions have been started in Debian and with upstream: Packages fixed The following 8 packages became reproducible due to changes in their build dependencies: access-modifier-checker, apache-log4j2, jenkins-xstream, libsdl-perl, maven-shared-incremental, ruby-pygments.rb, ruby-wikicloth, uimaj. The following packages became reproducible after getting fixed: Some uploads fixed some reproducibility issues but not all of them: Patches submitted which did not make their way to the archive yet: Discussions that have been started: reproducible.debian.net Holger Levsen added two new package sets: pkg-javascript-devel and pkg-php-pear. The list of packages with and without notes are now sorted by age of the latest build. Mattia Rizzolo added support for email notifications so that maintainers can be warned when a package becomes unreproducible. Please ask Mattia or Holger or in the #debian-reproducible IRC channel if you want to be notified for your packages! strip-nondeterminism development Andrew Ayer fixed the gzip handler so that it skip adding a predetermined timestamp when there was none. Documentation update Lunar added documentation about mtimes of file extracted using unzip being timezone dependent. He also wrote a short example on how to test reproducibility. Stephen Kitt updated the documentation about timestamps in PE binaries. Documentation and scripts to perform weekly reports were published by Lunar. Package reviews 50 obsolete reviews have been removed, 51 added and 29 updated this week. Thanks Chris West and Mathieu Bridon amongst others. New identified issues: Misc. Lunar will be talking (in French) about reproducible builds at Pas Sage en Seine on June 19th, at 15:00 in Paris. Meeting will happen this Wednesday, 19:00 UTC.

27 May 2015

Vincent Bernat: Live patching QEMU for VENOM mitigation

CVE-2015-3456, also known as VENOM, is a security vulnerability in QEMU virtual floppy controller:
The Floppy Disk Controller (FDC) in QEMU, as used in Xen [ ] and KVM, allows local guest users to cause a denial of service (out-of-bounds write and guest crash) or possibly execute arbitrary code via the FD_CMD_READ_ID, FD_CMD_DRIVE_SPECIFICATION_COMMAND, or other unspecified commands.
Even when QEMU has been configured with no floppy drive, the floppy controller code is still active. The vulnerability is easy to test1:
#define FDC_IOPORT 0x3f5
#define FD_CMD_READ_ID 0x0a
int main()  
    ioperm(FDC_IOPORT, 1, 1);
    outb(FD_CMD_READ_ID, FDC_IOPORT);
    for (size_t i = 0;; i++)
        outb(0x42, FDC_IOPORT);
    return 0;
 
Once the fix installed, all processes still have to be restarted for the upgrade to be effective. It is possible to minimize the downtime by leveraging virsh save. Another possibility would be to patch the running processes. The Linux kernel attracted a lot of interest in this area, with solutions like Ksplice (mostly killed by Oracle), kGraft (by Red Hat) and kpatch (by Suse) and the inclusion of a common framework in the kernel. The userspace has far less out-of-the-box solutions2. I present here a simple and self-contained way to patch a running QEMU to remove the vulnerability without requiring any sensible downtime. Here is a short demonstration:

Proof of concept First, let s find a workaround that would be simple to implement through live patching: while modifying running code text is possible, it is easier to modify a single variable.

Concept Looking at the code of the floppy controller and the patch, we can avoid the vulnerability by not accepting any command on the FIFO port. Each request would be answered by Invalid command (0x80) and a user won t be able to push more bytes to the FIFO until the answer is read and the FIFO queue reset. Of course, the floppy controller would be rendered useless in this state. But who cares? The list of commands accepted by the controller on the FIFO port is contained in the handlers[] array:
static const struct  
    uint8_t value;
    uint8_t mask;
    const char* name;
    int parameters;
    void (*handler)(FDCtrl *fdctrl, int direction);
    int direction;
  handlers[] =  
      FD_CMD_READ, 0x1f, "READ", 8, fdctrl_start_transfer, FD_DIR_READ  ,
      FD_CMD_WRITE, 0x3f, "WRITE", 8, fdctrl_start_transfer, FD_DIR_WRITE  ,
    /* [...] */
      0, 0, "unknown", 0, fdctrl_unimplemented  , /* default handler */
 ;
To avoid browsing the array each time a command is received, another array is used to map each command to the appropriate handler:
/* Associate command to an index in the 'handlers' array */
static uint8_t command_to_handler[256];
static void fdctrl_realize_common(FDCtrl *fdctrl, Error **errp)
 
    int i, j;
    static int command_tables_inited = 0;
    /* Fill 'command_to_handler' lookup table */
    if (!command_tables_inited)  
        command_tables_inited = 1;
        for (i = ARRAY_SIZE(handlers) - 1; i >= 0; i--)  
            for (j = 0; j < sizeof(command_to_handler); j++)  
                if ((j & handlers[i].mask) == handlers[i].value)  
                    command_to_handler[j] = i;
                 
             
         
     
    /* [...] */
 
Our workaround is to modify the command_to_handler[] array to map all commands to the fdctrl_unimplemented() handler (the last one in the handlers[] array).

Testing with gdb To check if the workaround works as expected, we test it with gdb. Unless you have compiled QEMU yourself, you need to install a package with debug symbols. Unfortunately, on Debian, they are not available, yet3. On Ubuntu, you can install the qemu-system-x86-dbgsym package after enabling the appropriate repositories. The following function for gdb maps every command to the unimplemented handler:
define patch
  set $handler = sizeof(handlers)/sizeof(*handlers)-1
  set $i = 0
  while ($i < 256)
   set variable command_to_handler[$i++] = $handler
  end
  printf "Done!\n"
end
Attach to the vulnerable process (with attach), call the function (with patch) and detach of the process (with detach). You can check that the exploit is not working anymore. This could be easily automated.

Limitations Using gdb has two main limitations:
  1. It needs to be installed on each host to be patched.
  2. The debug packages need to be installed as well. Moreover, it can be difficult to fetch previous versions of those packages.

Writing a custom patcher To overcome those limitations, we can write a customer patcher using the ptrace() system call without relying on debug symbols being present.

Finding the right memory spot Before being able to modify the command_to_handler[] array, we need to know its location. The first clue is given by the symbol table. To query it, use readelf -s:
$ readelf -s /usr/lib/debug/.build-id/09/95121eb46e2a4c13747ac2bad982829365c694.debug   \
>   sed -n -e 1,3p -e /command_to_handler/p
Symbol table '.symtab' contains 27066 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
  8485: 00000000009f9d00   256 OBJECT  LOCAL  DEFAULT   26 command_to_handler
This table is usually stripped out of the executable to save space, like shown below:
$ file -b /usr/bin/qemu-system-x86_64   tr , \\n
ELF 64-bit LSB shared object
 x86-64
 version 1 (SYSV)
 dynamically linked
 interpreter /lib64/ld-linux-x86-64.so.2
 for GNU/Linux 2.6.32
 BuildID[sha1]=0995121eb46e2a4c13747ac2bad982829365c694
 stripped
If your distribution provides a debug package, the debug symbols are installed in /usr/lib/debug. Most modern distributions are now relying on the build ID4 to map an executable to its debugging symbols, like the example above. Without a debug package, you need to recompile the existing package without stripping debug symbols in a clean environment5. On Debian, this can be done by setting the DEB_BUILD_OPTIONS environment variable to nostrip. We have now two possible cases:
  • the easy one, and
  • the hard one.

The easy case On x86, here is the standard layout of a regular Linux process in memory6: Memory layout of a regular process on x86 The random gaps (ASLR) are here to prevent an attacker from reliably jumping to a particular exploited function in memory. On x86-64, the layout is quite similar. The important point is that the base address of the executable is fixed. The memory mapping of a process is also available through /proc/PID/maps. Here is a shortened and annotated example on x86-64:
$ cat /proc/3609/maps
00400000-00401000         r-xp 00000000 fd:04 483  not-qemu [text segment]
00601000-00602000         r--p 00001000 fd:04 483  not-qemu [data segment]
00602000-00603000         rw-p 00002000 fd:04 483  not-qemu [BSS segment]
[random gap]
02419000-0293d000         rw-p 00000000 00:00 0    [heap]
[random gap]
7f0835543000-7f08356e2000 r-xp 00000000 fd:01 9319 /lib/x86_64-linux-gnu/libc-2.19.so
7f08356e2000-7f08358e2000 ---p 0019f000 fd:01 9319 /lib/x86_64-linux-gnu/libc-2.19.so
7f08358e2000-7f08358e6000 r--p 0019f000 fd:01 9319 /lib/x86_64-linux-gnu/libc-2.19.so
7f08358e6000-7f08358e8000 rw-p 001a3000 fd:01 9319 /lib/x86_64-linux-gnu/libc-2.19.so
7f08358e8000-7f08358ec000 rw-p 00000000 00:00 0
7f08358ec000-7f083590c000 r-xp 00000000 fd:01 5138 /lib/x86_64-linux-gnu/ld-2.19.so
7f0835aca000-7f0835acd000 rw-p 00000000 00:00 0
7f0835b08000-7f0835b0c000 rw-p 00000000 00:00 0
7f0835b0c000-7f0835b0d000 r--p 00020000 fd:01 5138 /lib/x86_64-linux-gnu/ld-2.19.so
7f0835b0d000-7f0835b0e000 rw-p 00021000 fd:01 5138 /lib/x86_64-linux-gnu/ld-2.19.so
7f0835b0e000-7f0835b0f000 rw-p 00000000 00:00 0
[random gap]
7ffdb0f85000-7ffdb0fa6000 rw-p 00000000 00:00 0    [stack]
With a regular executable, the value given in the symbol table is an absolute memory address:
$ readelf -s not-qemu   \
>   sed -n -e 1,3p -e /command_to_handler/p
Symbol table '.dynsym' contains 9 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
    47: 0000000000602080   256 OBJECT  LOCAL  DEFAULT   25 command_to_handler
So, the address of command_to_handler[], in the above example, is just 0x602080.

The hard case To enhance security, it is possible to load some executables at a random base address, just like a library. Such an executable is called a Position Independent Executable (PIE). An attacker won t be able to rely on a fixed address to find some helpful function. Here is the new memory layout: Memory layout of a PIE process on x86 With a PIE process, the value in the symbol table is now an offset from the base address.
$ readelf -s not-qemu-pie   sed -n -e 1,3p -e /command_to_handler/p
Symbol table '.dynsym' contains 17 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
    47: 0000000000202080   256 OBJECT  LOCAL  DEFAULT   25 command_to_handler
If we look at /proc/PID/maps, we can figure out where the array is located in memory:
$ cat /proc/12593/maps
7f6c13565000-7f6c13704000 r-xp 00000000 fd:01 9319  /lib/x86_64-linux-gnu/libc-2.19.so
7f6c13704000-7f6c13904000 ---p 0019f000 fd:01 9319  /lib/x86_64-linux-gnu/libc-2.19.so
7f6c13904000-7f6c13908000 r--p 0019f000 fd:01 9319  /lib/x86_64-linux-gnu/libc-2.19.so
7f6c13908000-7f6c1390a000 rw-p 001a3000 fd:01 9319  /lib/x86_64-linux-gnu/libc-2.19.so
7f6c1390a000-7f6c1390e000 rw-p 00000000 00:00 0
7f6c1390e000-7f6c1392e000 r-xp 00000000 fd:01 5138  /lib/x86_64-linux-gnu/ld-2.19.so
7f6c13b2e000-7f6c13b2f000 r--p 00020000 fd:01 5138  /lib/x86_64-linux-gnu/ld-2.19.so
7f6c13b2f000-7f6c13b30000 rw-p 00021000 fd:01 5138  /lib/x86_64-linux-gnu/ld-2.19.so
7f6c13b30000-7f6c13b31000 rw-p 00000000 00:00 0
7f6c13b31000-7f6c13b33000 r-xp 00000000 fd:04 4594  not-qemu-pie [text segment]
7f6c13cf0000-7f6c13cf3000 rw-p 00000000 00:00 0
7f6c13d2e000-7f6c13d32000 rw-p 00000000 00:00 0
7f6c13d32000-7f6c13d33000 r--p 00001000 fd:04 4594  not-qemu-pie [data segment]
7f6c13d33000-7f6c13d34000 rw-p 00002000 fd:04 4594  not-qemu-pie [BSS segment]
[random gap]
7f6c15c46000-7f6c15c67000 rw-p 00000000 00:00 0     [heap]
[random gap]
7ffe823b0000-7ffe823d1000 rw-p 00000000 00:00 0     [stack]
The base address is 0x7f6c13b31000, the offset is 0x202080 and therefore, the location of the array is 0x7f6c13d33080. We can check with gdb:
$ print &command_to_handler
$1 = (uint8_t (*)[256]) 0x7f6c13d33080 <command_to_handler>

Patching a memory spot Once we know the location of the command_to_handler[] array in memory, patching it is quite straightforward. First, we start tracing the target process:
/* Attach to the running process */
static int
patch_attach(pid_t pid)
 
    int status;
    printf("[.] Attaching to PID %d...\n", pid);
    if (ptrace(PTRACE_ATTACH, pid, NULL, NULL) == -1)  
        fprintf(stderr, "[!] Unable to attach to PID %d: %m\n", pid);
        return -1;
     
    if (waitpid(pid, &status, 0) == -1)  
        fprintf(stderr, "[!] Error while attaching to PID %d: %m\n", pid);
        return -1;
     
    assert(WIFSTOPPED(status)); /* Tracee may have died */
    if (ptrace(PTRACE_GETSIGINFO, pid, NULL, &si) == -1)  
        fprintf(stderr, "[!] Unable to read siginfo for PID %d: %m\n", pid);
        return -1;
     
    assert(si.si_signo == SIGSTOP); /* Other signals may have been received */
    printf("[*] Successfully attached to PID %d\n", pid);
    return 0;
 
Then, we retrieve the command_to_handler[] array, modify it and put it back in memory7.
static int
patch_doit(pid_t pid, unsigned char *target)
 
    int ret = -1;
    unsigned char *command_to_handler = NULL;
    size_t i;
    /* Get the table */
    printf("[.] Retrieving command_to_handler table...\n");
    command_to_handler = ptrace_read(pid,
                                     target,
                                     QEMU_COMMAND_TO_HANDLER_SIZE);
    if (command_to_handler == NULL)  
        fprintf(stderr, "[!] Unable to read command_to_handler table: %m\n");
        goto out;
     
    /* Check if the table has already been patched. */
    /* [...] */
    /* Patch it */
    printf("[.] Patching QEMU...\n");
    for (i = 0; i < QEMU_COMMAND_TO_HANDLER_SIZE; i++)  
        command_to_handler[i] = QEMU_NOT_IMPLEMENTED_HANDLER;
     
    if (ptrace_write(pid, target, command_to_handler,
           QEMU_COMMAND_TO_HANDLER_SIZE) == -1)  
        fprintf(stderr, "[!] Unable to patch command_to_handler table: %m\n");
        goto out;
     
    printf("[*] QEMU successfully patched!\n");
    ret = 0;
out:
    free(command_to_handler);
    return ret;
 
Since ptrace() only allows to read or write a word at a time, ptrace_read() and ptrace_write() are wrappers to read or write arbitrary large chunks of memory8. Here is the code for ptrace_read():
/* Read memory of the given process */
static void *
ptrace_read(pid_t pid, void *address, size_t size)
 
    /* Allocate the buffer */
    uword_t *buffer = malloc((size/sizeof(uword_t) + 1)*sizeof(uword_t));
    if (!buffer) return NULL;
    /* Read word by word */
    size_t readsz = 0;
    do  
        errno = 0;
        if ((buffer[readsz/sizeof(uword_t)] =
                ptrace(PTRACE_PEEKTEXT, pid,
                       (unsigned char*)address + readsz,
                       0)) && errno)  
            fprintf(stderr, "[!] Unable to peek one word at address %p: %m\n",
                    (unsigned char *)address + readsz);
            free(buffer);
            return NULL;
         
        readsz += sizeof(uword_t);
      while (readsz < size);
    return (unsigned char *)buffer;
 

Putting the pieces together The patcher is provided with the following information:
  • the PID of the process to be patched,
  • the command_to_handler[] offset from the symbol table, and
  • the build ID of the executable file used to get this offset (as a safety measure).
The main steps are:
  1. Attach to the process with ptrace().
  2. Get the executable name from /proc/PID/exe.
  3. Parse /proc/PID/maps to find the address of the text segment (it s the first one).
  4. Do some sanity checks:
    • check there is a ELF header at this location (4-byte magic number),
    • check the executable type (ET_EXEC for regular executables, ET_DYN for PIE), and
    • get the build ID and compare with the expected one.
  5. From the base address and the provided offset, compute the location of the command_to_handler[] array.
  6. Patch it.
You can find the complete patcher on GitHub.
$ ./patch --build-id 0995121eb46e2a4c13747ac2bad982829365c694 \
>         --offset 9f9d00 \
>         --pid 16833
[.] Attaching to PID 16833...
[*] Successfully attached to PID 16833
[*] Executable name is /usr/bin/qemu-system-x86_64
[*] Base address is 0x7f7eea912000
[*] Both build IDs match
[.] Retrieving command_to_handler table...
[.] Patching QEMU...
[*] QEMU successfully patched!

  1. The complete code for this test is on GitHub.
  2. An interesting project seems to be Katana. But there are also some insightful hacking papers on the subject.
  3. Some packages come with a -dbg package with debug symbols, some others don t. Fortunately, a proposal to automatically produce debugging symbols for everything is near completion.
  4. The Fedora Wiki contains the rationale behind the build ID.
  5. If the build is incorrectly reproduced, the build ID won t match. The information provided by the debug symbols may or may not be correct. Debian currently has a reproducible builds effort to ensure that each package can be reproduced.
  6. Anatomy of a program in memory is a great blog post explaining in more details how a program lives in memory.
  7. Being an uninitialized static variable, the variable is in the BSS section. This section is mapped to a writable memory segment. If it wasn t the case, with Linux, the ptrace() system call is still allowed to write. Linux will copy the page and mark it as private.
  8. With Linux 3.2 or later, process_vm_readv() and process_vm_writev() can be used to transfer data from/to a remote process without using ptrace() at all. However, ptrace() would still be needed to reliably stop the main thread.

4 February 2015

Vincent Bernat: Directory bookmarks with Zsh

There are numerous projects to implement directory bookmarks in your favorite shell. An inherent limitation of those implementations is they being only an enhanced cd command: you cannot use a bookmark in an arbitrary command. UPDATED: My initial implementation with Zsh was using dynamic named directories. I have been pointed on Twitter that there is a simpler way to implement bookmarks. The article has been updated to reflect that. As a side note, it is also possible to just use shell variables1. Zsh comes with a not well-known feature called static named directories. They are declared with the hash builtin and can be refered by prepending ~ to them:
$ hash -d -- -lldpd=/home/bernat/code/deezer/lldpd
$ echo ~-lldpd/README.md
/home/bernat/code/deezer/lldpd/README.md
$ head -n1 ~-lldpd/README.md
lldpd: implementation of IEEE 802.1ab (LLDP)
Because ~-lldpd is substituted during file name expansion, it is possible to use it in any command like a regular directory, like shown above. The - prefix is only here to avoid collision with home directories. Bookmarks are kept into a dedicated directory, $MARKPATH. Each bookmark is a symbolic link to the target directory: for example, ~-lldpd should be expanded to $MARKPATH/lldpd which points to the appropriate directory. Assuming that you have populated $MARKPATH with some links, here is how bookmarks are registered:
for link ($MARKPATH/*(N@))  
    hash -d -- -$ link:t =$ link:A 
 
You also automatically get completion and prompt expansion:
$ pwd
/home/bernat/code/deezer/lldpd/src/lib
$ echo $ (%):-%~ 
~-lldpd/src/lib
The last step is to manage bookmarks without adding or removing symbolic links manually. The following bookmark() function will display the existing bookmarks when called without arguments, will remove a bookmark when called with -d or add the current directory as a bookmark otherwise.
bookmark()  
    if (( $# == 0 )); then
        # When no arguments are provided, just display existing
        # bookmarks
        for link in $MARKPATH/*(N@); do
            local markname="$fg[green]$ link:t $reset_color"
            local markpath="$fg[blue]$ link:A $reset_color"
            printf "%-30s -> %s\n" $markname $markpath
        done
    else
        # Otherwise, we may want to add a bookmark or delete an
        # existing one.
        local -a delete
        zparseopts -D d=delete
        if (( $+delete[1] )); then
            # With  -d , we delete an existing bookmark
            command rm $MARKPATH/$1
        else
            # Otherwise, add a bookmark to the current
            # directory. The first argument is the bookmark
            # name.  .  is special and means the bookmark should
            # be named after the current directory.
            local name=$1
            [ $name == "." ] && name=$ PWD:t 
            ln -s $PWD $MARKPATH/$name
        fi
    fi
 
Find the complete version on GitHub.

Dynamic named directories Another (more complex) way to achieve the same thing is using dynamic named directories. I was initially using this solution but it is far more complex. This section is only here for historical reason. You can find the complete implementation in my GitHub repository. During file name expansion, a ~ followed by a string in square brackets is provided to the zsh_directory_name() function which will eventually reply with a directory name. This feature can be used to implement directory bookmarks:
$ cd ~[@lldpd]
$ pwd
/home/bernat/code/deezer/lldpd
$ echo ~[@lldpd]/README.md
/home/bernat/code/deezer/lldpd/README.md
$ head -n1 ~[@lldpd]/README.md
lldpd: implementation of IEEE 802.1ab (LLDP)
Like for static named directories, because ~[@lldpd] is substituted during file name expansion, it is possible to use it in any command like a regular directory.

Basic implementation Bookmarks are still kept into a dedicated directory, $MARKPATH and are still symbolic links. Here is how the core feature is implemented:
_bookmark_directory_name()  
    emulate -L zsh #  
    setopt extendedglob
    case $1 in
        n)
            [ [ $2 != ( #b)"@"(?*) ]] && return 1 #  
            typeset -ga reply
            reply=($ $ :-$MARKPATH/$match[1] :A ) #  
            return 0
            ;;
        *)
            return 1
            ;;
    esac
    return 0
 
add-zsh-hook zsh_directory_name _bookmark_directory_name
zsh_directory_name() is a function accepting hooks2: instead of defining it directly, we define another function and register it as a hook with add-zsh-hook. The hook is expected to handle different situations. The first one is to be able to transform a dynamic name into a regular directory name. In this case, the first parameter of the function is n and the second one is the dynamic name. In , the call to emulate will restore the pristine behaviour of Zsh and also ensure that any option set in the scope of the function will not have an impact outside. The function can then be reused safely in another environment. In , we check that the dynamic name starts with @ followed by at least one character. Otherwise, we declare we don t know how to handle it. Another hook will get the chance to do something. (#b) is a globbing flag. It activates backreferences for parenthesised groups. When a match is found, it is stored as an array, $match. In , we build the reply. We could have just returned $MARKPATH/$match[1] but to hide the symbolic link mechanism, we use the A modifier to ask Zsh to resolve symbolic links if possible. Zsh allows nested substitutions. It is therefore possible to use modifiers and flags on anything. $ :-$MARKPATH/$match[1] is a common trick to turn $MARKPATH/$match[1] into a parameter substitution and be able to apply the A modifier on it.

Completion Zsh is also able to ask for completion of a dynamic directory name. In this case, the completion system calls the hook function with c as the first argument.
_bookmark_directory_name()  
    # [...]
    case $1 in
        c)
            # Completion
            local expl
            local -a dirs
            dirs=($MARKPATH/*(N@:t)) #  
            dirs=("@"$ ^dirs ) #  
            _wanted dynamic-dirs expl 'bookmarked directory' compadd -S\] -a dirs
            return
            ;;
        # [...]
    esac
    # [...]
 
First, in , we create a list of possible bookmarks. In *(N@:t), N@ is a glob qualifier. N allows us to not return nothing if there is no match (otherwise, we would get an error) while @ only returns symbolic links. t is a modifier which will remove all leading pathname components. This is equivalent to use basename or $ something##*/ in POSIX shells but it plays nice with glob expressions. In , we just add @ before each bookmark name. If we have b1, b2 and b3 as bookmarks, $ ^dirs expands to b1,b2,b3 and therefore "@"$ ^dirs expands to the (@b1 @b2 @b3) array. The result is then feeded into the completion system.

Prompt expansion Many people put the name of the current directory in their prompt. It would be nice to have the bookmark name instead of the full name when we are below a bookmarked directory. That s also possible!
$ pwd
/home/bernat/code/deezer/lldpd/src/lib
$ echo $ (%):-%~ 
~[@lldpd]/src/lib
The prompt expansion system calls the hook function with d as first argument and the file name to transform.
_bookmark_directory_name()  
    # [...]
    case $1 in
        d)
            local link slink
            local -A links
            for link ($MARKPATH/*(N@))  
                links[$  #link:A $'\0'$ link:A ]=$ link:t  #  
             
            for slink ($ (@On)$ (k)links )  
                link=$ slink#*$'\0'  #  
                if [ [ $2 = ( #b)($ link )( /*) ]]; then
                    typeset -ga reply
                    reply=("@"$ links[$slink]  $(( $  #match[1]  )) )
                    return 0
                fi
             
            return 1
            ;;
        # [...]
    esac
    # [...]
 
OK. This is some black Zsh wizardry. Feel free to skip the explanation. This is a bit complex because we want to substitute the most specific bookmark, hence the need to sort bookmarks by their target lengths. In , the associative array $links is created by iterating on each symbolic link ($link) in the $MARKPATH directory. The goal is to map a target directory with the matching bookmark name. However, we need to iterate on this map from the longest to the shortest key. To achieve that, we prepend each key with its length. Remember, $ link:A is the absolute path with symbolic links resolved. So, $ #link:A is the length of this path. We concatenate the length of the target directory with the target directory name and use $'\0' as a separator because this is the only safe character for this purpose. The result is mapped to the bookmark name. The second loop is an iteration on the keys of the associative array $links (thanks to the use of the k parameter flag in $ (k)links ). Those keys are turned into an array (@ parameter flag) and sorted numerically in descending order (On parameter flag). Since the keys are directory names prefixed by their lengths, the first match will be the longest one. In , we extract the directory name from the key by removing the length and the null character at the beginning. Then, we check if the extracted directory name matches the file name we have been provided. Again, (#b) just activates backreferences. With extended globbing, we can use the or operator, . So, when either the file name matches exactly the directory name or is somewhere deeper, we create the reply which is an array whose first member is the bookmark name and the second member is the untranslated part of the file name.

Easy typing Typing ~[@ is cumbersome. Hopefully, Zsh line editor can be extended with additional bindings. The following snippet will substitute @@ (if typed without a pause) by ~[@:
vbe-insert-bookmark()  
    emulate -L zsh
    LBUFFER=$ LBUFFER "~[@"
 
zle -N vbe-insert-bookmark
bindkey '@@' vbe-insert-bookmark
In combination with the autocd option and completion, it is quite easy to jump to a bookmarked directory.

  1. For example:
    $ lldpd=/home/bernat/code/deezer/lldpd
    $ echo $lldpd/README.md
    /home/bernat/code/deezer/lldpd/README.md
    $ head -n1 $lldpd/README.md
    lldpd: implementation of IEEE 802.1ab (LLDP)
    
    The drawback is that you don t have a separate namespace for your bookmarks. You can still use a special prefix for that. Also, no prompt expansion.
  2. Other functions accepting hooks are chpwd() or precmd().

23 December 2014

Vincent Bernat: Eudyptula Challenge: superfast Linux kernel booting

The Eudyptula Challenge is a series of programming exercises for the Linux kernel, that start from a very basic Hello world kernel module, moving on up in complexity to getting patches accepted into the main Linux kernel source tree.
One of the first tasks of this quite interesting challenge is to compile and boot your own kernel. eudyptula-boot is a self-contained shell script to boot any kernel image to a shell. It is packed with the following features: In the following video, eudyptula-boot is used to boot the host kernel and execute a few commands:
In the next one, we use it to boot a custom kernel with an additional system call. This is the fifteenth task of the Eudyptula Challenge. A test program is used to check that the system call is working as expected. Additionaly, we demonstrate how to attach a debugger to the running kernel.
While this hack could be used to run containers3 with an increased isolation, the performance of the 9p filesystem is unfortunately quite poor.

  1. The only requirement is to have 9p virtio support enabled. This can easily be enabled with make kvmconfig.
  2. Only udev is started.
  3. A good way to start a container is to combine --root, --force and --exec parameters. Add --readwrite to the mix if you want to keep the modifications.

18 November 2014

Erich Schubert: Generate iptables rules via pyroman

Vincent Bernat blogged on using Netfilter rulesets, pointing out that inserting the rules one-by-one using iptables calls may leave your firewall temporarily incomplete, eventually half-working, and that this approach can be slow.
He's right with that, but there are tools that do this properly. ;-)
Some years ago, for a multi-homed firewall, I wrote a tool called Pyroman. Using rules specified either in Python or XML syntax, it generates a firewall ruleset for you.
But it also adresses the points Vincent raised:
  • It uses iptables-restore to load the firewall more efficiently than by calling iptables a hundred times
  • It will backup the previous firewall, and roll-back on errors (or lack of confirmation, if you are remote and use --safe)
It also has a nice feature for the use in staging: it can generate firewall rule sets offline, to allow you reviewing them before use, or transfer them to a different host. Not all functionality is supported though (e.g. the Firewall.hostname constant usable in python conditionals will still be the name of the host you generate the rules on - you may want to add a --hostname parameter to pyroman)
pyroman --print-verbose will generate a script readable by iptables-restore except for one problem: it contains both the rules for IPv4 and for IPv6, separated by #### IPv6 rules. It will also annotate the origin of the rule, for example:
# /etc/pyroman/02_icmpv6.py:82
-A rfc4890f -p icmpv6 --icmpv6-type 255 -j DROP
indicates that this particular line was produced due to line 82 in file /etc/pyroman/02_icmpv6.py. This makes debugging easier. In particular it allows pyroman to produce a meaningful error message if the rules are rejected by the kernel: it will tell you which line caused the rule that was rejected.
For the next version, I will probably add --output-ipv4 and --output-ipv6 options to make this more convenient to use. So far, pyroman is meant to be used on the firewall itself.
Note: if you have configured a firewall that you are happy with, you can always use iptables-save to dump the current firewall. But it will not preserve comments, obviously.

16 November 2014

Vincent Bernat: Staging a Netfilter ruleset in a network namespace

A common way to build a firewall ruleset is to run a shell script calling iptables and ip6tables. This is convenient since you get access to variables and loops. There are three major drawbacks with this method:
  1. While the script is running, the firewall is temporarily incomplete. Even if existing connections can be arranged to be left untouched, the new ones may not be allowed to be established (or unauthorized flows may be allowed). Also, essential NAT rules or mangling rules may be absent.
  2. If an error occurs, you are left with an half-working firewall. Therefore, you should ensure that some rules authorizing remote access are set very early. Or implement some kind of automatic rollback system.
  3. Building a large firewall can be slow. Each ip ,6 tables command will download the ruleset from the kernel, add the rule and upload the whole modified ruleset to the kernel.

Using iptables-restore A classic way to solve these problems is to build a rule file that will be read by iptables-restore and ip6tables-restore1. Those tools send the ruleset to the kernel in one pass. The kernel applies it atomically. Usually, such a file is built with ip ,6 tables-save but a script can fit the task. The ruleset syntax understood by ip ,6 tables-restore is similar to the syntax of ip ,6 tables but each table has its own block and chain declaration is different. See the following example:
$ iptables -P FORWARD DROP
$ iptables -t nat -A POSTROUTING -s 192.168.0.0/24 -j MASQUERADE
$ iptables -N SSH
$ iptables -A SSH -p tcp --dport ssh -j ACCEPT
$ iptables -A INPUT -i lo -j ACCEPT
$ iptables -A OUTPUT -o lo -j ACCEPT
$ iptables -A FORWARD -m state --state ESTABLISHED,RELATED -j ACCEPT
$ iptables -A FORWARD -j SSH
$ iptables-save
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
-A POSTROUTING -s 192.168.0.0/24 -j MASQUERADE
COMMIT
*filter
:INPUT ACCEPT [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [0:0]
:SSH - [0:0]
-A INPUT -i lo -j ACCEPT
-A FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -j SSH
-A OUTPUT -o lo -j ACCEPT
-A SSH -p tcp -m tcp --dport 22 -j ACCEPT
COMMIT
As you see, we have one block for the nat table and one block for the filter table. The user-defined chain SSH is declared at the top of the filter block with other builtin chains. Here is a script diverting ip ,6 tables commands to build such a file (heavily relying on some Zsh-fu2):
#!/bin/zsh
set -e
work=$(mktemp -d)
trap "rm -rf $work" EXIT
#   Redefine ip ,6 tables
iptables()  
    # Intercept -t
    local table="filter"
    [[ -n $ @[(r)-t]  ]] &&  
        # Which table?
        local index=$ (k)@[(r)-t] 
        table=$ @[(( index + 1 ))] 
        argv=( $argv[1,(( $index - 1 ))] $argv[(( $index + 2 )),$#] )
     
    [[ -n $ @[(r)-N]  ]] &&  
        # New user chain
        local index=$ (k)@[(r)-N] 
        local chain=$ @[(( index + 1 ))] 
        print ":$ chain  -" >> $ work /$ 0 -$ table -userchains
        return
     
    [[ -n $ @[(r)-P]  ]] &&  
        # Policy for a builtin chain
        local index=$ (k)@[(r)-P] 
        local chain=$ @[(( index + 1 ))] 
        local policy=$ @[(( index + 2 ))] 
        print ":$ chain  $ policy " >> $ work /$ 0 -$ table -policy
        return
     
    # iptables-restore only handle double quotes
    echo $ $ (q-)@ //\'/\"  >> $ work /$ 0 -$ table -rules #'
 
functions[ip6tables]=$ functions[iptables] 
#   Build the final ruleset that can be parsed by ip ,6 tables-restore
save()  
    for table ($ work /$ 1 -*-rules(:t:s/-rules//))  
        print "*$ $ table #$ 1 - "
        [ ! -f $ work /$ table -policy ]   cat $ work /$ table -policy
        [ ! -f $ work /$ table -userchains   cat $ work /$ table -userchains
        cat $ work /$ table -rules
        print "COMMIT"
     
 
#   Execute rule files
for rule in $(run-parts --list --regex '^[.a-zA-Z0-9_-]+$' $ 0%/* /rules); do
    . $rule
done
#   Execute rule files
ret=0
save iptables    iptables-restore    ret=$?
save ip6tables   ip6tables-restore   ret=$?
exit $ret
In , a new iptables() function is defined and will shadow the iptables command. It will try to locate the -t parameter to know which table should be used. If such a parameter exists, the table is remembered in the $table variable and removed from the list of arguments. Defining a new chain (with -N) is also handled as well as setting the policy (with -P). In , the save() function will output a ruleset that should be parseable by ip ,6 tables-restore. In , user rules are executed. Each ip ,6 tables command will call the previously defined function. When no error has occurred, in , ip ,6 tables-restore is invoked. The command will either succeed or fail. This method works just fine3. However, the second method is more elegant.

Using a network namespace An hybrid approach is to build the firewall rules with ip ,6 tables in a newly created network namespace, save it with ip ,6 tables-save and apply it in the main namespace with ip ,6 tables-restore. Here is the gist (still using Zsh syntax):
#!/bin/zsh
set -e
alias main='/bin/true  '
[ -n $iptables ]    
    #   Execute ourself in a dedicated network namespace
    iptables=1 unshare --net -- \
        $0 4> >(iptables-restore) 6> >(ip6tables-restore)
    #   In main namespace, disable iptables/ip6tables commands
    alias iptables=/bin/true
    alias ip6tables=/bin/true
    alias main='/bin/false  '
 
#   In both namespaces, execute rule files
for rule in $(run-parts --list --regex '^[.a-zA-Z0-9_-]+$' $ 0%/* /rules); do
    . $rule
done
#   In test namespace, save the rules
[ -z $iptables ]    
    iptables-save >&4
    ip6tables-save >&6
 
In , the current script is executed in a new network namespace. Such a namespace has its own ruleset that can be modified without altering the one in the main namespace. The $iptables environment variable tell in which namespace we are. In the new namespace, we execute all the rule files ( ). They contain classic ip ,6 tables commands. If an error occurs, we stop here and nothing happens, thanks to the use of set -e. Otherwise, in , the ruleset of the new namespace are saved using ip ,6 tables-save and sent to dedicated file descriptors. Now, the execution in the main namespace resumes in . The results of ip ,6 tables-save are feeded to ip ,6 tables-restore. At this point, the firewall is mostly operational. However, we will play again the rule files ( ) but the ip ,6 tables commands will be disabled ( ). Additional commands in the rule files, like enabling IP forwarding, will be executed. The new namespace does not provide the same environment as the main namespace. For example, there is no network interface in it, so we cannot get or set IP addresses. A command that must not be executed in the new namespace should be prefixed by main:
main ip addr add 192.168.15.1/24 dev lan-guest
You can look at a complete example on GitHub.

  1. Another nifty tool is iptables-apply which will apply a rule file and rollback after a given timeout unless the change is confirmed by the user.
  2. As you can see in the snippet, Zsh comes with some powerful features to handle arrays. Another big advantage of Zsh is it does not require quoting every variable to avoid field splitting. Hence, the script can handle values with spaces without a problem, making it far more robust.
  3. If I were nitpicking, there are three small flaws with it. First, when an error occurs, it can be difficult to match the appropriate location in your script since you get the position in the ruleset instead. Second, a table can be used before it is defined. So, it may be difficult to spot some copy/paste errors. Third, the IPv4 firewall may fail while the IPv6 firewall is applied, and vice-versa. Those flaws are not present in the next method.

Vincent Bernat: Intel Wireless 7260 as an access point

My home router acts as an access point with an Intel Dual-Band Wireless-AC 7260 wireless card. This card supports 802.11ac (on the 5 GHz band) and 802.11n (on both the 5 GHz and 2.4 GHz band). While this seems a very decent card to use in managed mode, this is not really a great choice for an access point.
$ lspci -k -nn -d 8086:08b1
03:00.0 Network controller [0280]: Intel Corporation Wireless 7260 [8086:08b1] (rev 73)
        Subsystem: Intel Corporation Dual Band Wireless-AC 7260 [8086:4070]
        Kernel driver in use: iwlwifi
TL;DR: Use an Atheros card instead.

Limitations First, the card is said dual-band but you can only uses one band at a time because there is only one radio. Almost all wireless cards have this limitation. If you want to use both the 2.4 GHz band and the less crowded 5 GHz band, two cards are usually needed.

5 GHz band There is no support to set an access point on the 5 GHz band. The firmware doesn t allow it. This can be checked with iw:
$ iw reg get
country CH: DFS-ETSI
        (2402 - 2482 @ 40), (N/A, 20), (N/A)
        (5170 - 5250 @ 80), (N/A, 20), (N/A)
        (5250 - 5330 @ 80), (N/A, 20), (0 ms), DFS
        (5490 - 5710 @ 80), (N/A, 27), (0 ms), DFS
        (57240 - 65880 @ 2160), (N/A, 40), (N/A), NO-OUTDOOR
$ iw list
Wiphy phy0
[...]
        Band 2:
                Capabilities: 0x11e2
                        HT20/HT40
                        Static SM Power Save
                        RX HT20 SGI
                        RX HT40 SGI
                        TX STBC
                        RX STBC 1-stream
                        Max AMSDU length: 3839 bytes
                        DSSS/CCK HT40
                Frequencies:
                        * 5180 MHz [36] (20.0 dBm) (no IR)
                        * 5200 MHz [40] (20.0 dBm) (no IR)
                        * 5220 MHz [44] (20.0 dBm) (no IR)
                        * 5240 MHz [48] (20.0 dBm) (no IR)
                        * 5260 MHz [52] (20.0 dBm) (no IR, radar detection)
                          DFS state: usable (for 192 sec)
                          DFS CAC time: 60000 ms
                        * 5280 MHz [56] (20.0 dBm) (no IR, radar detection)
                          DFS state: usable (for 192 sec)
                          DFS CAC time: 60000 ms
[...]
While the 5 GHz band is allowed by the CRDA, all frequencies are marked with no IR. Here is the explanation for this flag:
The no-ir flag exists to allow regulatory domain definitions to disallow a device from initiating radiation of any kind and that includes using beacons, so for example AP/IBSS/Mesh/GO interfaces would not be able to initiate communication on these channels unless the channel does not have this flag.

Multiple SSID This card can only advertise one SSID. Managing several of them is useful to setup distinct wireless networks, like a public access (routed to Tor), a guest access and a private access. iw can confirm this:
$ iw list
        valid interface combinations:
                 * #  managed   <= 1, #  AP, P2P-client, P2P-GO   <= 1, #  P2P-device   <= 1,
                   total <= 3, #channels <= 1
Here is the output of an Atheros card able to manage 8 SSID:
$ iw list
        valid interface combinations:
                 * #  managed, WDS, P2P-client   <= 2048, #  IBSS, AP, mesg point, P2P-GO   <= 8,
                   total <= 2048, #channels <= 1

Configuration as an access point Except for those two limitations, the card works fine as an access point. Here is the configuration that I use for hostapd:
interface=wlan-guest
driver=nl80211
# Radio
ssid=XXXXXXXXX
hw_mode=g
channel=11
# 802.11n
wmm_enabled=1
ieee80211n=1
ht_capab=[HT40-][SHORT-GI-20][SHORT-GI-40][DSSS_CCK-40][DSSS_CCK-40][DSSS_CCK-40]
# WPA
auth_algs=1
wpa=2
wpa_passphrase=XXXXXXXXXXXXXXX
wpa_key_mgmt=WPA-PSK
wpa_pairwise=TKIP
rsn_pairwise=CCMP
Because of the use of channel 11, only 802.11n HT40- rate can be enabled. Look at the Wikipedia page for 802.11n to check if you can use either HT40-, HT40+ or both.

Vincent Bernat: Replacing Swisscom router by a Linux box

I have recently moved to Lausanne, Switzerland. Broadband Internet access is not as cheap as in France. Free, a French ISP, is providing an FTTH access with a bandwith of 1 Gbps1 for about 38 (including TV and phone service), Swisscom is providing roughly the same service for about 200 2. Swisscom fiber access was available for my appartment and I chose the 40 Mbps contract without phone service for about 80 . Like many ISP, Swisscom provides an Internet box with an additional box for TV. I didn t unpack the TV box as I have no use for it. The Internet box comes with some nice features like the ability to setup firewall rules, a guest wireless access and some file sharing possibilities. No shell access! I have bought a small PC to act as router and replace the Internet box. I have loaded the upcoming Debian Jessie on it. You can find the whole software configuration in a GitHub repository. This blog post only covers the Swisscom-specific setup (and QoS). Have a look at those two blog posts for related topics:

Ethernet The Internet box is packed with a Siligence-branded 1000BX SFP3. This SFP receives and transmits data on the same fiber using a different wavelength for each direction. Instead of using a network card with an SFP port, I bought a Netgear GS110TP which comes with 8 gigabit copper ports and 2 fiber SFP ports. It is a cheap switch bundled with many interesting features like VLAN and LLDP. It works fine if you don t expect too much from it.

IPv4 IPv4 connectivity is provided over VLAN 10. A DHCP client is mandatory. Moreover, the DHCP vendor class identifier option (option 60) needs to be advertised. This can be done by adding the following line to /etc/dhcp/dhclient.conf when using the ISC DHCP client:
send vendor-class-identifier "100008,0001,,Debian";
The first two numbers are here to identify the service you are requesting. I suppose this can be read as requesting the Swisscom residential access service. You can put whatever you want after that. Once you get a lease, you need to use a browser to identify yourself to Swisscom on the first use.

IPv6 Swisscom provides IPv6 access through the 6rd protocol. This is a tunneling mechanism to facilitate IPv6 deployment accross an IPv4 infrastructure. This kind of tunnel is natively supported by Linux since kernel version 2.6.33. To setup IPv6, you need the base IPv6 prefix and the 6rd gateway. Some ISP are providing those values through DHCP (option 212) but this is not the case for Swisscom. The gateway is 6rd.swisscom.com and the prefix is 2a02:1200::/28. After appending the IPv4 address to the prefix, you still get 4 bits for internal subnets. Swisscom doesn t provide a fixed IPv4 address. Therefore, it is not possible to precompute the IPv6 prefix. When installed as a DHCP hook (in /etc/dhcp/dhclient-exit-hooks.d/6rd), the following script configures the tunnel:
sixrd_iface=internet6
sixrd_mtu=1472                  # This is 1500 - 20 - 8 (PPPoE header)
sixrd_ttl=64
sixrd_prefix=2a02:1200::/28     # No way to guess, just have to know it.
sixrd_br=193.5.29.1             # That's "6rd.swisscom.com"
sixrd_down()  
    ip tunnel del $ sixrd_iface    true
 
sixrd_up()  
    ipv4=$ new_ip_address:-$old_ip_address 
    sixrd_subnet=$(ruby <<EOF
require 'ipaddr'
prefix = IPAddr.new "$ sixrd_prefix ", Socket::AF_INET6
prefixlen = $ sixrd_prefix#*/ 
ipv4 = IPAddr.new "$ ipv4 ", Socket::AF_INET
ipv6 = IPAddr.new (prefix.to_i + (ipv4.to_i << (64 + 32 - prefixlen))), Socket::AF_INET6
puts ipv6
EOF
)
    # Let's configure the tunnel
    ip tunnel add $ sixrd_iface  mode sit local $ipv4 ttl $sixrd_ttl
    ip tunnel 6rd dev $ sixrd_iface  6rd-prefix $ sixrd_prefix 
    ip addr add $ sixrd_subnet 1/64 dev $ sixrd_iface 
    ip link set mtu $ sixrd_mtu  dev $ sixrd_iface 
    ip link set $ sixrd_iface  up
    ip route add default via ::$ sixrd_br  dev $ sixrd_iface 
 
case $reason in
    BOUND REBOOT)
        sixrd_down
        sixrd_up
        ;;
    RENEW REBIND)
        if [ "$new_ip_address" != "$old_ip_address" ]; then
            sixrd_down
            sixrd_up
        fi
        ;;
    STOP EXPIRE FAIL RELEASE)
        sixrd_down
        ;;
esac
The computation of the IPv6 prefix is offloaded to Ruby instead of trying to use the shell for that. Even if the ipaddr module is pretty basic , it suits the job. Swisscom is using the same MTU for all clients. Because some of them are using PPPoE, the MTU is 1472 instead of 1480. You can easily check your MTU with this handy online MTU test tool. It is not uncommon that PMTUD is broken on some parts of the Internet. While not ideal, setting up TCP MSS will alievate any problem you may run into with a MTU less than 1500:
ip6tables -t mangle -A POSTROUTING -o internet6 \
          -p tcp --tcp-flags SYN,RST SYN \
          -j TCPMSS --clamp-mss-to-pmtu

QoS UPDATED: Unfortunately, this section is incorrect, including its premise. Have a look at Dave Taht comment for more details. Once upon a time, QoS was a tacky subject. The Wonder Shaper was a common way to get a somewhat working setup. Nowadays, thanks to the work of the Bufferbloat project, there are two simple steps to get something quite good:
  1. Reduce the queue of your devices to something like 32 packets. This helps TCP to detect congestion and act accordingly while still being able to saturate a gigabit link.
    ip link set txqueuelen 32 dev lan
    ip link set txqueuelen 32 dev internet
    ip link set txqueuelen 32 dev wlan
    
  2. Change the root qdisc to fq_codel. A qdisc receives packets to be sent from the kernel and decide how packets are handled to the network card. Packets can be dropped, reordered or rate-limited. fq_codel is a queuing discipline combining fair queuing and controlled delay. Fair queuing means that all flows get an equal chance to be served. Another way to tell it is that a high-bandwidth flow won t starve the queue. Controlled delay means that the queue size will be limited to ensure the latency stays low. This is achieved by dropping packets more aggressively when the queue grows.
    tc qdisc replace dev lan root fq_codel
    tc qdisc replace dev internet root fq_codel
    tc qdisc replace dev wlan root fq_codel
    

  1. Maximum download speed is 1 Gbps, while maximum upload speed is 200 Mbps.
  2. This is the standard Vivo XL package rated at CHF 169. plus the 1 Gbps option at CHF 80. .
  3. There are two references on it: SGA 441SFP0-1Gb and OST-1000BX-S34-10DI. It transmits to the 1310 nm wave length and receives on the 1490 nm one.

1 August 2014

Rapha&#235;l Hertzog: My Free Software Activities in July 2014

This is my monthly summary of my free software related activities. If you re among the people who made a donation to support my work (548.59 , thanks everybody!), then you can learn how I spent your money. Otherwise it s just an interesting status update on my various projects. Distro Tracker Now that tracker.debian.org is live, people reported bugs (on the new tracker.debian.org pseudo-package that I requested) faster than I could fix them. Still I spent many, many hours on this project, reviewing submitted patches (thanks to Christophe Siraut, Joseph Herlant, Dimitri John Ledkov, Vincent Bernat, James McCoy, Andrew Starr-Bochicchio who all submitted some patches!), fixing bugs, making sure the code works with Django 1.7, and started the same with Python 3. I added a tox.ini so that I can easily run the test suite in all 4 supported environments (created by tox as virtualenv with the combinations of Django 1.6/1.7 and Python 2.7/3.4). Over the month, the git repository has seen 73 commits, we fixed 16 bugs and other issues that were only reported over IRC in #debian-qa. With the help of Enrico Zini and Martin Zobel, we enabled the possibility to login via sso.debian.org (Debian s official SSO) so that Debian developers don t even have to explicitly create their account. As usual more help is needed and I ll gladly answer your questions and review your patches. Misc packaging work Publican. I pushed a new upstream release of publican and dropped a useless build-dependency that was plagued by a difficult to fix RC bug (#749357 for the curious, I tried to investigate but it needs major work for make 4.x compatibility). GNOME 3.12. With gnome-shell 3.12 hitting unstable, I had to update gnome-shell-timer (and filed an upstream ticket at the same time), a GNOME Shell extension to start some run-down counters. Django 1.7. I packaged python-django 1.7 release candidate 1 in experimental (found a small bug, submitted a ticket with a patch that got quickly merged) and filed 85 bugs against all the reverse dependencies to ask their maintainers to test their package with Django 1.7 (that we want to upload before the freeze obviously). We identified a pain point in upgrade for packages using South and tried to discuss it with upstream, but after closer investigation, none of the packages are really affected. But the problem can hit administrators of non-packaged Django applications. Misc stuff. I filed a few bugs (#754282 against git-import-orig uscan, #756319 against wnpp to see if someone would be willing to package loomio), reviewed an updated package for django-ratelimit in #755611, made a non-maintainer upload of mairix (without prior notice) to update the package to a new upstream release and bring it to modern packaging norms (Mako failed to make an upload in 4 years so I just went ahead and did what I would have done if it were mine). Kali work resulting in Debian contributions Kali wants to switch from being based on stable to being based on testing so I did try to setup britney to manage a new kali-rolling repository and encountered some problems that I reported to debian-release. Niels Thykier has been very helpful and even managed to improve britney thanks to the very specific problem that the kali setup triggered. Since we use reprepro, I did write some Python wrapper to transform the HeidiResult file in a set of reprepro commands but at the same time I filed #756399 to request proper support of heidi files in reprepro. While analyzing britney s excuses file, I also noticed that the Kali mirrors contains many source packages that are useless because they only concern architectures that we don t host (and I filed #756523 against reprepro). While trying to build a live image of kali-rolling, I noticed that libdb5.1 and db5.1-util were still marked as priority standard when in fact Debian already switched to db5.3 and thus should only be optional (I filed #756623 against ftp.debian.org). When doing some upgrade tests from kali (wheezy based) to kali-rolling (jessie based) I noticed some problems that were also affecting Debian Jessie. I filed #756629 against libfile-fcntllock-perl (with a patch), and also #756618 against texlive-base (missing Replaces header). I also pinged Colin Watson on #734946 because I got a spurious base-passwd prompt during upgrade (that was triggered because schroot copied my unstable s /etc/passwd file in the kali chroot and the package noticed a difference on the shell of all system users). Thanks See you next month for a new summary of my activities.

One comment Liked this article? Click here. My blog is Flattr-enabled.

5 May 2014

Vincent Bernat: Dashkiosk: manage dashboards on multiple displays

Dashkiosk is a solution to manage dashboards on multiple displays. It comes in four parts:
  1. A server will manage the screens by sending URL to be displayed. A web interface enables an administrator to configure groups of dashboards and attach them to a set of displays.
  2. A receiver runs in a browser attached to each screen. On start, it contacts the server and waits for the URL to display.
  3. An Android application provides a simple fullscreen webview to display the receiver.
  4. A Chromecast custom receiver which will run the regular receiver to display dashboards using Google Chromecast devices. The server is able to drive Chromecast devices through nodecastor, a reimplementation of the sender API.
For a demo, have a look at the following video (it is also available as an Ogg Theora video).

27 April 2014

Vincent Bernat: Local corporate APT repositories

Distributing software efficiently accross your platform can be difficult. Every distribution comes with a package manager which is usually suited for this task. APT can be relied upon on when using Debian or a derivative. Unfortunately, the official repositories may not contain everything you need. When you require unpackaged software or more recent versions, it is possible to setup your own local repository. Most of what is presented here was setup for Dailymotion and was greatly inspired by the work done by Rapha l Pinson at Orange.

Setting up your repositories There are three kinds of repositories you may want to setup:
  1. A distribution mirror. Such a mirror will save bandwidth, provide faster downloads and permanent access, even when someone searches Google on Google.
  2. A local repository for your own packages with the ability to have a staging zone to test packages on some servers before putting them in production.
  3. Mirrors for unofficial repositories, like Ubuntu PPA. To avoid unexpected changes, such a repository will also get a staging and a production zone.
Before going further, it is quite important to understand what a repository is. Let s illustrate with the following line from my /etc/apt/sources.list:
deb http://ftp.debian.org/debian/ unstable main contrib non-free
In this example, http://ftp.debian.org/debian/ is the repository and unstable is the distribution. A distribution is subdivided into components. We have three components: main, contrib and non-free. To setup repositories, we will use reprepro. This is not the only solution but it has a good balance between versatility and simplicity. reprepro can only handle one repository. So, the first choice is about how you will split your packages in repositories, distributions and components. Here is what matters:
  • A repository cannot contain two identical packages (same name, same version, same architecture).
  • Inside a component, you can only have one version of a package.
  • Usually, a distribution is a subset of the versions while a component is a subset of the packages. For example, in Debian, with the distribution unstable, you choose to get the most recent versions while with the component main, you choose to get DFSG-free software only.
If you go for several repositories, you will have to handle several reprepro instances and won t be able to easily copy packages from one place to another. At Dailymotion, we put everything in the same repository but it would also be perfectly valid to have three repositories:
  • one to mirror the distribution,
  • one for your local packages, and
  • one to mirror unofficial repositories.
Here is our target setup: Local APT repository

Initial setup First, create a system user to work with the repositories:
$ adduser --system --disabled-password --disabled-login \
>         --home /srv/packages \
>         --group reprepro
All operations should be done with this user only. If you want to setup several repositories, create a directory for each of them. Each repository has those subdirectories:
  • conf/ contains the configuration files,
  • gpg/ contains the GPG stuff to sign the repository1,
  • logs/ contains the logs,
  • www/ contains the repository that should be exported by the web server.
Here is the content of conf/options:
outdir +b/www
logdir +b/logs
gnupghome +b/gpg
Then, you need to create the GPG key to sign the repository:
$ GNUPGHOME=gpg gpg --gen-key
Please select what kind of key you want:
   (1) RSA and RSA (default)
   (2) DSA and Elgamal
   (3) DSA (sign only)
   (4) RSA (sign only)
Your selection? 1
RSA keys may be between 1024 and 4096 bits long.
What keysize do you want? (2048) 4096
Requested keysize is 4096 bits
Please specify how long the key should be valid.
         0 = key does not expire
      <n>  = key expires in n days
      <n>w = key expires in n weeks
      <n>m = key expires in n months
      <n>y = key expires in n years
Key is valid for? (0) 10y
Key expires at mer. 08 nov. 2023 22:30:58 CET
Is this correct? (y/N) y
Real name: Dailymotion Archive Automatic Signing Key
Email address: the-it-operations@dailymotion.com
Comment: 
[...]
By setting an empty password, you allow reprepro to run unattended. You will have to distribute the public key of your new repository to let APT check the archive signature. An easy way is to ship it in some package.

Local mirror of an official distribution Let s start by mirroring a distribution. We want a local mirror of Ubuntu Precise. For this, we need to do two things:
  1. Setup a new distribution in conf/distributions.
  2. Configure the update sources in conf/updates.
Let s add this block to conf/distributions:
# Ubuntu Precise
Origin: Ubuntu
Label: Ubuntu
Suite: precise
Version: 12.04
Codename: precise
Architectures: i386 amd64
Components: main restricted universe multiverse
UDebComponents: main restricted universe multiverse
Description: Ubuntu Precise 12.04 (with updates and security)
Contents: .gz .bz2
UDebIndices: Packages Release . .gz
Tracking: minimal
Update: - ubuntu-precise ubuntu-precise-updates ubuntu-precise-security
SignWith: yes
This defines the precise distribution in our repository. It contains four components: main, restricted, universe and multiverse (like the regular distribution in official repositories). The Update line starts with a dash. This means reprepro will mark everything as deleted before updating with the provided sources. Old packages will not be kept when they are removed from Ubuntu. In conf/updates, we define the sources:
# Ubuntu Precise
Name: ubuntu-precise
Method: http://fr.archive.ubuntu.com/ubuntu
Fallback: http://de.archive.ubuntu.com/ubuntu
Suite: precise
Components: main main multiverse restricted universe
UDebComponents: main restricted universe multiverse
Architectures: amd64 i386
VerifyRelease: 437D05B5
GetInRelease: no
# Ubuntu Precise Updates
Name: ubuntu-precise-updates
Method: http://fr.archive.ubuntu.com/ubuntu
Fallback: http://de.archive.ubuntu.com/ubuntu
Suite: precise-updates
Components: main restricted universe multiverse
UDebComponents: main restricted universe multiverse
Architectures: amd64 i386
VerifyRelease: 437D05B5
GetInRelease: no
# Ubuntu Precise Security
Name: ubuntu-precise-security
Method: http://fr.archive.ubuntu.com/ubuntu
Fallback: http://de.archive.ubuntu.com/ubuntu
Suite: precise-security
Components: main restricted universe multiverse
UDebComponents: main restricted universe multiverse
Architectures: amd64 i386
VerifyRelease: 437D05B5
GetInRelease: no
The VerifyRelease lines are GPG key fingerprint to use to check the remote repository. The key needs to be imported in the local keyring:
$ gpg --keyring /usr/share/keyrings/ubuntu-archive-keyring.gpg \
>     --export 437D05B5   GNUPGHOME=gpg gpg --import
Another important point is that we merge three distributions (precise, precise-updates and precise-security) into a single distribution (precise) in our local repository. This may cause some difficulties with tools expecting the three distributions to be available (like the Debian Installer2). Next, you can run reprepro and ask it to update your local mirror:
$ reprepro update
This will take some time on the first run. You can execute this command every night. reprepro is not the fastest mirror solution but it is easy to setup, flexible and reliable.

Repository for local packages Let s configure the repository to accept local packages. For each official distribution (like precise), we will configure two distributions:
  • precise-staging contains packages that have not been fully tested and not ready to go to production.
  • precise-prod contains production packages copied from precise-staging.
In our workflow, packages are introduced in precise-staging where they can be tested and will be copied to precise-prod when we want them to be available for production. You can adopt a more complex workflow if you need. The reprepro part is quite easy. We add the following blocks into conf/distributions:
# Dailymotion Precise packages (staging)
Origin: Dailymotion #  
Label: dm-staging   #  
Suite: precise-staging
Codename: precise-staging
Architectures: i386 amd64 source
Components: main role/dns role/database role/web #  
Description: Dailymotion Precise staging repository
Contents: .gz .bz2
Tracking: keep
SignWith: yes
NotAutomatic: yes #  
Log: packages.dm-precise-staging.log
 --type=dsc email-changes
# Dailymotion Precise packages (prod)
Origin: Dailymotion #  
Label: dm-prod      #  
Suite: precise-prod
Codename: precise-prod
Architectures: i386 amd64 source
Components: main role/dns role/database role/web #  
Description: Dailymotion Precise prod repository
Contents: .gz .bz2
Tracking: keep
SignWith: yes
Log: packages.dm-precise-prod.log
First notice we use several components (in ):
  • main will contain packages that are not specific to a subset of the platform. If you put a package in main, it should work correctly on any host.
  • role/* are components dedicated to a subset of the platform. For example, in role/dns, we ship a custom version of BIND.
The staging distribution has the NotAutomatic flag (in ) which disallows the package manager to install those packages except if the user explicitely requests it. Just below, when a new dsc file is uploaded, the hook email-changes will be executed. It should be in the conf/ directory. The Origin and Label lines (in ) are quite important to be able to define an explicit policy of which packages should be installed. Let s say we use the following /etc/apt/sources.list file:
# Ubuntu packages
deb http://packages.dm.gg/dailymotion precise main restricted universe multiverse
# Dailymotion packages
deb http://packages.dm.gg/dailymotion precise-prod    main role/dns
deb http://packages.dm.gg/dailymotion precise-staging main role/dns
All servers have the precise-staging distribution. We must ensure we won t install those packages by mistake. The NotAutomatic flag is one possible safe-guard. We also use a tailored /etc/apt/preferences:
Explanation: Dailymotion packages of a specific component should be more preferred
Package: *
Pin: release o=Dailymotion, l=dm-prod, c=role/*
Pin-Priority: 950
Explanation: Dailymotion packages should be preferred
Package: *
Pin: release o=Dailymotion, l=dm-prod
Pin-Priority: 900
Explanation: staging should never be preferred
Package: *
Pin: release o=Dailymotion, l=dm-staging
Pin-Priority: -100
By default, packages will have a priority of 500. By setting a priority of -100 to the staging distribution, we ensure the packages cannot be installed at all. This is stronger than NotAutomatic which sets the priority to 1. When a package exists in Ubuntu and in our local repository, we ensure that, if this is a production package, we will use ours by using a priority of 900 (or 950 if we match a specific role component). Have a look at the How APT Interprets Priorities section of apt_preferences(5) manual page for additional information. Keep in mind that version matters only when the priority is the same. To check if everything works as you expect, use apt-cache policy:
$ apt-cache policy php5-memcache
  Installed: 3.0.8-1~precise2~dm1
  Candidate: 3.0.8-1~precise2~dm1
  Version table:
 *** 3.0.8-1~precise2~dm1 0
        950 http://packages.dm.gg/dailymotion/ precise-prod/role/web amd64 Packages
        100 /var/lib/dpkg/status
     3.0.8-1~precise1~dm4 0
        900 http://packages.dm.gg/dailymotion/ precise-prod/main amd64 Packages
       -100 http://packages.dm.gg/dailymotion/ precise-staging/main amd64 Packages
     3.0.6-1 0
        500 http://packages.dm.gg/dailymotion/ precise/universe amd64 Packages
If we want to install a package from the staging distribution, we can use apt-get with the -t precise-staging option to raise the priority of this distribution to 990. Once you have tested your package, you can copy it from the staging distribution to the production distribution:
$ reprepro -C main copysrc precise-prod precise-staging wackadoodle

Local mirror of third-party repositories Sometimes, you want a software published on some third-party repository without to repackage it yourself. A common example is the repositories edited by hardware vendors. Like for an Ubuntu mirror, there are two steps: defining the distribution and defining the source. We chose to put such mirrors into the same distributions as our local packages but with a dedicated component for each mirror. This way, those third-party packages will share the same workflow as our local packages: they will appear in the staging distribution, we validate them and copy them to the production distribution. The first step is to add the components and an appropriate Update line to conf/distributions:
Origin: Dailymotion
Label: dm-staging
Suite: precise-staging
Components: main role/dns role/database role/web vendor/hp
Update: hp
# [...]
Origin: Dailymotion
Label: dm-prod
Suite: precise-prod
Components: main role/dns role/database role/web vendor/hp
# [...]
We added the vendor/hp component to both the staging and the production distributions. However, only the staging distribution gets an Update line (remember, packages will be copied manually into the production distribution). We declare the source in conf/updates:
# HP repository
Name: hp
Method: http://downloads.linux.hp.com/SDR/downloads/ManagementComponentPack/
Suite: precise/current
Components: non-free>vendor/hp
Architectures: i386 amd64
VerifyRelease: 2689B887
GetInRelease: no
Don t forget to add the GPG key to your local keyring. Notice an interesting feature of reprepro: we copy the remote non-free component to our local vendor/hp component. Then, you can synchronize the mirror with reprepro update. Once the packages have been tested, you will have to copy them in the production distribution.

Building Debian packages Our reprepro setup seems complete, but how do we put packages into the staging distribution? You have several options to build Debian packages for your local repository. It really depends on how much time you want to invest in this activity:
  1. Build packages from source by adding a debian/ directory. This is the classic way of building Debian packages. You can start from scratch or use an existing package as a base. In the latest case, the package can be from the official archive but for a more recent distribution or a backport or from an unofficial repository.
  2. Use a tool that will create a binary package from a directory, like fpm. Such a tool will try to guess a lot of things to minimize your work. It can even download everything for you.
There is no universal solution. If you don t have the time budget for building packages from source, have a look at fpm. I would advise you to use the first approach when possible because you will get those perks for free:
  • You keep the sources in your repository. Whenever you need to rebuild something to fix an emergency bug, you won t have to hunt the sources which may be unavailable when you need them the most. Of course, this only works if you build packages that don t download stuff directly from the Internet.
  • You also keep the recipe3 to build the package in your repository. If someone enables some option and rebuild the package, you won t accidently drop this option on the next build. Those changes can be documented in debian/changelog. Moreover, you can use a version control software for the whole debian/ directory.
  • You can propose your package for inclusion into Debian. This will help many people once the package hits the archive.

Builders We chose pbuilder as a builder4. Its setup is quite straightforward. Here is our /etc/pbuilderrc:
DISTRIBUTION=$DIST
NAME="$DIST-$ARCH"
MIRRORSITE=http://packages.dm.gg/dailymotion
COMPONENTS=("main" "restricted" "universe" "multiverse")
OTHERMIRROR="deb http://packages.dm.gg/dailymotion $ DIST -staging main"
HOOKDIR=/etc/pbuilder/hooks.d
BASE=/var/cache/pbuilder/dailymotion
BASETGZ=$BASE/$NAME/base.tgz
BUILDRESULT=$BASE/$NAME/results/
APTCACHE=$BASE/$NAME/aptcache/
DEBBUILDOPTS="-sa"
KEYRING="/usr/share/keyrings/dailymotion-archive.keyring.gpg"
DEBOOTSTRAPOPTS=("--arch" "$ARCH" "--variant=buildd" "$ DEBOOTSTRAPOPTS[@] " "--keyring=$KEYRING")
APTKEYRINGS=("$KEYRING")
EXTRAPACKAGES=("dailymotion-archive-keyring")
pbuilder is expected to be invoked with DIST, ARCH and optionally ROLE environment variables. Building the initial bases can be done like this:
for ARCH in i386 amd64; do
  for DIST in precise; do
    export ARCH
    export DIST
    pbuilder --create
  done
done
We don t create a base for each role. Instead, we use a D hook to add the appropriate source:
#!/bin/bash
[ -z "$ROLE" ]    
  cat >> /etc/apt/sources.list <<EOF
deb http://packages.dm.gg/dailymotion $ DIST -staging role/$ ROLE 
EOF
 
apt-get update
We ensure packages from our staging distribution are preferred over other packages by adding an /etc/apt/preferences file in a E hook:
#!/bin/bash
cat > /etc/apt/preferences <<EOF
Explanation: Dailymotion packages are of higher priority
Package: *
Pin: release o=Dailymotion
Pin-Priority: 900
EOF
We also use a C hook to get a shell in case there is an error. This is convenient to debug a problem:
#!/bin/bash
apt-get install -y --force-yes vim less
cd /tmp/buildd/*/debian/..
/bin/bash < /dev/tty > /dev/tty 2> /dev/tty
A manual build can be run with:
$ ARCH=amd64 DIST=precise ROLE=web pbuilder \
>         --build somepackage.dsc

Version numbering To avoid to apply complex rules to chose a version number for a package, we chose to treat everything as a backport, even in-house software. We use the following scheme: X-Y~preciseZ+dmW.
  • X is the upstream version5.
  • Y is the Debian version. If there is no Debian version, use 0.
  • Z is the Ubuntu backport version. Again, if such a version doesn t exist, use 0.
  • W is our version of the package. We increment it when we make a change to the packaging. This is the only number we are allowed to control. All the others are set by an upstream entity, unless it doesn t exist and in this case, you use 0.
Let s suppose you need to backport wackadoodle. It is available in a more recent version of Ubuntu as 1.4-3. Your first backport will be 1.4-3~precise0+dm1. After a change to the packaging, the version will be 1.4-3~precise0+dm2. A new upstream version 1.5 is available and you need it. You will use 1.5-0~precise0+dm1. Later, this new upstream version will be available in some version of Ubuntu as 1.5-3ubuntu1. You will rebase your changes on this version and get 1.5-3ubuntu1~precise0+dm1. When using Debian instead of Ubuntu, a compatible convention could be : X-Y~bpo70+Z~dm+W.

Uploading To upload a package, a common setup is the following workflow:
  1. Upload the source package to an incoming directory.
  2. reprepro will notice the source package, check its correctness (signature, distribution) and put it in the archive.
  3. The builder will notice a new package needs to be built and build it.
  4. Once the package is built, the builder will upload the result to the incoming directory.
  5. reprepro will notice again the new binary package and integrate it in the archive.
This workflow has the disadvantage to have many moving pieces and to leave the user in the dark while the compilation is in progress. As an alternative, a simple script can be used to execute each step synchronously. The user can follow on their terminal that everything works as expected. Once we have the .changes file, the build script just issues the appropriate command to include the result in the archive:
$ reprepro -C main include precise-staging \
>      wackadoodle_1.4-3~precise0+dm4_amd64.changes
Happy hacking!

  1. The gpg/ directory could be shared by several repositories.
  2. We teached Debian Installer to work with our setup with an appropriate preseed file.
  3. fpm-cookery is a convenient tool to write recipes for fpm, similar to Homebrew or a BSD port tree. It could be used to achieve the same goal.
  4. sbuild is an alternative to pbuilder and is the official builder for both Debian and Ubuntu. Historically, pbuilder was more focused on developers needs.
  5. For a Git snapshot, we use something like 1.4-git20130905+1-ae42dc1 which is a snapshot made after version 1.4 (use 0.0 if no version has ever been released) at the given date. The following 1 is to be able to package different snapshots at the same date while the hash is here in case you need to retrieve the exact snapshot.

18 March 2014

Vincent Bernat: EDNS client subnet support for BIND

To provide geolocation-aware answers with BIND, a common solution is to use a patch adding GeoIP support. A client can be directed to the closest (and hopefully fastest) web server:
view "FRANCE"  
     match-clients   geoip_cityDB_country_FR;  ;
     zone "example.com" in  
         type master;
         file "france.example.com.dns";
      ;
 ;
view "GERMANY"  
     match-clients   geoip_cityDB_country_DE;  ;
     zone "example.com" in  
         type master;
         file "germany.example.com.dns";
      ;
 ;
/* [...] */
view "DEFAULT"  
    zone "example.com" in  
        type master;
        file "example.com.dns";
     ;
 ;
However, an end user does not usually talk directly to authoritative servers. They proxy the query to a third-party recursor server which will query the authoritative server on their behalf. The recursor also caches the answer to be able to serve it directly to other clients. On most cases, we can still rely on the recursor GeoIP location to forward the client to the closest web server because it is located in the client s ISP network, as shown on the following schema: Query for www.example.com through an ISP recursor
  1. Juan is located in China and wants to know the IP address of www.example.com. She queries her ISP resolver.
  2. The resolver asks the authoritative server for the answer.
  3. Because the IP address of the resolver is located in China, the authoritative server decides to answer with the IP address of the web server located in Japan which is the closest one.
  4. Juan can now enjoy short round-trips with the web server.
However, this is not the case when using a public recursor as provided by Google or OpenDNS. In this case, the IP address of the end client and the source IP address of the recursor may not share the same locality. For example, in the following schema, the authoritative server now thinks it is in relation with an European customer and answers with the IP address of the web server located in Europe: Query for www.example.com through an open recursor Moreover, caching makes the problem worse. To solve this problem, a new EDNS extension to expose the client subnet has been proposed. When using this extension, the recursor will provide the client subnet to the authoritative server for it to build an optimized reply. The subnet is vague enough to respect client s privacy but precise enough to be able to locate it. A patched version of dig allows one to make queries with this new extension:
$ geoiplookup 138.231.136.0
GeoIP Country Edition: FR, France
$ ./bin/dig/dig @dns-02.dailymotion.com www.dailymotion.com \
>     +client=138.231.136.0/24
; <<>> DiG 9.8.1-P1-geoip-1.3 <<>> @dns-02.dailymotion.com www.dailymotion.com +client=138.231.136.0/24
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23312
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; CLIENT-SUBNET: 138.231.136.0/24/24
;; QUESTION SECTION:
;www.dailymotion.com.           IN      A
;; ANSWER SECTION:
www.dailymotion.com.    600     IN      A       195.8.215.136
www.dailymotion.com.    600     IN      A       195.8.215.137
;; Query time: 20 msec
;; SERVER: 188.65.127.2#53(188.65.127.2)
;; WHEN: Sun Oct 20 15:44:47 2013
;; MSG SIZE  rcvd: 91
$ geoiplookup 195.8.215.136
GeoIP Country Edition: FR, France
In the above example, a client located in France gets a reply with two IP addresses located in France. If we now are an US client, we will get IP addresses located in the US:
$ geoiplookup 170.149.100.0
GeoIP Country Edition: US, United States
$ ./bin/dig/dig @dns-02.dailymotion.com www.dailymotion.com \
>     +client=170.149.100.0/24
; <<>> DiG 9.8.1-P1-geoip-1.3 <<>> @dns-02.dailymotion.com www.dailymotion.com +client=170.149.100.0/24
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23187
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; CLIENT-SUBNET: 170.149.100.0/24/24
;; QUESTION SECTION:
;www.dailymotion.com.           IN      A
;; ANSWER SECTION:
www.dailymotion.com.    600     IN      A       188.65.120.135
www.dailymotion.com.    600     IN      A       188.65.120.136
;; Query time: 18 msec
;; SERVER: 188.65.127.2#53(188.65.127.2)
;; WHEN: Sun Oct 20 15:47:22 2013
;; MSG SIZE  rcvd: 91
$ geoiplookup 188.65.120.135
GeoIP Country Edition: US, United States
The recursor is expected to cache the two different answers and only serve them if the client matches the appropriate subnet (the one confirmed in the answer from the authoritative server). With this new extension, the authoritative server knows that Juan is located in China and answers with the appropriate IP address: Query for www.example.com through an open recursor with client subnet Not many authoritative servers support this extension (PowerDNS and gdnsd, as far as I know). At Dailymotion, we have built a patch for BIND. It only works when BIND is configured as an authoritative server and it doesn t expose no configuration knobs. Feel free to use it (at your own risk). Once installed, you need to register yourself to OpenDNS and to Google to receive queries with the extension enabled.

24 February 2014

Vincent Bernat: Coping with the TCP TIME-WAIT state on busy Linux servers

TL;DR: Do not enable net.ipv4.tcp_tw_recycle. The Linux kernel documentation is not very helpful about what net.ipv4.tcp_tw_recycle does:
Enable fast recycling TIME-WAIT sockets. Default value is 0. It should not be changed without advice/request of technical experts.
Its sibling, net.ipv4.tcp_tw_reuse is a little bit more documented but the language is about the same:
Allow to reuse TIME-WAIT sockets for new connections when it is safe from protocol viewpoint. Default value is 0. It should not be changed without advice/request of technical experts.
The mere result of this lack of documentation is that we find numerous tuning guides advising to set both these settings to 1 to reduce the number of entries in the TIME-WAIT state. However, as stated by tcp(7) manual page, the net.ipv4.tcp_tw_recycle option is quite problematic for public-facing servers as it won t handle connections from two different computers behind the same NAT device, which is a problem hard to detect and waiting to bite you:
Enable fast recycling of TIME-WAIT sockets. Enabling this option is not recommended since this causes problems when working with NAT (Network Address Translation).
I will provide here a more detailed explanation in the hope to teach people who are wrong on the Internet. xkcd illustration As a sidenote, despite the use of ipv4 in its name, the net.ipv4.tcp_tw_recycle control also applies to IPv6. Also, keep in mind we are looking at the TCP stack of Linux. This is completely unrelated to Netfilter connection tracking which may be tweaked in other ways1.

About TIME-WAIT state Let s rewind a bit and have a close look at this TIME-WAIT state. What is it? See the TCP state diagram below2: TCP state diagram Only the end closing the connection first will reach the TIME-WAIT state. The other end will follow a path which usually permits to quickly get rid of the connection. You can have a look at the current state of connections with ss -tan:
$ ss -tan   head -5
LISTEN     0  511             *:80              *:*     
SYN-RECV   0  0     192.0.2.145:80    203.0.113.5:35449
SYN-RECV   0  0     192.0.2.145:80   203.0.113.27:53599
ESTAB      0  0     192.0.2.145:80   203.0.113.27:33605
TIME-WAIT  0  0     192.0.2.145:80   203.0.113.47:50685

Purpose There are two purposes for the TIME-WAIT state:
  • The most known one is to prevent delayed segments from one connection being accepted by a later connection relying on the same quadruplet (source address, source port, destination address, destination port). The sequence number also needs to be in a certain range to be accepted. This narrows a bit the problem but it still exists, especially on fast connections with large receive windows. RFC 1337 explains in details what happens when the TIME-WAIT state is deficient3. Here is an example of what could be avoided if the TIME-WAIT state wasn t shortened:
Duplicate segments accepted in another connection
  • The other purpose is to ensure the remote end has closed the connection. When the last ACK is lost, the remote end stays in the LAST-ACK state4. Without the TIME-WAIT state, a connection could be reopened while the remote end still thinks the previous connection is valid. When it receives a SYN segment (and the sequence number matches), it will answer with a RST as it is not expecting such a segment. The new connection will be aborted with an error:
Last ACK lost RFC 793 requires the TIME-WAIT state to last twice the time of the MSL. On Linux, this duration is not tunable and is defined in include/net/tcp.h as one minute:
#define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT
                                  * state, about 60 seconds     */
There have been propositions to turn this into a tunable value but it has been refused on the ground the TIME-WAIT state is a good thing.

Problems Now, let s see why this state can be annoying on a server handling a lot of connections. There are three aspects of the problem:
  • the slot taken in the connection table preventing new connections of the same kind,
  • the memory occupied by the socket structure in the kernel, and
  • the additional CPU usage.
The result of ss -tan state time-wait wc -l is not a problem per se!

Connection table slot A connection in the TIME-WAIT state is kept for one minute in the connection table. This means, another connection with the same quadruplet (source address, source port, destination address, destination port) cannot exist. For a web server, the destination address and the destination port are likely to be constant. If your web server is behind a L7 load-balancer, the source address will also be constant. On Linux, the client port is by default allocated in a port range of about 30,000 ports (this can be changed by tuning net.ipv4.ip_local_port_range). This means that only 30,000 connections can be established between the web server and the load-balancer every minute, so about 500 connections per second. If the TIME-WAIT sockets are on the client side, such a situation is easy to detect. The call to connect() will return EADDRNOTAVAIL and the application will log some error message about that. On the server side, this is more complex as there is no log and no counter to rely on. In doubt, you should just try to come with something sensible to list the number of used quadruplets:
$ ss -tan 'sport = :80'   awk ' print $(NF)" "$(NF-1) '   \
>     sed 's/:[^ ]*//g'   sort   uniq -c
    696 10.24.2.30 10.33.1.64
   1881 10.24.2.30 10.33.1.65
   5314 10.24.2.30 10.33.1.66
   5293 10.24.2.30 10.33.1.67
   3387 10.24.2.30 10.33.1.68
   2663 10.24.2.30 10.33.1.69
   1129 10.24.2.30 10.33.1.70
  10536 10.24.2.30 10.33.1.73
The solution is more quadruplets5. This can be done in several ways (in the order of difficulty to setup):
  • use more client ports by setting net.ipv4.ip_local_port_range to a wider range,
  • use more server ports by asking the web server to listen to several additional ports (81, 82, 83, ),
  • use more client IP by configuring additional IP on the load balancer and use them in a round-robin fashion,
  • use more server IP by configuring additional IP on the web server6.
Of course, a last solution is to tweak net.ipv4.tcp_tw_reuse and net.ipv4.tcp_tw_recycle. Don t do that yet, we will cover those settings later.

Memory With many connections to handle, leaving a socket open for one additional minute may cost your server some memory. For example, if you want to handle about 10,000 new connections per second, you will have about 600,000 sockets in the TIME-WAIT state. How much memory does it represent? Not that much! First, from the application point of view, a TIME-WAIT socket does not consume any memory: the socket has been closed. In the kernel, a TIME-WAIT socket is present in three structures (for three different purposes):
  1. A hash table of connections, named the TCP established hash table (despite containing connections in other states) is used to locate an existing connection, for example when receiving a new segment. Each bucket of this hash table contains both a list of connections in the TIME-WAIT state and a list of regular active connections. The size of the hash table depends on the system memory and is printed at boot:
    $ dmesg   grep "TCP established hash table"
    [    0.169348] TCP established hash table entries: 65536 (order: 8, 1048576 bytes)
    
    It is possible to override it by specifying the number of entries on the kernel command line with the thash_entries parameter. Each element of the list of connections in the TIME-WAIT state is a struct tcp_timewait_sock, while the type for other states is struct tcp_sock7:
    struct tcp_timewait_sock  
        struct inet_timewait_sock tw_sk;
        u32    tw_rcv_nxt;
        u32    tw_snd_nxt;
        u32    tw_rcv_wnd;
        u32    tw_ts_offset;
        u32    tw_ts_recent;
        long   tw_ts_recent_stamp;
     ;
    struct inet_timewait_sock  
        struct sock_common  __tw_common;
        int                     tw_timeout;
        volatile unsigned char  tw_substate;
        unsigned char           tw_rcv_wscale;
        __be16 tw_sport;
        unsigned int tw_ipv6only     : 1,
                     tw_transparent  : 1,
                     tw_pad          : 6,
                     tw_tos          : 8,
                     tw_ipv6_offset  : 16;
        unsigned long            tw_ttd;
        struct inet_bind_bucket *tw_tb;
        struct hlist_node        tw_death_node;
     ;
    
  2. A set of lists of connections, called the death row , is used to expire the connections in the TIME-WAIT state. They are ordered by how much time left before expiration. It uses the same memory space as for the entries in the hash table of connections. This is the struct hlist_node tw_death_node member of struct inet_timewait_sock.
  3. A hash table of bound ports, holding the locally bound ports and the associated parameters, is used to determine if it is safe to listen to a given port or to find a free port in the case of dynamic bind. The size of this hash table is the same as the size of the hash table of connections:
    $ dmesg   grep "TCP bind hash table"
    [    0.169962] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
    
    Each element is a struct inet_bind_socket. There is one element for each locally bound port. A TIME-WAIT connection to a web server is locally bound to the port 80 and shares the same entry as its sibling TIME-WAIT connections. On the other hand, a connection to a remote service is locally bound to some random port and does not share its entry.
So, we are only concerned by the space occupied by struct tcp_timewait_sock and struct inet_bind_socket. There is one struct tcp_timewait_sock for each connection in the TIME-WAIT state, inbound or outbound. There is one dedicated struct inet_bind_socket for each outbound connection and none for an inbound connection. A struct tcp_timewait_sock is only 168 bytes while a struct inet_bind_socket is 48 bytes:
$ sudo apt-get install linux-image-$(uname -r)-dbg
[...]
$ gdb /usr/lib/debug/boot/vmlinux-$(uname -r)
(gdb) print sizeof(struct tcp_timewait_sock)
 $1 = 168
(gdb) print sizeof(struct tcp_sock)
 $2 = 1776
(gdb) print sizeof(struct inet_bind_bucket)
 $3 = 48
So, if you have about 40,000 inbound connections in the TIME-WAIT state, it should eat less than 10MB of memory. If you have about 40,000 outbound connections in the TIME-WAIT state, you need to account for 2.5MB of additional memory. Let s check that by looking at the output of slabtop. Here is the result on a server with about 50,000 connections in the TIME-WAIT state, 45,000 of which are outbound connections:
$ sudo slabtop -o   grep -E '(^  OBJS tw_sock_TCP tcp_bind_bucket)'
  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
 50955  49725  97%    0.25K   3397       15     13588K tw_sock_TCP            
 44840  36556  81%    0.06K    760       59      3040K tcp_bind_bucket
There is nothing to change here: the memory used by TIME-WAIT connections is really small. If your server need to handle thousands of new connections per second, you need far more memory to be able to efficiently push data to clients. The overhead of TIME-WAIT connections is negligible.

CPU On the CPU side, searching for a free local port can be a bit expensive. The work is done by the inet_csk_get_port() function which uses a lock and iterate on locally bound ports until a free port is found. A large number of entries in this hash table is usually not a problem if you have a lot of outbound connections in the TIME-WAIT state (like ephemeral connections to a memcached server): the connections usually share the same profile, the function will quickly find a free port as it iterates on them sequentially.

Other solutions If you still think you have a problem with TIME-WAIT connections after reading the previous section, there are three additional solutions to solve them:
  • disable socket lingering,
  • net.ipv4.tcp_tw_reuse, and
  • net.ipv4.tcp_tw_recycle.

Socket lingering When close() is called, any remaining data in the kernel buffers will be sent in the background and the socket will eventually transition to the TIME-WAIT state. The application can continue to work immediatly and assume that all data will eventually be safely delivered. However, an application can choose to disable this behaviour, known as socket lingering. There are two flavors:
  1. In the first one, any remaining data will be discarded and instead of closing the connection with the normal four-packet connection termination sequence, the connection will be closed with a RST (and therefore, the peer will detect an error) and will be immediatly destroyed. No TIME-WAIT state in this case.
  2. With the second flavor, if there is any data still remaining in the socket send buffer, the process will sleep when calling close() until either all the data is sent and acknowledged by the peer or the configured linger timer expires. It is possible for a process to not sleep by setting the socket as non-blocking. In this case, the same process happens in the background. It permits the remaining data to be sent during a configured timeout but if the data is succesfully sent, the normal close sequence is run and you get a TIME-WAIT state. And on the other case, you ll get the connection close with a RST and the remaining data is discarded.
In both cases, disabling socket lingering is not a one-size-fits-all solution. It may be used by some applications like HAProxy or Nginx when it is safe to use from the upper protocol point of view. There are good reasons to not disable it unconditionnaly.

net.ipv4.tcp_tw_reuse The TIME-WAIT state prevents delayed segments to be accepted in an unrelated connection. However, on certain conditions, it is possible to assume a new connection s segment cannot be misinterpreted with an old connection s segment. RFC 1323 presents a set of TCP extensions to improve performance over high-bandwidth paths. Among other things, it defines a new TCP option carrying two four-byte timestamp fields. The first one is the current value of the timestamp clock of the TCP sending the option while the second one is the most recent timestamp received from the remote host. By enabling net.ipv4.tcp_tw_reuse, Linux will reuse an existing connection in the TIME-WAIT state for a new outgoing connection if the new timestamp is strictly bigger than the most recent timestamp recorded for the previous connection: an outgoing connection in the TIME-WAIT state can be reused after just one second. How is it safe? The first purpose of the TIME-WAIT state was to avoid duplicate segments to be accepted in an unrelated connection. Thanks to the use of timestamps, such a duplicate segments will come with an outdated timestamp and therefore be discarded. The second purpose was to ensure the remote end is not in the LAST-ACK state because of the lost of the last ACK. The remote end will retransmit the FIN segment until:
  1. it gives up (and tear down the connection), or
  2. it receives the ACK it is waiting (and tear down the connection), or
  3. it receives a RST (and tear down the connection).
If the FIN segments are received in a timely manner, the local end socket will still be in the TIME-WAIT state and the expected ACK segments will be sent. Once a new connection replaces the TIME-WAIT entry, the SYN segment of the new connection is ignored (thanks to the timestamps) and won t be answered by a RST but only by a retransmission of the FIN segment. The FIN segment will then be answered with a RST (because the local connection is in the SYN-SENT state) which will allow the transition out of the LAST-ACK state. The initial SYN segment will eventually be resent (after one second) because there was no answer and the connection will be established without apparent error, except a slight delay: Last ACK lost and timewait reuse It should be noted that when a connection is reused, the TWRecycled counter is increased (despite its name).

net.ipv4.tcp_tw_recycle This mechanism also relies on the timestamp option but affects both incoming and outgoing connections which is handy when the server usually closes the connection first8. The TIME-WAIT state is scheduled to expire sooner: it will be removed after the retransmission timeout (RTO) interval which is computed from the RTT and its variance. You can spot the appropriate values for a living connection with the ss command:
$ ss --info  sport = :2112 dport = :4057
State      Recv-Q Send-Q    Local Address:Port        Peer Address:Port   
ESTAB      0      1831936   10.47.0.113:2112          10.65.1.42:4057    
         cubic wscale:7,7 rto:564 rtt:352.5/4 ato:40 cwnd:386 ssthresh:200 send 4.5Mbps rcv_space:5792
To keep the same guarantees the TIME-WAIT state was providing, while reducing the expiration timer, when a connection enters the TIME-WAIT state, the latest timestamp is remembered in a dedicated structure containing various metrics for previous known destinations. Then, Linux will drop any segment from the remote host whose timestamp is not strictly bigger than the latest recorded timestamp, unless the TIME-WAIT state would have expired:
if (tmp_opt.saw_tstamp &&
    tcp_death_row.sysctl_tw_recycle &&
    (dst = inet_csk_route_req(sk, &fl4, req, want_cookie)) != NULL &&
    fl4.daddr == saddr &&
    (peer = rt_get_peer((struct rtable *)dst, fl4.daddr)) != NULL)  
        inet_peer_refcheck(peer);
        if ((u32)get_seconds() - peer->tcp_ts_stamp < TCP_PAWS_MSL &&
            (s32)(peer->tcp_ts - req->ts_recent) >
                                        TCP_PAWS_WINDOW)  
                NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
                goto drop_and_release;
         
 
When the remote host is in fact a NAT device, the condition on timestamps will forbid allof the hosts except one behind the NAT device to connect during one minute because they do not share the same timestamp clock. In doubt, this is far better to disable this option since it leads to difficult to detect and difficult to diagnose problems. The LAST-ACK state is handled in the exact same way as for net.ipv4.tcp_tw_recycle.

Summary The universal solution is to increase the number of possible quadruplets by using, for example, more server ports. This will allow you to not exhaust the possible connections with TIME-WAIT entries. On the server side, do not enable net.ipv4.tcp_tw_recycle unless you are pretty sure you will never have NAT devices in the mix. Enabling net.ipv4.tcp_tw_reuse is useless for incoming connections. On the client side, enabling net.ipv4.tcp_tw_reuse is another almost-safe solution. Enabling net.ipv4.tcp_tw_recycle in addition to net.ipv4.tcp_tw_reuse is mostly useless. And a final quote by W. Richard Stevens, in Unix Network Programming:
The TIME_WAIT state is our friend and is there to help us (i.e., to let old duplicate segments expire in the network). Instead of trying to avoid the state, we should understand it.

  1. Notably, fiddling with net.netfilter.nf_conntrack_tcp_timeout_time_wait won t change anything on how the TCP stack will handle the TIME-WAIT state.
  2. This diagram is licensed under the LaTeX Project Public License 1.3. The original file is available on this page.
  3. The first work-around proposed in RFC 1337 is to ignore RST segments in the TIME-WAIT state. This behaviour is controlled by net.ipv4.rfc1337 which is not enabled by default on Linux because this is not a complete solution to the problem described in the RFC.
  4. While in the LAST-ACK state, a connection will retransmit the last FIN segment until it gets the expected ACK segment. Therfore, it is unlikely we stay long in this state.
  5. On the client side, older kernels also have to find a free local tuple (source address and source port) for each outgoing connection. Increasing the number of server ports or IP won t help in this case. Linux 3.2 is recent enough to be able to share the same local tuple for different destinations. Thanks to Willy Tarreau for his insight on this aspect.
  6. This last solution may seem a bit dumb since you could just use more ports but some servers are not able to be configured this way. The before last solution can also be quite cumbersome to setup, depending on the load-balancing software, but uses less IP than the last solution.
  7. The use of a dedicated memory structure for sockets in the TIME-WAIT is here since Linux 2.6.14. The struct sock_common structure is a bit more verbose and I won t copy it here.
  8. When the server closes the connection first, it gets the TIME-WAIT state while the client will consider the corresponding quadruplet free and hence may reuse it for a new connection.

1 January 2014

Vincent Bernat: Testing infrastructure with serverspec

Checking if your servers are configured correctly can be done with IT automation tools like Puppet, Chef, Ansible or Salt. They allow an administrator to specify a target configuration and ensure it is applied. They can also run in a dry-run mode and report servers not matching the expected configuration. On the other hand, serverspec is a tool to bring the well known RSpec, a testing tool for the Ruby programming language frequently used for test-driven development, to the infrastructure world. It can be used to remotely test server state through an SSH connection. Why one would use such an additional tool? Many things are easier to express with a test than with a configuration change, like for example checking that a service is correctly installed by checking it is listening to some port.

Getting started Good knowledge of Ruby may help but is not a prerequisite to the use of serverspec. Writing tests feels like writing what we expect in plain English. If you think you need to know more about Ruby, here are two short resources to get started: serverspec s homepage contains a short and concise tutorial on how to get started. Please, read it. As a first illustration, here is a test checking a service is correctly listening on port 80:
describe port(80) do
  it   should be_listening  
end
The following test will spot servers still running with Debian Squeeze instead of Debian Wheezy:
describe command("lsb_release -d") do
  it   should return_stdout /wheezy/  
end
Conditional tests are also possible. For example, we want to check the miimon parameter of bond0, but only when the interface is present:
has_bond0 = file('/sys/class/net/bond0').directory?
# miimon should be set to something other than 0, otherwise, no checks
# are performed.
describe file("/sys/class/net/bond0/bonding/miimon"), :if => has_bond0 do
  it   should be_file  
  its(:content)   should_not eq "0\n"  
end
serverspec comes with a complete documentation of available resource types (like port and command) that can be used after the keyword describe. When a test is too complex to be expressed with simple expectations, it can be specified with arbitrary commands. In the below example, we check if memcached is configured to use almost all the available system memory:
# We want memcached to use almost all memory. With a 2GB margin.
describe "memcached" do
  it "should use almost all memory" do
    total = command("vmstat -s   head -1").stdout #  
    total = /\d+/.match(total)[0].to_i
    total /= 1024
    args = process("memcached").args #  
    memcached = /-m (\d+)/.match(args)[1].to_i
    (total - memcached).should be > 0
    (total - memcached).should be < 2000
  end
end
A bit more arcane, but still understandable: we combine arbitrary shell commands (in ) and use of other serverspec resource types (in ).

Advanced use Out of the box, serverspec provides a strong fundation to build a compliance tool to be run on all systems. It comes with some useful advanced tips, like sharing tests among similar hosts or executing several tests in parallel. I have setup a GitHub repository to be used as a template to get the following features:
  • assign roles to servers and tests to roles;
  • parallel execution;
  • report generation & viewer.

Host classification By default, serverspec-init generates a template where each host has its own directory with its unique set of tests. serverspec only handles test execution on remote hosts: the test execution flow (which tests are executed on which servers) is delegated to some Rakefile1. Instead of extracting the list of hosts to test from a directory hiearchy, we can extract it from a file (or from an LDAP server or from any source) and attach a set of roles to each of them:
hosts = File.foreach("hosts")
  .map    line  line.strip  
  .map do  host 
   
    :name => host.strip,
    :roles => roles(host.strip),
   
end
The roles() function should return a list of roles for a given hostname. It could be something as simple as this:
def roles(host)
  roles = [ "all" ]
  case host
  when /^web-/
    roles << "web"
  when /^memc-/
    roles << "memcache"
  when /^lb-/
    roles << "lb"
  when /^proxy-/
    roles << "proxy"
  end
  roles
end
In the snippet below, we create a task for each server as well as a server:all task that will execute the tests for all hosts (in ). Pay attention, in , at how we attach the roles to each server.
namespace :server do
  desc "Run serverspec to all hosts"
  task :all => hosts.map    h  h[:name]   #  
  hosts.each do  host 
    desc "Run serverspec to host # host[:name] "
    ServerspecTask.new(host[:name].to_sym) do  t 
      t.target = host[:name]
      #  : Build the list of tests to execute from server roles
      t.pattern = './spec/ ' + host[:roles].join(",") + ' /*_spec.rb'
    end
  end
end
You can check the list of tasks created:
$ rake -T
rake check:server:all      # Run serverspec to all hosts
rake check:server:web-10   # Run serverspec to host web-10
rake check:server:web-11   # Run serverspec to host web-11
rake check:server:web-12   # Run serverspec to host web-12
Then, you need to modify spec/spec_helper.rb to tell serverspec to fetch the host to test from the environment variable TARGET_HOST instead of extracting it from the spec file name.

Parallel execution By default, each task is executed when the previous one has finished. With many hosts, this can take some time. rake provides the -j flag to specify the number of tasks to be executed in parallel and the -m flag to apply parallelism to all tasks:
$ rake -j 10 -m check:server:all

Reports rspec is invoked for each host. Therefore, the output is something like this:
$ rake spec
env TARGET_HOST=web-10 /usr/bin/ruby -S rspec spec/web/apache2_spec.rb spec/all/debian_spec.rb
......
Finished in 0.99715 seconds
6 examples, 0 failures
env TARGET_HOST=web-11 /usr/bin/ruby -S rspec spec/web/apache2_spec.rb spec/all/debian_spec.rb
......
Finished in 1.45411 seconds
6 examples, 0 failures
This does not scale well if you have dozens or hundreds of hosts to test. Moreover, the output is mangled with parallel execution. Fortunately, rspec comes with the ability to save results in JSON format. Those per-host results can then be consolidated into a single JSON file. All this can be done in the Rakefile:
  1. For each task, set rspec_opts to --format json --out ./reports/current/# target .json. This is done automatically by the subclass ServerspecTask which also handles passing the hostname in an environment variable and a more concise and colored output.
  2. Add a task to collect the generated JSON files into a single report. The test source code is also embedded in the report to make it self-sufficient. Moreover, this task is executed automatically by adding it as a dependency of the last serverspec-related task.
Have a look at the complete Rakefile for more details on how this is done. A very simple web-based viewer can handle those reports2. It shows the test results as a matrix with failed tests in red: Report viewer example Clicking on any test will display the necessary information to troubleshoot errors, including the test short description, the complete test code, the expectation message and the backtrace: Report viewer showing detailed error I hope this additional layer will help making serverspec another feather in the IT cap, between an automation tool and a supervision tool.

  1. A Rakefile is a Makefile where tasks and their dependencies are described in plain Ruby. rake will execute them in the appropriate order.
  2. The viewer is available in the GitHub repository in the viewer/ directory.

Next.

Previous.