Search Results: "Enrico Zini"

23 November 2021

Enrico Zini: Really lossy compression of JPEG

Suppose you have a tool that archives images, or scientific data, and it has a test suite. It would be good to collect sample files for the test suite, but they are often so big one can't really bloat the repository with them. But does the test suite need everything that is in those files? Not necesarily. For example, if one's testing code that reads EXIF metadata, one doesn't care about what is in the image. That technique works extemely well. I can take GRIB files that are several megabytes in size, zero out their data payload, and get nice 1Kb samples for the test suite. I've started to collect and organise the little hacks I use for this into a tool I called mktestsample:
$ mktestsample -v samples1/*
2021-11-23 20:16:32 INFO common samples1/cosmo_2d+0.grib: size went from 335168b to 120b
2021-11-23 20:16:32 INFO common samples1/grib2_ifs.arkimet: size went from 4993448b to 39393b
2021-11-23 20:16:32 INFO common samples1/polenta.jpg: size went from 3191475b to 94517b
2021-11-23 20:16:32 INFO common samples1/test-ifs.grib: size went from 1986469b to 4860b
Those are massive savings, but I'm not satisfied about those almost 94Kb of JPEG:
$ ls -la samples1/polenta.jpg
-rw-r--r-- 1 enrico enrico 94517 Nov 23 20:16 samples1/polenta.jpg
$ gzip samples1/polenta.jpg
$ ls -la samples1/polenta.jpg.gz
-rw-r--r-- 1 enrico enrico 745 Nov 23 20:16 samples1/polenta.jpg.gz
I believe I did all I could: completely blank out image data, set quality to zero, maximize subsampling, and tweak quantization to throw everything away. Still, the result is a 94Kb file that can be gzipped down to 745 bytes. Is there something I'm missing? I suppose JPEG is better at storing an image than at storing the lack of an image. I cannot really complain :) I can still commit compressed samples of large images to a git repository, taking very little data indeed. That's really nice!

8 November 2021

Enrico Zini: An educational debugging session

This morning we realised that a test case failed on Fedora 34 only (the link is in Italian) and we set to debugging. The initial analysis This is the initial reproducer:
$ PROJ_DEBUG=3 python test
test_recipe (tests.test_litota3.TestLITOTA3NordArkimetIFS) ... pj_open_lib(proj.db): call fopen(/lib64/../share/proj/proj.db) - succeeded
proj_create: Open of /lib64/../share/proj/proj.db failed
pj_open_lib(proj.db): call fopen(/lib64/../share/proj/proj.db) - succeeded
proj_create: no database context specified
Cannot instantiate source_crs
EXCEPTION in py_coast(): ProjP: cannot create crs to crs from [EPSG:4326] to [+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs]
Note that opening /lib64/../share/proj/proj.db sometimes succeeds, sometimes fails. It's some kind of Schr dinger path, which works or not depending on how you observe it:
# ls -lad /lib64
lrwxrwxrwx 1 1000 1000 9 Jan 26  2021 /lib64 -> usr/lib64
$ ls -la /lib64/../share/proj/proj.db
-rw-r--r-- 1 root root 8925184 Jan 28  2021 /lib64/../share/proj/proj.db
$ cd /lib64/../share/proj/
$ cd /lib64
$ cd ..
$ cd share
-bash: cd: share: No such file or directory
And indeed, stat(2) finds it, and sqlite doesn't (the file is a sqlite database):
$ stat /lib64/../share/proj/proj.db
  File: /lib64/../share/proj/proj.db
  Size: 8925184     Blocks: 17432      IO Block: 4096   regular file
Device: 33h/51d Inode: 56907       Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-11-08 14:09:12.334350779 +0100
Modify: 2021-01-28 05:38:11.000000000 +0100
Change: 2021-11-08 13:42:51.758874327 +0100
 Birth: 2021-11-08 13:42:51.710874051 +0100
$ sqlite3 /lib64/../share/proj/proj.db
Error: unable to open database "/lib64/../share/proj/proj.db": unable to open database file
A minimal reproducer Later on we started stripping layers of code towards a minimal reproducer: here it is. It works or doesn't work depending on whether proj is linked explicitly, or via MagPlus:
$ cat
#include <magics/ProjP.h>
int main()  
    magics::ProjP p("EPSG:4326", "+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs");
    return 0;
$ g++ -o tc -I/usr/include/magics  -lMagPlus
$ ./tc
proj_create: Open of /lib64/../share/proj/proj.db failed
proj_create: no database context specified
terminate called after throwing an instance of 'magics::MagicsException'
  what():  ProjP: cannot create crs to crs from [EPSG:4326] to [+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +over +units=m +no_defs]
Aborted (core dumped)
$ g++ -o tc -I/usr/include/magics -lproj  -lMagPlus
$ ./tc
What is going on here? A difference between the two is the path used to link to
$ ldd ./tc   grep proj => /lib64/ (0x00007fd4919fb000)
$ g++ -o tc -I/usr/include/magics   -lMagPlus
$ ldd ./tc   grep proj => /lib64/../lib64/ (0x00007f6d1051b000)
Common sense screams that this should not matter, but we chased an intuition and found that one of the ways proj looks for its database is relative to its shared library. Indeed, gdb in hand, that dladdr call returns /lib64/../lib64/ From /lib64/../lib64/, proj strips two paths from the end, presumably to pass from something like /something/usr/lib/ to /something/usr. So, dladdr returns /lib64/../lib64/, which becomes /lib64/../, which becomes /lib64/../share/proj/proj.db, which exists on the file system and is used as a path to the database. But depending how you look at it, that path might or might not be valid: it passes the stat(2) check that stops the lookup for candidate paths, but sqlite is unable to open it. Why does the other path work? By linking in the other way, dladdr returns /lib64/, which becomes /share/proj/proj.db, which doesn't exist, which triggers a fallback to a PROJ_LIB constant defined at compile time, which is a path that works no matter how you look at it. Why that weird path with libMagPlus? To complete the picture, we found that is packaged with a rpath set, which is known to cause trouble
# readelf -d /usr/lib64/ grep rpath
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN/../lib64]
The workaround We found that one can set PROJ_LIB in the environment to override the normal proj database lookup. Building on that, we came up with a simple way to override it on Fedora 34 only:
    if distro is not None and distro.linux_distribution()[:2] == ("Fedora", "34") and "PROJ_LIB" not in os.environ:
         self.env_overrides["PROJ_LIB"] = "/usr/share/proj/"
This has been a most edifying and educational debugging session, with only the necessary modicum of curses and swearwords. Working in a team of excellent people really helps.

2 November 2021

Enrico Zini: help2man and subcommands

help2man is quite nice for autogenerating manpages from command line help, making sure that they stay up to date as command line options evolve. It works quite well, except for commands with subcommands, like Python programs that use argparse's add_subparser. So, here's a quick hack that calls help2man for each subcommand, and stitches everything together in a simple manpage.
import re
import shutil
import sys
import subprocess
import tempfile
# TODO: move to argparse
command = sys.argv[1]
# Use to get the program version
res =[sys.executable, "", "--version"], stdout=subprocess.PIPE, text=True, check=True)
version = res.stdout.strip()
# Call the main commandline help to get a list of subcommands
res =[sys.executable, command, "--help"], stdout=subprocess.PIPE, text=True, check=True)
subcommands = re.sub(r'^.+\ (.+)\ .+$', r'\1', res.stdout, flags=re.DOTALL).split(',')
# Generate a help2man --include file with an extra section for each subcommand
with tempfile.NamedTemporaryFile("wt") as tf:
    print("[>DESCRIPTION]", file=tf)
    for subcommand in subcommands:
        res =
                ["help2man", f"--name= command ", "--section=1",
                 "--no-info", "--version-string=dummy", f"./ command   subcommand "],
                stdout=subprocess.PIPE, text=True, check=True)
        subcommand_doc = re.sub(r'^.+.SH DESCRIPTION', '', res.stdout, flags=re.DOTALL)
        print(".SH ", subcommand.upper(), " SUBCOMMAND", file=tf)
    with open(f" command", "rt") as fd:
        shutil.copyfileobj(fd, tf)
    # Call help2man on the main command line help, with the extra include file
    # we just generated
            ["help2man", f"--include= ", f"--name= command ",
             "--section=1", "--no-info", f"--version-string= version ",
             "--output=arkimaps.1", "./arkimaps"],

22 October 2021

Enrico Zini: Scanning for imports in Python scripts

I had to package a nontrivial Python codebase, and I needed to put dependencies in I could do git grep -h import sort -u, then review the output by hand, but I lacked the motivation for it. Much better to take a stab at solving the general problem The result is at One fun part is scanning a directory tree, using ast to find import statements scattered around the code:
class Scanner:
    def __init__(self):
        self.names: Set[str] = set()
    def scan_dir(self, root: str):
        for dirpath, dirnames, filenames, dir_fd in os.fwalk(root):
            for fn in filenames:
                if fn.endswith(".py"):
                    with dirfd_open(fn, dir_fd=dir_fd) as fd:
                        self.scan_file(fd, os.path.join(dirpath, fn))
                st = os.stat(fn, dir_fd=dir_fd)
                if st.st_mode & (stat.S_IXUSR   stat.S_IXGRP   stat.S_IXOTH):
                    with dirfd_open(fn, dir_fd=dir_fd) as fd:
                            lead = fd.readline()
                        except UnicodeDecodeError:
                        if re_python_shebang.match(lead):
                            self.scan_file(fd, os.path.join(dirpath, fn))
    def scan_file(self, fd: TextIO, pathname: str):"Reading file %s", pathname)
            tree = ast.parse(, pathname)
        except SyntaxError as e:
            log.warning("%s: file cannot be parsed", pathname, exc_info=e)
    def scan_tree(self, tree: ast.AST):
        for stm in tree.body:
            if isinstance(stm, ast.Import):
                for alias in stm.names:
                    if not isinstance(, str):
                        print("NAME", repr(, stm)
            elif isinstance(stm, ast.ImportFrom):
                if stm.module is not None:
            elif hasattr(stm, "body"):
Another fun part is grouping the imported module names by where in sys.path they have been found:
    scanner = Scanner()
    by_sys_path: Dict[str, List[str]] = collections.defaultdict(list)
    for name in sorted(scanner.names):
        spec = importlib.util.find_spec(name)
        if spec is None or spec.origin is None:
            for sp in sys.path:
                if spec.origin.startswith(sp):
    for sys_path, names in sorted(by_sys_path.items()):
        print(f" sys_path or 'unidentified' :")
        for name in names:
            print(f"   name ")
An example. It's kind of nice how it can at least tell apart stdlib modules so one doesn't need to read through those:
$ ./scan-imports  /himblick
Maybe such a tool already exists and works much better than this? From a quick search I didn't find it, and it was fun to (re)invent it. Updates: Jakub Wilk pointed out to an old python-modules script that finds Debian dependencies. The AST scanning code should be refactored to use ast.NodeVisitor.

10 September 2021

Enrico Zini: A nightmare of confcalls and microphones

I had this nightmare where I had a very, very important confcall. I joined with Chrome. Chrome said Failed to access your microphone - Cannot use microphone for an unknown reason. Could not start audio source. I joined with Firefox. Firefox chose Monitor of Built-in Audio Analog Stereo as a microphone, and did not let me change it. Not in the browser, not in pavucontrol. I joined with the browser on my phone, and the webpage said This meeting needs to use your microphone and camera. Select *Allow* when your browser asks for permissions. But the question never came. I could hear people talking. I had very important things to say. I tried typing them in the chat window, but they weren't seeing it. The meeting ended. I was on the verge of tears.
Tell me, Mr. Anderson, what good is a phone call when you are unable to speak?
Since this nightmare happened for real, including the bit about tears in the end, let's see that it doesn't happen again. I should now have three working systems, which hopefully won't all break again all at the same time. Fixing Chrome I can reproduce this reliably, on Bullseye's standard Chromium 90.0.4430.212-1, just launched on an empty profile, no extensions. The webpage has camera and microphone allowed. Chrome doesn't show up in the recording tab of pulseaudio. Nothing on Chrome's stdout/stderr. JavaScript console has:
Logger.js:154 2021-09-10Txx:xx:xx.xxxZ [features/base/tracks] Failed to create local tracks
DOMException: Could not start audio source
I found the answer here:
I had the similar problem once with chromium. i could solve it by switching in preferences->microphone-> from "default" to "intern analog stereo".
Opening the little popup next to the microphone/mute button allows choosing other microphones, which work. Only "Same as system (Default)" does not work. Fixing Firefox I have firefox-esr 78.13.0esr-1~deb11u1. In Jitsi, microphone selection is disabled on the toolbar and in the settings menu. In pavucontrol, changing the recording device for Firefox has no effect. If for some reason the wrong microphone got chosen, those are not ways of fixing it. What I found works is to click on the camera permission icon, remove microphone permission, then reload the page. At that point Firefox will ask for permission again, and that microphone selection seems to work. Relevant bugs: on Jitsi and on Firefox. Since this is well known (once you find the relevant issues), I'd have appreciated Jitsi at least showing a link to an explanation of workarounds on Firefox, instead of just disabling microphone selection. Fixing Jitsi on the phone side I really don't want to preemptively give camera and microphone permissions to my phone browser. I noticed that there's the Jitsi app on F-Droid and much as I hate to use an app when a website would work, at least in this case it's a way to keep the permission sets separate, so I installed that. Fixing pavucontrol? I tried to find out why I can't change input device for FireFox on pavucontrol. I only managed to find an Ask Ubuntu question with no answer and a Unix StackExchange question with no answer.

20 July 2021

Enrico Zini: Run a webserver for a specific user *only*

I'm creating a program that uses the web browser for its user interface, and I'm reasonably sure I'm not the first person doing this. Normally such a problem would listen to a port on localhost, and tell the browser to connect to it. Bonus points for listening to a randomly allocated free port, so that one does not need to involve some amount of luck to get the program started. However, using a local port still means that any user on the local machine can connect to it, which is generally a security issue. A possible solution would be to use AF_UNIX Unix Domain Sockets, which are supported by various web servers, but as far as I understand not currently by browsers. I checked Firefox and Chrome, and they currently seem to fail to even acknowledge the use case. I'm reasonably sure I'm not the first person doing this, and yes, it's intended as an understatement. So, dear Lazyweb, is there a way to securely use a browser as a UI for a user's program, without exposing access to the backend to other users in the system? Access token in the URL Emanuele Di Giacomo suggests to add an access token to the URL that gets passed to the browser. This would work to protect access on localhost: even if the application cannot use HTTPS, other users cannot see packets that go through the local interface, so both the access token and the session cookie that one could send afterwards would be protected. Network namespaces I thought about isolating server and browser in a private network namespace with something like unshare(1), but it seems to require root. Johannes Schauer Marin Rodrigues wrote to correct that:
It's possible to unshare the network namespace by first unsharing the user namespace and thus becoming root which is possible without being root since #898446 got fixed. For example you can run this as the normal user: lxc-usernsexec -- lxc-unshare -s NETWORK -- ip addr If you don't want to depend on lxc, you can write a wrapper in Perl or Python. I have a Perl implementation of that in mmdebstrap.
Firewalling Martin Schuster wrote to suggest another option:
I had the same issue. My approach was "weird", but worked: Block /outgoing/ connections to the port, unless the uid is correct. That might be counter-intuitive, but of course all connections /to/ localhost will be done /from/ localhost also. Something like: iptables -A OUTPUT -p tcp -d localhost --dport 8123 -m owner --uid-owner joe -j ACCEPT iptables -A OUTPUT -p tcp -d localhost --dport 8123 -j REJECT

30 June 2021

Enrico Zini: Systemd containers with unittest

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience. Unit testing some parts of Transilience, like the apt and systemd actions, or remote Mitogen connections, can really use a containerized system for testing. To have that, I reused my work on nspawn-runner. to build a simple and very fast system of ephemeral containers, with minimal dependencies, based on systemd-nspawn and btrfs snapshots: Setup To be able to use systemd-nspawn --ephemeral, the chroots needs to be btrfs subvolumes. If you are not running on a btrfs filesystem, you can create one to run the tests, even on a file:
fallocate -l 1.5G testfile
/usr/sbin/mkfs.btrfs testfile
sudo mount -o loop testfile test_chroots/
I created a script to setup the test environment, here is an extract:
mkdir -p test_chroots
cat << EOF > "test_chroots/CACHEDIR.TAG"
Signature: 8a477f597d28d172789f06886806bc55
# chroots used for testing transilience, can be regenerated with make-test-chroot
btrfs subvolume create test_chroots/buster
eatmydata debootstrap --variant=minbase --include=python3,dbus,systemd buster test_chroots/buster
CACHEDIR.TAG is a nice trick to tell backup software not to bother backing up the contents of this directory, since it can be easily regenerated. eatmydata is optional, and it speeds up debootstrap quite a bit. Running unittest with sudo Here's a simple helper to drop root as soon as possible, and regain it only when needed. Note that it needs $SUDO_UID and $SUDO_GID, that are set by sudo, to know which user to drop into:
class ProcessPrivs:
    Drop root privileges and regain them only when needed
    def __init__(self):
        self.orig_uid, self.orig_euid, self.orig_suid = os.getresuid()
        self.orig_gid, self.orig_egid, self.orig_sgid = os.getresgid()
        if "SUDO_UID" not in os.environ:
            raise RuntimeError("Tests need to be run under sudo")
        self.user_uid = int(os.environ["SUDO_UID"])
        self.user_gid = int(os.environ["SUDO_GID"])
        self.dropped = False
    def drop(self):
        Drop root privileges
        if self.dropped:
        os.setresgid(self.user_gid, self.user_gid, 0)
        os.setresuid(self.user_uid, self.user_uid, 0)
        self.dropped = True
    def regain(self):
        Regain root privileges
        if not self.dropped:
        os.setresuid(self.orig_suid, self.orig_suid, self.user_uid)
        os.setresgid(self.orig_sgid, self.orig_sgid, self.user_gid)
        self.dropped = False
    def root(self):
        Regain root privileges for the duration of this context manager
        if not self.dropped:
    def user(self):
        Drop root privileges for the duration of this context manager
        if self.dropped:
privs = ProcessPrivs()
As soon as this module is loaded, root privileges are dropped, and can be regained for as little as possible using a handy context manager:
   with privs.root():["systemd-run", ...], check=True, capture_output=True)
Using the chroot from test cases The infrastructure to setup and spin down ephemeral machine is relatively simple, once one has worked out the nspawn incantations:
class Chroot:
    Manage an ephemeral chroot
    running_chroots: Dict[str, "Chroot"] =  
    def __init__(self, name: str, chroot_dir: Optional[str] = None): = name
        if chroot_dir is None:
            self.chroot_dir = self.get_chroot_dir(name)
            self.chroot_dir = chroot_dir
        self.machine_name = f"transilience- uuid.uuid4() "
    def start(self):
        Start nspawn on this given chroot.
        The systemd-nspawn command is run contained into its own unit using
        unit_config = [
        cmd = ["systemd-run"]
        for c in unit_config:
            cmd.append(f"--property= c ")
            f"--directory= self.chroot_dir ",
            f"--machine= self.machine_name ",
            "--notify-ready=yes"))"%s: starting machine using image %s", self.machine_name, self.chroot_dir)
        log.debug("%s: running %s", self.machine_name, " ".join(shlex.quote(c) for c in cmd))
        with privs.root():
  , check=True, capture_output=True)
        log.debug("%s: started", self.machine_name)
        self.running_chroots[self.machine_name] = self
    def stop(self):
        Stop the running ephemeral containers
        cmd = ["machinectl", "terminate", self.machine_name]
        log.debug("%s: running %s", self.machine_name, " ".join(shlex.quote(c) for c in cmd))
        with privs.root():
  , check=True, capture_output=True)
        log.debug("%s: stopped", self.machine_name)
        del self.running_chroots[self.machine_name]
    def create(cls, chroot_name: str) -> "Chroot":
        Start an ephemeral machine from the given master chroot
        res = cls(chroot_name)
        return res
    def get_chroot_dir(cls, chroot_name: str):
        Locate a master chroot under test_chroots/
        chroot_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "test_chroots", chroot_name))
        if not os.path.isdir(chroot_dir):
            raise RuntimeError(f" chroot_dir  does not exists or is not a chroot directory")
        return chroot_dir
# We need to use atextit, because unittest won't run
# tearDown/tearDownClass/tearDownModule methods in case of KeyboardInterrupt
# and we need to make sure to terminate the nspawn containers at exit
def cleanup():
    # Use a list to prevent changing running_chroots during iteration
    for chroot in list(Chroot.running_chroots.values()):
And here's a TestCase mixin that starts a containerized systems and opens a Mitogen connection to it:
class ChrootTestMixin:
    Mixin to run tests over a setns connection to an ephemeral systemd-nspawn
    container running one of the test chroots
    chroot_name = "buster"
    def setUpClass(cls):
        import mitogen
        from transilience.system import Mitogen = mitogen.master.Broker()
        cls.router = mitogen.master.Router(
        cls.chroot = Chroot.create(cls.chroot_name)
        with privs.root():
            cls.system = Mitogen(
          , "setns", kind="machinectl",
                    container=cls.chroot.machine_name, router=cls.router)
    def tearDownClass(cls):
Running tests Once the tests are set up, everything goes on as normal, except one needs to run nose2 with sudo:
sudo nose2-3
Spin up time for containers is pretty fast, and the tests drop root as soon as possible, and only regain it for as little as needed. Also, dependencies for all this are minimal and available on most systems, and the setup instructions seem pretty straightforward

29 June 2021

Enrico Zini: Building a Transilience playbook in a zipapp

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience. Mitogen is a great library, but scarily complicated, and I've been wondering how hard it would be to make alternative connection methods for Transilience. Here's a wild idea: can I package a whole Transilience playbook, plus dependencies, in a zipapp, then send the zipapp to the machine to be provisioned, and run it locally? It turns out I can. Creating the zipapp This is somewhat hackish, but until I can rely on Python 3.9's improved importlib.resources module, I cannot think of a better way:
    def zipapp(self, target: str, interpreter=None):
        Bundle this playbook into a self-contained zipapp
        import zipapp
        import jinja2
        import transilience
        if interpreter is None:
            interpreter = sys.executable
        if getattr(transilience.__loader__, "archive", None):
            # Recursively iterating module directories requires Python 3.9+
            raise NotImplementedError("Cannot currently create a zipapp from a zipapp")
        with tempfile.TemporaryDirectory() as workdir:
            # Copy transilience
            shutil.copytree(os.path.dirname(__file__), os.path.join(workdir, "transilience"))
            # Copy jinja2
            shutil.copytree(os.path.dirname(jinja2.__file__), os.path.join(workdir, "jinja2"))
            # Copy argv[0] as
            shutil.copy(sys.argv[0], os.path.join(workdir, ""))
            # Copy argv[0]/roles
            role_dir = os.path.join(os.path.dirname(sys.argv[0]), "roles")
            if os.path.isdir(role_dir):
                shutil.copytree(role_dir, os.path.join(workdir, "roles"))
            # Turn everything into a zipapp
            zipapp.create_archive(workdir, target, interpreter=interpreter, compressed=True)
Since the zipapp contains not just the playbook, the roles, and the roles' assets, but also Transilience and Jinja2, it can run on any system that has a Python 3.7+ interpreter, and nothing else! I added it to the standard set of playbook command line options, so any Transilience playbook can turn itself into a self-contained zipapp:
$ ./provision --help
usage: provision [-h] [-v] [--debug] [-C] [--local LOCAL]
                 [--ansible-to-python role   --ansible-to-ast role   --zipapp file.pyz]
  --zipapp file.pyz     bundle this playbook in a self-contained executable
                        python zipapp
Loading assets from the zipapp I had to create ZipFile varieties of some bits of infrastructure in Transilience, to load templates, files, and Ansible yaml files from zip files. You can see above a way to detect if a module is loaded from a zipfile: check if the module's __loader__ attribute has an archive attribute. Here's a Jinja2 template loader that looks into a zip:
class ZipLoader(jinja2.BaseLoader):
    def __init__(self, archive: zipfile.ZipFile, root: str):
        self.zipfile = archive
        self.root = root
    def get_source(self, environment: jinja2.Environment, template: str):
        path = os.path.join(self.root, template)
        with, "r") as fd:
            source =
        return source, None, lambda: True
I also created a FileAsset abstract interface to represent a local file, and had Role.lookup_file return an appropriate instance:
    def lookup_file(self, path: str) -> str:
        Resolve a pathname inside the place where the role assets are stored.
        Returns a pathname to the file
        if self.role_assets_zipfile is not None:
            return ZipFileAsset(self.role_assets_zipfile, os.path.join(self.role_assets_root, path))
            return LocalFileAsset(os.path.join(self.role_assets_root, path))
An interesting side effect of having smarter local file accessors is that I can cache the contents of small files and transmit them to the remote host together with the other action parameters, saving a potential network round trip for each builtin.copy action that has a small source. The result The result is kind of fun:
$ time ./provision --zipapp test.pyz
real    0m0.203s
user    0m0.174s
sys 0m0.029s
$ time scp test.pyz root@test:
test.pyz                                                                                                         100%  528KB 388.9KB/s   00:01
real    0m1.576s
user    0m0.010s
sys 0m0.007s
And on the remote:
# time ./test.pyz --local=test
2021-06-29 18:05:41,546 test: [connected 0.000s]
2021-06-29 18:12:31,555 test: 88 total actions in 0.00ms: 87 unchanged, 0 changed, 1 skipped, 0 failed, 0 not executed.
real    0m0.979s
user    0m0.783s
sys 0m0.172s
Compare with a Mitogen run:
$ time PYTHONPATH=../transilience/ ./provision
2021-06-29 18:13:44 test: [connected 0.427s]
2021-06-29 18:13:46 test: 88 total actions in 2.50s: 87 unchanged, 0 changed, 1 skipped, 0 failed, 0 not executed.
real    0m2.697s
user    0m0.856s
sys 0m0.042s
From a single test run, not a good benchmark, it's 0.203 + 1.576 + 0.979 = 2.758s with the zipapp and 2.697s with Mitogen. Even if I've been lucky, it's a similar order of magnitude. What can I use this for? This was mostly a fun hack. It could however be the basis for a Fabric-based connector, or a clusterssh-based connector, or for bundling a Transilience playbook into an installation image, or to add a provisioning script to the boot partition of a Raspberry Pi. It looks like an interesting trick to have up one's sleeve. One could even build an Ansible-based connector(!) in which a simple Ansible playbook, with no facts gathering, is used to build the zipapp, push it to remote systems and run it. That would be the wackiest way of speeding up Ansible, ever! Next: using Systemd containers with unittest, for Transilience's test suite.

26 June 2021

Enrico Zini: Ansible conditionals in Transilience

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience. I thought a lot of what I managed to do so far with Transilience would be impossible, but then here I am. How about Ansible conditionals? Those must be impossible, right? Let's give it a try. A quick recon of Ansible sources Looking into Ansible's sources, when expressions are lists of strings AND-ed together. The expressions are Jinja2 expressions that Ansible pastes into a mini-template, renders, and checks the string that comes out. A quick recon of Jinja2 Jinja2 has a convenient function (jinja2.Environment.compile_expression) that compiles a template snippet into a Python function. It can also parse a template into an AST that can be inspected in various ways. Evaluating Ansible conditionals in Python Environment.compile_expression seems to really do precisely what we need for this, straight out of the box. There is an issue with the concept of "defined": for Ansible it seems to mean "the variable is present in the template context". In Transilience instead, all variables are fields in the Role dataclass, and can be None when not set. This means that we need to remove variables that are set to None before passing the parameters to the compiled Jinjae expression:
class Conditional:
    An Ansible conditional expression
    def __init__(self, engine: template.Engine, body: str):
        # Original unparsed expression
        self.body: str = body
        # Expression compiled to a callable
        self.expression: Callable = engine.env.compile_expression(body)
    def evaluate(self, ctx: Dict[str, Any]):
        ctx =  name: val for name, val in ctx.items() if val is not None 
        return self.expression(**ctx)
Generating Python code Transilience does not only support running Ansible roles, but also converting them to Python code. I can keep this up by traversing the Jinja2 AST generating Python expressions. The code is straightforward enough that I can throw in a bit of pattern matching to make some expressions more idiomatic for Python:
class Conditional:
    def __init__(self, engine: template.Engine, body: str):
        parser = jinja2.parser.Parser(engine.env, body, state='variable')
        self.jinja2_ast: nodes.Node = parser.parse_expression()
    def get_python_code(self) -> str:
        return to_python_code(self.jinja2_ast
def to_python_code(node: nodes.Node) -> str:
    if isinstance(node, nodes.Name):
        if node.ctx == "load":
            return f"self. "
            raise NotImplementedError(f"jinja2 Name nodes with ctx= node.ctx!r  are not supported:  node!r ")
    elif isinstance(node, nodes.Test):
        if == "defined":
            return f" to_python_code(node.node)  is not None"
        elif == "undefined":
            return f" to_python_code(node.node)  is None"
            raise NotImplementedError(f"jinja2 Test nodes with name=!r  are not supported:  node!r ")
    elif isinstance(node, nodes.Not):
        if isinstance(node.node, nodes.Test):
            # Special case match well-known structures for more idiomatic Python
            if == "defined":
                return f" to_python_code(node.node.node)  is None"
            elif == "undefined":
                return f" to_python_code(node.node.node)  is not None"
        elif isinstance(node.node, nodes.Name):
            return f"not  to_python_code(node.node) "
        return f"not ( to_python_code(node.node) )"
    elif isinstance(node, nodes.Or):
        return f"( to_python_code(node.left)  or  to_python_code(node.right) )"
    elif isinstance(node, nodes.And):
        return f"( to_python_code(node.left)  and  to_python_code(node.right) )"
        raise NotImplementedError(f"jinja2  node.__class__  nodes are not supported:  node!r ")
Scanning for variables Lastly, I can implement scanning conditionals for variable references to add as fields to the Role dataclass:
class FindVars(jinja2.visitor.NodeVisitor):
    def __init__(self):
        self.found: Set[str] = set()
    def visit_Name(self, node):
        if node.ctx == "load":
class Conditional:
    def list_role_vars(self) -> Sequence[str]:
        fv = FindVars()
        return fv.found
The result in action Take this simple Ansible task:
 - name: Example task
      state: touch
      path: /tmp/test
   when: (is_test is defined and is_test) or debug is defined
Run it through ./provision --ansible-to-python test and you get:
from __future__ import annotations
from typing import Any
from transilience import role
from transilience.actions import builtin, facts
class Role(role.Role):
    # Role variables used by templates
    debug: Any = None
    is_test: Any = None
    def all_facts_available(self):
        if ((self.is_test is not None and self.is_test)
                or self.debug is not None):
                builtin.file(path='/tmp/test', state='touch'),
                name='Example task')
Besides one harmless set of parentheses too much, what I wasn't sure would be possible is there, right there, staring at me with a mischievous grin. Next: Building a Transilience playbook in a zipapp.

25 June 2021

Enrico Zini: Parsing YAML

This is part of a series of posts on ideas for an Ansible-like provisioning system, implemented in Transilience. The time has come for me to try and prototype if it's possible to load some Transilience roles from Ansible's YAML instead of Python. The data models of Transilience and Ansible are not exactly the same. Some of the differences that come to mind: To simplify the work, I'll start from loading a single role out of Ansible, not an entire playbook. TL;DR: scroll to the bottom of the post for the conclusion! Loading tasks The first problem of loading an Ansible task is to figure out which of the keys is the module name. I have so far failed to find precise reference documentation about what keyboards are used to define a task, so I'm going by guesswork, and if needed a look at Ansible's sources. My first attempt goes by excluding all known non-module keywords:
        candidates = []
        for key in task_info.keys():
            if key in ("name", "args", "notify"):
        if len(candidates) != 1:
            raise RoleNotLoadedError(f"could not find a known module in task  task_info!r ")
        modname = candidates[0]
        if modname.startswith("ansible.builtin."):
            name = modname[16:]
            name = modname
This means that Ansible keywords like when or with will break the parsing, and it's fine since they are not supported yet. args seems to carry arguments to the module, when the module main argument is not a dict, as may happen at least with the command module. Task parameters One can do all sorts of chaotic things to pass parameters to Ansible tasks: for example string lists can be lists of strings or strings with comma-separated lists, and they can be preprocesed via Jinja2 templating, and they can be complex data structures that might contain strings that need Jinja2 preprocessing. I ended up mapping the behaviours I encountered in an AST-like class hierarchy which includes recursive complex structures. Variables Variables look hard: Ansible has a big free messy cauldron of global variables, and Transilience needs a predefined list of per-role variables. However, variables are mainly used inside Jinja2 templates, and Jinja2 can parse to an Abstract Syntax Tree and has useful methods to examine its AST. Using that, I managed with resonable effort to scan an Ansible role and generate a list of all the variables it uses! I can then use that list, filter out facts-specific names like ansible_domain, and use them to add variable definition to the Transilience roles. That is exciting! Handlers Before loading tasks, I load handlers as one-action roles, and index them by name. When an Ansible task notifies a handler, I can then loop up by name the roles I generated in the earlier pass, and I have all that I need. Parsed Abstract Syntax Tree Most of the results of all this parsing started looking like an AST, so I changed the rest of the prototype to generate an AST. This means that, for a well defined subset of nsible's YAML, there exists now a tool that is able to parse it into an AST and raeson with it. Transilience's playbooks gained a --ansible-to-ast option to parse an Ansible role and dump the resulting AST as JSON:
$ ./provision --help
usage: provision [-h] [-v] [--debug] [-C] [--ansible-to-python role]
                 [--ansible-to-ast role]
Provision my VPS
optional arguments:
  -C, --check           do not perform changes, but check if changes would be
  --ansible-to-ast role
                        print the AST of the given Ansible role as understood
                        by Transilience
The result is extremely verbose, since every parameter is itself a node in the tree, but I find it interesting. Here is, for example, a node for an Ansible task which has a templated parameter:
      "node": "task",
      "action": "builtin.blockinfile",
          "node": "parameter",
          "type": "scalar",
          "value": "/etc/aliases"
          "node": "parameter",
          "type": "template_string",
          "value": "root:  postmaster \n % for name, dest in aliases.items() % \n name :  dest \n % endfor % \n"
        "name": "configure /etc/aliases",
        "blockinfile":  ,
        "notify": "reread /etc/aliases"
      "notify": [
Here's a node for an Ansible template task converted to Transilience's model:
      "node": "task",
      "action": "builtin.copy",
          "node": "parameter",
          "type": "scalar",
          "value": "/etc/dovecot/local.conf"
          "node": "parameter",
          "type": "template_path",
          "value": "dovecot.conf"
        "name": "configure dovecot",
        "template":  ,
        "notify": "restart dovecot"
      "notify": [
Executing The first iteration of prototype code for executing parsed Ansible roles is a little execise in closures and dynamically generated types:
    def get_role_class(self) -> Type[Role]:
        # If we have handlers, instantiate role classes for them
        handler_classes =  
        for name, ansible_role in self.handlers.items():
            handler_classes[name] = ansible_role.get_role_class()
        # Create all the functions to start actions in the role
        start_funcs = []
        for task in self.tasks:
        # Function that calls all the 'Action start' functions
        def role_main(self):
            for func in start_funcs:
        if self.uses_facts:
            role_cls = type(, (Role,),  
                "start": lambda host: None,
                "all_facts_available": role_main
            role_cls = dataclass(role_cls)
            role_cls = with_facts(facts.Platform)(role_cls)
            role_cls = type(, (Role,),  
                "start": role_main
            role_cls = dataclass(role_cls)
        return role_cls
Now that the parsed Ansible role is a proper AST, I'm considering redesigning that using a generic Role class that works as an AST interpreter. Generating Python I maintain a library that can turn an invoice into Python code, and I have a convenient AST. I can't not generate Python code out of an Ansible role!
$ ./provision --help
usage: provision [-h] [-v] [--debug] [-C] [--ansible-to-python role]
                 [--ansible-to-ast role]
Provision my VPS
optional arguments:
  --ansible-to-python role
                        print the given Ansible role as Transilience Python
  --ansible-to-ast role
                        print the AST of the given Ansible role as understood
                        by Transilience
And will you look at this annotated extract:
$ ./provision --ansible-to-python mailserver
from __future__ import annotations
from typing import Any
from transilience import role
from transilience.actions import builtin, facts
# Role classes generated from Ansible handlers!
class ReloadPostfix(role.Role):
    def start(self):
            builtin.systemd(unit='postfix', state='reloaded'),
            name='reload postfix')
class RestartDovecot(role.Role):
    def start(self):
            builtin.systemd(unit='dovecot', state='restarted'),
            name='restart dovecot')
# The role, including a standard set of facts
class Role(role.Role):
    # These are the variables used by Jinja2 template files and strings. I need
    # to use Any, since Ansible variables are not typed
    aliases: Any = None
    myhostname: Any = None
    postmaster: Any = None
    virtual_domains: Any = None
    def all_facts_available(self):
        # A Jinja2 string inside a string list!
                    'certbot', 'certonly', '-d',
                    self.render_string('mail. ansible_domain '), '-n',
                    '/etc/letsencrypt/live/mail. ansible_domain /fullchain.pem'
            name='obtain mail.* letsencrypt certificate')
        # A converted template task!
            name='configure dovecot',
            # Notify referring to the corresponding Role class!
        # Referencing a variable collected from a fact!
            builtin.copy(dest='/etc/mailname', content=self.ansible_domain),
            name='configure /etc/mailname',
Conclusion Transilience can load a (growing) subset of Ansible syntax, one role at a time, which contains: The role loader in Transilience now looks for YAML when it does not find a Python module, and runs it pipelined and fast! There is code to generate Python code from an Ansible module: you can take an Ansible role, convert it to Python, and then work on it to add more complex logic, or clean it up for adding it to a library of reusable roles! Next: Ansible conditionals

23 June 2021

Enrico Zini: Transilience check mode

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience. I added check mode to Transilience, to do everything except perform changes, like Ansible does:
$ ./provision --help
usage: provision [-h] [-v] [--debug] [-C] [--to-python role]
Provision my VPS
optional arguments:
  -h, --help        show this help message and exit
  -v, --verbose     verbose output
  --debug           verbose output
  -C, --check       do not perform changes, but check if changes would be    NEW!
                    needed                                                   NEW!
It was quite straightforwad to add a new field to the base Action class, and tweak the implementations not to perform changes if it is True:
# Shortcut function to annotate dataclass fields with documentation metadata
def doc(default: Any, doc: str, **kw):
    return field(default=default, metadata= "doc": doc )
class Action:
    check: bool = doc(False, "when True, check if the action would perform changes, but do nothing")
Like with Ansible, check mode takes about the same time as a normal run which does not perform changes. Unlike Ansible, with Transilience this is actually pretty fast! ;) Next step: parsing YAML!

18 June 2021

Enrico Zini: Playbooks, host vars, group vars

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience. Host variables Ansible allows to specify per-host variables, and I like that. Let's try to model a host as a dataclass:
class Host:
    A host to be provisioned.
    name: str
    type: str = "Mitogen"
    args: Dict[str, Any] = field(default_factory=dict)
    def _make_system(self) -> System:
        cls = getattr(transilience.system, self.type)
        return cls(, **self.args)
This should have enough information to create a connection to the host, and can be subclassed to add host-specific dataclass fields. Host variables can then be provided as default constructor arguments when instantiating Roles:
    # Add host/group variables to role constructor args
    host_fields = f for f in fields(host) 
    for field in fields(role_cls):
        if in host_fields:
            role_kwargs.setdefault(, getattr(host,
    role = role_cls(**role_kwargs)
Group variables It looks like I can model groups and group variables by using dataclasses as mixins:
class Webserver:
    server_name: str = ""
class Srv1(Webserver):
Doing things like filtering all hosts that are members of a given group can be done with a simple isinstance or issubclass test. Playbooks So far Transilience is executing on one host at a time, and Ansible can execute on a whole host inventory. Since the most part of running a playbook is I/O bound, we can parallelize hosts using threads, without worrying too much about the performance impact of GIL. Let's introduce a Playbook class as the main entry point for a playbook:
class Playbook:
    def setup_logging(self):
    def make_argparser(self):
        description = inspect.getdoc(self)
        if not description:
            description = "Provision systems"
        parser = argparse.ArgumentParser(description=description)
        parser.add_argument("-v", "--verbose", action="store_true",
                            help="verbose output")
        parser.add_argument("--debug", action="store_true",
                            help="verbose output")
        return parser
    def hosts(self) -> Sequence[Host]:
        Generate a sequence with all the systems on which the playbook needs to run
        return ()
    def start(self, runner: Runner):
        Start the playbook on the given runner.
        This method is called once for each system returned by systems()
        raise NotImplementedError(f" self.__class__.__name__ .start is not implemented")
    def main(self):
        parser = self.make_argparser()
        self.args = parser.parse_args()
        # Start all the runners in separate threads
        threads = []
        for host in self.hosts():
            runner = Runner(host)
            t = threading.Thread(target=runner.main)
        # Wait for all threads to complete
        for t in threads:
And an actual playbook will now look like something like this:
from dataclasses import dataclass
import sys
from transilience import Playbook, Host
class MyServer(Host):
    srv_root: str = "/srv"
    site_admin: str = ""
class VPS(Playbook):
    Provision my VPS
    def hosts(self):
        yield MyServer(name="server", args= 
            "method": "ssh",
            "hostname": "",
            "username": "root",
    def start(self, runner):
                aliases= ... )
if __name__ == "__main__":
It looks quite straightforward to me, works on any number of hosts, and has a proper command line interface:
./provision  --help
usage: provision [-h] [-v] [--debug]
Provision my VPS
optional arguments:
  -h, --help     show this help message and exit
  -v, --verbose  verbose output
  --debug        verbose output
Next step: check mode!

17 June 2021

Enrico Zini: Reimagining Ansible variables

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience. While experimenting with Transilience, I've been giving some thought about Ansible variables. My gripes I like the possibility to define host and group variables, and I like to have a set of variables that are autodiscovered on the target systems. I do not like to have everything all blended in a big bucket of global variables. Let's try some more prototyping. My fiddlings First, Role classes could become dataclasses, too, and declare the variables and facts that they intend to use (typed, even!):
class Role(role.Role):
    Postfix mail server configuration
    # Postmaster username
    postmaster: str = None
    # Public name of the mail server
    myhostname: str = None
    # Email aliases defined on this mail server
    aliases: Dict[str, str] = field(default_factory=dict)
Using dataclasses.asdict() I immediately gain context variables for rendering Jinja2 templates:
class Role:
    # [...]
    def render_file(self, path: str, **kwargs):
        Render a Jinja2 template from a file, using as context all Role fields,
        plus the given kwargs.
        ctx = asdict(self)
        return self.template_engine.render_file(path, ctx)
    def render_string(self, template: str, **kwargs):
        Render a Jinja2 template from a string, using as context all Role fields,
        plus the given kwargs.
        ctx = asdict(self)
        return self.template_engine.render_string(template, ctx)
I can also model results from fact gathering into dataclass members:
# From ansible/module_utils/facts/system/
class Platform(Facts):
    Facts from the platform module
    ansible_system: Optional[str] = None
    ansible_kernel: Optional[str] = None
    ansible_kernel: Optional[str] = None
    ansible_kernel_version: Optional[str] = None
    ansible_machine: Optional[str] = None
    # [...]
    ansible_userspace_architecture: Optional[str] = None
    ansible_machine_id: Optional[str] = None
    def summary(self):
        return "gather platform facts"
    def run(self, system: transilience.system.System):
        # ... collect facts
I like that this way, one can explicitly declare what variables a Facts action will collect, and what variables a Role needs. At this point, I can add machinery to allow a Role to declare what Facts it needs, and automatically have the fields from the Facts class added to the Role class. Then, when facts are gathered, I can make sure that their fields get copied over to the Role classes that use them. In a way, variables become role-scoped, and Facts subclasses can be used like some kind of Role mixin, that contributes only field members:
# Postfix mail server configuration
class Role(role.Role):
    # Postmaster username
    postmaster: str = None
    # Public name of the mail server
    myhostname: str = None
    # Email aliases defined on this mail server
    aliases: Dict[str, str] = field(default_factory=dict)
    # All fields from actions.facts.Platform are inherited here!
    def have_facts(self, facts):
        # self.ansible_domain comes from actions.facts.Platform
            argv=["certbot", "certonly", "-d", f"mail. self.ansible_domain ", "-n", "--apache"],
            creates=f"/etc/letsencrypt/live/mail. self.ansible_domain /fullchain.pem"
        ), name="obtain mail.* certificate")
        # the template context will have the Role variables, plus the variables
        # of all the Facts the Role uses
        with self.notify(ReloadPostfix):
            ), name="configure /etc/postfix/")
One can also fill in variables when instantiating Roles, making parameterized generic Roles possible and easy:
                "me": "enrico",
Outcomes I like where this is going: having well defined variables for facts and roles, means that the variables that get into play can be explicitly defined, well known, and documented. I think this design lends itself quite well to role reuse: I have a feeling that, this way, it may be much easier to create generic libraries of Roles that one can reuse to compose complex playbooks. Since roles are just Python modules, we even already know how to package and distribute them! Next step: Playbooks, host vars, group vars.

14 June 2021

Enrico Zini: Pipelining

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience. Running actions on a server is nice, but a network round trip for each action is not very efficient. If I need to run a linear sequence of actions, I can stream them all to the server, and then read replies streamed from the server as they get executed. This technique is called pipelining and one can see it used, for example, in Redis, or Mitogen. Roles Ansible has the concept of "Roles" as a series of related tasks: I'll play with that. Here's an example role to install and setup fail2ban:
class Role(role.Role):
    def main(self):
                enabled = true
                enabled = true
        ), name="configure fail2ban")
I prototyped roles as classes, with methods that push actions down the pipeline. If an action fails, all further actions for the same role won't executed, and will be marked as skipped. Since skipping is applied per-role, it means that I can blissfully stream actions for multiple roles to the server down the same pipe, and errors in one role will stop executing that role and not others. Potentially I can get multiple roles going with a single network round-trip:
import sys
from transilience.system import Mitogen
from transilience.runner import Runner
def main():
    system = Mitogen("my server", "ssh", hostname="", username="root")
    runner = Runner(system)
    # Send roles to the server
    # Run until all roles are done
if __name__ == "__main__":
That looks like a playbook, using Python as glue rather than YAML. Decision making in roles Besides filing a series of actions, a role may need to take decisions based on the results of previous actions, or on facts discovered from the server. In that case, we need to wait until the results we need come back from the server, and then decide if we're done or if we want to send more actions down the pipe. Here's an example role that installs and configures Prosody:
from transilience import actions, role
from transilience.actions import builtin
from .handlers import RestartProsody
class Role(role.Role):
    Set up prosody XMPP server
    def main(self):
        self.add(actions.facts.Platform(), then=self.have_facts)
            name=["certbot", "python-certbot-apache"],
        ), name="install support packages")
            name=["prosody", "prosody-modules", "lua-sec", "lua-event", "lua-dbi-sqlite3"],
        ), name="install prosody packages")
    def have_facts(self, facts):
        facts = facts.facts  # Malkovich Malkovich Malkovich!
        domain = facts["domain"]
        ctx =  
            "ansible_domain": domain
            argv=["certbot", "certonly", "-d", f"chat. domain ", "-n", "--apache"],
            creates=f"/etc/letsencrypt/live/chat. domain /fullchain.pem"
        ), name="obtain chat certificate")
        with self.notify(RestartProsody):
                content=self.template_engine.render_file("roles/prosody/templates/prosody.cfg.lua", ctx),
            ), name="write prosody configuration")
            ), name="write prosody firewall")
    # ...
This files some general actions down the pipe, with a hook that says: when the results of this action come back, run self.have_facts(). At that point, the role can use the results to build certbot command lines, render prosody's configuration from Jinja2 templates, and use the results to file further action down the pipe. Note that this way, while the server is potentially still busy installing prosody, we're already streaming prosody's configuration to it. If anything goes wrong with the installation of prosody's package, the role will be marked as failed and all further actions of the same role, even those filed by have_facts() will be skipped. Notify and handlers In the previous example self.notify() also appears: that's my attempt to model the equivalent of Ansible's handlers. If any of the actions inside the with produce changes, then the RestartProsody role will be executed, potentially filing more actions ad the end of the playbook. The runner will take care of collecting all the triggered role classes in a set, which discards duplicates, and then running the main() method of all resulting roles, which will cause more actions to be filed down the pipe. Action conditions Sometimes some actions are only meaningful as consequences of other actions. Let's take, for example, enabling buster-backports as an extra apt source:
        a = self.add(builtin.copy(
            content="deb [arch=amd64] buster-backports main contrib",
        ), name="enable backports")
        ), name="update after enabling backports",
           # Run only if the previous copy changed anything
           when= a: ResultState.CHANGED ,
Here we want to update Apt's cache, which is a slow operation, only after we actually write /etc/apt/sources.list.d/debian-buster-backports.list. If the file was already there from a previous run, we can skip downloading the new package lists. The when= attributes adds an annotation to the action that is sent town the pipeline, that says that it should only be run if the state of a previous action matches the given one. In this case, when on the remote it's the turn of "update after enabling backports", it gets skipped unless the state of the previous "enable backports" action is CHANGED. Effects of pipelining I ported enough of Ansible's modules to be able to run the provisioning scripts of my VPS entirely via ansible. This is the playbook run as plain Ansible:
$ time ansible-playbook vps.yaml
servername       : ok=55   changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
real    2m10.072s
user    0m33.149s
sys 0m10.379s
This is the same playbook run with Ansible speeded up via the Mitogen backend, which makes Ansible more bearable:
$ export ANSIBLE_STRATEGY=mitogen_linear
$ time ansible-playbook vps.yaml
servername       : ok=55   changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
real    0m24.428s
user    0m8.479s
sys 0m1.894s
This is the same playbook ported to Transilience:
$ time ./provision
real    0m2.585s
user    0m0.659s
sys 0m0.034s
Doing nothing went from 2 minutes down to 3 seconds! That's the kind of running time that finally makes me comfortable with maintaining my VPS by editing the playbook only, and never logging in to mess with the system configuration by hand! Next steps I'm quite happy with what I have: I can now maintain my VPS with a simple script with quick iterative cycles. I might use it to develop new playbooks, and port them to ansible only when they're tested and need to be shared with infrastructure that needs to rely on something more solid and battle tested than a prototype provisioning system. I might also keep working on it as I have more interesting ideas that I'd like to try. I feel like Ansible reached some architectural limits that are hard to overcome without a major redesign, and are in many way hardcoded in its playbook configuration. It's nice to be able to try out new designs without that baggage. I'd love it if even just the library of Transilience actions could grow, and gain widespread use. Ansible modules standardized a set of management operations, that I think became the way people think about system management, and should really be broadly available outside of Ansible. If you are interesting in playing with Transilience, such as: do get in touch or send a pull request! :) Next step: Reimagining Ansible variables.

Enrico Zini: Use ansible actions in a script

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience. I like many of the modules provided with Ansible: they are convenient, platform-independent implementations of common provisioning steps. They'd be fantastic to have in a library that I could use in normal programs. This doesn't look easy to do with Ansible code as it is. Also, the code quality of various Ansible modules doesn't fit something I'd want in a standard library of cross-platform provisioning functions. Modeling Actions I want to keep the declarative, idempotent aspect of describing actions on a system. A good place to start could be a hierarchy of dataclasses that hold the same parameters as ansible modules, plus a run() method that performs the action:
class Action:
    Base class for all action implementations.
    An Action is the equivalent of an ansible module: a declarative
    representation of an idempotent operation on a system.
    An Action can be run immediately, or serialized, sent to a remote system,
    run, and sent back with its results.
    uuid: str = field(default_factory=lambda: str(uuid.uuid4()))
    result: Result = field(default_factory=Result)
    def summary(self):
        Return a short text description of this action
        return self.__class__.__name__
    def run(self, system: transilience.system.System):
        Perform the action
        self.result.state = ResultState.NOOP
I like that Ansible tasks have names, and I hate having to give names to trivial tasks like "Create directory /foo/bar", so I added a summary() method so that trivial tasks like that can take care of naming themselves. Dataclasses allow to introspect fields and annotate them with extra metadata, and together with docstrings, I can make actions reasonably self-documeting. I ported some of Ansible's modules over: see complete list in the git repository. Running Actions in a script With a bit of glue code I can now run Ansible-style functions from a plain Python script:
from transilience.runner import Script
script = Script()
for i in range(10):
    script.builtin.file(state="touch", path=f"/tmp/test i ")
Running Actions remotely Dataclasses have an asdict function that makes them trivially serializable. If their members stick to data types that can be serialized with Mitogen and the run implementation doesn't use non-pure, non-stdlib Python modules, then I can trivially run actions on all sorts of remote systems using Mitogen:
from transilience.runner import Script
from transilience.system import Mitogen
script = Script(system=Mitogen("my server", "ssh", hostname="", username="user"))
for i in range(10):
    script.builtin.file(state="touch", path=f"/tmp/test i ")
How fast would that be, compared to Ansible?
$ time ansible-playbook test.yaml
real    0m15.232s
user    0m4.033s
sys 0m1.336s
$ time ./test_script
real    0m4.934s
user    0m0.547s
sys 0m0.049s
With a network round-trip for each single operation I'm already 3x faster than Ansible, and it can run on nspawn containers, too! I always wanted to have a library of ansible modules useable in normal scripts, and I've always been angry with Ansible for not bundling their backend code in a generic library. Well, now there's the beginning of one! Sweet! Next step, pipelining.

Enrico Zini: My gripes with Ansible

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience. Musing about Ansible I like infrastructure as code. I like to be able to represent an entire system as text files in a git repositories, and to be able to use that to recreate the system, from my Virtual Private Server, to my print server and my stereo, to build machines, to other kind of systems I might end up setting up. I like that the provisioning work I do on a machine can be self-documenting and replicable at will. The good For that I quite like Ansible, in principle: simple (in theory) YAML files describe a system in (reasonably) high-level steps, and it can be run on (almost) any machine that happens to have a simple Python interpreter installed. I also like many of the modules provided with Ansible: they are convenient, platform-independent implementations of common provisioning steps. They'd be fantastic to have in a library that I could use in normal programs. The bad Unfortunately, Ansible is slow. Running the playbook on my VPS takes about 3 whole minutes even if I'm just changing a line in a configuration file. This means that most of the time, instead of changing that line in the playbook and running it, to then figure out after 3 minutes that it was the wrong line, or I made a spelling mistake in the playbook, I end up logging into the server and editing in place. That defeats the whole purpose, but that level of latency between iterations is just unacceptable to me. The ugly I also think that Ansible has outgrown its original design, and the supposedly declarative, idempotent YAML has become a full declarative scripting language in disguise, whose syntax is extremely awkward and verbose. If I'm writing declarative descriptions, YAML is great. If I'm writing loops and conditionals, I want to write code, not templated YAML. I also keep struggling trying to use Ansible to provision chroots and nspawn containers. A personal experiment: Transilience There's another thing I like in Ansible: it's written in Python, which is a language I'm comfortable with. Compared to other platforms, it's one that I'm more likely to be able to control beyond being a simple user. What if I can port Ansible modules into a library of high-level provisioning functions, that I can just run via normal Python scripts? What if I can find a way to execute those scripts remotely and not just locally? I've started writing some prototype code, and the biggest problem is, of course, finding a name. Ansible comes from Ursula K. Le Guin's Hainish Cycle novels, where it is a device that allows its users to communicate near-instantaneously over interstellar distances. Traveling, however, is still constrained by the speed of light. Later in the same universe, the novels A Fisherman of the Inland Sea and The Shobies' Story, talk about experiments with instantaneous interstellar travel, as a science Ursula Le Guin called transilience:
Transilience: n. A leap across or from one thing to another [1913 Webster]
Transilience. I like everything about this name. Now that the hardest problem is solved, the rest is just a simple matter of implementation details.

9 June 2021

Enrico Zini: Ansible recurse and follow quirks

I'm reading Ansible's builtin.file sources for, uhm, reasons, and the use of follow stood out to my eyes. Reading on, not only that. I feel like the ansible codebase needs a serious review, at least in essential core modules like this one. In the file module documentation it says:
This flag indicates that filesystem links, if they exist, should be followed.
In the recursive_set_attributes implementation instead, follow means "follow symlinks to directories", but if a symlink to a file is found, it does not get followed, kind of. What happens is that ansible will try to change the mode of the symlink, which makes sense on some operating systems. And it does try to use lchmod if present. Buf if not, this happens:
# Attempt to set the perms of the symlink but be
# careful not to change the perms of the underlying
# file while trying
underlying_stat = os.stat(b_path)
os.chmod(b_path, mode)
new_underlying_stat = os.stat(b_path)
if underlying_stat.st_mode != new_underlying_stat.st_mode:
    os.chmod(b_path, stat.S_IMODE(underlying_stat.st_mode))
So it tries doing chmod on the symlink, and if that changed the mode of the actual file, switch it back. I would have appreciated a comment documenting on which systems a hack like this makes sense. As it is, it opens a very short time window in which a symlink attack can make a system file vulerable, and an exception thrown by the second stat will make it vulnerable permanently. What about follow following links during recursion: how does it avoid loops? I don't see a cache of (device, inode) pairs visited. Let's try:
fatal: [localhost]: FAILED! =>  "changed": false, "details": "maximum recursion depth exceeded", "gid": 1000, "group": "enrico", "mode": "0755", "msg": "mode must be in octal or symbolic form", "owner": "enrico", "path": "/tmp/test/test1", "size": 0, "state": "directory", "uid": 1000 
Ok, it, uhm, delegates handling that to the Python stack size. I guess it means that a ln -s .. foo in a directory that gets recursed will always fail the task. Fun! More quirks Turning a symlink into a hardlink is considered a noop if the symlink points to the same file:
- hosts: localhost
   - name: create test file
        path: /tmp/testfile
        state: touch
   - name: create test link
        path: /tmp/testlink
        state: link
        src: /tmp/testfile
   - name: turn it into a hard link
        path: /tmp/testlink
        state: hard
        src: /tmp/testfile
$ ansible-playbook test3.yaml
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'
PLAY [localhost] ************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ******************************************************************************************************************************************************************************************************
ok: [localhost]
TASK [create test file] *****************************************************************************************************************************************************************************************************
changed: [localhost]
TASK [create test link] *****************************************************************************************************************************************************************************************************
changed: [localhost]
TASK [turn it into a hard link] *********************************************************************************************************************************************************************************************
ok: [localhost]
PLAY RECAP ******************************************************************************************************************************************************************************************************************
localhost                  : ok=4    changed=2    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
More quirks Converting a directory into a hardlink should work, but it doesn't because unlink is used instead of rmdir:
- hosts: localhost
   - name: create test dir
        path: /tmp/testdir
        state: directory
   - name: turn it into a symlink
        path: /tmp/testdir
        state: hard
        src: /tmp/
        force: yes
$ ansible-playbook test4.yaml
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'
PLAY [localhost] ************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ******************************************************************************************************************************************************************************************************
ok: [localhost]
TASK [create test dir] ******************************************************************************************************************************************************************************************************
changed: [localhost]
TASK [turn it into a symlink] ***********************************************************************************************************************************************************************************************
fatal: [localhost]: FAILED! =>  "changed": false, "gid": 1000, "group": "enrico", "mode": "0755", "msg": "Error while replacing: [Errno 21] Is a directory: b'/tmp/testdir'", "owner": "enrico", "path": "/tmp/testdir", "size": 0, "state": "directory", "uid": 1000 
PLAY RECAP ******************************************************************************************************************************************************************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0
More quirks This is hard to test, but it looks like if source and destination are hardlinks to the same inode numbers, but on different filesystems, the operation is considered a successful noop: It should probably be something like:
if (st1.st_dev, st1.st_ino) == (st2.st_dev, st2.st_ino):

8 June 2021

Enrico Zini: Mock syscalls with C++

I wrote and maintain some C++ code to stream high quantities of data as fast as possible, and I try to use splice and sendfile when available. The availability of those system calls varies at runtime according to a number of factors, and the code needs to be written to fall back to read/write loops depending on what the splice and sendfile syscalls say. The tricky issue is unit testing: since the code path chosen depends on the kernel, the test suite will test one path or the other depending on the machine and filesystems where the tests are run. It would be nice to be able to mock the syscalls, and replace them during tests, and it looks like I managed. First I made catalogues of the mockable syscalls I want to be able to mock. One with function pointers, for performance, and one with std::function, for flexibility:
 * Linux versions of syscalls to use for concrete implementations.
struct ConcreteLinuxBackend
    static ssize_t (*read)(int fd, void *buf, size_t count);
    static ssize_t (*write)(int fd, const void *buf, size_t count);
    static ssize_t (*writev)(int fd, const struct iovec *iov, int iovcnt);
    static ssize_t (*sendfile)(int out_fd, int in_fd, off_t *offset, size_t count);
    static ssize_t (*splice)(int fd_in, loff_t *off_in, int fd_out,
                             loff_t *off_out, size_t len, unsigned int flags);
    static int (*poll)(struct pollfd *fds, nfds_t nfds, int timeout);
    static ssize_t (*pread)(int fd, void *buf, size_t count, off_t offset);
 * Mockable versions of syscalls to use for testing concrete implementations.
struct ConcreteTestingBackend
    static std::function<ssize_t(int fd, void *buf, size_t count)> read;
    static std::function<ssize_t(int fd, const void *buf, size_t count)> write;
    static std::function<ssize_t(int fd, const struct iovec *iov, int iovcnt)> writev;
    static std::function<ssize_t(int out_fd, int in_fd, off_t *offset, size_t count)> sendfile;
    static std::function<ssize_t(int fd_in, loff_t *off_in, int fd_out,
                                 loff_t *off_out, size_t len, unsigned int flags)> splice;
    static std::function<int(struct pollfd *fds, nfds_t nfds, int timeout)> poll;
    static std::function<ssize_t(int fd, void *buf, size_t count, off_t offset)> pread;
    static void reset();
Then I converted the code to templates, parameterized on the catalogue class. Explicit template instantiation helps in making sure that one doesn't need to include template code in all sorts of places. Finally, I can have a RAII class for mocking:
 * RAII mocking of syscalls for concrete stream implementations
struct MockConcreteSyscalls
    std::function<ssize_t(int fd, void *buf, size_t count)> orig_read;
    std::function<ssize_t(int fd, const void *buf, size_t count)> orig_write;
    std::function<ssize_t(int fd, const struct iovec *iov, int iovcnt)> orig_writev;
    std::function<ssize_t(int out_fd, int in_fd, off_t *offset, size_t count)> orig_sendfile;
    std::function<ssize_t(int fd_in, loff_t *off_in, int fd_out,
                                 loff_t *off_out, size_t len, unsigned int flags)> orig_splice;
    std::function<int(struct pollfd *fds, nfds_t nfds, int timeout)> orig_poll;
    std::function<ssize_t(int fd, void *buf, size_t count, off_t offset)> orig_pread;
    : orig_read(ConcreteTestingBackend::read),
    ConcreteTestingBackend::read = orig_read;
    ConcreteTestingBackend::write = orig_write;
    ConcreteTestingBackend::writev = orig_writev;
    ConcreteTestingBackend::sendfile = orig_sendfile;
    ConcreteTestingBackend::splice = orig_splice;
    ConcreteTestingBackend::poll = orig_poll;
    ConcreteTestingBackend::pread = orig_pread;
And here's the specialization to pretend sendfile and splice aren't available:
 * Mock sendfile and splice as if they weren't available on this system
struct DisableSendfileSplice : public MockConcreteSyscalls
    ConcreteTestingBackend::sendfile = [](int out_fd, int in_fd, off_t *offset, size_t count) -> ssize_t  
        errno = EINVAL;
        return -1;
    ConcreteTestingBackend::splice = [](int fd_in, loff_t *off_in, int fd_out,
                                        loff_t *off_out, size_t len, unsigned int flags) -> ssize_t  
        errno = EINVAL;
        return -1;
It's now also possible to reproduce in the test suite all sorts of system-related issues we might observe in production over time.

5 June 2021

Enrico Zini: Ansible blockinfile oddity

I was reading Ansible's blockinfile sources for, uhm, reasons, and the code flow looked a bit odd. So I checked what happens if a file has spurious block markers. Give this file:
$ cat /tmp/test.orig
And this playbook:
$ cat test.yaml
- hosts: localhost
   - name: test blockinfile
        block: NEWLINE
        path: /tmp/test
You get this result:
$ cat /tmp/test
I was hoping that I was reading the code incorrectly, but it turns out that Ansible's blockinfile matches the last pair of begin-end markers it finds, in whatever order it finds them.

21 April 2021

Enrico Zini: Python output buffering

Here's a little toy program that displays a message like a split-flap display:
import sys
import time
def display(line: str):
    cur = '0' * len(line)
    while True:
        print(cur, end="\r")
        if cur == line:
        cur = "".join(chr(min(ord(c) + 1, ord(oc))) for c, oc in zip(cur, line))
message = " ".join(sys.argv[1:])
This only works if the script's stdout is unbuffered. Pipe the output through cat, and you get a long wait, and then the final string, without the animation. What is happening is that since the output is not going to a terminal, optimizations kick in that buffer the output and send it in bigger chunks, to make processing bulk I/O more efficient. I haven't found a good introductory explanation of buffering in Python's documentation. The details seem to be scattered in the io module documentation and they mostly assume that one is already familiar with concepts like unbuffered, line-buffered or block-buffered. The libc documentation has a good quick introduction that one can read to get up to speed. Controlling buffering in Python In Python, one can force a buffer flush with the flush() method of the output file descriptor, like sys.stdout.flush(), to make sure pending buffered output gets sent. Python's print() function also supports flush=True as an optional argument:
    print(cur, end="\r", flush=True)
If one wants to change the default buffering for a file descriptor, since Python 3.7 there's a convenient reconfigure() method, which can reconfigure line buffering only:
Otherwise, the technique is to reassign sys.stdout to something that has the behaviour one wants (code from this StackOverflow thread):
import io
# Python 3, open as binary, then wrap in a TextIOWrapper with write-through.
sys.stdout = io.TextIOWrapper(open(sys.stdout.fileno(), 'wb', 0), write_through=True)
If one needs all this to implement a progressbar, one should make sure to have a look at the progressbar module first.