Search Results: "pyro"

5 February 2022

Reproducible Builds: Reproducible Builds in January 2022

Welcome to the January 2022 report from the Reproducible Builds project. In our reports, we try outline the most important things that have been happening in the past month. As ever, if you are interested in contributing to the project, please visit our Contribute page on our website.
An interesting blog post was published by Paragon Initiative Enterprises about Gossamer, a proposal for securing the PHP software supply-chain. Utilising code-signing and third-party attestations, Gossamer aims to mitigate the risks within the notorious PHP world via publishing attestations to a transparency log. Their post, titled Solving Open Source Supply Chain Security for the PHP Ecosystem goes into some detail regarding the design, scope and implementation of the system.
This month, the Linux Foundation announced SupplyChainSecurityCon, a conference focused on exploring the security threats affecting the software supply chain, sharing best practices and mitigation tactics. The conference is part of the Linux Foundation s Open Source Summit North America and will take place June 21st 24th 2022, both virtually and in Austin, Texas.

Debian There was a significant progress made in the Debian Linux distribution this month, including:

Other distributions kpcyrd reported on Twitter about the release of version 0.2.0 of pacman-bintrans, an experiment with binary transparency for the Arch Linux package manager, pacman. This new version is now able to query rebuilderd to check if a package was independently reproduced.
In the world of openSUSE, however, Bernhard M. Wiedemann posted his monthly reproducible builds status report.

diffoscope diffoscope is our in-depth and content-aware diff utility. Not only can it locate and diagnose reproducibility issues, it can provide human-readable diffs from many kinds of binary formats. This month, Chris Lamb prepared and uploaded versions 199, 200, 201 and 202 to Debian unstable (that were later backported to Debian bullseye-backports by Mattia Rizzolo), as well as made the following changes to the code itself:
  • New features:
    • First attempt at incremental output support with a timeout. Now passing, for example, --timeout=60 will mean that diffoscope will not recurse into any sub-archives after 60 seconds total execution time has elapsed. Note that this is not a fixed/strict timeout due to implementation issues. [ ][ ]
    • Support both variants of odt2txt, including the one provided by the unoconv package. [ ]
  • Bug fixes:
    • Do not return with a UNIX exit code of 0 if we encounter with a file whose human-readable metadata matches literal file contents. [ ]
    • Don t fail if comparing a nonexistent file with a .pyc file (and add test). [ ][ ]
    • If the debian.deb822 module raises any exception on import, re-raise it as an ImportError. This should fix diffoscope on some Fedora systems. [ ]
    • Even if a Sphinx .inv inventory file is labelled The remainder of this file is compressed using zlib, it might not actually be. In this case, don t traceback and simply return the original content. [ ]
  • Documentation:
    • Improve documentation for the new --timeout option due to a few misconceptions. [ ]
    • Drop reference in the manual page claiming the ability to compare non-existent files on the command-line. (This has not been possible since version 32 which was released in September 2015). [ ]
    • Update X has been modified after NT_GNU_BUILD_ID has been applied messages to, for example, not duplicating the full filename in the diffoscope output. [ ]
  • Codebase improvements:
    • Tidy some control flow. [ ]
    • Correct a recompile typo. [ ]
In addition, Alyssa Ross fixed the comparison of CBFS names that contain spaces [ ], Sergei Trofimovich fixed whitespace for compatibility with version 21.12 of the Black source code reformatter [ ] and Zbigniew J drzejewski-Szmek fixed JSON detection with a new version of file [ ].

Testing framework The Reproducible Builds project runs a significant testing framework at tests.reproducible-builds.org, to check packages and other artifacts for reproducibility. This month, the following changes were made:
  • Fr d ric Pierret (fepitre):
    • Add Debian bookworm to package set creation. [ ]
  • Holger Levsen:
    • Install the po4a package where appropriate, as it is needed for the Reproducible Builds website job [ ]. In addition, also run the i18n.sh and contributors.sh scripts [ ].
    • Correct some grammar in Debian live image build output. [ ]
    • Shell monitor improvements:
      • Only show the offline node section if there are offline nodes. [ ]
      • Colorise offline nodes. [ ]
      • Shrink screen usage. [ ][ ][ ]
    • Node health check improvements:
      • Detect if live package builds encounter incomplete snapshots. [ ][ ][ ]
      • Detect if a host is running with today s date (when it should be set artificially in the future). [ ]
    • Use the devscripts package from bullseye-backports on Debian nodes. [ ]
    • Use the Munin monitoring package bullseye-backports on Debian nodes too. [ ]
    • Update New Year handling, needed to be able to detect real and fake dates. [ ][ ]
    • Improve the error message of the script that powercycles the arm64 architecture nodes hosted by Codethink. [ ]
  • Mattia Rizzolo:
    • Use the new --timeout option added in diffoscope version 202. [ ]
  • Roland Clobus:
    • Update the build scripts now that the hooks for live builds are now maintained upstream in the live-build repository. [ ]
    • Show info lines in Jenkins when reproducible hooks have been active. [ ]
    • Use unique folders for the artifacts from each live Debian version. [ ]
  • Vagrant Cascadian:
    • Switch the Debian armhf architecture nodes to use new proxy. [ ]
    • Misc. node maintenance. [ ].

Upstream patches The Reproducible Builds project attempts to fix as many currently-unreproducible packages as possible. In January, we wrote a large number of such patches, including:

And finally If you are interested in contributing to the Reproducible Builds project, please visit our Contribute page on our website. However, you can get in touch with us via:

21 November 2021

Antoine Beaupr : mbsync vs OfflineIMAP

After recovering from my latest email crash (previously, previously), I had to figure out which tool I should be using. I had many options but I figured I would start with a popular one (mbsync). But I also evaluated OfflineIMAP which was resurrected from the Python 2 apocalypse, and because I had used it before, for a long time. Read on for the details.

Benchmark setup All programs were tested against a Dovecot 1:2.3.13+dfsg1-2 server, running Debian bullseye. The client is a Purism 13v4 laptop with a Samsung SSD 970 EVO 1TB NVMe drive. The server is a custom build with a AMD Ryzen 5 2600 CPU, and a RAID-1 array made of two NVMe drives (Intel SSDPEKNW010T8 and WDC WDS100T2B0C). The mail spool I am testing against has almost 400k messages and takes 13GB of disk space:
$ notmuch count --exclude=false
372758
$ du -sh --exclude xapian Maildir
13G Maildir
The baseline we are comparing against is SMD (syncmaildir) which performs the sync in about 7-8 seconds locally (3.5 seconds for each push/pull command) and about 10-12 seconds remotely. Anything close to that or better is good enough. I do not have recent numbers for a SMD full sync baseline, but the setup documentation mentions 20 minutes for a full sync. That was a few years ago, and the spool has obviously grown since then, so that is not a reliable baseline. A baseline for a full sync might be also set with rsync, which copies files at nearly 40MB/s, or 317Mb/s!
anarcat@angela:tmp(main)$ time rsync -a --info=progress2 --exclude xapian  shell.anarc.at:Maildir/ Maildir/
 12,647,814,731 100%   37.85MB/s    0:05:18 (xfr#394981, to-chk=0/395815)    
72.38user 106.10system 5:19.59elapsed 55%CPU (0avgtext+0avgdata 15988maxresident)k
8816inputs+26305112outputs (0major+50953minor)pagefaults 0swaps
That is 5 minutes to transfer the entire spool. Incremental syncs are obviously pretty fast too:
anarcat@angela:tmp(main)$ time rsync -a --info=progress2 --exclude xapian  shell.anarc.at:Maildir/ Maildir/
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=0/395815)    
1.42user 0.81system 0:03.31elapsed 67%CPU (0avgtext+0avgdata 14100maxresident)k
120inputs+0outputs (3major+12709minor)pagefaults 0swaps
As an extra curiosity, here's the performance with tar, pretty similar with rsync, minus incremental which I cannot be bothered to figure out right now:
anarcat@angela:tmp(main)$ time ssh shell.anarc.at tar --exclude xapian -cf - Maildir/   pv -s 13G   tar xf - 
56.68user 58.86system 5:17.08elapsed 36%CPU (0avgtext+0avgdata 8764maxresident)k
0inputs+0outputs (0major+7266minor)pagefaults 0swaps
12,1GiO 0:05:17 [39,0MiB/s] [===================================================================> ] 92%
Interesting that rsync manages to almost beat a plain tar on file transfer, I'm actually surprised by how well it performs here, considering there are many little files to transfer. (But then again, this maybe is exactly where rsync shines: while tar needs to glue all those little files together, rsync can just directly talk to the other side and tell it to do live changes. Something to look at in another article maybe?) Since both ends are NVMe drives, those should easily saturate a gigabit link. And in fact, a backup of the server mail spool achieves much faster transfer rate on disks:
anarcat@marcos:~$ tar fc - Maildir   pv -s 13G > Maildir.tar
15,0GiO 0:01:57 [ 131MiB/s] [===================================] 115%
That's 131Mibyyte per second, vastly faster than the gigabit link. The client has similar performance:
anarcat@angela:~(main)$ tar fc - Maildir   pv -s 17G > Maildir.tar
16,2GiO 0:02:22 [ 116MiB/s] [==================================] 95%
So those disks should be able to saturate a gigabit link, and they are not the bottleneck on fast links. Which begs the question of what is blocking performance of a similar transfer over the gigabit link, but that's another question altogether, because no sync program ever reaches the above performance anyways. Finally, note that when I migrated to SMD, I wrote a small performance comparison that could be interesting here. It show SMD to be faster than OfflineIMAP, but not as much as we see here. In fact, it looks like OfflineIMAP slowed down significantly since then (May 2018), but this could be due to my larger mail spool as well.

mbsync The isync (AKA mbsync) project is written in C and supports syncing Maildir and IMAP folders, with possibly multiple replicas. I haven't tested this but I suspect it might be possible to sync between two IMAP servers as well. It supports partial mirorrs, message flags, full folder support, and "trash" functionality.

Complex configuration file I started with this .mbsyncrc configuration file:
SyncState *
Sync New ReNew Flags
IMAPAccount anarcat
Host imap.anarc.at
User anarcat
PassCmd "pass imap.anarc.at"
SSLType IMAPS
CertificateFile /etc/ssl/certs/ca-certificates.crt
IMAPStore anarcat-remote
Account anarcat
MaildirStore anarcat-local
# Maildir/top/sub/sub
#SubFolders Verbatim
# Maildir/.top.sub.sub
SubFolders Maildir++
# Maildir/top/.sub/.sub
# SubFolders legacy
# The trailing "/" is important
#Path ~/Maildir-mbsync/
Inbox ~/Maildir-mbsync/
Channel anarcat
# AKA Far, convert when all clients are 1.4+
Master :anarcat-remote:
# AKA Near
Slave :anarcat-local:
# Exclude everything under the internal [Gmail] folder, except the interesting folders
#Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
# Or include everything
Patterns *
# Automatically create missing mailboxes, both locally and on the server
#Create Both
Create slave
# Sync the movement of messages between folders and deletions, add after making sure the sync works
#Expunge Both
Long gone are the days where I would spend a long time reading a manual page to figure out the meaning of every option. If that's your thing, you might like this one. But I'm more of a "EXAMPLES section" kind of person now, and I somehow couldn't find a sample file on the website. I started from the Arch wiki one but it's actually not great because it's made for Gmail (which is not a usual Dovecot server). So a sample config file in the manpage would be a great addition. Thankfully, the Debian packages ships one in /usr/share/doc/isync/examples/mbsyncrc.sample but I only found that after I wrote my configuration. It was still useful and I recommend people take a look if they want to understand the syntax. Also, that syntax is a little overly complicated. For example, Far needs colons, like:
Far :anarcat-remote:
Why? That seems just too complicated. I also found that sections are not clearly identified: IMAPAccount and Channel mark section beginnings, for example, which is not at all obvious until you learn about mbsync's internals. There are also weird ordering issues: the SyncState option needs to be before IMAPAccount, presumably because it's global. Using a more standard format like .INI or TOML could improve that situation.

Stellar performance A transfer of the entire mail spool takes 56 minutes and 6 seconds, which is impressive. It's not quite "line rate": the resulting mail spool was 12GB (which is a problem, see below), which turns out to be about 29Mbit/s and therefore not maxing the gigabit link, and an order of magnitude slower than rsync. The incremental runs are roughly 2 seconds, which is even more impressive, as that's actually faster than rsync:
===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.015       0.052       1.930       2.029       2.105       
user        0.660       0.040       0.592       0.661       0.722       
sys         0.338       0.033       0.268       0.341       0.387    
Those tests were performed with isync 1.3.0-2.2 on Debian bullseye. Tests with a newer isync release originally failed because of a corrupted message that triggered bug 999804 (see below). Running 1.4.3 under valgrind works around the bug, but adds a 50% performance cost, the full sync running in 1h35m. Once the upstream patch is applied, performance with 1.4.3 is fairly similar, considering that the new sync included the register folder with 4000 messages:
120.74user 213.19system 59:47.69elapsed 9%CPU (0avgtext+0avgdata 105420maxresident)k
29128inputs+28284376outputs (0major+45711minor)pagefaults 0swaps
That is ~13GB in ~60 minutes, which gives us 28.3Mbps. Incrementals are also pretty similar to 1.3.x, again considering the double-connect cost:
===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.500       0.087       2.340       2.491       2.629       
user        0.718       0.037       0.679       0.711       0.793       
sys         0.322       0.024       0.284       0.320       0.365
Those tests were all done on a Gigabit link, but what happens on a slower link? My server uplink is slow: 25 Mbps down, 6 Mbps up. There mbsync is worse than the SMD baseline:
===> multitime results
1: mbsync -a
Mean        Std.Dev.    Min         Median      Max
real        31.531      0.724       30.764      31.271      33.100      
user        1.858       0.125       1.721       1.818       2.131       
sys         0.610       0.063       0.506       0.600       0.695       
That's 30 seconds for a sync, which is an order of magnitude slower than SMD.

Great user interface Compared to OfflineIMAP and (ahem) SMD, the mbsync UI is kind of neat:
anarcat@angela:~(main)$ mbsync -a
Notice: Master/Slave are deprecated; use Far/Near instead.
C: 1/2  B: 204/205  F: +0/0 *0/0 #0/0  N: +1/200 *0/0 #0/0
(Note that nice switch away from slavery-related terms too.) The display is minimal, and yet informative. It's not obvious what does mean at first glance, but the manpage is useful at least for clarifying that:
This represents the cumulative progress over channels, boxes, and messages affected on the far and near side, respectively. The message counts represent added messages, messages with updated flags, and trashed messages, respectively. No attempt is made to calculate the totals in advance, so they grow over time as more information is gathered. (Emphasis mine).
In other words:
  • C 2/2: channels done/total (2 done out of 2)
  • B 204/205: mailboxes done/total (204 out of 205)
  • F: changes on the far side
  • N: +10/200 *0/0 #0/0: changes on the "near" side:
    • +10/200: 10 out of 200 messages downloaded
    • *0/0: no flag changed
    • #0/0: no message deleted
You get used to it, in a good way. It does not, unfortunately, show up when you run it in systemd, which is a bit annoying as I like to see a summary mail traffic in the logs.

Interoperability issue In my notmuch setup, I have bound key S to "mark spam", which basically assigns the tag spam to the message and removes a bunch of others. Then I have a notmuch-purge script which moves that message to the spam folder, for training purposes. It basically does this:
notmuch search --output=files --format=text0 "$search_spam" \
      xargs -r -0 mv -t "$HOME/Maildir/$ PREFIX junk/cur/"
This method, which worked fine in SMD (and also OfflineIMAP) created this error on sync:
Maildir error: duplicate UID 37578.
And indeed, there are now two messages with that UID in the mailbox:
anarcat@angela:~(main)$ find Maildir/.junk/ -name '*U=37578*'
Maildir/.junk/cur/1637427889.134334_2.angela,U=37578:2,S
Maildir/.junk/cur/1637348602.2492889_221804.angela,U=37578:2,S
This is actually a known limitation or, as mbsync(1) calls it, a "RECOMMENDATION":
When using the more efficient default UID mapping scheme, it is important that the MUA renames files when moving them between Maildir fold ers. Mutt always does that, while mu4e needs to be configured to do it:
(setq mu4e-change-filenames-when-moving t)
So it seems I would need to fix my script. It's unclear how the paths should be renamed, which is unfortunate, because I would need to change my script to adapt to mbsync, but I can't tell how just from reading the above. (A manual fix is actually to rename the file to remove the U= field: mbsync will generate a new one and then sync correctly.) Fortunately, someone else already fixed that issue: afew, a notmuch tagging script (much puns, such hurt), has a move mode that can rename files correctly, specifically designed to deal with mbsync. I had already been told about afew, but it's one more reason to standardize my notmuch hooks on that project, it looks like. Update: I have tried to use afew and found it has significant performance issues. It also has a completely different paradigm to what I am used to: it assumes all incoming mail has a new and lays its own tags on top of that (inbox, sent, etc). It can only move files from one folder at a time (see this bug) which breaks my spam training workflow. In general, I sync my tags into folders (e.g. ham, spam, sent) and message flags (e.g. inbox is F, unread is "not S", etc), and afew is not well suited for this (although there are hacks that try to fix this). I have worked hard to make my tagging scripts idempotent, and it's something afew doesn't currently have. Still, it would be better to have that code in Python than bash, so maybe I should consider my options here.

Stability issues The newer release in Debian bookworm (currently at 1.4.3) has stability issues on full sync. I filed bug 999804 in Debian about this, which lead to a thread on the upstream mailing list. I have found at least three distinct crashes that could be double-free bugs "which might be exploitable in the worst case", not a reassuring prospect. The thing is: mbsync is really fast, but the downside of that is that it's written in C, and with that comes a whole set of security issues. The Debian security tracker has only three CVEs on isync, but the above issues show there could be many more. Reading the source code certainly did not make me very comfortable with trusting it with untrusted data. I considered sandboxing it with systemd (below) but having systemd run as a --user process makes that difficult. I also considered using an apparmor profile but that is not trivial because we need to allow SSH and only some parts of it... Thankfully, upstream has been diligent at addressing the issues I have found. They provided a patch within a few days which did fix the sync issues. Update: upstream actually took the issue very seriously. They not only got CVE-2021-44143 assigned for my bug report, they also audited the code and found several more issues collectively identified as CVE-2021-3657, which actually also affect 1.3 (ie. Debian 11/bullseye/stable). Somehow my corpus doesn't trigger that issue, but it was still considered serious enough to warrant a CVE. So one the one hand: excellent response from upstream; but on the other hand: how many more of those could there be in there?

Automation with systemd The Arch wiki has instructions on how to setup mbsync as a systemd service. It suggests using the --verbose (-V) flag which is a little intense here, as it outputs 1444 lines of messages. I have used the following .service file:
[Unit]
Description=Mailbox synchronization service
ConditionHost=!marcos
Wants=network-online.target
After=network-online.target
Before=notmuch-new.service
[Service]
Type=oneshot
ExecStart=/usr/bin/mbsync -a
Nice=10
IOSchedulingClass=idle
NoNewPrivileges=true
[Install]
WantedBy=default.target
And the following .timer:
[Unit]
Description=Mailbox synchronization timer
ConditionHost=!marcos
[Timer]
OnBootSec=2m
OnUnitActiveSec=5m
Unit=mbsync.service
[Install]
WantedBy=timers.target
Note that we trigger notmuch through systemd, with the Before and also by adding mbsync.service to the notmuch-new.service file:
[Unit]
Description=notmuch new
After=mbsync.service
[Service]
Type=oneshot
Nice=10
ExecStart=/usr/bin/notmuch new
[Install]
WantedBy=mbsync.service
An improvement over polling repeatedly with a .timer would be to wake up only on IMAP notify, but neither imapnotify nor goimapnotify seem to be packaged in Debian. It would also not cover for the "sent folder" use case, where we need to wake up on local changes.

Password-less setup The sample file suggests this should work:
IMAPStore remote
Tunnel "ssh -q host.remote.com /usr/sbin/imapd"
Add BatchMode, restrict to IdentitiesOnly, provide a password-less key just for this, add compression (-C), find the Dovecot imap binary, and you get this:
IMAPAccount anarcat-tunnel
Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C anarcat@imap.anarc.at /usr/lib/dovecot/imap"
And it actually seems to work:
$ mbsync -a
Notice: Master/Slave are deprecated; use Far/Near instead.
C: 0/2  B: 0/1  F: +0/0 *0/0 #0/0  N: +0/0 *0/0 #0/0imap(anarcat): Error: net_connect_unix(/run/dovecot/stats-writer) failed: Permission denied
C: 2/2  B: 205/205  F: +0/0 *0/0 #0/0  N: +1/1 *3/3 #0/0imap(anarcat)<1611280><90uUOuyElmEQlhgAFjQyWQ>: Info: Logged out in=10808 out=15396642 deleted=0 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=1 body_bytes=8087
It's a bit noisy, however. dovecot/imap doesn't have a "usage" to speak of, but even the source code doesn't hint at a way to disable that Error message, so that's unfortunate. That socket is owned by root:dovecot so presumably Dovecot runs the imap process as $user:dovecot, which we can't do here. Oh well? Interestingly, the SSH setup is not faster than IMAP. With IMAP:
===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.367       0.065       2.220       2.376       2.458       
user        0.793       0.047       0.731       0.776       0.871       
sys         0.426       0.040       0.364       0.434       0.476
With SSH:
===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.515       0.088       2.274       2.532       2.594       
user        0.753       0.043       0.645       0.766       0.804       
sys         0.328       0.045       0.212       0.340       0.393
Basically: 200ms slower. Tolerable.

Migrating from SMD The above was how I migrated to mbsync on my first workstation. The work on the second one was more streamlined, especially since the corruption on mailboxes was fixed:
  1. install isync, with the patch:
    dpkg -i isync_1.4.3-1.1~_amd64.deb
    
  2. copy all files over from previous workstation to avoid a full resync (optional):
    rsync -a --info=progress2 angela:Maildir/ Maildir-mbsync/
    
  3. rename all files to match new hostname (optional):
    find Maildir-mbsync/ -type f -name '*.angela,*' -print0    rename -0 's/\.angela,/\.curie,/'
    
  4. trash the notmuch database (optional):
    rm -rf Maildir-mbsync/.notmuch/xapian/
    
  5. disable all smd and notmuch services:
    systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer
    
  6. do one last sync with smd:
    smd-pull --show-tags ; smd-push --show-tags ; notmuch new ; notmuch-sync-flagged -v
    
  7. backup notmuch on the client and server:
    notmuch dump   pv > notmuch.dump
    
  8. backup the maildir on the client and server:
    cp -al Maildir Maildir-bak
    
  9. create the SSH key:
    ssh-keygen -t ed25519 -f .ssh/id_ed25519_mbsync
    cat .ssh/id_ed25519_mbsync.pub
    
  10. add to .ssh/authorized_keys on the server, like this: command="/usr/lib/dovecot/imap",restrict ssh-ed25519 AAAAC...
  11. move old files aside, if present:
    mv Maildir Maildir-smd
    
  12. move new files in place (CRITICAL SECTION BEGINS!):
    mv Maildir-mbsync Maildir
    
  13. run a test sync, only pulling changes: mbsync --create-near --remove-none --expunge-none --noop anarcat-register
  14. if that works well, try with all mailboxes: mbsync --create-near --remove-none --expunge-none --noop -a
  15. if that works well, try again with a full sync: mbsync register mbsync -a
  16. reindex and restore the notmuch database, this should take ~25 minutes:
    notmuch new
    pv notmuch.dump   notmuch restore
    
  17. enable the systemd services and retire the smd-* services: systemctl --user enable mbsync.timer notmuch-new.service systemctl --user start mbsync.timer rm ~/.config/systemd/user/smd* systemctl daemon-reload
During the migration, notmuch helpfully told me the full list of those lost messages:
[...]
Warning: cannot apply tags to missing message: CAN6gO7_QgCaiDFvpG3AXHi6fW12qaN286+2a7ERQ2CQtzjSEPw@mail.gmail.com
Warning: cannot apply tags to missing message: CAPTU9Wmp0yAmaxO+qo8CegzRQZhCP853TWQ_Ne-YF94MDUZ+Dw@mail.gmail.com
Warning: cannot apply tags to missing message: F5086003-2917-4659-B7D2-66C62FCD4128@gmail.com
[...]
Warning: cannot apply tags to missing message: mailman.2.1316793601.53477.sage-members@mailman.sage.org
Warning: cannot apply tags to missing message: mailman.7.1317646801.26891.outages-discussion@outages.org
Warning: cannot apply tags to missing message: notmuch-sha1-000458df6e48d4857187a000d643ac971deeef47
Warning: cannot apply tags to missing message: notmuch-sha1-0079d8e0c3340e6f88c66f4c49fca758ea71d06d
Warning: cannot apply tags to missing message: notmuch-sha1-0194baa4cfb6d39bc9e4d8c049adaccaa777467d
Warning: cannot apply tags to missing message: notmuch-sha1-02aede494fc3f9e9f060cfd7c044d6d724ad287c
Warning: cannot apply tags to missing message: notmuch-sha1-06606c625d3b3445420e737afd9a245ae66e5562
Warning: cannot apply tags to missing message: notmuch-sha1-0747b020f7551415b9bf5059c58e0a637ba53b13
[...]
As detailed in the crash report, all of those were actually innocuous and could be ignored. Also note that we completely trash the notmuch database because it's actually faster to reindex from scratch than let notmuch slowly figure out that all mails are new and all the old mails are gone. The fresh indexing took:
nov 19 15:08:54 angela notmuch[2521117]: Processed 384679 total files in 23m 41s (270 files/sec.).
nov 19 15:08:54 angela notmuch[2521117]: Added 372610 new messages to the database.
While a reindexing on top of an existing database was going twice as slow, at about 120 files/sec.

Current config file Putting it all together, I ended up with the following configuration file:
SyncState *
Sync All
# IMAP side, AKA "Far"
IMAPAccount anarcat-imap
Host imap.anarc.at
User anarcat
PassCmd "pass imap.anarc.at"
SSLType IMAPS
CertificateFile /etc/ssl/certs/ca-certificates.crt
IMAPAccount anarcat-tunnel
Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C anarcat@imap.anarc.at /usr/lib/dovecot/imap"
IMAPStore anarcat-remote
Account anarcat-tunnel
# Maildir side, AKA "Near"
MaildirStore anarcat-local
# Maildir/top/sub/sub
#SubFolders Verbatim
# Maildir/.top.sub.sub
SubFolders Maildir++
# Maildir/top/.sub/.sub
# SubFolders legacy
# The trailing "/" is important
#Path ~/Maildir-mbsync/
Inbox ~/Maildir/
# what binds Maildir and IMAP
Channel anarcat
Far :anarcat-remote:
Near :anarcat-local:
# Exclude everything under the internal [Gmail] folder, except the interesting folders
#Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
# Or include everything
#Patterns *
Patterns * !register  !.register
# Automatically create missing mailboxes, both locally and on the server
Create Both
#Create Near
# Sync the movement of messages between folders and deletions, add after making sure the sync works
Expunge Both
# Propagate mailbox deletion
Remove both
IMAPAccount anarcat-register-imap
Host imap.anarc.at
User register
PassCmd "pass imap.anarc.at-register"
SSLType IMAPS
CertificateFile /etc/ssl/certs/ca-certificates.crt
IMAPAccount anarcat-register-tunnel
Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C register@imap.anarc.at /usr/lib/dovecot/imap"
IMAPStore anarcat-register-remote
Account anarcat-register-tunnel
MaildirStore anarcat-register-local
SubFolders Maildir++
Inbox ~/Maildir/.register/
Channel anarcat-register
Far :anarcat-register-remote:
Near :anarcat-register-local:
Create Both
Expunge Both
Remove both
Note that it may be out of sync with my live (and private) configuration file, as I do not publish my "dotfiles" repository publicly for security reasons.

OfflineIMAP I've used OfflineIMAP for a long time before switching to SMD. I don't exactly remember why or when I started using it, but I do remember it became painfully slow as I started using notmuch, and would sometimes crash mysteriously. It's been a while, so my memory is hazy on that. It also kind of died in a fire when Python 2 stop being maintained. The main author moved on to a different project, imapfw which could serve as a framework to build IMAP clients, but never seemed to implement all of the OfflineIMAP features and certainly not configuration file compatibility. Thankfully, a new team of volunteers ported OfflineIMAP to Python 3 and we can now test that new version to see if it is an improvement over mbsync.

Crash on full sync The first thing that happened on a full sync is this crash:
Copy message from RemoteAnarcat:junk:
 ERROR: Copying message 30624 [acc: Anarcat]
  decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Thread 'Copy message from RemoteAnarcat:junk' terminated with exception:
Traceback (most recent call last):
  File "/usr/share/offlineimap3/offlineimap/imaputil.py", line 406, in utf7m_decode
    for c in binary.decode():
AttributeError: 'memoryview' object has no attribute 'decode'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/share/offlineimap3/offlineimap/threadutil.py", line 146, in run
    Thread.run(self)
  File "/usr/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 802, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 342, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 908, in _fetch_from_imap
    ndata1 = self.parser['8bit-RFC'].parsebytes(data[0][1])
  File "/usr/lib/python3.9/email/parser.py", line 123, in parsebytes
    return self.parser.parsestr(text, headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 67, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 56, in parse
    feedparser.feed(data)
  File "/usr/lib/python3.9/email/feedparser.py", line 176, in feed
    self._call_parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 180, in _call_parse
    self._parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 298, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 256, in _parsegen
    if self._cur.get_content_type() == 'message/delivery-status':
  File "/usr/lib/python3.9/email/message.py", line 578, in get_content_type
    value = self.get('content-type', missing)
  File "/usr/lib/python3.9/email/message.py", line 471, in get
    return self.policy.header_fetch_parse(k, v)
  File "/usr/lib/python3.9/email/policy.py", line 163, in header_fetch_parse
    return self.header_factory(name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 601, in __call__
    return self[name](name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 196, in __new__
    cls.parse(value, kwds)
  File "/usr/lib/python3.9/email/headerregistry.py", line 445, in parse
    kwds['parse_tree'] = parse_tree = cls.value_parser(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2675, in parse_content_type_header
    ctype.append(parse_mime_parameters(value[1:]))
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2569, in parse_mime_parameters
    token, value = get_parameter(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2492, in get_parameter
    token, value = get_value(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2403, in get_value
    token, value = get_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1294, in get_quoted_string
    token, value = get_bare_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1223, in get_bare_quoted_string
    token, value = get_encoded_word(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1064, in get_encoded_word
    text, charset, lang, defects = _ew.decode('=?' + tok + '?=')
  File "/usr/lib/python3.9/email/_encoded_words.py", line 181, in decode
    string = bstring.decode(charset)
AttributeError: decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Last 1 debug messages logged for Copy message from RemoteAnarcat:junk prior to exception:
thread: Register new thread 'Copy message from RemoteAnarcat:junk' (account 'Anarcat')
ERROR: Exceptions occurred during the run!
ERROR: Copying message 30624 [acc: Anarcat]
  decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 802, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 342, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 908, in _fetch_from_imap
    ndata1 = self.parser['8bit-RFC'].parsebytes(data[0][1])
  File "/usr/lib/python3.9/email/parser.py", line 123, in parsebytes
    return self.parser.parsestr(text, headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 67, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 56, in parse
    feedparser.feed(data)
  File "/usr/lib/python3.9/email/feedparser.py", line 176, in feed
    self._call_parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 180, in _call_parse
    self._parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 298, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 256, in _parsegen
    if self._cur.get_content_type() == 'message/delivery-status':
  File "/usr/lib/python3.9/email/message.py", line 578, in get_content_type
    value = self.get('content-type', missing)
  File "/usr/lib/python3.9/email/message.py", line 471, in get
    return self.policy.header_fetch_parse(k, v)
  File "/usr/lib/python3.9/email/policy.py", line 163, in header_fetch_parse
    return self.header_factory(name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 601, in __call__
    return self[name](name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 196, in __new__
    cls.parse(value, kwds)
  File "/usr/lib/python3.9/email/headerregistry.py", line 445, in parse
    kwds['parse_tree'] = parse_tree = cls.value_parser(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2675, in parse_content_type_header
    ctype.append(parse_mime_parameters(value[1:]))
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2569, in parse_mime_parameters
    token, value = get_parameter(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2492, in get_parameter
    token, value = get_value(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2403, in get_value
    token, value = get_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1294, in get_quoted_string
    token, value = get_bare_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1223, in get_bare_quoted_string
    token, value = get_encoded_word(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1064, in get_encoded_word
    text, charset, lang, defects = _ew.decode('=?' + tok + '?=')
  File "/usr/lib/python3.9/email/_encoded_words.py", line 181, in decode
    string = bstring.decode(charset)
Folder junk [acc: Anarcat]:
 Copy message UID 30626 (29008/49310) RemoteAnarcat:junk -> LocalAnarcat:junk
Command exited with non-zero status 100
5252.91user 535.86system 3:21:00elapsed 47%CPU (0avgtext+0avgdata 846304maxresident)k
96344inputs+26563792outputs (1189major+2155815minor)pagefaults 0swaps
That only transferred about 8GB of mail, which gives us a transfer rate of 5.3Mbit/s, more than 5 times slower than mbsync. This bug is possibly limited to the bullseye version of offlineimap3 (the lovely 0.0~git20210225.1e7ef9e+dfsg-4), while the current sid version (the equally gorgeous 0.0~git20211018.e64c254+dfsg-1) seems unaffected.

Tolerable performance The new release still crashes, except it does so at the very end, which is an improvement, since the mails do get transferred:
 *** Finished account 'Anarcat' in 511:12
ERROR: Exceptions occurred during the run!
ERROR: Exception parsing message with ID (<20190619152034.BFB8810E07A@marcos.anarc.at>) from imaplib (response type: bytes).
 AttributeError: decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 910, in _fetch_from_imap
    raise OfflineImapError(
ERROR: Exception parsing message with ID (<40A270DB.9090609@alternatives.ca>) from imaplib (response type: bytes).
 AttributeError: decoding with 'x-mac-roman' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 910, in _fetch_from_imap
    raise OfflineImapError(
ERROR: IMAP server 'RemoteAnarcat' does not have a message with UID '32686'
Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 889, in _fetch_from_imap
    raise OfflineImapError(reason, severity)
Command exited with non-zero status 1
8273.52user 983.80system 8:31:12elapsed 30%CPU (0avgtext+0avgdata 841936maxresident)k
56376inputs+43247608outputs (811major+4972914minor)pagefaults 0swaps
"offlineimap  -o " took 8 hours 31 mins 15 secs
This is 8h31m for transferring 12G, which is around 3.1Mbit/s. That is nine times slower than mbsync, almost an order of magnitude! Now that we have a full sync, we can test incremental synchronization. That is also much slower:
===> multitime results
1: sh -c "offlineimap -o   true"
            Mean        Std.Dev.    Min         Median      Max
real        24.639      0.513       23.946      24.526      25.708      
user        23.912      0.473       23.404      23.795      24.947      
sys         1.743       0.105       1.607       1.729       2.002
That is also an order of magnitude slower than mbsync, and significantly slower than what you'd expect from a sync process. ~30 seconds is long enough to make me impatient and distracted; 3 seconds, less so: I can wait and see the results almost immediately.

Integrity check That said: this is still on a gigabit link. It's technically possible that OfflineIMAP performs better than mbsync over a slow link, but I Haven't tested that theory. The OfflineIMAP mail spool is missing quite a few messages as well:
anarcat@angela:~(main)$ find Maildir-offlineimap -type f -type f -a \! -name '.*'   wc -l 
381463
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*'   wc -l 
385247
... although that's probably all either new messages or the register folder, so OfflineIMAP might actually be in a better position there. But digging in more, it seems like the actual per-folder diff is fairly similar to mbsync: a few messages missing here and there. Considering OfflineIMAP's instability and poor performance, I have not looked any deeper in those discrepancies.

Other projects to evaluate Those are all the options I have considered, in alphabetical order
  • doveadm-sync: requires dovecot on both ends, can tunnel over SSH, may have performance issues in incremental sync, written in C
  • fdm: fetchmail replacement, IMAP/POP3/stdin/Maildir/mbox,NNTP support, SOCKS support (for Tor), complex rules for delivering to specific mailboxes, adding headers, piping to commands, etc. discarded because no (real) support for keeping mail on the server, and written in C
  • getmail: fetchmail replacement, IMAP/POP3 support, supports incremental runs, classification rules, Python
  • interimap: syncs two IMAP servers, apparently faster than doveadm and offlineimap, but requires running an IMAP server locally, Perl
  • isync/mbsync: TLS client certs and SSH tunnels, fast, incremental, IMAP/POP/Maildir support, multiple mailbox, trash and recursion support, and generally has good words from multiple Debian and notmuch people (Arch tutorial), written in C, review above
  • mail-sync: notify support, happens over any piped transport (e.g. ssh), diff/patch system, requires binary on both ends, mentions UUCP in the manpage, mentions rsmtp which is a nice name for rsendmail. not evaluated because it seems awfully complex to setup, Haskell
  • nncp: treat the local spool as another mail server, not really compatible with my "multiple clients" setup, Golang
  • offlineimap3: requires IMAP, used the py2 version in the past, might just still work, first sync painful (IIRC), ways to tunnel over SSH, review above, Python
Most projects were not evaluated due to lack of time.

Conclusion I'm now using mbsync to sync my mail. I'm a little disappointed by the synchronisation times over the slow link, but I guess that's on par for the course if we use IMAP. We are bound by the network speed much more than with custom protocols. I'm also worried about the C implementation and the crashes I have witnessed, but I am encouraged by the fast upstream response. Time will tell if I will stick with that setup. I'm certainly curious about the promises of interimap and mail-sync, but I have ran out of time on this project.

29 June 2021

Antoine Beaupr : Another syncmaildir crash

So I had another major email crash with my syncmaildir setup. This time I was at least able to confirm the issue, and I still haven't lost mail thanks to backups and sheer luck (again).

The crash It is not really worth going over the crash in details, it's fairly similar to the last one: something bad happened and smd started destroying everything. The hint is that it takes a long time to do what usually takes seconds. It helps that I now have a second monitor showing logs. I still lost much more mail than the last time. I used to have "301 723 messages", according to notmuch. But then when I ran smd-pull by hand, it was telling me:
95K emails scanned
Oops. You can see notmuch happily noticing the destroyed files on the server:
jun 28 16:33:40 marcos notmuch[28532]: No new mail. Removed 65498 messages. Detected 1699 file renames.
jun 28 16:36:05 marcos notmuch[29746]: No new mail. Removed 68883 messages. Detected 2488 file renames.
jun 28 16:41:40 marcos notmuch[31972]: No new mail. Removed 118295 messages. Detected 3657 file renames.
The final count ended up being 81 042 messages, according to notmuch. A whopping 220 000 mails deleted. The interesting bit, this time around, is that I caught smd in the act of running two processes in parallel:
jun 28 16:30:09 curie systemd[2845]: Finished pull emails with syncmaildir. 
jun 28 16:30:09 curie systemd[2845]: Starting push emails with syncmaildir... 
jun 28 16:30:09 curie systemd[2845]: Starting pull emails with syncmaildir... 
So clearly that is the source of the bug.

Recovery Emergency stop on curie:
notmuch dump > notmuch.dump
systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer
On marcos (the server), guessed the number of messages delivered since the last backup to be 71, just looking at timestamps in the mail log. Made a list:
grep postfix/local /var/log/mail.log   tail -71 > lost-mail
Found postfix queue IDs:
sed 's/.*\]://;s/:.*//' lost-mail > qids
Turn those into message IDs, find those that are missing from the disk (had previously ran notmuch new just to be sure it's up to date):
while read qid ; do 
    grep "$qid: message-id" /var/log/mail.log
done < qids    sed 's/.*message-id=<//;s/>//'   while read msgid; do
    sudo -u anarcat notmuch count --exclude=false id:$msgid   grep -q 0 && echo $msgid
done
Copy this back on curie as missing-msgids and:
$ wc -l missing-msgids 
48 missing-msgids
$ while read msgid ; do notmuch count --exclude=false id:$msgid   grep -q 0 && echo $msgid ; done < missing-msgids
mailman.189.1624881611.23397.nodes-reseaulibre.ca@reseaulibre.ca
AnwMy7rdSpK-N-vt4AiOag@ismtpd0148p1mdw1.sendgrid.net
only two mails missing! whoohoo! Copy those back onto marcos as really-missing-msgids, and look at the full mail logs to see what they are:
~anarcat/src/koumbit-scripts/mail/postfix-trace --from-file really-missing-msgids2
I actually remembered deleting those, so no mail lost! Rebuild the list of msgids that were lost, on marcos:
while read qid ; do grep "$qid: message-id" /var/log/mail.log; done < qids    sed 's/.*message-id=<//;s/>//'
Copy that on curie as lost-mail-msgids, then copy the files over in a test dir:
while read msgid ; do
    notmuch search --output=files --exclude=false "id:$msgid"
done < lost-mail-msgids   sed 's#/home/anarcat/Maildir/##'   rsync -v  --files-from=- /home/anarcat/Maildir/ shell.anarc.at:restore/Maildir-angela/
If that looks about right, on marcos:
find restore/Maildir-angela/ -type f   wc -l
... should match the number of missing mails, roughly. Copy if in the real spool:
while read msgid ; do
    notmuch search --output=files --exclude=false "id:$msgid"
done < lost-mail-msgids   sed 's#/home/anarcat/Maildir/##'   rsync -v  --files-from=- /home/anarcat/Maildir/ shell.anarc.at:Maildir/
Then on the server, notmuch new should find the new emails, and we shouldn't have any lost mail anymore:
while read qid ; do grep "$qid: message-id" /var/log/mail.log; done < qids    sed 's/.*message-id=<//;s/>//'   while read msgid; do sudo -u anarcat notmuch count --exclude=false id:$msgid   grep -q 0 && echo $msgid ; done
Then, crucial moment, try to pull the new mails from the backups on curie:
anarcat@curie:~(main)$ smd-pull  -n  --show-tags -v
Found lockfile of a dead instance. Ignored.
Phase 0: handshake
Phase 1: changes detection
    5K emails scanned
   10K emails scanned
   15K emails scanned
   20K emails scanned
   25K emails scanned
   30K emails scanned
   35K emails scanned
   40K emails scanned
   45K emails scanned
   50K emails scanned
Phase 2: synchronization
Phase 3: agreement
default: smd-client@localhost: TAGS: stats::new-mails(49687), del-mails(0), bytes-received(215752279), xdelta-received(3703852)
"smd-pull  -n  --show-tags -v" took 3 mins 39 secs
This brought me back to the state after the backup plus the mails delivered during the day, which means I had to catchup with all my holiday's read emails (1440 mails!) but thankfully I made a dump of the notmuch database on curie at the start of the procedure, so this actually restored a sane state:
pv notmuch.dump   notmuch restore
Phew!

Workaround I have filed this as a bug in upstream issue 18. Considering I filed 11 issues and only 3 of those were closed, I'm not holding my breath. I nevertheless filed PR 19 in the hope that this will fix my particular issue, but I'm not even sure this is the right fix...

Fix At this point, I'm really ready to give up on SMD. It's really, really nice to be able to sync mail over SSH because I don't need to store my IMAP password on disk. But surely there are more reliable syncing mechanisms. I do not remember ever losing that much mail before. At worst, offlineimap would duplicate emails like mad, but never destroy my entire mail spool that way. As mentioned before, there are other programs that sync mail. I'm looking at:
  • offlineimap3: requires IMAP, used the py2 version in the past, might just still work, first sync painful (IIRC), ways to tunnel over SSH, see comment below
  • isync/mbsync: might be faster, I remember having trouble switching from offlineimap to this, has support for TLS client certs, running over SSH, and generally has good words from multiple Debian and notmuch people
  • getmail: just downloads email, might not be enough
  • nncp: treat the local spool as another mail server, might not be compatible with my "multiple clients" setup
  • doveadm-sync: requires dovecot on both ends, but supports using SSH to sync, will try this next, may have performance problems, see comment below
  • interimap: syncs two IMAP servers, apparently faster than doveadm and offlineimap
  • mail-sync: notify support, happens over any piped transport (e.g. ssh), diff/patch system, requires binary on both ends, mentions UUCP in the manpage, seems awfully complicated to setup, mentions rsmtp which is a nice name for rsendmail

23 March 2021

Antoine Beaupr : Major email crash with syncmaildir

TL:DR; lost half my mail (150,000 messages, ~6GB) last night. Cause uncertain, but possibly a combination of a dead CMOS battery, systemd OnCalendar=daily, a (locking?) bug in syncmaildir, and generally, a system too exotic and complicated.

The crash So I somehow lost half my mail:
anarcat@angela:~(main)$ du -sh Maildir/
7,9G    Maildir/
anarcat@curie:~(main)$ du -sh Maildir
14G     Maildir
anarcat@marcos:~$ du -sh Maildir
8,0G    Maildir
Those are three different machines:
  • angela: my laptop, not always on
  • curie: my workstation, mostly always on
  • marcos: my mail server, always on
Those mails are synchronized using a rather exotic system based on SSH, syncmaildir and rsendmail. The anomaly started on curie:
-- Reboot --
mar 22 16:13:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:13:00 curie smd-pull[4801]: rm: impossible de supprimer '/home/anarcat/.smd/workarea/Maildir': Le dossier n'est pas vide
mar 22 16:13:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:13:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:13:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
mar 22 16:14:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:14:00 curie smd-pull[7025]:  4091 ?        00:00:00 smd-push
mar 22 16:14:00 curie smd-pull[7025]: Already running.
mar 22 16:14:00 curie smd-pull[7025]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:14:00 curie smd-pull[7025]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 4091) run(rm /home/anarcat/.smd/lock))
mar 22 16:14:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:14:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:14:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
Then it seems like smd-push (from curie) started destroying the universe for some reason:
mar 22 16:20:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:20:00 curie smd-pull[9319]:  4091 ?        00:00:00 smd-push
mar 22 16:20:00 curie smd-pull[9319]: Already running.
mar 22 16:20:00 curie smd-pull[9319]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:20:00 curie smd-pull[9319]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(ru
mar 22 16:20:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:20:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:20:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
mar 22 16:21:34 curie smd-push[4091]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(293920), bytes-received(0), xdelta-received(26995)
mar 22 16:21:35 curie smd-push[9374]: register: smd-client@smd-server-register: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(215)
mar 22 16:21:35 curie systemd[3199]: smd-push.service: Succeeded.
Notice the del-mails(293920) there: it is actively trying to destroy basically every email in my mail spool. Then somehow push and pull started both at once:
mar 22 16:21:35 curie systemd[3199]: Started push emails with syncmaildir.
mar 22 16:21:35 curie systemd[3199]: Starting push emails with syncmaildir...
mar 22 16:22:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:22:00 curie smd-pull[10333]:  9455 ?        00:00:00 smd-push
mar 22 16:22:00 curie smd-pull[10333]: Already running.
mar 22 16:22:00 curie smd-pull[10333]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:22:00 curie smd-pull[10333]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(r
mar 22 16:22:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:22:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:22:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
mar 22 16:22:00 curie smd-push[9455]: smd-client: ERROR: Data transmission failed.
mar 22 16:22:00 curie smd-push[9455]: smd-client: ERROR: This problem is transient, please retry.
mar 22 16:22:00 curie smd-push[9455]: smd-client: ERROR: server sent ABORT or connection died
mar 22 16:22:00 curie smd-push[9455]: smd-server: ERROR: Unable to open Maildir/.kobo/cur/1498563708.M122624P22121.marcos,S=32234,W=32792:2,S: Maildir/.kobo/cur/1498563708.M122624P22121.marco
mar 22 16:22:00 curie smd-push[9455]: smd-server: ERROR: The problem should be transient, please retry.
mar 22 16:22:00 curie smd-push[9455]: smd-server: ERROR: Unable to open requested file.
mar 22 16:22:00 curie smd-push[9455]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(293920), bytes-received(0), xdelta-received(26995)
mar 22 16:22:00 curie smd-push[9455]: default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
mar 22 16:22:00 curie smd-push[9455]: default: smd-server@localhost: TAGS: error::context(transmit) probable-cause(simultaneous-mailbox-edit) human-intervention(avoidable) suggested-actions(r
mar 22 16:22:00 curie systemd[3199]: smd-push.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:22:00 curie systemd[3199]: smd-push.service: Failed with result 'exit-code'.
mar 22 16:22:00 curie systemd[3199]: Failed to start push emails with syncmaildir.
There it seems push tried to destroy the universe again: del-mails(293920). Interestingly, the push started again in parallel with the pull, right that minute:
mar 22 16:22:00 curie systemd[3199]: Starting push emails with syncmaildir...
... but didn't complete for a while, here's pull trying to start again:
mar 22 16:24:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:24:00 curie smd-pull[12051]: 10466 ?        00:00:00 smd-push
mar 22 16:24:00 curie smd-pull[12051]: Already running.
mar 22 16:24:00 curie smd-pull[12051]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:24:00 curie smd-pull[12051]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 10466) run(rm /home/anarcat/.smd/lock))
mar 22 16:24:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:24:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:24:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
... and the long push finally resolving:
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: Data transmission failed.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: This problem is transient, please retry.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: server sent ABORT or connection died
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: Data transmission failed.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: This problem is transient, please retry.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: server sent ABORT or connection died
mar 22 16:24:00 curie smd-push[10466]: smd-server: ERROR: Unable to open Maildir/.kobo/cur/1498563708.M122624P22121.marcos,S=32234,W=32792:2,S: Maildir/.kobo/cur/1498563708.M122624P22121.marcos,S=32234,W=32792:2,S: No such file or directory
mar 22 16:24:00 curie smd-push[10466]: smd-server: ERROR: The problem should be transient, please retry.
mar 22 16:24:00 curie smd-push[10466]: smd-server: ERROR: Unable to open requested file.
mar 22 16:24:00 curie smd-push[10466]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(293920), bytes-received(0), xdelta-received(26995)
mar 22 16:24:00 curie smd-push[10466]: default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
mar 22 16:24:00 curie smd-push[10466]: default: smd-server@localhost: TAGS: error::context(transmit) probable-cause(simultaneous-mailbox-edit) human-intervention(avoidable) suggested-actions(retry)
mar 22 16:24:00 curie systemd[3199]: smd-push.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:24:00 curie systemd[3199]: smd-push.service: Failed with result 'exit-code'.
mar 22 16:24:00 curie systemd[3199]: Failed to start push emails with syncmaildir.
mar 22 16:24:00 curie systemd[3199]: Starting push emails with syncmaildir...
This pattern repeats until 16:35, when that locking issue silently recovered somehow:
mar 22 16:35:03 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:35:41 curie smd-pull[20788]: default: smd-client@localhost: TAGS: stats::new-mails(5), del-mails(1), bytes-received(21885), xdelta-received(6863398)
mar 22 16:35:42 curie smd-pull[21373]: register: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(215)
mar 22 16:35:42 curie systemd[3199]: smd-pull.service: Succeeded.
mar 22 16:35:42 curie systemd[3199]: Started pull emails with syncmaildir.
mar 22 16:36:35 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:36:36 curie smd-pull[21738]: default: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(214)
mar 22 16:36:37 curie smd-pull[21816]: register: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(215)
mar 22 16:36:37 curie systemd[3199]: smd-pull.service: Succeeded.
mar 22 16:36:37 curie systemd[3199]: Started pull emails with syncmaildir.
... notice that huge xdelta-received there, that's 7GB right there. Mysteriously, the curie mail spool survived this, possibly because smd-pull started failing again:
mar 22 16:38:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:38:00 curie smd-pull[23556]: 21887 ?        00:00:00 smd-push
mar 22 16:38:00 curie smd-pull[23556]: Already running.
mar 22 16:38:00 curie smd-pull[23556]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:38:00 curie smd-pull[23556]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 21887) run(rm /home/anarcat/.smd/lock))
mar 22 16:38:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:38:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:38:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
That could have been when i got on angela to check my mail, and it was busy doing the nasty removal stuff... although the times don't match. Here is when angela came back online:
anarcat@angela:~(main)$ last
anarcat  :0           :0               Mon Mar 22 19:57   still logged in
reboot   system boot  5.10.0-0.bpo.3-a Mon Mar 22 19:57   still running
anarcat  :0           :0               Mon Mar 22 17:43 - 18:47  (01:03)
reboot   system boot  5.10.0-0.bpo.3-a Mon Mar 22 17:39   still running
Then finally the sync on curie started failing with:
mar 22 16:46:35 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:46:42 curie smd-pull[27455]: smd-server: ERROR: Client aborted, removing /home/anarcat/.smd/curie-anarcat__Maildir.db.txt.new and /home/anarcat/.smd/curie-anarcat__Maildir.db.txt.mtime.new
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: Failed to copy Maildir/.debian/cur/1613401668.M901837P27073.marcos,S=3740,W=3815:2,S to Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: The destination already exists but its content differs.
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: To fix this problem you have two options:
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: - rename Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S by hand so that Maildir/.debian/cur/1613401668.M901837P27073.marcos,S=3740,W=3815:2,S
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR:   can be copied without replacing it.
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR:   Executing  cd; mv -n "Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S" "Maildir/.koumbit/cur/1616446002.1.localhost"  should work.
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: - run smd-push so that your changes to Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR:   are propagated to the other mailbox
mar 22 16:46:42 curie smd-pull[27455]: default: smd-client@localhost: TAGS: error::context(copy-message) probable-cause(concurrent-mailbox-edit) human-intervention(necessary) suggested-actions(run(mv -n "/home/anarcat/.smd/workarea/Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S" "/home/anarcat/.smd/workarea/Maildir/.koumbit/tmp/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S") run(smd-push default))
mar 22 16:46:42 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:46:42 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:46:42 curie systemd[3199]: Failed to start pull emails with syncmaildir.
It went on like this until I found the problem. This is, presumably, a good thing because those emails were not being destroyed. On angela, things looked like this:
-- Reboot --
mar 22 17:39:29 angela systemd[1677]: Started run notmuch new at least once a day.
mar 22 17:39:29 angela systemd[1677]: Started run smd-pull regularly.
mar 22 17:40:46 angela systemd[1677]: Starting pull emails with syncmaildir...
mar 22 17:43:18 angela smd-pull[3916]: smd-server: ERROR: Unable to open Maildir/.tor/new/1616446842.M285912P26118.marcos,S=8860,W=8996: Maildir/.tor/new/1616446842.M285912P26118.marcos,S=886
0,W=8996: No such file or directory
mar 22 17:43:18 angela smd-pull[3916]: smd-server: ERROR: The problem should be transient, please retry.
mar 22 17:43:18 angela smd-pull[3916]: smd-server: ERROR: Unable to open requested file.
mar 22 17:43:18 angela smd-pull[3916]: smd-client: ERROR: Data transmission failed.
mar 22 17:43:18 angela smd-pull[3916]: smd-client: ERROR: This problem is transient, please retry.
mar 22 17:43:18 angela smd-pull[3916]: smd-client: ERROR: server sent ABORT or connection died
mar 22 17:43:18 angela smd-pull[3916]: default: smd-server@smd-server-anarcat: TAGS: error::context(transmit) probable-cause(simultaneous-mailbox-edit) human-intervention(avoidable) suggested
-actions(retry)
mar 22 17:43:18 angela smd-pull[3916]: default: smd-client@localhost: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
mar 22 17:43:18 angela systemd[1677]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 17:43:18 angela systemd[1677]: smd-pull.service: Failed with result 'exit-code'.
mar 22 17:43:18 angela systemd[1677]: Failed to start pull emails with syncmaildir.
mar 22 17:43:18 angela systemd[1677]: Starting pull emails with syncmaildir...
mar 22 17:43:29 angela smd-pull[4847]: default: smd-client@localhost: TAGS: stats::new-mails(29), del-mails(0), bytes-received(401519), xdelta-received(38914)
mar 22 17:43:29 angela smd-pull[5600]: register: smd-client@localhost: TAGS: stats::new-mails(2), del-mails(0), bytes-received(92150), xdelta-received(471)
mar 22 17:43:29 angela systemd[1677]: smd-pull.service: Succeeded.
mar 22 17:43:29 angela systemd[1677]: Started pull emails with syncmaildir.
mar 22 17:43:29 angela systemd[1677]: Starting push emails with syncmaildir...
mar 22 17:43:32 angela smd-push[5693]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(217)
mar 22 17:43:33 angela smd-push[6575]: register: smd-client@smd-server-register: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(219)
mar 22 17:43:33 angela systemd[1677]: smd-push.service: Succeeded.
mar 22 17:43:33 angela systemd[1677]: Started push emails with syncmaildir.
Notice how long it took to get the first error, in that first failure: it failed after 3 minutes! Presumably that's when it started deleting all that mail. And this is during pull, not push, so the error didn't come from angela.

Affected data It seems 2GB of mail from my main INBOX was destroyed. Another 2.4GB of spam (kept for training purposes) was also destroyed, along with 700MB of Sent mail. The rest is hard to figure out, because the folders are actually still there, just smaller. So I relied on ncdu to figure out the size changes. (Note that I don't really archive (or delete much of) my mail since I use notmuch, which is why the INBOX is so large...) Concretely, according to the notmuch-new.service which still runs periodically on marcos, here are the changes that happened on the server:
mar 22 16:17:12 marcos notmuch[10729]: Added 7 new messages to the database. Removed 57985 messages. Detected 1372 file renames.
mar 22 16:22:43 marcos notmuch[12826]: No new mail. Removed 143842 messages. Detected 6072 file renames.
mar 22 16:27:02 marcos notmuch[13969]: No new mail. Removed 82071 messages. Detected 1783 file renames.
mar 22 16:29:45 marcos notmuch[15079]: Added 22743 new messages to the database. Detected 1 file rename.
mar 22 16:31:48 marcos notmuch[16196]: Added 22779 new messages to the database. Removed 5 messages.
mar 22 16:33:11 marcos notmuch[17192]: Added 3711 new messages to the database.
mar 22 16:40:41 marcos notmuch[19122]: Added 74558 new messages to the database. Detected 1 file rename.
mar 22 16:43:21 marcos notmuch[20325]: Added 9061 new messages to the database. Detected 4 file renames.
mar 22 17:43:08 marcos notmuch[7420]: Added 1793 new messages to the database. Detected 6 file renames.
That is basically the entire mail spool destroyed at first (283 898 messages), and then bits and pieces of it progressively re-added (134 645 messages), somehow, so 149 253 mails were lost, presumably.

Recovery I disabled the services all over the place:
systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer
(Well, technically, I did that only on angela, as I thought the problem was there. Luckily, curie kept going but it seems like it was harmless.) I made a backup of the mail spool on curie:
tar cf - Maildir/   pv -s 14G   gzip -c > Maildir.tgz
Then I crossed my fingers and ran smd-push -v -s, as that was suggested by smd error codes themselves. That thankfully started restoring mail. It failed a few times on weird cases of files being duplicates, but I resolved this by following the instructions. Or mostly: I actually deleted the files instead of moving them, which made smd even unhappier (if there ever was such a thing). I had to recreate some of those files, so, lesson learned: do follow the advice smd gives you, even if it seems useless or strange. But then smd-push was humming along, uploading tens of thousands of messages, saturating the upload in the office, refilling the mail spool on the server... yaay!... ? Except... well, of course that didn't quite work: the mail spool in the office eventually started to grow beyond the size of the mail spool on the workstation. That is what smd-push eventually settled on:
default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(151697), del-mails(0), bytes-received(7539147811), xdelta-received(10881198)
It recreated 151 697 emails, adding about 2000 emails to the pool, kind of from nowhere at all. On marcos, before:
ncdu 1.13 ~ Use the arrow keys to navigate, press ? for help
--- /home/anarcat/Maildir ------------------------------------
    4,0 GiB [##########] /.notmuch
  717,3 MiB [#         ] /.Archives.2014
  498,2 MiB [#         ] /.feeds.debian-planet
  453,1 MiB [#         ] /.Archives.2012
  414,5 MiB [#         ] /.debian
  408,2 MiB [#         ] /.quoifaire
  389,8 MiB [          ] /.rapports
  356,6 MiB [          ] /.tor
  182,6 MiB [          ] /.koumbit
  179,8 MiB [          ] /tmp
   56,8 MiB [          ] /.nn
   43,0 MiB [          ] /.act-mtl
   32,6 MiB [          ] /.feeds.sysadvent
   31,7 MiB [          ] /.feeds.releases
   31,4 MiB [          ] /.Sent.2005
   26,3 MiB [          ] /.sage
   25,5 MiB [          ] /.freedombox
   24,0 MiB [          ] /.feeds.git-annex
   21,1 MiB [          ] /.Archives.2011
   19,1 MiB [          ] /.Sent.2003
   16,7 MiB [          ] /.bugtraq
   16,2 MiB [          ] /.mlug
 Total disk usage:   8,0 GiB  Apparent size:   7,6 GiB  Items: 184426
After:
ncdu 1.13 ~ Use the arrow keys to navigate, press ? for help
--- /home/anarcat/Maildir ------------------------------------
    4,7 GiB [##########] /.notmuch
    2,7 GiB [#####     ] /.junk
    1,9 GiB [###       ] /cur
  717,3 MiB [#         ] /.Archives.2014
  659,3 MiB [#         ] /.Sent
  513,9 MiB [#         ] /.Archives.2012
  498,2 MiB [#         ] /.feeds.debian-planet
  449,6 MiB [          ] /.Archives.2015
  414,5 MiB [          ] /.debian
  408,2 MiB [          ] /.quoifaire
  389,8 MiB [          ] /.rapports
  380,8 MiB [          ] /.Archives.2013
  356,6 MiB [          ] /.tor
  261,1 MiB [          ] /.Archives.2011
  240,9 MiB [          ] /.koumbit
  183,6 MiB [          ] /.Archives.2010
  179,8 MiB [          ] /tmp
  128,4 MiB [          ] /.lists
  106,1 MiB [          ] /.inso-interne
  103,0 MiB [          ] /.github
   75,0 MiB [          ] /.nanog
   69,8 MiB [          ] /.full-disclosure
 Total disk usage:  16,2 GiB  Apparent size:  15,5 GiB  Items: 341143
That is 156 717 files more. On curie:
ncdu 1.13 ~ Use the arrow keys to navigate, press ? for help
--- /home/anarcat/Maildir ------------------------------------------------------------------
    2,7 GiB [##########] /.junk
    2,3 GiB [########  ] /.notmuch
    1,9 GiB [######    ] /cur
  661,2 MiB [##        ] /.Archives.2014
  655,3 MiB [##        ] /.Sent
  512,0 MiB [#         ] /.Archives.2012
  447,3 MiB [#         ] /.Archives.2015
  438,5 MiB [#         ] /.feeds.debian-planet
  406,5 MiB [#         ] /.quoifaire
  383,6 MiB [#         ] /.debian
  378,6 MiB [#         ] /.Archives.2013
  303,3 MiB [#         ] /.tor
  296,0 MiB [#         ] /.rapports
  237,6 MiB [          ] /.koumbit
  233,2 MiB [          ] /.Archives.2011
  182,1 MiB [          ] /.Archives.2010
  127,0 MiB [          ] /.lists
  104,8 MiB [          ] /.inso-interne
  102,7 MiB [          ] /.register
   89,6 MiB [          ] /.github
   67,1 MiB [          ] /.full-disclosure
   66,5 MiB [          ] /.nanog
 Total disk usage:  13,3 GiB  Apparent size:  12,6 GiB  Items: 342465
Interestingly, there are more files, but less disk usage. It's possible the notmuch database there is more efficient. So maybe there's nothing to worry about. Last night's marcos backup has:
root@marcos:/home/anarcat# find /mnt/home/anarcat/Maildir   pv -l   wc -l
 341k 0:00:16 [20,4k/s] [                             <=>                                                                                                                                     ]
341040
... 341040 files, which seems about right, considering some mail was delivered during the day. An audit can be performed with hashdeep:
borg mount /media/sdb2/borg/::marcos-auto-2021-03-22 /mnt
hashdeep -c sha256 -r /mnt/home/anarcat/Maildir   pv -l -s 341k > Maildir-backup-manifest.txt
And then compared with:
hashdeep -c sha256 -k Maildir-backup-manifest.txt Maildir/
Some extra files should show up in the Maildir, and very few should actually be missing, because I shouldn't have deleted mail from the previous day the next day, or at least very few. The actual summary hashdeep gave me was:
hashdeep: Audit failed
   Input files examined: 0
  Known files expecting: 0
          Files matched: 339080
Files partially matched: 0
            Files moved: 782
        New files found: 107
  Known files not found: 106
So 106 files added, 107 deleted. Seems good enough for me... Postfix was stopped at Mar 22 21:12:59 to try and stop external events from confusing things even further. I reviewed the delivery log to see if mail that came in during the problem window disappeared:
grep 'dovecot:.*stored mail into mailbox' /var/log/mail.log  
  tail -20  
  sed 's/.*msgid=<//;s/>.*//'   
  while read msgid; do 
    notmuch count --exclude=false id:$msgid  
      grep 0 && echo $msgid missing;
  done
And things looked okay. Now of course if we go further back, we find mail I actually deleted (because I do do that sometimes), so it's hard to use this log as an audit trail. We can only hope that the curie spool is sufficiently coherent to be relied on. Worst case, we'll have to restore from last night's backup, but that's getting far away now: I get hundreds of mails a day in that mail spool, and reseting back to last night does not seem like a good idea. A dry run of smd-pull on angela seems to agree that it's missing some files:
default: smd-client@localhost: TAGS: stats::new-mails(154914), del-mails(0), bytes-received(0), xdelta-received(0)
... a number of mails somewhere in between the other two, go figure. A "wet" run of this was started, without deletion (-n), which gave us:
default: smd-client@localhost: TAGS: stats::new-mails(154911), del-mails(0), bytes-received(7658160107), xdelta-received(10837609)
Strange that it sync'd three less emails, but that's still better than nothing, and we have a mail spool on angela again:
anarcat@angela:~(main)$ notmuch new
purging with prefix '.': spam moved (0), ham moved (0), deleted (0), done
Note: Ignoring non-mail file: /home/anarcat/Maildir//.uidvalidity
Processed 1779 total files in 26s (66 files/sec.).
Added 1190 new messages to the database. Removed 3 messages. Detected 593 file renames.
tagging with prefix '.': spam, sent, feeds, koumbit, tor, lists, rapports, folders, done.
Notice how only 1190 messages were re-added, that is because I killed notmuch before it had time to remove all those mails from its database.

Possible causes I am totally at a loss as to why smd started destroying everything like it did. But a few things come to mind:
  1. I rewired my office on that day.
  2. This meant unplugging curie, the workstation.
  3. It has a bad CMOS battery (known problem), so it jumped around the time continuum a few times, sometimes by years.
  4. The smd services are ran from a systemd unit with OnCalendar=*:0/2. I have heard that it's possible that major time jumps "pile up" execution of jobs, and it seems this happened in this case.
  5. It's possible that locking in smd is not as great as it could be, and that it corrupted its internal data structures on curie, which led it to command a destruction of the remote mail spool.
It's also possible that there was a disk failure on the server, marcos. But since it's running on a (software) RAID-1 array, and no errors have been found (according to dmesg), I don't think that's a plausible hypothesis.

Lessons learned
  1. follow what smd says, even if it seems useless or strange.
  2. trust but verify: just backup everything before you do anything, especially the largest data set.
  3. daily backups are not great for email, unless you're ready to lose a day of email (which I'm not).
  4. hashdeep is great. I keep finding new use cases for it. Last time it was to audit my camera SD card to make sure I didn't forget anything, and now this. it's fast and powerful.
  5. borg is great too. the FUSE mount was especially useful, and it was pretty fast to explore the backup, even through that overhead: checksumming 15GB of mail took about 35 minutes, which gives a respectable 8MB/s, probably bottlenecked by the crap external USB drive I use for backups (!).
  6. I really need to finish my backup system so that I have automated offsite backups, although in this case that would actually have been much slower (certainly not 8MB/s!).

Workarounds and solutions I setup fake-hwclock on curie, so that the next power failure will not upset my clock that badly. I am thinking of switching to ZFS or BTRFS for most of my filesystems, so that I can use filesystem snapshots (including remotely!) as a backup strategy. This seems so much more powerful than crawling the filesystem for changes, and allows for truly offsite backups protected from an attacker (hopefully). But it's a long way there. I'm also thinking of rebuilding my mail setup without smd. It's not the first time something like this happens with smd. It's the first time I am more confident it's the root cause of the problem, however, and it makes me really nervous for the future. I have used offlineimap in the past and it seems it was finally ported to Python 3 so that could be an option again. isync/mbsync is another option, which I tried before but do not remember why I didn't switch. A complete redesign with something like getmail and/or nncp could also be an option. But alas, I lack the time to go crazy with those experiments. Somehow, doing like everyone else and just going with Google still doesn't seem to be an option for me. Screw big tech. But I am afraid they will win, eventually. In any case, I'm just happy I got mail again, strangely.

30 March 2020

Axel Beckert: How do you type on a keyboard with only 46 or even 28 keys?

Some of you might have noticed that I m into keyboards since a few years ago into mechanical keyboards to be precise. Preface It basically started with the Swiss Mechanical Keyboard Meetup (whose website I started later on) was held in the hackerspace of the CCCZH. I mostly used TKL keyboards (i.e. keyboards with just the for me useless number block missing) and tried to get my hands on more keyboards with Trackpoints (but failed so far). At some point a year or two ago, I looking into smaller keyboards for having a mechanical keyboard with me when travelling. I first bought a Vortex Core at Candykeys. The size was nice and especially having all layers labelled on the keys was helpful, but nevertheless I soon noticed that the smaller the keyboards get, the more important is, that they re properly programmable. The Vortex Core is programmable, but not the keys in the bottom right corner which are exactly the keys I wanted to change to get a cursor block down there. (Later I found out that there are possibilities to get this done, either with an alternative firmware and a hack of it or desoldering all switches and mounting an alternative PCB called Atom47.) 40% Keyboards So at some point I ordered a MiniVan keyboard from The Van Keyboards (MiniVan keyboards will soon be available again at The Key Dot Company), here shown with GMK Paperwork (also bought from and designed by The Van Keyboards):
The MiniVan PCBs are fully programmable with the free and open source firmware QMK and started to use that more and more instead of bigger keyboards. Layers With the MiniVan I learned the concepts of layers. Layers are similar to what many laptop keyboards do with the Fn key and to some extent also what the German standard layout does with the AltGr key: Layers are basically alternative key maps you can switch with a special key (often called Fn , Fn1 , Fn2 , etc., or especially if there are two additional layers Raise and Lower ). There are several concepts how these layers can be reached with these keys: My MiniVan Layout For the MiniVan, two additional layers suffice easily, but since I have a few characters on multiple layers and also have mouse control and media keys crammed in there, I have three additional layers on my MiniVan keyboards:

TRNS means transparent, i.e. use the settings from lower layers.
I also use a feature that allows me to mind different actions to a key depending if I just tap the key or if I hold it. Some also call this tap dance . This is especially very popular on the usually rather huge spacebar. There, the term SpaceFn has been coined, probably after this discussion on Geekhack. I use this for all my layer switching keys: With this layout I can type English texts as fast as I can type them on a standard or TKL layout. German umlauts are a bit more difficult because it requires 4 to 6 key presses per umlaut as I use the Compose key functionality (mapped to the Menu key between the spacebars and the cursor block. So to type an on my MiniVan, I have to:
  1. press and release Menu (i.e. Compose); then
  2. press and hold either Shift-Spacebar (i.e. Shift-Fn1) or Slash (i.e. Fn2), then
  3. press N for a double quote (i.e. Shift-Fn1-N or Fn2-N) and then release all keys, and finally
  4. press and release the base character for the umlaut, in this case Shift-A.
And now just use these concepts and reduce the amount of keys to 28: 30% and Sub-30% Keyboards In late 2019 I stumbled upon a nice little keyboard kit shop on Etsy which I (and probably most other people in the mechanical keyboard scene) didn t take into account for looking for keyboards called WorldspawnsKeebs. They offer mostly kits for keyboards of 40% size and below, most of them rather simple and not expensive. For about 30 you get a complete sub-30% keyboard kit (without switches and keycaps though, but that very common for keyboard kits as it leaves the choice of switches and key caps to you) named Alpha28 consisting of a minimal Acrylic case and a PCB and electronics set. This Alpha28 keyboard is btw. fully open source as the source code, (i.e. design files) for the hardware are published under a free license (MIT license) on GitHub. And here s how my Alpha28 looks like with GMK Mitolet (part of the GMK Pulse group-buy) key caps:
So we only have character keys, Enter (labelled Data as there was no 1u Enter key with that row profile in that key cap set; I ll also call it Data for the rest of this posting) and a small spacebar, not even modifier keys. The Default Alpha28 Layout The original key layout by the developer of the Alpha28 used the spacbar as Shift on hold and as space if just tapped, and the Data key switches always to the next layer, i.e. it switches the layer permanently on tap and not just on hold. This way that key rotates through all layers. In all other layers, V switches back to the default layer. I assume that the modifiers on the second layer are also on tap and apply to the next other normal key. This has the advantage that you don t have to bend your fingers for some key combos, but you have to remember on which layer you are at the moment. (IIRC QMK allows you to show that via LEDs or similar.) Kinda just like vi. My Alpha28 Layout But maybe because I m more an Emacs person, I dislike remembering states myself and don t bind bending my fingers. So I decided to develop my own layout using tap-or-hold and only doing layer switches by holding down keys:

A triangle means that the settings from lower layers are used, N/A means the key does nothing.
It might not be very obvious, but on the default layer, all keys in the bottom row and most keys on the row ends have tap-or-hold configurations. Basic ideasBottom row if holdOther rows if holdHow the keys are divided into layersUsing the Alpha28 This layout works surprisingly well for me. Only for Minus, Equal, Single Quote and Semicolon I still often have to think or try if they re on Layer 1 or 2 as on my 40%s (MiniVan, Zlant, etc.) I have them all on layer 1 (and in general one layer less over all). And for really seldom used keys like Insert, PrintScreen, ScrollLock or Pause, I might have to consult my own documentation. They re somewhere in the middle of the keyboard, either on layer 1, 2, or 3. ;-) And of course, typing umlauts takes even two keys more per umlaut as on the MiniVan since on the one hand Menu is not on the default layer and on the other hand, I don t have this nice shifted number row and actually have to also press Shift to get a double quote. So to type an on my Alpha, I have to:
  1. press and release Space-F (i.e. Fn1-F) for Menu (i.e. Compose); then
  2. press and hold A-Spacebar-L (i.e. Shift-Fn1-L) for getting a double quote, then
  3. press and release the base character for the umlaut, in this case L-A for Shift-A (because we can t use A for Shift as I can t hold a key and then press it again :-).
Conclusion If the characters on upper layers are not labelled like on the Vortex Core, i.e. especially on all self-made layouts, typing is a bit like playing that old children s game Memory: as soon as you remember (or your muscle memory knows) where some special characters are, typing gets faster. Otherwise, you start with trial and error or look the documentation. Or give up. ;-) Nevertheless, typing on a sub-30% keyboard like the Alpha28 is much more difficult and slower than on a 40% keyboard like the MiniVan. So the Alpha28 very likely won t become my daily driver while the MiniVan defacto is my already my daily driver. But I like these kind of challenges as others like the game Memory . So I ordered three more 30% and sub-30% keyboard kits and WorldspawnsKeebs for soldering on the upcoming weekend during the COVID19 lockdown: And if I at some point want to try to type with even fewer keys, I ll try a Butterstick keyboard with just 20 keys. It s a chorded keyboard where you have to press multiple keys at the same time to get one charcter: So to get an A from the missing middle row, you have to press Q and Z simultaneously, to get Escape, press Q and W simultaneously, to get Control, press Q, W, Z and X simultaneously, etc. And if that s not even enough, I already bought a keyboard kit named Ginny (or Ginni, the developer can t seem to decide) with just 10 keys from an acquaintance. Couldn t resist when offered his surplus kits. :-) It uses the ASETNIOP layout which was initially developed for on-screen keyboards on tablets.

25 April 2017

Keith Packard: TeleMini3

TeleMini V3.0 Dual-deploy altimeter with telemetry now available TeleMini v3.0 is an update to our original TeleMini v1.0 flight computer. It is a miniature (1/2 inch by 1.7 inch) dual-deploy flight computer with data logging and radio telemetry. Small enough to fit comfortably in an 18mm tube, this powerful package does everything you need on a single board: I don't have anything in these images to show just how tiny this board is but the spacing between the screw terminals is 2.54mm (0.1in), and the whole board is only 13mm wide (1/2in). This was a fun board to design. As you might guess from the version number, we made a couple prototypes of a version 2 using the same CC1111 SoC/radio part as version 1 but in the EasyMini form factor (0.8 by 1.5 inches). Feed back from existing users indicated that bigger wasn't better in this case, so we shelved that design. With the availability of the STM32F042 ARM Cortex-M0 part in a 4mm square package, I was able to pack that, the higher power CC1200 radio part, a 512kB memory part and a beeper into the same space as the original TeleMini version 1 board. There is USB on the board, but it's only on some tiny holes, along with the cortex SWD debugging connection. I may make some kind of jig to gain access to that for configuration, data download and reprogramming. For those interested in an even smaller option, you could remove the screw terminals and battery connector and directly wire to the board, and replace the beeper with a shorter version. You could even cut the rear mounting holes off to make the board shorter; there are no components in that part of the board.

28 February 2017

Chris Lamb: Free software activities in February 2017

Here is my monthly update covering what I have been doing in the free software world (previous month):
Reproducible builds

Whilst anyone can inspect the source code of free software for malicious flaws, most software is distributed pre-compiled to end users. The motivation behind the Reproducible Builds effort is to permit verification that no flaws have been introduced either maliciously or accidentally during this compilation process by promising identical results are always generated from a given source, thus allowing multiple third-parties to come to a consensus on whether a build was compromised. (I have been awarded a grant from the Core Infrastructure Initiative to fund my work in this area.) This month I:
I also made the following changes to our tooling:
diffoscope

diffoscope is our in-depth and content-aware diff utility that can locate and diagnose reproducibility issues.

  • New features:
    • Add a machine-readable JSON output format. (Closes: #850791).
    • Add an --exclude option. (Closes: #854783).
    • Show results from debugging packages last. (Closes: #820427).
    • Extract archive members using an auto-incrementing integer avoiding the need to sanitise filenames. (Closes: #854723).
    • Apply --max-report-size to --text output. (Closes: #851147).
    • Specify <html lang="en"> in the HTML output. (re. #849411).
  • Bug fixes:
    • Fix errors when comparing directories with non-directories. (Closes: #835641).
    • Device and RPM fallback comparisons require xxd. (Closes: #854593).
    • Fix tests that call xxd on Debian Jessie due to change of output format. (Closes: #855239).
    • Add missing Recommends for comparators. (Closes: #854655).
    • Importing submodules (ie. parent.child) will attempt to import parent. (Closes: #854670).
    • Correct logic of module_exists ensuring we correctly skip the debian.deb822 tests when python3-debian is not installed. (Closes: #854745).
    • Clean all temporary files in the signal handler thread instead of attempting to pass the exception back to the main thread. (Closes: #852013).
    • Fix behaviour of setting report maximums to zero (ie. no limit).
  • Optimisations:
    • Don't uselessly run xxd(1) on non-directories.
    • No need to track libarchive directory locations.
    • Optimise create_limited_print_func.
  • Tests:
    • When comparing two empty directories, ensure that the mtime of the directory is consistent to avoid non-deterministic failures.
    • Ensure we can at least import the "deb_fallback" and "rpm_fallback" modules.
    • Add test for symlink differing in destination.
    • Add tests for --progress, --status-fd and profiling output options as well as the Deb Changes,Buildinfo,Dsc and RPM fallback comparisons.
    • Add get_data and @skip_unless_module_exists test helpers.
    • Mark impossible-to-reach code to improve test coverage.

buildinfo.debian.net

buildinfo.debian.net is my experiment into how to process, store and distribute .buildinfo files after the Debian archive software has processed them.

  • Drop raw_text fields now as we've moved these to Amazon S3.
  • Drop storage of Installed-Build-Depends and subsequently-orphaned Binary package instances to recover diskspace.

strip-nondeterminism

strip-nondeterminism is our tool to remove specific non-deterministic results from a completed build.

  • Print log entry when fixing a file. (Closes: #777239).
  • Run our entire testsuite in autopkgtests, not just the first test. (Closes: #852517).
  • Don't test for stat(2)'s blksize and block attributes. (Closes: #854937).
  • Use error() from Dh_Lib.pm over "manual" die().


Debian
Debian LTS

This month I have been paid to work 13 hours on Debian Long Term Support (LTS). In that time I did the following:
  • "Frontdesk" duties, triaging CVEs, etc.
  • Issued DLA 817-1 for libphp-phpmailer, correcting a local file disclosure vulnerability where insufficient parsing of HTML messages could potentially be used by attacker to read a local file.
  • Issued DLA 826-1 for wireshark which fixes a denial of service vulnerability in wireshark, where a malformed NATO Ground Moving Target Indicator Format ("STANAG 4607") capture file could cause a memory exhausion/infinite loop.

Uploads
  • python-django (1:1.11~beta1-1) New upstream beta release.
  • redis (3:3.2.8-1) New upstream release.
  • gunicorn (19.6.0-11) Use $ misc:Pre-Depends to populate Pre-Depends for dpkg-maintscript-helper.
  • dh-virtualenv (1.0-1~bpo8+1) Upload to jessie-backports.

I sponsored the following uploads: I also performed the following QA uploads:
  • dh-kpatches (0.99.36+nmu4) Make kernel kernel builds reproducible.
Finally, I made the following non-maintainer uploads:
  • cpio (2.12+dfsg-3) Remove rmt.8.gz to prevent a piuparts error.
  • dot-forward (1:0.71-2.2) Correct a FTBFS; we don't install anything to /usr/sbin, so use GNU Make's $(wildcard ..) over the shell's own * expansion.


FTP Team

As a Debian FTP assistant I ACCEPTed 116 packages: autobahn-cpp, automat, bglibs, bitlbee, bmusb, bullet, case, certspotter, checkit-tiff, dash-el, dash-functional-el, debian-reference, el-x, elisp-bug-hunter, emacs-git-messenger, emacs-which-key, examl, genwqe-user, giac, golang-github-cloudflare-cfssl, golang-github-docker-goamz, golang-github-docker-libnetwork, golang-github-go-openapi-spec, golang-github-google-certificate-transparency, golang-github-karlseguin-ccache, golang-github-karlseguin-expect, golang-github-nebulouslabs-bolt, gpiozero, gsequencer, jel, libconfig-mvp-slicer-perl, libcrush, libdist-zilla-config-slicer-perl, libdist-zilla-role-pluginbundle-pluginremover-perl, libevent, libfunction-parameters-perl, libopenshot, libpod-weaver-section-generatesection-perl, libpodofo, libprelude, libprotocol-http2-perl, libscout, libsmali-1-java, libtest-abortable-perl, linux, linux-grsec, linux-signed, lockdown, lrslib, lua-curses, lua-torch-cutorch, mariadb-10.1, mini-buildd, mkchromecast, mocker-el, node-arr-exclude, node-brorand, node-buffer-xor, node-caller, node-duplexer3, node-ieee754, node-is-finite, node-lowercase-keys, node-minimalistic-assert, node-os-browserify, node-p-finally, node-parse-ms, node-plur, node-prepend-http, node-safe-buffer, node-text-table, node-time-zone, node-tty-browserify, node-widest-line, npd6, openoverlayrouter, pandoc-citeproc-preamble, pydenticon, pyicloud, pyroute2, pytest-qt, pytest-xvfb, python-biomaj3, python-canonicaljson, python-cgcloud, python-gffutils, python-h5netcdf, python-imageio, python-kaptan, python-libtmux, python-pybedtools, python-pyflow, python-scrapy, python-scrapy-djangoitem, python-signedjson, python-unpaddedbase64, python-xarray, qcumber, r-cran-urltools, radiant, repo, rmlint, ruby-googleauth, ruby-os, shutilwhich, sia, six, slimit, sphinx-celery, subuser, swarmkit, tmuxp, tpm2-tools, vine, wala & x265. I additionally filed 8 RC bugs against packages that had incomplete debian/copyright files against: checkit-tiff, dash-el, dash-functional-el, libcrush, libopenshot, mkchromecast, pytest-qt & x265.

20 August 2016

Daniel Stender: Collected notes from Python packaging

Here are some collected notes on some particular problems from packaging Python stuff for Debian, and more are coming up like this in the future. Some of the issues discussed here might be rather simple and even benign for the experienced packager, but maybe this is be helpful for people coming across the same issues for the first time, wondering what's going wrong. But some of the things discussed aren't easy. Here are the notes for this posting, there is no particular order. UnicodeDecoreError on open() in Python 3 running in non-UTF-8 environments I've came across this problem again recently packaging httpbin 0.5.0. The build breaks the following way e.g. trying to build with sbuild in a chroot, that's the first run of setup.py with the default Python 3 interpreter:
I: pybuild base:184: python3.5 setup.py clean 
Traceback (most recent call last):
  File "setup.py", line 5, in <module>
    os.path.join(os.path.dirname(__file__), 'README.rst')).read()
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2386: ordinal not in range(128)
E: pybuild pybuild:274: clean: plugin distutils failed with: exit code=1: python3.5 setup.py clean 
One comes across UnicodeDecodeError-s quite oftenly on different occasions, mostly related to Python 2 packaging (like here). Here it's the case that in setup.py the long_description is tried to be read from the opened UTF-8 encoded file README.rst:
long_description = open(
    os.path.join(os.path.dirname(__file__), 'README.rst')).read()
This is a problem for Python 3.5 (and other Python 3 versions) when setup.py is executed by an interpreter run in a non-UTF-8 environment1:
$ LANG=C.UTF-8 python3.5 setup.py clean
running clean
$ LANG=C python3.5 setup.py clean
Traceback (most recent call last):
  File "setup.py", line 5, in <module>
    os.path.join(os.path.dirname(__file__), 'README.rst')).read()
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2386: ordinal not in range(128)
$ LANG=C python2.7 setup.py clean
running clean
The reason for this error is, the default encoding for file object returned by open() e.g. in Python 3.5 is platform dependent, so that read() fails on that when there's a mismatch:
>>> readme = open('README.rst')
>>> readme
<_io.TextIOWrapper name='README.rst' mode='r' encoding='ANSI_X3.4-1968'>
Non-UTF-8 build environments because $LANG isn't particularly set at all or set to C are common or even default in Debian packaging, like in the continuous integration resp. test building for reproducible builds the primary environment features that (see here). That's also true for the base system of the sbuild environment:
$ schroot -d / -c unstable-amd64-sbuild -u root
(unstable-amd64-sbuild)root@varuna:/# locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
A problem like this is solved mostly elegant by installing some workaround in debian/rules. A quick and easy fix is to add export LC_ALL=C.UTF-8 here, which supersedes the locale settings of the build environment. $LC_ALL is commonly used to change the existing locale settings, it overrides all other locale variables with the same value (see here). C.UTF-8 is an UTF-8 locale which is available by default in a base system, it could be used without installing the locales package (or worse, the huge locales-all package):
(unstable-amd64-sbuild)root@varuna:/# locale -a
C
C.UTF-8
POSIX
This problem ideally should be taken care of upstream. In Python 3, the default open() is io.open(), for which the specific encoding could be given, so that the UnicodeDecodeError disappears. Python 2 uses os.open() for open(), which doesn't support that, but has io.open(), too. A fix which works for both Python branches goes like this:
import io
long_description = io.open(
    os.path.join(os.path.dirname(__file__), 'README.rst'), encoding='utf-8').read()
non-deterministic order of requirements in egg-info/requires.txt This problem appeared in prospector/0.11.7-5 in the reproducible builds test builds, that was the first package of Prospector running on Python 32. It was revealed that there were differences in the sorting order of the [with_everything] dependencies resp. requirements in prospector-0.11.7.egg-info/requires.txt if the package was build on varying systems:
$ debbindiff b1/prospector_0.11.7-5_amd64.changes b2/prospector_0.11.7-5_amd64.changes 
 ... 
  prospector_0.11.7-5_all.deb
      file list
      @@ -1,3 +1,3 @@
       -rw-r--r--   0        0        0        4 2016-04-01 20:01:56.000000 debian-binary
      --rw-r--r--   0        0        0     4343 2016-04-01 20:01:56.000000 control.tar.gz
      +-rw-r--r--   0        0        0     4344 2016-04-01 20:01:56.000000 control.tar.gz
       -rw-r--r--   0        0        0    74512 2016-04-01 20:01:56.000000 data.tar.xz
      control.tar.gz
          control.tar
              ./md5sums
                  md5sums
                  Files in package differ
      data.tar.xz
          data.tar
              ./usr/share/prospector/prospector-0.11.7.egg-info/requires.txt
              @@ -1,12 +1,12 @@
               
               [with_everything]
              +pyroma>=1.6,<2.0
               frosted>=1.4.1
               vulture>=0.6
              -pyroma>=1.6,<2.0
Reproducible builds folks recognized this to be a toolchain problem and set up the issue randonmness_in_python_setuptools_requires.txt to cover this. Plus, a wishlist bug against python-setuptools was filed to fix this. The patch which was provided by Chris Lamb adds sorting of dependencies in requires.txt for Setuptools by adding sorted() (stdlib) to _write_requirements() in command/egg_info.py:
--- a/setuptools/command/egg_info.py
+++ b/setuptools/command/egg_info.py
@@ -406,7 +406,7 @@ def warn_depends_obsolete(cmd, basename, filename):
 def _write_requirements(stream, reqs):
     lines = yield_lines(reqs or ())
     append_cr = lambda line: line + '\n'
-    lines = map(append_cr, lines)
+    lines = map(append_cr, sorted(lines))
     stream.writelines(lines)
O.k. In the toolchain, nothing sorts these requirements alphanumerically to make differences disappear, but what is the reason for these differences in the Prospector packages, though? The problem is somewhat subtle. In setup.py, [with_everything] is a dictionary entry of _OPTIONAL (which is used for extras_require) that is created by a list comprehension out of the other values in that dictionary. The code goes like this:
_OPTIONAL =  
    'with_frosted': ('frosted>=1.4.1',),
    'with_vulture': ('vulture>=0.6',),
    'with_pyroma': ('pyroma>=1.6,<2.0',),
    'with_pep257': (),  # note: this is no longer optional, so this option will be removed in a future release
 
_OPTIONAL['with_everything'] = [req for req_list in _OPTIONAL.values() for req in req_list]
The result, the new _OPTIONAL dictionary including the key with_everything (which w/o further sorting is the source of the list of requirements requires.txt) e.g. looks like this (code snipped run through in my IPython):
In [3]: _OPTIONAL
Out[3]: 
 'with_everything': ['vulture>=0.6', 'pyroma>=1.6,<2.0', 'frosted>=1.4.1'],
 'with_vulture': ('vulture>=0.6',),
 'with_pyroma': ('pyroma>=1.6,<2.0',),
 'with_frosted': ('frosted>=1.4.1',),
 'with_pep257': () 
That list comprehension iterates over the other dictionary entries to gather the value of with_everything, and Python programmers know that of course dictionaries are not indexed and therefore the order of the key-value pairs isn't fixed, but is determined by certain conditions. That's the source for the non-determinism of this Debian package revision of Prospector. This problem has been fixed by a patch, which just presorts the list of requirements before it gets added to _OPTIONAL:
@@ -76,8 +76,8 @@
     'with_pyroma': ('pyroma>=1.6,<2.0',),
     'with_pep257': (),  # note: this is no longer optional, so this option will be removed in a future release
  
-_OPTIONAL['with_everything'] = [req for req_list in _OPTIONAL.values() for req in req_list]
-
+with_everything = [req for req_list in _OPTIONAL.values() for req in req_list]
+_OPTIONAL['with_everything'] = sorted(with_everything)
In comparison to the list method sort(), the function sorted() does not change iterables in-place, but returns a new object. Both could be used. As a side note, egg-info/requires.txt isn't even needed, but that's another issue.

  1. As an alternative to prefixing LC_ALL, env -i could be used to get an empty environment.
  2. 0.11.7-4 already but this package revision was in experimental due to the switch to Python 3 and therefore not tested by reproducible builds continuous integration.

7 May 2016

Keith Packard: Altos1.6.3

AltOS 1.6.3 Bdale and I are pleased to announce the release of AltOS version 1.6.3. AltOS is the core of the software for all of the Altus Metrum products. It consists of firmware for our cc1111, STM32L151, STMF042, LPC11U14 and ATtiny85 based electronics and Java-based ground station software. Version 1.6.3 adds idle mode to AltosDroid and has bug fixes for our host software on desktops, laptops an android devices along with BlueTooth support for Windows. 1.6.3 is in Beta test for Android; if you want to use the beta version, join the AltosDroid beta program AltOS AltOS fixes: AltosUI and TeleGPS Applications AltosUI and TeleGPS New Features: AltosUI and TeleGPS Fixes: AltosDroid AltosDroid new features: AltosDroid bug fixes: Documentation

10 April 2016

Vincent Bernat: Testing network software with pytest and Linux namespaces

Started in 2008, lldpd is an implementation of IEEE 802.1AB-2005 (aka LLDP) written in C. While it contains some unit tests, like many other network-related software at the time, the coverage of those is pretty poor: they are hard to write because the code is written in an imperative style and tighly coupled with the system. It would require extensive mocking1. While a rewrite (complete or iterative) would help to make the code more test-friendly, it would be quite an effort and it will likely introduce operational bugs along the way. To get better test coverage, the major features of lldpd are now verified through integration tests. Those tests leverage Linux network namespaces to setup a lightweight and isolated environment for each test. They run through pytest, a powerful testing tool.

pytest in a nutshell pytest is a Python testing tool whose primary use is to write tests for Python applications but is versatile enough for other creative usages. It is bundled with three killer features:
  • you can directly use the assert keyword,
  • you can inject fixtures in any test function, and
  • you can parametrize tests.

Assertions With unittest, the unit testing framework included with Python, and many similar frameworks, unit tests have to be encapsulated into a class and use the provided assertion methods. For example:
class testArithmetics(unittest.TestCase):
    def test_addition(self):
        self.assertEqual(1 + 3, 4)
The equivalent with pytest is simpler and more readable:
def test_addition():
    assert 1 + 3 == 4
pytest will analyze the AST and display useful error messages in case of failure. For further information, see Benjamin Peterson s article.

Fixtures A fixture is the set of actions performed in order to prepare the system to run some tests. With classic frameworks, you can only define one fixture for a set of tests:
class testInVM(unittest.TestCase):
    def setUp(self):
        self.vm = VM('Test-VM')
        self.vm.start()
        self.ssh = SSHClient()
        self.ssh.connect(self.vm.public_ip)
    def tearDown(self):
        self.ssh.close()
        self.vm.destroy()
    def test_hello(self):
        stdin, stdout, stderr = self.ssh.exec_command("echo hello")
        stdin.close()
        self.assertEqual(stderr.read(), b"")
        self.assertEqual(stdout.read(), b"hello\n")
In the example above, we want to test various commands on a remote VM. The fixture launches a new VM and configure an SSH connection. However, if the SSH connection cannot be established, the fixture will fail and the tearDown() method won t be invoked. The VM will be left running. Instead, with pytest, we could do this:
@pytest.yield_fixture
def vm():
    r = VM('Test-VM')
    r.start()
    yield r
    r.destroy()
@pytest.yield_fixture
def ssh(vm):
    ssh = SSHClient()
    ssh.connect(vm.public_ip)
    yield ssh
    ssh.close()
def test_hello(ssh):
    stdin, stdout, stderr = ssh.exec_command("echo hello")
    stdin.close()
    stderr.read() == b""
    stdout.read() == b"hello\n"
The first fixture will provide a freshly booted VM. The second one will setup an SSH connection to the VM provided as an argument. Fixtures are used through dependency injection: just give their names in the signature of the test functions and fixtures that need them. Each fixture only handle the lifetime of one entity. Whatever a dependent test function or fixture succeeds or fails, the VM will always be finally destroyed.

Parameters If you want to run the same test several times with a varying parameter, you can dynamically create test functions or use one test function with a loop. With pytest, you can parametrize test functions and fixtures:
@pytest.mark.parametrize("n1, n2, expected", [
    (1, 3, 4),
    (8, 20, 28),
    (-4, 0, -4)])
def test_addition(n1, n2, expected):
    assert n1 + n2 == expected

Testing lldpd The general plan for to test a feature in lldpd is the following:
  1. Setup two namespaces.
  2. Create a virtual link between them.
  3. Spawn a lldpd process in each namespace.
  4. Test the feature in one namespace.
  5. Check with lldpcli we get the expected result in the other.
Here is a typical test using the most interesting features of pytest:
@pytest.mark.skipif('LLDP-MED' not in pytest.config.lldpd.features,
                    reason="LLDP-MED not supported")
@pytest.mark.parametrize("classe, expected", [
    (1, "Generic Endpoint (Class I)"),
    (2, "Media Endpoint (Class II)"),
    (3, "Communication Device Endpoint (Class III)"),
    (4, "Network Connectivity Device")])
def test_med_devicetype(lldpd, lldpcli, namespaces, links,
                        classe, expected):
    links(namespaces(1), namespaces(2))
    with namespaces(1):
        lldpd("-r")
    with namespaces(2):
        lldpd("-M", str(classe))
    with namespaces(1):
        out = lldpcli("-f", "keyvalue", "show", "neighbors", "details")
        assert out['lldp.eth0.lldp-med.device-type'] == expected
First, the test will be executed only if lldpd was compiled with LLDP-MED support. Second, the test is parametrized. We will execute four distinct tests, one for each role that lldpd should be able to take as an LLDP-MED-enabled endpoint. The signature of the test has four parameters that are not covered by the parametrize() decorator: lldpd, lldpcli, namespaces and links. They are fixtures. A lot of magic happen in those to keep the actual tests short:
  • lldpd is a factory to spawn an instance of lldpd. When called, it will setup the current namespace (setting up the chroot, creating the user and group for privilege separation, replacing some files to be distribution-agnostic, ), then call lldpd with the additional parameters provided. The output is recorded and added to the test report in case of failure. The module also contains the creation of the pytest.config.lldpd object that is used to record the features supported by lldpd and skip non-matching tests. You can read fixtures/programs.py for more details.
  • lldpcli is also a factory, but it spawns instances of lldpcli, the client to query lldpd. Moreover, it will parse the output in a dictionary to reduce boilerplate.
  • namespaces is one of the most interesting pieces. It is a factory for Linux namespaces. It will spawn a new namespace or refer to an existing one. It is possible to switch from one namespace to another (with with) as they are contexts. Behind the scene, the factory maintains the appropriate file descriptors for each namespace and switch to them with setns(). Once the test is done, everything is wipped out as the file descriptors are garbage collected. You can read fixtures/namespaces.py for more details. It is quite reusable in other projects2.
  • links contains helpers to handle network interfaces: creation of virtual ethernet link between namespaces, creation of bridges, bonds and VLAN, etc. It relies on the pyroute2 module. You can read fixtures/network.py for more details.
You can see an example of a test run on the Travis build for 0.9.2. Since each test is correctly isolated, it s possible to run parallel tests with pytest -n 10 --boxed. To catch even more bugs, both the address sanitizer (ASAN) and the undefined behavior sanitizer (UBSAN) are enabled. In case of a problem, notably a memory leak, the faulty program will exit with a non-zero exit code and the associated test will fail.

  1. A project like cwrap would definitely help. However, it lacks support for Netlink and raw sockets that are essential in lldpd operations.
  2. There are three main limitations in the use of namespaces with this fixture. First, when creating a user namespace, only root is mapped to the current user. With lldpd, we have two users (root and _lldpd). Therefore, the tests have to run as root. The second limitation is with the PID namespace. It s not possible for a process to switch from one PID namespace to another. When you call setns() on a PID namespace, only children of the current process will be in the new PID namespace. The PID namespace is convenient to ensure everyone gets killed once the tests are terminated but you must keep in mind that /proc must be mounted in children only. The third limitation is that, for some namespaces (PID and user), all threads of a process must be part of the same namespace. Therefore, don t use threads in tests. Use multiprocessing module instead.

30 September 2015

Chris Lamb: Free software activities in September 2015

Inspired by Rapha l Hertzog, here is a monthly update covering a large part of what I have been doing in the free software world:
Debian The Reproducible Builds project was also covered in depth on LWN as well as in Lunar's weekly reports (#18, #19, #20, #21, #22).
Uploads
  • redis A new upstream release, as well as overhauling the systemd configuration, maintaining feature parity with sysvinit and adding various security hardening features.
  • python-redis Attempting to get its Debian Continuous Integration tests to pass successfully.
  • libfiu Ensuring we do not FTBFS under exotic locales.
  • gunicorn Dropping a dependency on python-tox now that tests are disabled.



RC bugs


I also filed FTBFS bugs against actdiag, actdiag, bangarang, bmon, bppphyview, cervisia, choqok, cinnamon-control-center, clasp, composer, cpl-plugin-naco, dirspec, django-countries, dmapi, dolphin-plugins, dulwich, elki, eqonomize, eztrace, fontmatrix, freedink, galera-3, golang-git2go, golang-github-golang-leveldb, gopher, gst-plugins-bad0.10, jbofihe, k3b, kalgebra, kbibtex, kde-baseapps, kde-dev-utils, kdesdk-kioslaves, kdesvn, kdevelop-php-docs, kdewebdev, kftpgrabber, kile, kmess, kmix, kmldonkey, knights, konsole4, kpartsplugin, kplayer, kraft, krecipes, krusader, ktp-auth-handler, ktp-common-internals, ktp-text-ui, libdevice-cdio-perl, libdr-tarantool-perl, libevent-rpc-perl, libmime-util-java, libmoosex-app-cmd-perl, libmoosex-app-cmd-perl, librdkafka, libxml-easyobj-perl, maven-dependency-plugin, mmtk, murano-dashboard, node-expat, node-iconv, node-raw-body, node-srs, node-websocket, ocaml-estring, ocaml-estring, oce, odb, oslo-config, oslo.messaging, ovirt-guest-agent, packagesearch, php-svn, php5-midgard2, phpunit-story, pike8.0, plasma-widget-adjustableclock, plowshare4, procps, pygpgme, pylibmc, pyroma, python-admesh, python-bleach, python-dmidecode, python-libdiscid, python-mne, python-mne, python-nmap, python-nmap, python-oslo.middleware, python-riemann-client, python-traceback2, qdjango, qsapecng, ruby-em-synchrony, ruby-ffi-rzmq, ruby-nokogiri, ruby-opengraph-parser, ruby-thread-safe, shortuuid, skrooge, smb4k, snp-sites, soprano, stopmotion, subtitlecomposer, svgpart, thin-provisioning-tools, umbrello, validator.js, vdr-plugin-prefermenu, vdr-plugin-vnsiserver, vdr-plugin-weather, webkitkde, xbmc-pvr-addons, xfsdump & zanshin.

18 November 2014

Erich Schubert: Generate iptables rules via pyroman

Vincent Bernat blogged on using Netfilter rulesets, pointing out that inserting the rules one-by-one using iptables calls may leave your firewall temporarily incomplete, eventually half-working, and that this approach can be slow.
He's right with that, but there are tools that do this properly. ;-)
Some years ago, for a multi-homed firewall, I wrote a tool called Pyroman. Using rules specified either in Python or XML syntax, it generates a firewall ruleset for you.
But it also adresses the points Vincent raised:
  • It uses iptables-restore to load the firewall more efficiently than by calling iptables a hundred times
  • It will backup the previous firewall, and roll-back on errors (or lack of confirmation, if you are remote and use --safe)
It also has a nice feature for the use in staging: it can generate firewall rule sets offline, to allow you reviewing them before use, or transfer them to a different host. Not all functionality is supported though (e.g. the Firewall.hostname constant usable in python conditionals will still be the name of the host you generate the rules on - you may want to add a --hostname parameter to pyroman)
pyroman --print-verbose will generate a script readable by iptables-restore except for one problem: it contains both the rules for IPv4 and for IPv6, separated by #### IPv6 rules. It will also annotate the origin of the rule, for example:
# /etc/pyroman/02_icmpv6.py:82
-A rfc4890f -p icmpv6 --icmpv6-type 255 -j DROP
indicates that this particular line was produced due to line 82 in file /etc/pyroman/02_icmpv6.py. This makes debugging easier. In particular it allows pyroman to produce a meaningful error message if the rules are rejected by the kernel: it will tell you which line caused the rule that was rejected.
For the next version, I will probably add --output-ipv4 and --output-ipv6 options to make this more convenient to use. So far, pyroman is meant to be used on the firewall itself.
Note: if you have configured a firewall that you are happy with, you can always use iptables-save to dump the current firewall. But it will not preserve comments, obviously.

13 September 2014

Keith Packard: Altos1.5

AltOS 1.5 EasyMega support, features and bug fixes Bdale and I are pleased to announce the release of AltOS version 1.5. AltOS is the core of the software for all of the Altus Metrum products. It consists of firmware for our cc1111, STM32L151, LPC11U14 and ATtiny85 based electronics and Java-based ground station software. This is a major release of AltOS, including support for our new EasyMega board and a host of new features and bug fixes AltOS Firmware EasyMega added, new features and fixes Our new flight computer, EasyMega, is a TeleMega without any radios: AltOS Changes We've made a few improvements in the firmware: AltOS Bug Fixes We also fixed a few bugs in the firmware: AltosUI and TeleGPS EasyMega support, OS integration and more The AltosUI and TeleGPS applications have a few changes for this release:

16 June 2014

Keith Packard: Altos1.4

AltOS 1.4 TeleGPS support, features and bug fixes Bdale and I are pleased to announce the release of AltOS version 1.4. AltOS is the core of the software for all of the Altus Metrum products. It consists of firmware for our cc1111, STM32L151, LPC11U14 and ATtiny85 based electronics and Java-based ground station software. This is a major release of AltOS, including support for our new TeleGPS board and a host of new features and bug fixes AltOS Firmware TeleGPS added, new features and fixes Our new tracker, TeleGPS, works quite differently than a flight computer TeleGPS transmits our digital telemetry protocol, APRS and radio direction finding beacons. For TeleMega, we ve made the firing time for the additional pyro channels (A-D) configurable, in case the default (50ms) isn t long enough. AltOS Beeping Changes The three-beep startup tones have been replaced with a report of the current battery voltage. This is nice on all of the board, but particularly useful with EasyMini which doesn t have the benefit of telemetry reporting its state. We also changed the other state tones to Farnsworth spacing. This makes them all faster, and easier to distinguish from the numeric reports of voltage and altitude. Finally, we ve added the ability to change the frequency of the beeper tones. This is nice when you have two Altus Metrum flight computers in the same ebay and want to be able to tell the beeps apart. AltOS Bug Fixes Fixed a bug which prevented you from using TeleMega s extra pyro channel Flight State After configuration value. AltOS 1.3.2 on TeleMetrum v2.0 and TeleMega would reset the flight number to 2 after erasing flights; that s been fixed. AltosUI New Maps, igniter tab and a few fixes With TeleGPS tracks now potentially ranging over a much wider area than a typical rocket flight, the Maps interface has been updated to include zooming and multiple map styles. It also now uses less memory, which should make it work on a wider range of systems. For TeleMega, we ve added an Igniter tab to the flight monitor interface so you can check voltages on the extra pyro channels before pushing the button. We re hoping that the new Maps interface will load and run on machines with limited memory for Java applications; please let us know if this changes anything for you. TeleGPS All new application just for TeleGPS While TeleGPS shares the same telemetry and data logging capabilities as all of the Altus Metrum flight computers, its use as a tracker is expected to be both broader and simpler than the rocketry-specific systems. We ve build a custom TeleGPS application that incorporates the mapping and data visualization aspects of AltosUI, but eliminates all of the rocketry-specific flight state tracking.

15 February 2014

Keith Packard: Altos1.3.2

AltOS 1.3.2 Bug fixes and improved APRS support Bdale and I are pleased to announce the release of AltOS version 1.3.2. AltOS is the core of the software for all of the Altus Metrum products. It consists of firmware for our cc1111, STM32L151, LPC11U14 and ATtiny85 based electronics and Java-based ground station software. This is a minor release of AltOS, including bug fixes for TeleMega, TeleMetrum v2.0 and AltosUI . AltOS Firmware GPS Satellite reporting and APRS improved Firmware version 1.3.1 has a bug on TeleMega when it has data from more than 12 GPS satellites. This causes buffer overruns within the firmware. 1.3.2 limits the number of reported satellites to 12. APRS now continues to send the last known good GPS position, and reports GPS lock status and number of sats in view in the APRS comment field, along with the battery and igniter voltages. AltosUI TeleMega GPS Satellite, GPS max height and Fire Igniters AltosUI was crashing when TeleMega reported that it had data from more than 12 satellites. While the TeleMega firmware has been fixed to never do that, AltosUI also has a fix in case you fly a TeleMega board without updated firmware. GPS max height is now displayed in the flight statistics. As the u-Blox GPS chips now provide accurate altitude information, we ve added the maximum height as computed by GPS here. Fire Igniters now uses the letters A through D to label the extra TeleMega pyro channels instead of the numbers 0-3.

23 January 2014

Keith Packard: Altos1.3.1

AltOS 1.3.1 Bug fixes and improved APRS support Bdale and I are pleased to announce the release of AltOS version 1.3.1. AltOS is the core of the software for all of the Altus Metrum products. It consists of firmware for our cc1111, STM32L151, LPC11U14 and ATtiny85 based electronics and Java-based ground station software. This is a minor release of AltOS, including bug fixes for TeleMega, TeleMetrum v2.0 and AltosUI . AltOS Firmware Antenna down fixed and APRS improved Firmware version 1.3 has a bug in the support for operating the flight computer with the antenna facing downwards; the accelerometer calibration data would be incorrect. Furthermore, the accelerometer self-test routine would be confused if the flight computer were moved in the first second after power on. The firmware now simply re-tries the self-test several times. I went out and bought a real APRS radio, the Yaesu FT1D to replace my venerable VX 7R. With this in hand, I changed our APRS support to use the compressed position format, which takes fewer bytes to send, offers increased resolution and includes altitude data. I took the altitude data out of the comment field and replaced that with battery and igniter voltages. This makes APRS reasonably useful in pad mode to monitor the state of the flight computer before boost. Anyone with a TeleMega should update to the new firmware eventually, although there aren t any critical bug fixes here, unless you re trying to operate the device with the antenna pointing downwards. AltosUI TeleMega support and offline map loading improved. I added all of the new TeleMega sensor data as possible elements in the graph. This lets you see roll rates and horizontal acceleration values for the whole flight. The Fire Igniter dialog now lists all of the TeleMega extra pyro channels so you can play with those on the ground as well. Our offline satellite images are downloaded from Google, but they restrict us to reading 50 images per minute. When we tried to download a 9x9 grid of images to save for later use on the flight line, Google would stop feeding us maps after the first 50. You d have to poke the button a second time to try and fill in the missing images. We fixed this by just limiting how fast we load maps, and now we can reliably load an 11x11 grid of images. Of course, there are also a few minor bug fixes, so it s probably worth updating even if the above issues don t affect you.

19 December 2013

Keith Packard: Altos1.3

AltOS 1.3 TeleMega and EasyMini support Bdale and I are pleased to announce the release of AltOS version 1.3. AltOS is the core of the software for all of the Altus Metrum products. It consists of firmware for our cc1111, STM32L151, LPC11U14 and ATtiny85 based electronics and Java-based ground station software. This is a major release of AltOS as it includes support for both of our brand new flight computers, TeleMega and EasyMini. AltOS Firmware New hardware, new features and fixes Our new advanced flight computer, TeleMega, required a lot of new firmware features, including: Our new easy-to-use flight computer, EasyMini also uses a new processor, the LPC11U14, which is an ARM Cortex-M0 part. For our existing cc1111 devices, there are some minor bug fixes for the flight software, so you should plan on re-flashing flight units at some point. However, there aren t any incompatible changes, so you don t have to do it all at once. Bug fixes: AltosUI Redesigned for TeleMega and EasyMini support AltosUI has also seen quite a bit of work for the 1.3 release, but almost all of that was a massive internal restructuring necessary to support flight computers with a wide range of sensors. From the user s perspective, it s pretty similar with a few changes:

18 December 2013

Daniel Pocock: Embedding Python in multi-threaded C++ applications

Embedding Python into other applications to provide a scripting mechanism is a popular practice. Ganglia can run user-supplied Python scripts for metric collection and the Blender project does it too, allowing users to develop custom tools and use script code to direct their animations. There are various reasons people choose Python:
  • Modern, object orientated style of programming
  • Interpreted language (no need to compile things, just edit and run the code)
  • Python has a vast array of modules providing many features such as database and network access, data processing, etc
The bottom line is that the application developer who chooses to embed Python in their existing application can benefit from the product of all this existing code multiplied by the imagination of their users. Enter repro repro is the SIP proxy of the reSIProcate project. reSIProcate is an advanced SIP implementation developed in C++. repro is a multi-threaded process. repro's most serious competitor is the Kamailio SIP proxy. Kamailio has its own bespoke scripting language that it has inherited from the SIP Express Router (SER) family of projects. repro has always been far more rigid in its capabilities than Kamailio. On the other hand, while Kamailio has given users great flexibility, it has also come at a cost: users can easily build configurations that are not valid or may not do what they really intend if they don't understand the intricacies of the SIP protocol. Here is an example of the Kamailio configuration script (from Daniel's excellent blog about building a Skype-like service in less than an hour) Kamailio also has a wide array of plugins for things like database and LDAP access. repro only had embedded bdb and MySQL support. Embedding Python into repro appears to be a quick way to fill many of these gaps and allow users to combine the power of the reSIProcate stack with their own custom routing logic. On the other hand, it is not simply copying the Kamailio scripting solution: rather, it provides a distinctive alternative. Starting the integration Embedding Python is such a popular practice that there is even dedicated documentation on the subject. As well as looking there, I also looked over the example provided by the embedded Python module for Ganglia. Looking over the Ganglia mod_python code I noticed a lot of boilerplate code for reference counting and other tedious activities. Given that reSIProcate is C++ code, I thought I would look for a C++ solution to this and I came across PyCXX. PyCXX is licensed under BSD-like terms similar to reSIProcate itself so it is a good fit. There is also the alternative Boost.Python API, however, reSIProcate has been built without Boost dependencies so I decided to stick with PyCXX. I looked over the PyCXX examples and the documentation and was able to complete a first cut of the embedded Python scripting feature very quickly. Using PyCXX One unusual thing I noticed about PyCXX is that the Debian package, python-cxx-dev does not provide any shared library. Instead, some uncompiled source files are provided and each project using PyCXX must compile them and link them statically itself. Here is how I do that in the Makefile.am for pyroute in repro:
AM_CXXFLAGS = -I $(top_srcdir)
reproplugin_LTLIBRARIES = libpyroute.la
libpyroute_la_SOURCES = PyRoutePlugin.cxx
libpyroute_la_SOURCES += PyRouteWorker.cxx
libpyroute_la_SOURCES += $(PYCXX_SRCDIR)/cxxextensions.c
libpyroute_la_SOURCES += $(PYCXX_SRCDIR)/cxx_extensions.cxx
libpyroute_la_SOURCES += $(PYCXX_SRCDIR)/cxxsupport.cxx
libpyroute_la_SOURCES += $(PYCXX_SRCDIR)/../IndirectPythonInterface.cxx
libpyroute_la_CPPFLAGS = $(DEPS_PYTHON_CFLAGS)
libpyroute_la_LDFLAGS = -module -avoid-version
libpyroute_la_LDFLAGS += $(DEPS_PYTHON_LIBS)
EXTRA_DIST = example.py
noinst_HEADERS = PyRouteWorker.hxx
noinst_HEADERS += PyThreadSupport.hxx
The value PYCXX_SRCDIR must be provided on the configure command line. On Debian, it is /usr/share/python2.7/CXX/Python2 Going multi-threaded My initial implementation simply invoked the Python method from the main routing thread of the repro SIP proxy. This meant that it would only be suitable for executing functions that complete quickly, ruling out the use of any Python scripts that talk to network servers or other slow activities. When the proxy becomes heavily loaded, it is important that it can complete many tasks asynchronously, such as forwarding chat messages between users in real-time. Therefore, it was essential to extend the solution to run the Python scripts in a pool of worker threads. At this point, I had an initial feeling that there may be danger in just calling the Python methods from some other random threads started by my own code. I went to see the manual and I came across this specific documentation about the subject. It looks quite easy, just wrap the call to the user-supplied Python code in something like this:
PyGILState_STATE gstate;
gstate = PyGILState_Ensure();
/* Perform Python actions here. */
result = CallSomeFunction();
/* evaluate result or handle exception */
/* Release the thread. No Python API allowed beyond this point. */
PyGILState_Release(gstate);
Unfortunately, I found that this would not work and that one of two problems occur when using this code:
  • The thread blocks on the call to PyGILState_Ensure()
  • The program crashes with a segmentation fault when the call to a Python method was invoked
Exactly which of these outcomes I experienced seemed to depend on whether I tried to explicitly call PyEval_ReleaseThread() from the main thread after doing the Py_Initialize() and other setup tasks. I tried various permutations of using PyGILState_Ensure()PyGILState_Release() and/or PyEval_SaveThread()/PyEval_ReleaseThread() but I always had one of the same problems. The next thing that occurred to me is that maybe PyCXX provides some framework for thread integration: I had a look through the code and couldn't find any reference to the threading functionality from the C API. I went looking for more articles and mailing list discussions and found implementation notes such as this one in Linux Journal and this wiki from the Blender developers. Most of them just appeared to be repeating what was in the manual, with a few subtle differences, but none of this provided an immediate solution. Eventually, I discovered this other blog about concurrency with embedded Python and it suggests something not highlighted in any of the other resources: calling PyThreadState_New(m_interpreterState) in each thread after it starts and before it does anything else. Combining this with the use of PyEval_SaveThread()/PyEval_ReleaseThread() fixed the problem: the use of PyThreadState_New() was not otherwise mentioned in the relevant section of the Python guide. I decided to take this solution a step further and create a convenient C++ class to encapsulate the logic, you can see this in PyThreadSupport.hxx:
class PyExternalUser
 
   public:
      PyExternalUser(PyInterpreterState* interpreterState)
       : mInterpreterState(interpreterState),
         mThreadState(PyThreadState_New(mInterpreterState))  ;
   class Use
    
      public:
         Use(PyExternalUser& user)
          : mUser(user)
           PyEval_RestoreThread(mUser.getThreadState());  ;
         ~Use()   mUser.setThreadState(PyEval_SaveThread());  ;
      private:
         PyExternalUser& mUser;
    ;
   friend class Use;
   protected:
      PyThreadState* getThreadState()   return mThreadState;  ;
      void setThreadState(PyThreadState* threadState)   mThreadState = threadState;  ;
   private:
      PyInterpreterState* mInterpreterState;
      PyThreadState* mThreadState;
 ;
and the way to use it is demonstrated in the PyRouteWorker class. Observe how PyExternalUser::Use is instantiated in the PyRouteWorker::process() method: when it goes out of scope (either due to a normal return, an error or an exception) the necessary call to PyEval_SaveThread() is made in the PyExternalUser::Use::~Use() destructor. Using other Python modules and DSO problems All of the above worked for basic Python such as this trivial example script:
def on_load():
    '''Do initialisation when module loads'''
    print 'example: on_load invoked'
def provide_route(method, request_uri, headers):
    '''Process a request URI and return the target URI(s)'''
    print 'example: method = ' + method
    print 'example: request_uri = ' + request_uri
    print 'example: From = ' + headers["From"]
    print 'example: To = ' + headers["To"]
    routes = list()
    routes.append('sip:bob@example.org')
    routes.append('sip:alice@example.org')
    return routes
However, it needs a more credible and useful test: using the python-ldap module to try and query an LDAP server appears like a good choice. Upon trying to use import ldap in the Python script, repro would refuse to load the Python script, choking on an error like this:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/ldap/__init__.py", line 22, in 
    import _ldap
ImportError: /usr/lib/python2.7/dist-packages/_ldap.so: undefined symbol: PyExc_SystemError
I looked at the file _ldap.so and discovered that it is linked with the LDAP libraries but not explicitly linked to any version of the Python runtime libraries. It expects the application hosting it to provide the Python symbols globally. In my own implementation, my embedded Python encapsulation code is provide as a DSO plugin, similar to the way plugins are loaded in Ganglia or Apache. The DSO links to Python: the DSO is loaded by a dlopen() call from the main process. The main repro binary has no direct link to Python libraries. Adding RTLD_GLOBAL to the top-level dlopen() call for loading the plugin is one way to ensure the Python symbols are made available to the Python modules loaded indirectly by the Python interpreter. This solution may be suitable for applications that don't mix and match many different components. Doing something useful with it Now it was all working nicely, I took a boilerplate LDAP Python example and used it for making a trivial script that converts sip:user@example.org to something like sip:9001@pbx.example.org, assuming that 9001 is the telephoneNumber associated with the user@ email address in LDAP. It is surprisingly simple and easily adaptable to local requirements depending upon the local LDAP structures:
import ldap
from urlparse import urlparse
def on_load():
    '''Do initialisation when module loads'''
    #print 'ldap router: on_load invoked'
def provide_route(method, request_uri, headers):
    '''Process a request URI and return the target URI(s)'''
    #print 'ldap router: request_uri = ' + request_uri
    _request_uri = urlparse(request_uri)
    routes = list()
    
    # Basic LDAP server parameters:
    server_uri = 'ldaps://ldap.example.org'
    base_dn = "dc=example,dc=org"
    # this domain will be appended to the phone numbers when creating
    # the target URI:
    phone_domain = 'pbx.example.org'
    # urlparse is not great for "sip:" URIs,
    # the user@host portion is in the 'path' element:
    filter = "(&(objectClass=inetOrgPerson)(mail=%s))" % _request_uri.path
    #print "Using filter: %s" % filter
    try:
        con = ldap.initialize(server_uri)
        scope = ldap.SCOPE_SUBTREE
        retrieve_attributes = None
        result_id = con.search(base_dn, scope, filter, retrieve_attributes)
        result_set = []
        while 1:
            timeout = 1
            result_type, result_data = con.result(result_id, 0, None)
            if (result_data == []):
                break
            else:
                if result_type == ldap.RES_SEARCH_ENTRY:
                    result_set.append(result_data)
        if len(result_set) == 0:
            #print "No Results."
            return routes
        for i in range(len(result_set)):
            for entry in result_set[i]:
                if entry[1].has_key('telephoneNumber'):
                    phone = entry[1]['telephoneNumber'][0]
                    routes.append('sip:' + phone + '@' + phone_domain)
    except ldap.LDAPError, error_message:
        print "Couldn't Connect. %s " % error_message
    return routes
Embedded Python opens up a world of possibilities After Ganglia 3.1.0 introduced an embedded Python scripting facility, dozens of new modules started appearing in github. Python scripting lowers the barrier for new contributors to a project and makes it much easier to fine tune free software projects to meet local requirements: hopefully we will see similar trends with the repro SIP proxy and other projects that choose Python. The code is committed here in the reSIProcate repository. These features will appear in the next beta release of reSIProcate and Debian packages will be available in unstable in a few days.

29 November 2013

Keith Packard: Black Friday 2013

Back on Black (Friday) Event Altus Metrum is pleased to announce our Back on Black (Friday) event! For the first time since the Black Forest fire in June, we re re-opening our web store this weekend with a host of new and classic Altus Metrum products, including a special pre-order discount on our latest-and-greatest flight computer design, TeleMega. This weekend only, Friday, 29 November 2013 through Monday, 2 December, 2013, the first 40 TeleMega direct orders placed through our web store will receive a special $50 pre-order discount (regular $400, now only $350!). TeleMega is an advanced flight computer with 9-axis IMU, 6 pyro channels, uBlox Max 7Q GPS and 40mW telemetry system. We designed TeleMega to be the ideal flight computer for sustainers and other complex projects. TeleMega production is currently in process, and we expect to be ready to ship in mid-December. Pre-order now and we won t charge you until we ship. Learn more about TeleMega at: http://altusmetrum.org/TeleMega/ We are also pleased to announce that TeleBT is back in stock. Priced at $150, TeleBT is our latest ground station that connects to your laptop over USB or your Android device over BlueTooth. Learn more about TeleBT at http://altusmetrum.org/TeleBT/ Another new product we re thrilled to announce is EasyMini! Priced at only $80, EasyMini is a two-channel flight computer with built-in data logging and USB data download. Like our more advanced flight computers, EasyMini is loaded with sophisticated electronics and firmware, designed to be very simple to use yet capable enough for high performance airframes. Perfect as a first flight computer, EasyMini is also great as a backup deployment controller in complext projects. Learn more about EasyMini at: http://altusmetrum.org/EasyMini/ Also in stock for immediate shipment is MicroPeak, our 1.9 gram recording altimeter available for $50. The MicroPeak USB adapter, also $50, has been improved to make data downloading a snap. Read more about these at: http://altusmetrum.org/MicroPeak http://altusmetrum.org/MicroPeakUSB You can learn more about these and all our other Altus Metrum products at http://altusmetrum.org. The special discount on TeleMega pre-orders is available only on orders placed directly through Bdale s web store at http://shop.gag.com Thank you all for your support of Altus Metrum during 2013. It s been a rough year, but we re having a great time updating our existing products and designing new stuff! We look forward to returning products like TeleMetrum and TeleMini to the market soon, and plan to introduce even more new products soon.

5 September 2013

Keith Packard: Airfest-altimeter-testing

Altimeter Testing at Airfest Bdale and I, along with AJ Towns and Mike Beattie, spent last weekend in Argonia, Kansas, flying rockets with our Kloudbusters friends at Airfest 19. We had a great time! AJ and Mike both arrived a week early at Bdale s to build L3 project airframes, and both flew successful cert flights at Airfest! Airfest was an opportunity for us to test fly prototypes of new flight electronics Bdale and I have spent the last few weeks developing, and I thought I d take a few minutes today to write some notes about what we built and flew. TeleMega We ve been working on TeleMega for quite a while. It s a huge step up in complexity from our original TeleMetrum, as it has a raft of external sensors and six pyro circuits. Bdale flew TeleMega in his new fiberglass 4 airframe on a Loki 75mm blue M demo motor. GPS tracking was excellent; you can see here that GPS altitude tracked the barometric sensor timing exactly: GPS lost lock when the motor lit, but about 3 seconds after motor burnout, it re-acquired the satellite signals and was reporting usable altitude data right away. The GPS reported altitude was higher than the baro sensor, but that can be explained by our approximation of an atmospheric model used to convert pressure into altitude. The rest of the flight was also nominal; TeleMega deployed drogue and main chutes just fine. TeleMetrum We ve redesigned TeleMetrum. The new version uses better sensors (MS5607 baro sensor, MMA6555 accelerometer) and a higher power radio (CC1120 40mW). The board is the same size, all the connectors are in the same places so it s a drop-in replacement, and it s still got two pyro channels and USB for configuration, data download and battery charging. I loaded up my Candy-Cane airframe with a small 5 grain 38mm CTI classic: The flight computer worked perfectly, but GPS reception was not as good as we d like to see: Given how well TeleMega was receiving GPS signals, I m hopeful that we ll be able to tweak TeleMetrum to improve performance. TeleMini We ve also redesigned TeleMini. It s still a two-channel flight computer with logging and telemetry, but we ve replaced the baro sensor with the MS5607, added on-board flash for increased logging space and added on-board screw terminals for an external battery and power switch. You can still use one of our 3.7V batteries, but you can also use another battery providing from 3.7 to 15V. I was hoping to finish up the firmware and fly it, but I ran out of time before the launch. The good news is that all of the components of the board have been tested and work correctly, and the firmware is feature complete , meaning we ve gotten all of the features coded, it s just not quite working yet. EasyMini EasyMini is a new product for us. It s essentially the same as a TeleMini, but without a radio. Two channels, baro-only, with logging. Like TeleMini, it includes an on-board USB connector and can use either one of our 3.7V batteries, or an external battery from 3.7V to 15V. EasyMini and TeleMini are the same size, and have holes in the same places, so you can swap between them easily. I flew EasyMini in my Koala airframe with a 29mm 3 grain CTI blue-streak motor. EasyMini successfully deployed the main chute and logged flight data: We also sent a couple of boards home with Kevin Trojanowski and Greg Rothman for them to play with. TeleGPS TeleGPS is a GPS tracker, incorporating a u-blox Max receiver and a 70cm transmitter. It can send position information via APRS or our usual digital telemetry formats. I was also hoping to have the TeleGPS firmware working, and I spent a couple of nights in the motel coding, but didn t manage to finish up. So, no data from this board either. Production Plans Given the success of the latest TeleMega prototype, we re hoping to have it into production first. We ll do some more RF testing on the bench with the boards to make sure it meets our standards before sending it out for the first production run. The goal is to have TeleMega ready to sell by the end of October. TeleMetrum clearly needs work on the layout to improve GPS RF performance. With the testing equipment that Bdale is in the midst of re-acquiring, it should be possible to finish this up fairly soon. However, the flight firmware looks great, so we re hoping to get these done in time to sell by the end of November. TeleMini is looking great from a hardware perspective, but the firmware needs work. Once the firmware is running, we ll need to make enough test flights to shake out any remaining issues before moving forward with it. EasyMini is also looking finished; I ve got a stack of prototypes and will be getting people to fly them at my local launch in another couple of weeks. The plan here is to build a small batch by hand and get them into the store once we re finished testing, using those to gauge interest before we pay for a larger production run.

Next.